优化Levenshtein距离的速度

编程入门 行业动态 更新时间:2024-10-07 10:15:22
本文介绍了优化Levenshtein距离的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个单元格字典,其中包含很多单词(大约15000个).

I have a cell array dictionary which contains a lot of words (ca. 15000).

我想为所有单词对计算函数strdist(以计算Levenshtein距离).我尝试了两种方法,但是它们都非常慢.什么是更有效的解决方案?

I want to compute the function strdist (to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?

这是我的代码(dict_keys是我的长度为m的单元格数组):

Here is my code (dict_keys is my cell array of length m):

1)

matrix = sparse(m,m); for i = 1:m-1; matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys ); end

2)

matrix = sparse(m,m); for i = 1:m-1; for j = i+1:m matrix(i,j) = strdist(dict_keys{i},dict_keys{j}); end end

推荐答案

函数'strdist'不是内置的matlab函数,所以我想如果您从文件交换.这就是为什么您的两种方法在时间上大致相等的原因,cellfun在内部只会扩展为一个循环.

Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun internally just expands into a loop.

如果strdist是对称的,即strdist(a,b)== strdist(b,a),则可以节省一半的计算量.似乎是这种情况,因此仅在第二个循环(您正在执行的操作)中计算j<i的所有情况.

If strdist is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i in the second loop (which you are doing).

否则,您可以在C中将strdist作为mex函数实现,并且可能会看到一些显着的速度改进.可以在 rosettacode 中找到Levenshtein距离的C实现.

Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.

或深入研究该算法如何计算两个字符串的距离的详细信息,看看是否可以对其向量化并减少二次方的运行时间.但是,这可能不是很容易.

Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.

最后,如果您拥有并行计算工具箱的许可和多核CPU,则由于strdist调用是完全相互独立的,因此可以轻松地并行化代码.

Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist calls are completely independent of each other.

更多推荐

优化Levenshtein距离的速度

本文发布于:2023-11-29 09:21:57,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1645990.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:距离   速度   Levenshtein

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!