我有一个单元格字典,其中包含很多单词(大约15000个).
I have a cell array dictionary which contains a lot of words (ca. 15000).
我想为所有单词对计算函数strdist(以计算Levenshtein距离).我尝试了两种方法,但是它们都非常慢.什么是更有效的解决方案?
I want to compute the function strdist (to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?
这是我的代码(dict_keys是我的长度为m的单元格数组):
Here is my code (dict_keys is my cell array of length m):
1)
matrix = sparse(m,m); for i = 1:m-1; matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys ); end2)
matrix = sparse(m,m); for i = 1:m-1; for j = i+1:m matrix(i,j) = strdist(dict_keys{i},dict_keys{j}); end end推荐答案
函数'strdist'不是内置的matlab函数,所以我想如果您从文件交换.这就是为什么您的两种方法在时间上大致相等的原因,cellfun在内部只会扩展为一个循环.
Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun internally just expands into a loop.
如果strdist是对称的,即strdist(a,b)== strdist(b,a),则可以节省一半的计算量.似乎是这种情况,因此仅在第二个循环(您正在执行的操作)中计算j<i的所有情况.
If strdist is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i in the second loop (which you are doing).
否则,您可以在C中将strdist作为mex函数实现,并且可能会看到一些显着的速度改进.可以在 rosettacode 中找到Levenshtein距离的C实现.
Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.
或深入研究该算法如何计算两个字符串的距离的详细信息,看看是否可以对其向量化并减少二次方的运行时间.但是,这可能不是很容易.
Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.
最后,如果您拥有并行计算工具箱的许可和多核CPU,则由于strdist调用是完全相互独立的,因此可以轻松地并行化代码.
Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist calls are completely independent of each other.
更多推荐
优化Levenshtein距离的速度
发布评论