我正在使用Levenshtein距离算法将作为用户输入提供的公司名称与已知公司名称的数据库进行比较,以找到最接近的匹配项.就其本身而言,该算法可以正常工作,但是我想构建一个Bias,以便如果字符串的初始部分匹配,则认为编辑距离较小.
I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match.
例如,如果搜索条件为"ABCD",则两个均为"ABCD Co.".和"XYX ABCD"具有相同的编辑距离.但是,我想增加一个事实,即第一个字符串的起始部分比第二个字符串更紧密地匹配搜索条件.
For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string.
一种执行此操作的方法可能是将插入/删除/替换成本修改为在字符串的开头处较高,而在结尾处处较低.有没有人有成功实施此事的例子?使用Levenshtein距离仍然是我要达到的最佳方法吗?我对这种方法的假设是否正确?
One way of doing this might be to modify the insert/delete/replace costs to be higher at the beginning of the strings and lower towards the end. Does anyone have an example of a successful implementation of this? Is using Levenshtein distance still the best way to do what I am trying to achieve? Is my assumption of the approach accurate?
更新:出于我的直接目的,我决定放弃上述操作,而是使用Jaro Winkler编辑距离来解决问题.但是,我将保留此空缺以供进一步输入.
UPDATE: For my immediate purposes I have decided to forgo the above and instead use the Jaro Winkler edit distance which seems to solve the problem. However I will leave this open for further inputs.
推荐答案您要查找的内容看起来像是 Smith-Waterman 本地对齐方式: en.wikipedia/wiki/Smith%E2%80%93Waterman_algorithm
What you're looking for looks like a Smith-Waterman local alignment: en.wikipedia/wiki/Smith%E2%80%93Waterman_algorithm
更多推荐
修改Levenshtein距离以进行位置偏移
发布评论