在字符串模糊匹配多个单词

编程入门 行业动态 更新时间:2024-10-14 18:19:40
本文介绍了在字符串模糊匹配多个单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我想聘请 Levenshtein距离的帮助找模糊关键字(静态文本)一个OCR页面上。结果。要做到这一点,我想给被允许误差的百分比(如15%)。

I'm trying to employ the help of the Levenshtein Distance to find fuzzy keywords(static text) on an OCR page. To do this, I want to give a percentage of errors that are allowed (say, 15%).

string Keyword = "past due electric service";

由于关键字是25个字符长,我想允许4错误(25 * 0.15圆角高达)结果我需要能够比较它...

Since the keyword is 25 characters long, I want to allow for 4 errors (25 * .15 rounded up) I need to be able to compare it to...

string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank you! current electric service total balances unpaid 7 days after the total due date are subject to a late charge of 7.5% of the amount due or $2.00, whichever/5 greater. "

这是我在做它现在...

This is how I am doing it now...

int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202 int NumberOfErrorsAllowed = 4; int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205

显然,关键字是在 OCR_Text (它不应该)未找到。但是,使用莱文斯坦的距离,错误的数量小于15%的余地(因此我的逻辑说,它的发现)。

Clearly, Keyword is not found in OCR_Text (which it shouldn't be). But, using Levenshtein's Distance, the number of errors is less than the 15% leeway (therefore my logic says it's found).

有谁知道一个更好的方式做此?

Does anyone know of a better way to do this?

推荐答案

回答我的问题与使用子串的。帐的情况下更为碰上相同类型的问题。 。有点非正统的,但它为我的伟大工程

Answered My Question with the use of sub-strings. Posting in case others run into the same type of problem. A little unorthodox, but it works great for me.

int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have. int LowestLevenshteinNumber = 999999; //initialize insanely high maximum decimal PossibleStringLength = (PossibleString.Length); //Length of string to search decimal StaticTextLength = (StaticText.Length); //Length of text to search for decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage //Look for best match with 1 less character than it should have, then the correct amount of characters. //And last, with 1 more character. (This is because one letter can be recognized as //two (W -> VV) and visa versa) for (int i = 0; i < 3; i++) { for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++) { string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer)); int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero)); int lNumber = LevenshteinAlgorithm(StaticText, possibleResult); if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber))) { PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber }); LowestLevenshteinNumber = lNumber; } } TextLengthBuffer++; } public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm { int n = s.Length; int m = t.Length; int[,] d = new int[n + 1, m + 1]; if (n == 0) { return m; } if (m == 0) { return n; } for (int i = 0; i <= n; d[i, 0] = i++) { } for (int j = 0; j <= m; d[0, j] = j++) { } for (int i = 1; i <= n; i++) { for (int j = 1; j <= m; j++) { int cost = (t[j - 1] == s[i - 1]) ? 0 : 1; d[i, j] = Math.Min( Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost); } } return d[n, m]; }

更多推荐

在字符串模糊匹配多个单词

本文发布于:2023-10-23 05:28:55,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1519925.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:多个   字符串   单词   模糊

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!