以下字符串被视为相等.我怎样才能匹配这样的东西?
The following strings are considered equal. How can I match stuff like this?
"Hazard Const. Company" "hazard construction company" "PETERSON-CHASE GENERAL ENGINEERING CONSTRUCTION INC" "peterson-chase general engineering construction inc" "TRAFFIC DEVELOPMENT SERVICES " "traffic development services"我的环境是 ruby,但我只是想知道匹配字符串的一般原则.由于空格问题和缩写,上述示例不适用于基本的a"==b".我可以使用正则表达式忽略大小写或小写字符串来缓解大小写问题...
My environment is ruby, but I'm just wondering general principles to match strings. The above examples don't work w/ rudimentary "a"=="b" because of whitespace issues, and abbreviations. I can mitigate casing issues w/ regex case-ignore or downcase the strings...
推荐答案以下示例比较了所有字符串并计算了 leventhtein 差异(使一个字符串适应另一个字符串所需的击键次数).
The following sample compares all of your strings and computes the levensthtein difference (amount of keystrokes it takes to adapt one string to the other).
基于定义的最大差异和对字符串长度的补偿,然后将字符串作为具有出现次数和值的键放入哈希中.
Based on a defined maximum difference and with a compensation for the lengts of the string it then puts the strings in a hash as a key with the number of occurences als value.
require 'levenshtein' MAX_DISTANCE, COMPENSATION = 3, 5 strings = [ "Hazard Const. Company", "hazard construction company", "PETERSON-CHASE GENERAL ENGINEERING CONSTRUCTION INC", "peterson-chase general engineering construction inc", "TRAFFIC DEVELOPMENT SERVICES ", "traffic development services" ] result = {} strings.each do |s| s.downcase! similar = result.keys.select { |key| Levenshtein.distance(key, s) < MAX_DISTANCE+(s.length/COMPENSATION) } if similar.any? result[similar.first] += 1 else result.merge!({s => 1}) end end puts result.inspect # {"hazard const. company"=>2, "peterson-chase general engineering construction inc"=>2, "traffic development services "=>2}更多推荐
字符串匹配技术
发布评论