我必须根据名称对某些数据进行交叉验证.
I have to make a cross-validation for some data based on names.
我面临的问题是,根据来源,名称会有细微的变化,例如:
The problem I'm facing is that depending on the source, names have slight variations, for example:
L & L AIR CONDITIONING vs L & L AIR CONDITIONING Service BEST ROOFING vs ROOFING INC我有数千条记录,因此手动执行将非常耗时,我想尽可能地使流程自动化.
I have several thousands of records so do it manually will be very time demanding, I want to automate the process as much as possible.
由于还有其他单词,不足以小写名称.
Since there are additional words it wouldn't be enough to lowercase the names.
哪种算法可以很好地解决这个问题?
Which are good algorithms to handle this?
也许要计算相关性,使"INC"或"Service"之类的词权重较低
Maybe to calculate the correlation giving low weight to words like 'INC' or 'Service'
我尝试了difflib库
I tried the difflib library
difflib.SequenceMatcher(None,name_1.lower(),name_2.lower()).ratio()我得到了不错的结果.
推荐答案我将使用余弦相似度来实现相同目的.它将为您提供与弦的接近程度相匹配的分数.
I would use cosine similarity to achieve the same. It will give you a matching score of how close the strings are.
以下是可以帮助您实现此目的的代码(几个月前,我记得从Stackoverflow本身获取了此代码-现在找不到链接)
Here is the code to help you with the same (I remember getting this code from Stackoverflow itself, some months ago - couldn't find the link now)
import re, math from collections import Counter WORD = repile(r'\w+') def get_cosine(vec1, vec2): # print vec1, vec2 intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): return Counter(WORD.findall(text)) def get_similarity(a, b): a = text_to_vector(a.strip().lower()) b = text_to_vector(b.strip().lower()) return get_cosine(a, b) get_similarity('L & L AIR CONDITIONING', 'L & L AIR CONDITIONING Service') # returns 0.9258200997725514我发现另一个有用的版本是基于NLP的,我编写了它.
Another version that I found useful was slightly NLP based and I authored it.
import re, math from collections import Counter from nltk.corpus import stopwords from nltk.stem.porter import * from nltk.corpus import wordnet as wn stop = stopwords.words('english') WORD = repile(r'\w+') stemmer = PorterStemmer() def get_cosine(vec1, vec2): # print vec1, vec2 intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): words = WORD.findall(text) a = [] for i in words: for ss in wn.synsets(i): a.extend(ss.lemma_names()) for i in words: if i not in a: a.append(i) a = set(a) w = [stemmer.stem(i) for i in a if i not in stop] return Counter(w) def get_similarity(a, b): a = text_to_vector(a.strip().lower()) b = text_to_vector(b.strip().lower()) return get_cosine(a, b) def get_char_wise_similarity(a, b): a = text_to_vector(a.strip().lower()) b = text_to_vector(b.strip().lower()) s = [] for i in a: for j in b: s.append(get_similarity(str(i), str(j))) try: return sum(s)/float(len(s)) except: # len(s) == 0 return 0 get_similarity('I am a good boy', 'I am a very disciplined guy') # Returns 0.5491201525567068您可以同时调用get_similarity或get_char_wise_similarity来查看哪种方法更适合您的用例.我同时使用了两种方法-正常相似性可以清除非常接近的相似性,然后使用字符明智的相似性来清除足够接近的相似性.然后其余的必须手动处理.
You can call both get_similarity or get_char_wise_similarity to see what works for your use case better. I used both - normal similarity to weed out really close ones, and then character wise similarity to weed out close enough ones. And then the remaining ones had to be dealt with manually.
更多推荐
比较名称之间的相似性
发布评论