模糊分组依据,对相似词进行分组

编程入门 行业动态 更新时间:2024-10-27 08:30:14
本文介绍了模糊分组依据,对相似词进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

这个问题之前在这里问过

this question is asked here before

将相似词分组的好策略是什么?

但没有给出关于如何分组"项目的明确答案.基于 difflib 的解决方案基本上是搜索,对于给定的项目,difflib 可以从列表中返回最相似的单词.但这如何用于分组?

but no clear answer is given on how to "group" items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping?

我想减少

['ape', 'appel', 'apple', 'peach', 'puppy']

['ape', 'appel', 'peach', 'puppy']

['ape', 'apple', 'peach', 'puppy']

我尝试过的一个想法是,对于每个项目,遍历列表,如果 get_close_matches 返回多个匹配项,则使用它,如果不保持原样.这部分奏效了,但它可以建议 apple for appel,然后 appel for apple,这些词只会交换位置,什么都不会改变.

One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change.

如有任何指针、库名称等,我将不胜感激.

I would appreciate any pointers, names of libraries, etc.

注意:同样在性能方面,我们有一个 300,000 项的列表,get_close_matches 似乎有点慢.有谁知道基于 C/++ 的解决方案?

Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there?

谢谢,

注意:进一步调查显示kmedoid是正确的算法(以及层次聚类),因为kmedoid不需要中心",它需要/使用数据点本身作为中心(这些点被称为medoids,因此得名).在词组情况下,medoid 将是该组/集群的代表元素.

Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require "centers", it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.

推荐答案

您需要对组进行规范化.在每组中,选择一个代表该组的单词或编码.然后按他们的代表将单词分组.

You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.

一些可能的方法:

  • 选择第一个遇到的词.
  • 选择字典序的第一个单词.
  • 为所有单词推导出一个模式.
  • 选择一个唯一索引.
  • 使用 soundex 作为模式.
  • Pick the first encountered word.
  • Pick the lexicographic first word.
  • Derive a pattern for all the words.
  • Pick an unique index.
  • Use the soundex as pattern.

不过,对单词进行分组可能很困难.如果A与B相似,B与C相似,则A和C不一定相似.如果 B 是代表,则 A 和 C 都可以包含在该组中.但如果A或C是代表,则另一个不能包括在内.

Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.

按照第一个选择(第一个遇到的词):

Going by the first alternative (first encountered word):

class Seeder: def __init__(self): self.seeds = set() self.cache = dict() def get_seed(self, word): LIMIT = 2 seed = self.cache.get(word,None) if seed is not None: return seed for seed in self.seeds: if self.distance(seed, word) <= LIMIT: self.cache[word] = seed return seed self.seeds.add(word) self.cache[word] = word return word def distance(self, s1, s2): l1 = len(s1) l2 = len(s2) matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)] for zz in xrange(0,l2): for sz in xrange(0,l1): if s1[sz] == s2[zz]: matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz]) else: matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1) return matrix[l2][l1] import itertools def group_similar(words): seeder = Seeder() words = sorted(words, key=seeder.get_seed) groups = itertools.groupby(words, key=seeder.get_seed) return [list(v) for k,v in groups]

示例:

import pprint print pprint.pprint(group_similar([ 'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you', 'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we', 'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all', 'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if', 'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make', 'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take', 'people', 'into', 'year', 'your', 'good', 'some', 'could', 'them', 'see', 'other', 'than', 'then', 'now', 'look', 'only', 'come', 'its', 'over', 'think', 'also', 'back', 'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well', 'way', 'even', 'new', 'want', 'because', 'any', 'these', 'give', 'day', 'most', 'us' ]), width=120)

输出:

[['after'], ['also'], ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'], ['back'], ['because'], ['but', 'about', 'get', 'just'], ['first'], ['from'], ['good', 'look'], ['have', 'make', 'give'], ['his', 'her', 'if', 'him', 'its', 'how', 'us'], ['into'], ['know', 'new'], ['like', 'time', 'take'], ['most'], ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'], ['only'], ['over', 'our', 'even'], ['people'], ['say', 'she', 'way', 'day'], ['some', 'see', 'come'], ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'], ['think'], ['well'], ['what', 'who', 'when', 'than'], ['with', 'will', 'which'], ['work'], ['would', 'could'], ['year', 'your']]

更多推荐

模糊分组依据,对相似词进行分组

本文发布于:2023-10-22 22:42:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1518941.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:模糊

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!