Python模糊字符串匹配作为相关样式表/矩阵

编程入门 行业动态 更新时间:2024-10-22 14:38:44
本文介绍了Python模糊字符串匹配作为相关样式表/矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个包含 x 个字符串名称及其关联 ID 的文件.基本上是两列数据.

I have a file with x number of string names and their associated IDs. Essentially two columns of data.

我想要的是一个格式为 x x x 的相关样式表(将相关数据同时作为 x 轴和 y 轴),但我想要模糊模糊库的函数模糊,而不是相关性.ratio(x,y) 作为使用字符串名称作为输入的输出.基本上针对每个条目运行每个条目.

What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.

这就是我的想法.只是为了表明我的意图:

This is sort of what I had in mind. Just to show my intent:

import pandas as pd from fuzzywuzzy import fuzz df = pd.read_csv('random_data_file.csv') df = df[['ID','String']] df['String_Dup'] = df['String'] #creating duplicate of data in question df = df.set_index('ID') df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())

但显然这种方法目前对我不起作用.任何帮助表示赞赏.不一定是pandas,只是我比较熟悉的一个环境.

But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.

我希望我的问题措辞清晰,真的,任何意见都值得赞赏,

I hope my issue is clearly worded, and really, any input is appreciated,

推荐答案

使用 pandas 的 crosstab 函数,后跟一个列式apply 计算模糊.这比我的第一个答案要优雅得多.

Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz. This is considerably more elegant than my first answer.

import pandas as pd from fuzzywuzzy import fuzz # Create sample data frame. df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')], columns=['id', 'strings']) # Create the cartesian product between the strings column with itself. ct = pd.crosstab(df['strings'], df['strings']) # Note: for pandas versions <0.22, the two series must have different names. # In case you observe a "Level XX not found" error, the following may help: # ct = pd.crosstab(df['strings'].rename(), df['strings'].rename()) # Apply the fuzz (column-wise). Argument col has type pd.Series. ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index]) # This results in the following: # strings abc abracadabra brabra cadra # strings # abc 100 43 44 25 # abracadabra 43 100 71 62 # brabra 44 71 100 55 # cadra 25 62 55 100

为简单起见,我省略了您问题中建议的 groupby 操作.如果需要在组上应用模糊字符串匹配,只需创建一个单独的函数:

For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:

def cross_fuzz(df): ct = pd.crosstab(df['strings'], df['strings']) ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index]) return ct df.groupby('id').apply(cross_fuzz)

更多推荐

Python模糊字符串匹配作为相关样式表/矩阵

本文发布于:2023-11-29 17:21:22,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1647077.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:矩阵   字符串   样式表   模糊   Python

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!