Python模糊匹配列表性能中的字符串

编程入门行业动态更新时间:2024-10-14 16:25:26

本文介绍了Python模糊匹配列表性能中的字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在检查在4个相同的数据框列中是否有相似的结果(模糊匹配)，并且以下面的代码为例.当我将其应用于实际的40.000行x 4列数据集时，将始终以整数运行.问题是代码太慢.例如，如果我将数据集限制为10个用户，则需要8分钟来计算，而要花20到19分钟.有什么我想念的吗?我不知道为什么要花这么长时间.我希望能在2小时或更短时间内获得所有结果.任何提示或帮助将不胜感激.

I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.

from fuzzywuzzy import process dataframecolumn = ["apple","tb"] compare = ["adfad","apple","asple","tab"] Ratios = [process.extract(x,compare) for x in dataframecolumn] result = list() for ratio in Ratios: for match in ratio: if match[1] != 100: result.append(match) break print (result)

输出: [('asple'，80)，('tab'，80)]

Output: [('asple', 80), ('tab', 80)]

推荐答案

通过编写矢量化操作并避免循环来显着提高速度

Major speed improvements come by writing vectorized operations and avoiding loops

导入必要的程序包

from fuzzywuzzy import fuzz import pandas as pd import numpy as np

从第一个列表创建数据框

dataframecolumn = pd.DataFrame(["apple","tb"]) dataframecolumn.columns = ['Match']

从第二个列表创建数据框

compare = pd.DataFrame(["adfad","apple","asple","tab"]) compare.columns = ['compare']

合并-引入键(自连接)的笛卡尔积

dataframecolumn['Key'] = 1 compare['Key'] = 1 combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left") combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframepare)]

向量化

def partial_match(x,y): return(fuzz.ratio(x,y)) partial_match_vector = np.vectorize(partial_match)

使用矢量化并通过在分数上设置阈值来获得期望的结果

combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare']) combined_dataframe = combined_dataframe[combined_dataframe.score>=80]

结果

+--------+-----+--------+------+ | Match | Key | compare | score +--------+-----+--------+------+ | apple | 1 | asple | 80 | tb | 1 | tab | 80 +--------+-----+--------+------+

更多推荐

Python模糊匹配列表性能中的字符串

本文发布于:2023-10-23 05:28:36，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1519924.html