在 pandas 中执行模糊字符串匹配的更快方法

编程入门行业动态更新时间:2024-10-17 05:38:16

本文介绍了在 pandas 中执行模糊字符串匹配的更快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

有没有办法使用Fuzzywuzzy在大熊猫中加快模糊字符串的匹配.

Is there any way to speed up the fuzzy string match using fuzzywuzzy in pandas.

我有一个数据框为extra_names，它的名称与其他数据框为names_df时要进行模糊匹配.

I have a dataframe as extra_names which has names that I want to run fuzzy matches for with another dataframe as names_df.

>> extra_names.head() not_matching 0 Vij Sales 1 Crom Electronics 2 REL Digital 3 Bajaj Elec 4 Reliance Digi >> len(extra_names) 6500 >> names_df.head() names types 0 Vijay Sales 1 1 Croma Electronics 1 2 Reliance Digital 2 3 Bajaj Electronics 2 4 Pai Electricals 2 >> len(names_df) 250

到目前为止，我正在使用以下代码运行逻辑，但是要花很长时间才能完成.

As of now, I'm running the logic using the following code, but its taking forever to complete.

choices = names_df['names'].unique().tolist() def fuzzy_match(row): best_match = process.extractOne(row, choices) return best_match[0], best_match[1] if best_match else '','' %%timeit extra_names['best_match'], extra_names['match%'] = extra_names['not_matching'].apply(fuzzy_match)

在我发布此问题时，查询仍在运行.有什么方法可以加快此模糊字符串匹配过程的速度?

As I'm posting this question, the query is still running. Is there any way to speed up this fuzzy string matching process?

推荐答案

让我们尝试difflib:

import difflib from functools import partial f = partial( difflib.get_close_matches, possibilities=names_df['names'].tolist(), n=1) matches = extra_names['not_matching'].map(f).str[0].fillna('') scores = [ difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, extra_names['not_matching']) ] extra_names.assign(best=matches, score=scores) not_matching best score 0 Vij Sales Vijay Sales 0.900000 1 Crom Electronics Croma Electronics 0.969697 2 REL Digital Reliance Digital 0.666667 3 Bajaj Elec Bajaj Electronics 0.740741 4 Reliance Digi Reliance Digital 0.896552

更多推荐

在 pandas 中执行模糊字符串匹配的更快方法

本文发布于:2023-10-23 05:31:20，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1519934.html