用相同的字符串替换列中的相似字符串

编程入门 行业动态 更新时间:2024-10-13 02:16:01
本文介绍了用相同的字符串替换列中的相似字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个Pandas数据框,该数据框收集进行交易的供应商的名称.由于此数据是自动从银行对帐单中收集的,因此许多供应商都是相似的……但并不完全相同.总之,我想用一个名称替换供应商名称的不同排列.

I have a Pandas dataframe that collects the names of vendors at which a transaction was made. As this data is automatically collected from bank statements, lots of the vendors are similar... but not quite the same. In summary, I want to replace the different permutations of the vendors' names with a single name.

我认为我可以找到一种方法来做到这一点(见下文),但是我是一个初学者,在我看来这是一个复杂的问题.我真的很想知道更多有经验的编码人员将如何使用它.

I think I can work out a way to do it (see below), but I'm a beginner and this seems to me like it's a complex problem. I'd be really interested to see how more experienced coders would approach it.

我有一个这样的数据框(在现实生活中,它大约有20列,最多约50行):

I have a dataframe like this (in real life, it's about 20 columns and a maximum of around 50 rows):

Groceries Car Luxuries 0 Sainsburys Texaco wst453 Amazon 1 Sainsburys bur Texaco east Firebox Ltd 2 Sainsbury's east Shell wstl Sony 3 Tesco Shell p/stn Sony ent nrk 4 Tescos ref 657 Texac Amazon EU 5 Tesco 45783 Moto Amazon marketplace

我想找到类似的条目,并用这些条目的第一个实例替换它们,所以我最终得到了这一点:

I'd like to find the similar entries and replace them with the first instance of those entries, so I'd end up with this:

Groceries Car Luxuries 0 Sainsburys Texaco wst453 Amazon 1 Sainsburys Texaco wst453 Firebox Ltd 2 Sainsburys Shell wstl Sony 3 Tesco Shell wstl Sony 4 Tesco Texaco wst453 Amazon 5 Tesco Moto Amazon

我的解决方案可能远非最佳.我当时想按字母顺序排序,然后按位排序,并使用difflib中的SequenceMatcher之类的东西来比较每对供应商.如果相似度高于某个百分比(我希望一直使用此值直到满意为止),然后将假定这两个供应商是相同的.我担心我可能正在使用大锤敲碎螺母,或者可能会花费很长时间(我并不迷恋性能,但同样地,我也不想等待数小时).

My solution might be far from optimum. I was thinking of sorting alphabetically, then going through bitwise and using something like SequenceMatcher from difflib to compare each pair of vendors. If the similarity is above a certain percentage (I'm expecting to play with this value until I'm happy) then the two vendors will be assumed to be the same. I'm concerned that I might be using a sledgehammer to crack a nut, or it might take a long time (I'm not obsessed with performance, but equally I don't want to wait hours for the result).

真的很想听听人们对这个问题的想法!

Really interested to hear people's thoughts on this problem!

推荐答案

在开始时,问题似乎并不复杂,但是确实如此.我不喜欢我的代码,必须有更好的方法.但是我的代码正在工作.

At the start, the problem doesn't seem complicated, but it is. I didn't like my code, there must be a better way. However my code is working.

我使用名为 fuzzywuzzy 的字符串相似性软件包来确定必须替换的字符串.该软件包使用Levenshtein相似性,我使用%90作为阈值.另外,任何字符串的第一个单词都用作比较字符串.这是我的代码:

I used string similarity package named fuzzywuzzy to decide which string must be replaced. This package uses Levenshtein Similarity, and I used %90 as threshold value. Also, first word of any string is used as comparison string. Here is my code:

import pandas from fuzzywuzzy import fuzz # Replaces %90 and more similar strings def func(input_list): for count, item in enumerate(input_list): rest_of_input_list = input_list[:count] + input_list[count + 1:] new_list = [] for other_item in rest_of_input_list: similarity = fuzz.ratio(item, other_item) if similarity >= 90: new_list.append(item) else: new_list.append(other_item) input_list = new_list[:count] + [item] + new_list[count :] return input_list df = pandas.read_csv('input.txt') # Read data from csv result = [] for column in list(df): column_values = list(df[column]) first_words = [x[:x.index(" ")] if " " in x else x for x in column_values] result.append(func(first_words)) new_df = pandas.DataFrame(result).transpose() new_df.columns = list(df) print(new_df)

输出:

Groceries Car Luxuries 0 Sainsbury's Texac Amazon 1 Sainsbury's Texac Firebox 2 Sainsbury's Shell Sony 3 Tesco Shell Sony 4 Tesco Texac Amazon 5 Tesco Moto Amazon

我想,func()函数可以更好地编码,但这是我首先想到的.

I guess, func() function can be coded better but this is what comes first to my mind.

更多推荐

用相同的字符串替换列中的相似字符串

本文发布于:2023-11-30 07:07:59,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1649120.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!