完全相同的文本字符串不匹配(Exact same text strings not matching)

编程入门行业动态更新时间:2024-10-25 16:22:08

我在数据框title和store中有两列，其中包含我想要对数据框进行子集化的文本字符串：

In [84]: 2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com

当我尝试：

df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]

我得到：

Out[84]: Empty DataFrame Columns: [title, store] Index: []

但是，当我这样做时：

df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]

我得到：

2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com

我不知道该怎么做！

我试图复制人物'咖啡伙伴'进行等效性测试并得到False 。

'coffee‑mate' == 'coffee-mate' Out[92]: False

我有一种感觉，这是与编码有关，但不知道如何检测和解决问题。有人可以帮忙吗？

I have two columns in a dataframe title and store containing text strings by which I want to subset the dataframe:

When I try :

df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]

I get:

Out[84]: Empty DataFrame Columns: [title, store] Index: []

However, when I do this:

df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]

I get:

2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com

I don't know what to make of this !

I tried copying the characters 'coffee-mate' to do an equivalency test and got False.

'coffee‑mate' == 'coffee-mate' Out[92]: False

I have a feeling this is something to do with encoding but don't know how to detect and fix the issue. Can someone help?

最满意答案

数据u"\u2011"的“咖啡伴侣”使用不分断的连字符（ u"\u2011" ），并且您的搜索字符串使用连字符

Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char

你的连字符http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char

虽然它们对你和我来说都一样，但Python认为它们是两个不同的字符。如果将来出现这个问题，我只需将该字符粘贴到UTF8工具中即可解决此问题 - 您可以对coffee-mate和coffee‑mate

The "coffee-mate" in your dataframe uses a non-breaking hyphen (u"\u2011"), and your search string uses a hyphen

Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char

Your hyphen http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char

While they look the same to you and me, Python considers them two different characters. If you have this issue in the future, I solved this just by copy pasting the character into this UTF8 tool - you were wise to run a comparison of coffee-mate and coffee‑mate

更多推荐

本文发布于:2023-04-29 03:56:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1334582.html