我在数据框title和store中有两列,其中包含我想要对数据框进行子集化的文本字符串:
In [84]: 2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com当我尝试:
df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]我得到:
Out[84]: Empty DataFrame Columns: [title, store] Index: []但是,当我这样做时:
df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]我得到:
2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com我不知道该怎么做!
我试图复制人物'咖啡伙伴'进行等效性测试并得到False 。
'coffee‑mate' == 'coffee-mate' Out[92]: False我有一种感觉,这是与编码有关,但不知道如何检测和解决问题。 有人可以帮忙吗?
I have two columns in a dataframe title and store containing text strings by which I want to subset the dataframe:
In [84]: 2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.comWhen I try :
df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]I get:
Out[84]: Empty DataFrame Columns: [title, store] Index: []However, when I do this:
df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]I get:
2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.comI don't know what to make of this !
I tried copying the characters 'coffee-mate' to do an equivalency test and got False.
'coffee‑mate' == 'coffee-mate' Out[92]: FalseI have a feeling this is something to do with encoding but don't know how to detect and fix the issue. Can someone help?
最满意答案
数据u"\u2011"的“咖啡伴侣”使用不分断的连字符( u"\u2011" ),并且您的搜索字符串使用连字符
Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char
你的连字符http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char
虽然它们对你和我来说都一样,但Python认为它们是两个不同的字符。 如果将来出现这个问题,我只需将该字符粘贴到UTF8工具中即可解决此问题 - 您可以对coffee-mate和coffee‑mate
The "coffee-mate" in your dataframe uses a non-breaking hyphen (u"\u2011"), and your search string uses a hyphen
Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char
Your hyphen http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char
While they look the same to you and me, Python considers them two different characters. If you have this issue in the future, I solved this just by copy pasting the character into this UTF8 tool - you were wise to run a comparison of coffee-mate and coffee‑mate
更多推荐
发布评论