完全相同的文本字符串不匹配(Exact same text strings not matching)

编程入门 行业动态 更新时间:2024-10-25 16:22:08
完全相同的文本字符串不匹配(Exact same text strings not matching)

我在数据框title和store中有两列,其中包含我想要对数据框进行子集化的文本字符串:

In [84]: 2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com

当我尝试:

df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]

我得到:

Out[84]: Empty DataFrame Columns: [title, store] Index: []

但是,当我这样做时:

df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]

我得到:

2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com

我不知道该怎么做!

我试图复制人物'咖啡伙伴'进行等效性测试并得到False 。

'coffee‑mate' == 'coffee-mate' Out[92]: False

我有一种感觉,这是与编码有关,但不知道如何检测和解决问题。 有人可以帮忙吗?

I have two columns in a dataframe title and store containing text strings by which I want to subset the dataframe:

In [84]: 2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com

When I try :

df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]

I get:

Out[84]: Empty DataFrame Columns: [title, store] Index: []

However, when I do this:

df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]

I get:

2631 coffee‑mate sugar free french ... jet.com 2633 nestle coffeemate natural bliss ... jet.com 2634 coffee‑mate liquid coffee creamer, ... jet.com 3085 coffee‑mate hazelnut ... jet.com

I don't know what to make of this !

I tried copying the characters 'coffee-mate' to do an equivalency test and got False.

'coffee‑mate' == 'coffee-mate' Out[92]: False

I have a feeling this is something to do with encoding but don't know how to detect and fix the issue. Can someone help?

最满意答案

数据u"\u2011"的“咖啡伴侣”使用不分断的连字符( u"\u2011" ),并且您的搜索字符串使用连字符

Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char

你的连字符http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char

虽然它们对你和我来说都一样,但Python认为它们是两个不同的字符。 如果将来出现这个问题,我只需将该字符粘贴到UTF8工具中即可解决此问题 - 您可以对coffee-mate和coffee‑mate

The "coffee-mate" in your dataframe uses a non-breaking hyphen (u"\u2011"), and your search string uses a hyphen

Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char

Your hyphen http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char

While they look the same to you and me, Python considers them two different characters. If you have this issue in the future, I solved this just by copy pasting the character into this UTF8 tool - you were wise to run a comparison of coffee-mate and coffee‑mate

更多推荐

本文发布于:2023-04-29 03:56:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1334582.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串   完全相同   不匹配   文本   matching

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!