Python:如何按子字符串相关性对字符串列表进行排序?

编程入门行业动态更新时间:2024-10-27 12:32:38

本文介绍了Python:如何按子字符串相关性对字符串列表进行排序?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一些字符串列表，例如:

I have some list of strings, for example:

["foo bar SOME baz TEXT bob", "SOME foo bar baz bob TEXT", "SOME foo TEXT", "foo bar SOME TEXT baz", "SOME TEXT"]

我希望按准确度对SOME TEXT子字符串进行排序(大写无关紧要).像这样的命令:

I want it to be sorted by exactness to SOME TEXT substring (upper case doesn't matter). Something like this order:

["SOME TEXT", "foo bar SOME TEXT baz", "SOME foo TEXT", "foo bar SOME baz TEXT bob", "SOME foo bar baz bob TEXT"]

这个想法是-最好的分数将获得与子字符串单词位置最匹配的字符串.而对于更大数量的马虎"，子字符串的单词之间的单词-它获得的较低顺序. 我发现了一些库，例如 fuzzyset 或 Levenshtein距离，但我不确定这是我所需要的.据我了解，我知道要排序的确切子字符串，并且那些库搜索相似的单词. 实际上，我需要在Django项目中执行一些数据库查询(Postgresql)之后执行这种排序.我已经尝试使用其ORM进行全文本搜索，但是没有获得这种相关的排序顺序(它不计算子字符串词之间的距离).接下来，我尝试了Haystack + Whoosh，但此刻也没有找到有关如何在此进行排序的信息.因此，现在的想法是获取查询集，然后将其从数据库中排序出来(是的，我知道这可能是一个错误的决定，但现在我希望它可以正常工作).但是，如果有人告诉我如何在任何一种技术中做到这一点，我在这里已经提到过-这也将非常酷.谢谢！

The idea is - the best score gets the string with the best match to substring words position. And for bigger amount of "sloppy" words between substring's words - the lower ordering it gets. I have found some libraries like fuzzyset, or Levenshtein distance but I'm not sure this is what I need. I know the exact substring by what I want to sort and those libs search the similar words, as I understood. Actually I need to do this sort after some database query (Postgresql) in my Django project. I have already tried full-text search with its ORM, but didn't get this relevant sort order (it doesn't count the distance between substring words). Next I have tried Haystack+Whoosh, but also at this moment didn't find info how to do this sort there. So idea now is to get query set and next sort it out of the database (yep, I know that might be a bad decision, but for now I want it just work). But if anybody tells me how to do this within any of technologies, I have mentioned here - that will be also super cool. Thank you!

p.s.子字符串的长度应该在最多20个单词的字符串中为2-10个单词.

p.s. The length of substring supposed to be 2-10 words in max 20 word string.

推荐答案

您可以使用 difflib. SequenceMatcher ，以实现与所需输出非常相似的功能:

You can use difflib.SequenceMatcher, to achieve something very similar to your desired output:

>>> import difflib >>> l = ["foo bar SOME baz TEXT bob", "SOME foo bar baz bob TEXT", "SOME foo TEXT", "foo bar SOME TEXT baz", "SOME TEXT"] >>> sorted(l, key=lambda z: difflib.SequenceMatcher(None, z, "SOME TEXT").ratio(), reverse=True) ['SOME TEXT', 'SOME foo TEXT', 'foo bar SOME TEXT baz', 'foo bar SOME baz TEXT bob', 'SOME foo bar baz bob TEXT']

如果您不知道唯一的区别，就是与所需的输出相比，两个元素"foo bar SOME TEXT baz"和"SOME foo TEXT"的位置已交换.

If you can't tell the only difference is that the position of the two elements "foo bar SOME TEXT baz" and "SOME foo TEXT" are swapped compared to your desired output.