相交文本以查找常用词

编程入门行业动态更新时间:2024-10-28 14:26:30

本文介绍了相交文本以查找常用词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在试图找出哪一种是最佳的交叉方式，一组文本并在其中找到常用词。鉴于这种情况：

I'm trying to find out which would be the most optimal way of intersection a set of texts and find the common words in them. Given this scenario:

var t1 = 'My name is Mary-Ann, and I come from Kansas!'; var t2 = 'John, meet Mary, she comes from far away'; var t3 = 'Hi Mary-Ann, come here, nice to meet you!';

交叉口结果应为：

var result =["Mary"];

它应该能够忽略像。，！？ -

It should be able to ignore punctuation marks like .,!?-

正则表达式的解决方案是否最优？

Would a solution with regular expressions be optimal?

推荐答案

这是一个经过测试的解决方案：

Here's a tested solution :

此函数是可变参数，您可以使用任意数量的文本调用它：

This function is variadic, you can call it with any number of texts :

console.log(intersect(t1, t2, t3)) // -> ["Mary"] console.log(intersect(t1, t2)) // -> ["Mary", "from"] console.log(intersect()) // -> []

如果你需要支持非英语语言，那么这个正则表达式是不够的，因为JavaScript正则表达式中对Unicode的不良支持。要么使用正则表达式库，要么明确定义正则表达式排除字符，如 a.match（/ [^ \\\\ - 。，！？] + / g）; （这可能就足够了）。

If you need to support non English languages, then this regex won't be enough because of the poor support of Unicode in JavaScript regexes. Either you use a regex library or you define your regex by explicitly excluding characters as in a.match(/[^\s\-.,!?]+/g); (this will probably be enough for you) .

详细说明：

这个想法是用第一个文本的标记填充一个集合，然后从集合中删除其他文本中缺少的标记。

The idea is to fill a set with the tokens of the first text and then remove from the set the tokens missing in the other texts.

该集合是用作地图的JavaScript对象。一些纯粹主义者会使用 Object.create（null）来避免原型，我喜欢 {} 的简单性。

因为我希望我的功能是 variadic ，我使用参数而不是将传递的文本定义为显式参数。

arguments 不是真正的数组，所以要迭代它你需要一个 for 循环或一个技巧，如 []。forEach.call 。它的工作原理是因为参数是array-like。

要标记化，我只需使用匹配以匹配单词，这里没什么特别的（请参阅上面关于更好地支持其他语言的说明）

我使用！i 来检查它是否是第一个文字。在这种情况下，我只需将标记复制为集合中的属性。必须使用一个值，我使用 1 。将来， ES6设置将使意图在这里变得更加明显。

对于以下文本，我迭代集合的元素（键）并删除那些不在数组中的元素令牌（ tokens.indexOf（k）< 0 ）

最后，我返回集合的元素，因为我们想要一个数组。最简单的解决方案是使用 Object.keys 。

The set is a JavaScript object used as a map. Some purists would have used Object.create(null) to avoid a prototype, I like the simplicity of {}.

As I want my function to be variadic, I use arguments instead of defining the passed texts as explicit arguments.

arguments isn't a real array, so to iterate over it you need either a for loop or a trick like [].forEach.call. It works because arguments is "array-like".

To tokenize, I simply use match to match words, nothing special here (see note above regarding better support of other languages, though)

I use !i to check if it's the first text. In that case, I simply copy the tokens as properties in the set. A value must be used, I use 1. In the future, ES6 sets will make the intent more obvious here.

For the following texts, I iterate over the elements of the sets (the keys) and I remove the ones which are not in the array of tokens (tokens.indexOf(k)<0)

Finally, I return the elements of the sets because we want an array. The simplest solution is to use Object.keys.

更多推荐

相交文本以查找常用词

本文发布于:2023-10-18 18:22:57，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1505014.html