我正在试图找出哪一种是最佳的交叉方式,一组文本并在其中找到常用词。鉴于这种情况:
I'm trying to find out which would be the most optimal way of intersection a set of texts and find the common words in them. Given this scenario:
var t1 = 'My name is Mary-Ann, and I come from Kansas!'; var t2 = 'John, meet Mary, she comes from far away'; var t3 = 'Hi Mary-Ann, come here, nice to meet you!';交叉口结果应为:
var result =["Mary"];它应该能够忽略像。,!? -
It should be able to ignore punctuation marks like .,!?-
正则表达式的解决方案是否最优?
Would a solution with regular expressions be optimal?
推荐答案这是一个经过测试的解决方案:
Here's a tested solution :
function intersect() { var set = {}; [].forEach.call(arguments, function(a,i){ var tokens = a.match(/\w+/g); if (!i) { tokens.forEach(function(t){ set[t]=1 }); } else { for (var k in set){ if (tokens.indexOf(k)<0) delete set[k]; } } }); return Object.keys(set); }此函数是可变参数,您可以使用任意数量的文本调用它:
This function is variadic, you can call it with any number of texts :
console.log(intersect(t1, t2, t3)) // -> ["Mary"] console.log(intersect(t1, t2)) // -> ["Mary", "from"] console.log(intersect()) // -> []如果你需要支持非英语语言,那么这个正则表达式是不够的,因为JavaScript正则表达式中对Unicode的不良支持。要么使用正则表达式库,要么明确定义正则表达式排除字符,如 a.match(/ [^ \\\\ - 。,!?] + / g); (这可能就足够了)。
If you need to support non English languages, then this regex won't be enough because of the poor support of Unicode in JavaScript regexes. Either you use a regex library or you define your regex by explicitly excluding characters as in a.match(/[^\s\-.,!?]+/g); (this will probably be enough for you) .
详细说明:
这个想法是用第一个文本的标记填充一个集合,然后从集合中删除其他文本中缺少的标记。
The idea is to fill a set with the tokens of the first text and then remove from the set the tokens missing in the other texts.
更多推荐
相交文本以查找常用词
发布评论