确定文字是否为英文?

编程入门 行业动态 更新时间:2024-10-21 15:52:38
本文介绍了确定文字是否为英文?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我同时使用 Nltk 和 Scikit Learn 进行一些文本处理.但是,在我的文档列表中,有些文档不是英语的.例如,以下可能是正确的:

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:

[ "this is some text written in English", "this is some more text written in English", "Ce n'est pas en anglais" ]

出于分析目的,我希望将所有非英语的句子作为预处理的一部分删除.但是,有什么好方法吗?我一直在使用Google搜索,但是找不到任何可以让我识别字符串是否为英文的特定内容.这是Nltk或Scikit learn中未提供的功能吗? 编辑我见过像此和这个,但两者都是针对单个单词的...不是文档".我是否必须遍历句子中的每个单词以检查整个句子是否为英语?

For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?

我使用的是Python,因此使用Python的库会更可取,但是我可以根据需要切换语言,只是认为Python会是最好的选择.

I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

推荐答案

有一个名为langdetect的库.它是从可从以下位置获得的Google语言检测功能移植而来的:

There is a library called langdetect. It is ported from Google's language-detection available here:

pypi.python/pypi/langdetect

开箱即用,它支持55种语言.

It supports 55 languages out of the box.

更多推荐

确定文字是否为英文?

本文发布于:2023-11-29 17:17:36,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1647068.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:英文   文字

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!