在使用TreeTagger进行标记时,必须使用* unicode * string作为标记文本吗?(Must use *unicode* string as text to tag, while ta

系统教程 行业动态 更新时间:2024-06-14 17:00:14
在使用TreeTagger进行标记时,必须使用* unicode * string作为标记文本吗?(Must use *unicode* string as text to tag, while tagging with TreeTagger?)

在TreeTagger的网站上,我创建了一个目录并下载了指定的文件。 然后treetaggerwrapper ,因此从我试图测试的文档和尝试如何标记一些文本如下:

In [40]: import treetaggerwrapper tagger = treetaggerwrapper.TreeTagger(TAGLANG='en') tags = tagger.TagText("This is a very short text to tag.") print tags

然后我收到以下警告:

WARNING:TreeTagger:Abbreviation file not found: english-abbreviations WARNING:TreeTagger:Processing without abbreviations file. ERROR:TreeTagger:Must use *unicode* string as text to tag, not <type 'str'>. --------------------------------------------------------------------------- TreeTaggerError Traceback (most recent call last) <ipython-input-40-37b912126580> in <module>() 1 import treetaggerwrapper 2 tagger = treetaggerwrapper.TreeTagger(TAGLANG='en') ----> 3 tags = tagger.TagText("This is a very short text to tag.") 4 print tags /usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in TagText(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, encoding, errors) 1236 return self.tag_text(text, numlines=numlines, tagonly=tagonly, 1237 prepronly=prepronly, tagblanks=tagblanks, notagurl=notagurl, -> 1238 notagemail=notagemail, notagip=notagip, notagdns=notagdns) 1239 1240 # -------------------------------------------------------------------------- /usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in tag_text(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, nosgmlsplit) 1302 # Raise exception now, with an explicit message. 1303 logger.error("Must use *unicode* string as text to tag, not %s.", type(text)) -> 1304 raise TreeTaggerError("Must use *unicode* string as text to tag.") 1305 1306 if isinstance(text, six.text_type): TreeTaggerError: Must use *unicode* string as text to tag.

我在哪里下载英语和西班牙语的缩写文件?,如何正确安装treetaggerwrapper?

From TreeTagger's website I created a directory and downloaded the specified files. Then treetaggerwrapper, thus from the documentation I tried to test and try how to tag some text as follows:

In [40]: import treetaggerwrapper tagger = treetaggerwrapper.TreeTagger(TAGLANG='en') tags = tagger.TagText("This is a very short text to tag.") print tags

Then I got the following warnings:

WARNING:TreeTagger:Abbreviation file not found: english-abbreviations WARNING:TreeTagger:Processing without abbreviations file. ERROR:TreeTagger:Must use *unicode* string as text to tag, not <type 'str'>. --------------------------------------------------------------------------- TreeTaggerError Traceback (most recent call last) <ipython-input-40-37b912126580> in <module>() 1 import treetaggerwrapper 2 tagger = treetaggerwrapper.TreeTagger(TAGLANG='en') ----> 3 tags = tagger.TagText("This is a very short text to tag.") 4 print tags /usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in TagText(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, encoding, errors) 1236 return self.tag_text(text, numlines=numlines, tagonly=tagonly, 1237 prepronly=prepronly, tagblanks=tagblanks, notagurl=notagurl, -> 1238 notagemail=notagemail, notagip=notagip, notagdns=notagdns) 1239 1240 # -------------------------------------------------------------------------- /usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in tag_text(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, nosgmlsplit) 1302 # Raise exception now, with an explicit message. 1303 logger.error("Must use *unicode* string as text to tag, not %s.", type(text)) -> 1304 raise TreeTaggerError("Must use *unicode* string as text to tag.") 1305 1306 if isinstance(text, six.text_type): TreeTaggerError: Must use *unicode* string as text to tag.

Where do I download the abbreviation file for english and spanish languages?, and how can I install correctly treetaggerwrapper?.

最满意答案

该方法仅将unicode字符串添加到字符串中以使其成为unicode字符串

tags = tagger.TagText(u"This is a very short text to tag.")

"This is a very short text to tag." 是一个str类型 ,一旦你添加它是unicode:

In [12]: type("This is a very short text to tag.") Out[12]: str In [13]: type(u"This is a very short text to tag.") Out[13]: unicode

如果您从其他来源获取str,则需要解码:

In [15]: s = "This is a very short text to tag." In [16]: type(s) Out[16]: str In [17]: type(s.decode("utf-8")) Out[17]: unicode

标记脚本可以在这里下载

The method only takes unicode strings add a u to your string to make it a unicode string:

tags = tagger.TagText(u"This is a very short text to tag.")

"This is a very short text to tag." is a str type, once you add the u it is unicode:

In [12]: type("This is a very short text to tag.") Out[12]: str In [13]: type(u"This is a very short text to tag.") Out[13]: unicode

If you were taking the str from another source you would need to decode:

In [15]: s = "This is a very short text to tag." In [16]: type(s) Out[16]: str In [17]: type(s.decode("utf-8")) Out[17]: unicode

The tagging scripts can be downloaded here

更多推荐

本文发布于:2023-04-18 00:49:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/dzcp/58b003e2d9f530a8e11b183a5f79a3cc.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:标记   文本   unicode   TreeTagger   string

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!