问题描述
限时送ChatGPT账号..我是 Python 新手,我有一个标记化任务输入是一个带有句子的 .txt 文件输出是带有令牌的 .txt 文件,当我说 Token 时,我的意思是:简单的单词,,",!",'?, '.'' " '
I am new with Python and I have a Tokenization assignment The Input is a .txt file with sentences and output is .txt file with Tokens, and When I say Token i mean: simple word, ',' , '!' , '?' , '.' ' " '
我有这个功能:输入:Elemnt 是一个带或不带标点符号的词,可以是这样的词:嗨或说:或说"StrForCheck :是我想与单词分开的标点符号数组TokenFile:是我的输出文件
I have this function: Input: Elemnt is a word with or without Punctuation, could be word like: Hi or said: or said" StrForCheck : is an array of Punctuation that i want to separate from the words TokenFile: is my output file
def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):
def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):
FirstOrLastIsSeman = 0
for seman in StrForCheck:
WordSplitOnSeman = Elemnt.split(seman)
if len(WordSplitOnSeman) > 1:
if Elemnt[len(Elemnt)-1] == seman:
FirstOrLastIsSeman = len(Elemnt)-1
elif Elemnt[0] == seman:
FirstOrLastIsSeman = 1
if FirstOrLastIsSeman == 1:
TokenFile.write(Elemnt[0])
TokenFile.write('\n')
TokenFile.write(Elemnt[1:-1])
TokenFile.write('\n')
elif FirstOrLastIsSeman == len(Elemnt)-1:
TokenFile.write(Elemnt[0:-1])
TokenFile.write('\n')
TokenFile.write(Elemnt[len(Elemnt)-1])
TokenFile.write('\n')
elif FirstOrLastIsSeman == 0:
TokenFile.write(Elemnt)
TokenFile.write('\n')
代码循环遍历标点数组,如果他找到一个,我检查标点是单词中的第一个字母还是最后一个字母,并在我的输出文件中将单词和标点分别写在不同的行中
The Code loops over the Punctuation Array, and if he finds one, i check if the Punctuation was the first letter or the last letter in the word, and write in my output file the word and the Punctuation each in a different line
但我的问题是它在整个文本上都很好用,除了那些词:工作"、已创建"、"公共"、"警察"
But My Problem is that it works wonderful on the whole text except those words: Jobs" , created" , public" , police"
推荐答案
注意
for l in open('some_file.txt', 'r'):
...
迭代每一行,所以你只需要考虑在一行中做什么.
iterates over each line, so you just need to consider what to do within a line.
考虑以下函数:
def tokenizer(l):
prev_i = 0
for (i, c) in enumerate(l):
if c in ',.?!- ':
if prev_i != i:
yield l[prev_i: i]
yield c
prev_i = i + 1
if prev_i != 0:
yield l[prev_i: ]
随着它的进行,它会吐出"令牌.你可以这样使用它:
It "spits out" tokens as it goes along. You can use it like this:
l = "hello, hello, what's all this shouting? We'll have no trouble here"
for tok in tokenizer(l):
print tok
<小时>
hello
,
hello
,
what's
all
this
shouting
?
We'll
have
no
trouble
here
这篇关于Python 标记化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论