Python 标记化

编程入门 行业动态 更新时间:2024-10-23 16:15:28
本文介绍了Python 标记化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时送ChatGPT账号..

我是 Python 新手,我有一个标记化任务输入是一个带有句子的 .txt 文件输出是带有令牌的 .txt 文件,当我说 Token 时,我的意思是:简单的单词,,",!",'?, '.'' " '

I am new with Python and I have a Tokenization assignment The Input is a .txt file with sentences and output is .txt file with Tokens, and When I say Token i mean: simple word, ',' , '!' , '?' , '.' ' " '

我有这个功能:输入:Elemnt 是一个带或不带标点符号的词,可以是这样的词:嗨或说:或说"StrForCheck :是我想与单词分开的标点符号数组TokenFile:是我的输出文件

I have this function: Input: Elemnt is a word with or without Punctuation, could be word like: Hi or said: or said" StrForCheck : is an array of Punctuation that i want to separate from the words TokenFile: is my output file

def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):

def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):

FirstOrLastIsSeman = 0

for seman in StrForCheck:
    WordSplitOnSeman = Elemnt.split(seman)
    if len(WordSplitOnSeman) > 1:
        if Elemnt[len(Elemnt)-1] == seman:
            FirstOrLastIsSeman = len(Elemnt)-1
        elif Elemnt[0] == seman:
            FirstOrLastIsSeman = 1

if FirstOrLastIsSeman == 1:
    TokenFile.write(Elemnt[0])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[1:-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == len(Elemnt)-1:
    TokenFile.write(Elemnt[0:-1])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[len(Elemnt)-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == 0:
    TokenFile.write(Elemnt)
    TokenFile.write('\n')

代码循环遍历标点数组,如果他找到一个,我检查标点是单词中的第一个字母还是最后一个字母,并在我的输出文件中将单词和标点分别写在不同的行中

The Code loops over the Punctuation Array, and if he finds one, i check if the Punctuation was the first letter or the last letter in the word, and write in my output file the word and the Punctuation each in a different line

但我的问题是它在整个文本上都很好用,除了那些词:工作"、已创建"、"公共"、"警察"

But My Problem is that it works wonderful on the whole text except those words: Jobs" , created" , public" , police"

推荐答案

注意

for l in open('some_file.txt', 'r'):
    ...

迭代每一行,所以你只需要考虑在一行中做什么.

iterates over each line, so you just need to consider what to do within a line.

考虑以下函数:

def tokenizer(l):
    prev_i = 0
    for (i, c) in enumerate(l):
        if c in ',.?!- ':
            if prev_i != i:
                yield l[prev_i: i]
            yield c
            prev_i = i + 1
    if prev_i != 0:
        yield l[prev_i: ]

随着它的进行,它会吐出"令牌.你可以这样使用它:

It "spits out" tokens as it goes along. You can use it like this:

l = "hello, hello, what's all this shouting? We'll have no trouble here"
for tok in tokenizer(l):
    print tok

<小时>

hello
,

hello
,

what's

all

this

shouting
?

We'll

have

no

trouble

here

这篇关于Python 标记化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

更多推荐

[db:关键词]

本文发布于:2023-04-30 04:44:13,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1389341.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:标记   Python

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!