Python 标记化

编程入门行业动态更新时间:2024-10-23 16:15:28

本文介绍了Python 标记化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

我是 Python 新手，我有一个标记化任务输入是一个带有句子的 .txt 文件输出是带有令牌的 .txt 文件，当我说 Token 时，我的意思是:简单的单词，，"，！"，'?, '.'' " '

I am new with Python and I have a Tokenization assignment The Input is a .txt file with sentences and output is .txt file with Tokens, and When I say Token i mean: simple word, ',' , '!' , '?' , '.' ' " '

我有这个功能:输入:Elemnt 是一个带或不带标点符号的词，可以是这样的词:嗨或说:或说"StrForCheck :是我想与单词分开的标点符号数组TokenFile:是我的输出文件

I have this function: Input: Elemnt is a word with or without Punctuation, could be word like: Hi or said: or said" StrForCheck : is an array of Punctuation that i want to separate from the words TokenFile: is my output file

def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):

FirstOrLastIsSeman = 0

for seman in StrForCheck:
    WordSplitOnSeman = Elemnt.split(seman)
    if len(WordSplitOnSeman) > 1:
        if Elemnt[len(Elemnt)-1] == seman:
            FirstOrLastIsSeman = len(Elemnt)-1
        elif Elemnt[0] == seman:
            FirstOrLastIsSeman = 1

if FirstOrLastIsSeman == 1:
    TokenFile.write(Elemnt[0])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[1:-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == len(Elemnt)-1:
    TokenFile.write(Elemnt[0:-1])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[len(Elemnt)-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == 0:
    TokenFile.write(Elemnt)
    TokenFile.write('\n')

代码循环遍历标点数组，如果他找到一个，我检查标点是单词中的第一个字母还是最后一个字母，并在我的输出文件中将单词和标点分别写在不同的行中

The Code loops over the Punctuation Array, and if he finds one, i check if the Punctuation was the first letter or the last letter in the word, and write in my output file the word and the Punctuation each in a different line

但我的问题是它在整个文本上都很好用，除了那些词:工作"、已创建"、"公共"、"警察"

But My Problem is that it works wonderful on the whole text except those words: Jobs" , created" , public" , police"

推荐答案

注意

for l in open('some_file.txt', 'r'):
    ...

迭代每一行，所以你只需要考虑在一行中做什么.

iterates over each line, so you just need to consider what to do within a line.

考虑以下函数:

def tokenizer(l):
    prev_i = 0
    for (i, c) in enumerate(l):
        if c in ',.?!- ':
            if prev_i != i:
                yield l[prev_i: i]
            yield c
            prev_i = i + 1
    if prev_i != 0:
        yield l[prev_i: ]

随着它的进行，它会吐出"令牌.你可以这样使用它:

It "spits out" tokens as it goes along. You can use it like this:

l = "hello, hello, what's all this shouting? We'll have no trouble here"
for tok in tokenizer(l):
    print tok

<小时>

hello
,

hello
,

what's

all

this

shouting
?

We'll

have

no

trouble

here

这篇关于Python 标记化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-30 04:44:13，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1389341.html

标记 Python

上一篇：带有 APR 的 Tomcat 仍然说 aprConnector 是假的
下一篇：实现Unraid内网穿透之虚拟机安装openwrt做旁路由及部分优化

发布评论取消回复

评论列表（有 0 条评论）

Python 标记化

问题描述

推荐答案

发布评论取消回复

最近发表

热门文章

标签列表