NLP基础介绍

编程入门行业动态更新时间:2024-10-05 05:18:05

NLP<a href=https://www.elefans.com/category/jswz/34/1770030.html style= 基础介绍"/>

NLP基础介绍

原文链接：/post/c939a57a.html

定义

自然语言处理是一门融语言学、计算机科学、人工智能于一体的科学，解决的是”让机器可以理解自然语言“。

发展阶段：

1950年代，基于规则的方式；

1970年代，统计语言学；

2003年，神经网络。

主要研究方向：

词法短语：分词，词性标注，命名实体识别，组块分析，Term权重，Term紧密度

句法语义：语言模型，依存句法分析，词义消歧，语义角色标注，深层语义分析

篇章理解：文本分类、聚类，文本摘要，文本生成，篇章关系识别，篇章衔接关系，指带消歧，语义表示，语义匹配，主题模型，情感分析，舆情监控

系统应用：信息抽取，只是图谱(表示，建图，补全，推理等)，信息检索(索引，召回，排序等)，Query分析，自动问答，智能对话，阅读理解，机器翻译，语音识别、合成，OCR，图像文字生成

词法阶段的工具：

NLTK

官网地址：
Python上著名的自然语言处理库，具有一下优点：
- 自带语料库，词性分类库
- 自带分词，POS（词性标注），NER（命名实体识别）等功能
- 强大的社区支持

词法（处理流水线）

Raw_Text表示一句话或者一个文本；Tokenize表示分词；POS Tag表示词性标注；Lemma/Stemming表示词的泛化，比如am, is, are可以转化成be这种形式，或 worked, working转化成work的形式；stopwords表示停用词；最后转化成一个Word_List。

Tokenize

吧长句子拆成有“意义”的小部件

import nltk
sentence = "hello, world"
tokens = nltk.word_tokenize(sentence)
tokens# output
['hello', ',', 'world']

jieba中文分词工具

词性归一化：

Stemming 词干提取：一般来说，就是把不影响词性的inflection的小尾巴砍掉

walking 砍ing = walk

walked 砍ed = walk
Lemmatization 词性归一：把各种类型的词的变形，都归为一个形式

went 归一 = go

are 归一 = be

from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
snowball_stemmer.stem('maximum')# output
'maximum'snowball_stemmer.stem('presumably')# output
'presum'

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize('dogs')# output
'dog'wordnet_lemmatizer.lemmatize('churches')# output
'church'wordnet_lemmatizer.lemmatize('aardwolves')# output
'aardwolf'# 没有 POS Tag，默认是 NN 名词
wordnet_lemmatizer.lemmatize('are')# output
'are'wordnet_lemmatizer.lemmatize('is')# output
'is'# 加上 POS Tag
wordnet_lemmatizer.lemmatize('is', pos='v')# output
'be'wordnet_lemmatizer.lemmatize('are', pos='v')# output
'be'

词性标注

import nltk
text = nltk.word_tokenize("what does the fox say")
text# output
['what', 'does', 'the', 'fox', 'say']# 词性标注
nltk.pos_tag(text)# output
[('what', 'WDT'),('does', 'VBZ'),('the', 'DT'),('fox', 'NNS'),('say', 'VBP')]

命名实体识别

# NER
from nltk import ne_chunk, pos_tag, word_tokenize
sentence = "John studies at Stanford University."
ner = ne_chunk(pos_tag(word_tokenize(sentence)))print(ner), type(ner)# output
(S(PERSON John/NNP)studies/NNSat/IN(ORGANIZATION Stanford/NNP University/NNP)./.)
(None, nltk.tree.Tree)[" ".join(w for w, t in elt) for elt in ner if isinstance(elt, nltk.Tree)]# output
['John', 'Stanford University']

停用词

# 下载stopwords词库，nltk.down('stopwords')# stopwords
from nltk.corpus import stopwords
# 先token一把，得到一个word_list
# ...
# 然后filter一把
filter_words = [word for word in word_list if word not in stopwords.words('english')]

篇章理解：情感分析

最简单的sentiment dictionary

like 1

good 2

bad -2

terrible -3

类似于关键词打分机制

比如：AFINN-111

下载地址:

# 篇章理解：情感分析sentiment_dictionary = {}
for line in open('./AFINN-111.txt'):word, score = line.split('\t')sentiment_dictionary[word] = int(score)# 把这个打分表记录在Dict上以后
# 跑一遍整个句子，把对应的值想加
words = ['like', 'love', 'beautiful']
total_score = sum(sentiment_dictionary.get(word, 0) for word in words)'''
描述
Python 字典(Dictionary) get() 函数返回指定键的值，如果值不在字典中返回默认值。语法
get()方法语法：
dict.get(key, default=None)参数
key -- 字典中要查找的键。
default -- 如果指定键的值不存在时，返回该默认值值。返回值
返回指定键的值，如果值不在字典中返回默认值None。
'''# 有值就是Dict中的值，没有就是0# 于是你就得到了一个 sentiment scoretotal_score# output
8

显然这个方法太Navie

新词怎么办？

特殊词汇怎么办？

更深层次的玩意怎么办？

改进：

# 改进from nltk.classify import NaiveBayesClassifier# 随手造点训练集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'def preprocess(s):# Func: 句子处理# 这里简单的用了split()，把句子中每个单词分开# 显然，还有更多的processing method可以用return {word: True for word in s.lower().split()}# return 长这样：# {'this': True, 'is': True, 'a': True, 'good': True, 'book': True}# 其中，前一个叫fname，对应每个出现的文本单词；# 后一个叫fval，指的是每个文本单词对应的值。# 这里我们用最简单的 True 来表示这个词【出现在当前的句子中】的意义。# 当然，我们可以升级这个方程，让它带有更牛的fval，比如word2vec# 把训练集给做成标准形式
training_data = [[preprocess(s1), 'pos'],  # pos, neg 为label[preprocess(s2), 'pos'],[preprocess(s3), 'neg'],[preprocess(s4), 'neg']]# 喂给model
model = NaiveBayesClassifier.train(training_data)# 打出结果
print(model.classify(preprocess('this is a good book')))# output
pos

# 先把数据都读进来
pos_data = []
with open('PATH_TO_rt-polarity-pos.txt, encoding='latin-1') as f:for line in f:pos_data.append([preprocess(line), 'pos'])neg_data = []
with open('PATH_TO_rt-polarity-neg.txt, encoding='latin-1') as f:for line in f:neg_data.append([preprocess(line), 'neg']) # 把测试集和训练集分开
training_data = pos_data[:4000] + neg_data[:4000]
testing_data = pos_data[4000:] + neg_data[4000:]# 引入model
model = NaiveBayesClassifier.train(training_data)# 测试
print(model.classify(preprocess('this is a bad movie')))

词袋模型（BOW）

词袋模型能够把一个句子转化为向量表示，是比较简单直白的一种方法，它不考虑句子中单词的顺序，只考虑词表（vocabulary）中单词在这个句子中的出现次数。

缺点：

词汇：词汇表需要精心设计，最重要的是为了管理大小，这会影响文档表示的稀疏性。
稀疏性：由于计算原因（空间和时间复杂性）以及信息原因，稀疏表示更难以建模，其中挑战是模型在如此大的代表空间中利用如此少的信息。
含义：丢弃单词顺序忽略了上下文，而忽略了文档中单词的含义（语义）。上下文和意义可以为模型提供很多东西，如果建模可以说出不同排列的相同单词之间的区别（“this is interesting” vs “is this interesting”），同义词(“old bike” vs “used bike”)，以及更多例子。

from sklearn.feature_extraction.text import CountVectorizer
corpus = ["John likes to watch movies, Mary likes movies too","John also likes to watch football games",
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())# output
['also', 'football', 'games', 'john', 'likes', 'mary', 'movies', 'to', 'too', 'watch']
[[0 0 0 1 2 1 2 1 1 1][1 1 1 1 1 0 0 1 0 1]]

TF-IDF

TF: Term Frequency，衡量一个term在文档中出现的有多频繁。

TF(t) = (t出现在文档中的次数) / (文档中的term总数)

IDF: Inverse Document Frequency，衡量一个term有多重要。

有些词出现的很多，但是明显不是很有用。比如：‘is’, ‘the’, 'and’之类的。

为了平衡，我们把罕见的词的重要性（weight）提高，把常见词的重要性降低。

IDF(t) = In(文档总数 / 含有t的文档总数)

TF-IDF = TF * IDF

举例：

一个文档中有100个单词，其中单词baby出现了3次。

那么，TF(baby) = (3 / 100) = 0.03

如果我们有10M的文档，baby出现在其中的1000个文档中。

则，IDF(baby) = In(10000000 / 1000)

所以，TF-IDF(baby) = TF(baby) * IDF(baby) = 0.03 * 4 = 0.12

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)print(vectorizer.get_feature_names())
print(X)
print(X.toarray())# output
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'](0, 1)	0.46979138557992045(0, 2)	0.5802858236844359(0, 6)	0.38408524091481483(0, 3)	0.38408524091481483(0, 8)	0.38408524091481483(1, 5)	0.5386476208856763(1, 1)	0.6876235979836938(1, 6)	0.281088674033753(1, 3)	0.281088674033753(1, 8)	0.281088674033753(2, 4)	0.511848512707169(2, 7)	0.511848512707169(2, 0)	0.511848512707169(2, 6)	0.267103787642168(2, 3)	0.267103787642168(2, 8)	0.267103787642168(3, 1)	0.46979138557992045(3, 2)	0.5802858236844359(3, 6)	0.38408524091481483(3, 3)	0.38408524091481483(3, 8)	0.38408524091481483
[[0.         0.46979139 0.58028582 0.38408524 0.         0.0.38408524 0.         0.38408524][0.         0.6876236  0.         0.28108867 0.         0.538647620.28108867 0.         0.28108867][0.51184851 0.         0.         0.26710379 0.51184851 0.0.26710379 0.51184851 0.26710379][0.         0.46979139 0.58028582 0.38408524 0.         0.0.38408524 0.         0.38408524]]

更多推荐

NLP基础介绍

本文发布于:2024-02-28 09:27:15，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1768817.html