Text Mining

编程入门行业动态更新时间:2024-10-09 08:29:56

Text Mining

notes of course “Introduction to Data Science” from RWTH-Aachen in semester winter 19/20, Professor van der Aalst, Willibrordus

文章目录

Basic Definitions and Preliminaries
- - Definition
  - Including
  - Applications
Text Processing
- - Corpus
  - Three Steps
Text Modeling
- - Bag-of-Word Model
  - Document-Term Matrix
  - Document Clustering
N-gram model
- - Motivation
  - Markov assumption
  - Notation
  - Model
  - Maximum Likelihood Estimation (MLE)
- Limitation:
Autoencoders to reduce the dimensions and handle sparseness
word2vec to capture context
- Skip-grams (k-skip n-grams)
- word2vec

Basic Definitions and Preliminaries

Definition

The extraction of (structured) knowledge from (unstructured) text.

Including

Information Retrieval (IR): 访问数据库得到相关数据
NLP

Applications

Document Clustering
Document Classification
Named Entity Recognization
Part Of Speech Tagging
Conference Resolution: 比如识别人称he/she，知道是不是同一个人
Sentiment Analysis: positive/negative attitude
Keyword Extraction
Machine Translation

Text Processing

Corpus

The collection of text that you want to analyze, and it is divided into consistent pieces.
The pieces of text are often called documents.
Annotated corpus: are often down by hand.

Three Steps

Tokenization: split unit in smaller units
Stopword removal(stop list): remove the words that do not add meaning like ‘be’ and ‘have’ words/ articles----because two unrelated can have many similar stopwords
Token normalization: Stemming: chop off part of the word(like computer, computation -> comput); Lemmatization: go to the root word

problems: Homonyms: two same words having different meanings

Text Modeling

Bag-of-Word Model

If W W W is possible words, a document d d d is a multiset of words: d ∈ B ( W ) d\in \mathbb{B}(W) d∈B(W).
If D = B ( W ) D = \mathbb{B}(W) D=B(W) is the set of all documents, a corpus c c c is a multiset of documents: c ∈ B ( D ) c\in\mathbb{B}(D) c∈B(D)
But! multiset representation lose information like order.

Document-Term Matrix

Table form of Bag-of-Word

term-frequency(TF): tf = #of occurences of word w in document d
Inverse Document Frequency(IDF): measures the specificity of a word in the corpus that contains it. i d f ( w ) = log ⁡ 2 ( N # of document that contain w at least once ) idf(w) = \log_{2}(\dfrac{N}{\#\text{of document that contain w at least once}}) idf(w)=log2(#of document that contain w at least onceN) with N equals to the number of documents in the corpus. Intuitively i d f ( w ) = log ⁡ 2 ( 1 P ( w ) ) idf(w) = \log_2(\dfrac{1}{P(w)}) idf(w)=log2(P(w)1), i.e. the more unlikely, the higher the value. With higher value, words maybe not representative and can be stopword.
TF-idf Scoring: t f i d f ( w , d ) = t f ( w , d ) ∗ i d f ( w ) tfidf(w,d) = tf(w, d) * idf(w) tfidf(w,d)=tf(w,d)∗idf(w), it shows the relevance of a word in a corpus and the strength of the association between a word and a document.
s c o r e ( q u e r y , d ) = ∑ w ∈ q u e r y t f i d f ( w , d ) score(query, d) = \sum_{w\in{query}}tfidf(w, d) score(query,d)=∑w∈querytfidf(w,d), rank document by score

Document Clustering

imaging every token is vector and we can use cosine similarity.

N-gram model

sequences of consecutive tokens, eg: Completion Prediction.
N means how many words in the sequence.
Chain rule: P ( A , B , . . . , Y , Z ) = P ( A , B , . . . , Y ∣ Z ) = P ( A ∣ B , . . . , Z ) P ( B ∣ C , . . . , Z ) . . . P ( Y ∣ Z ) P ( Z ) P(A, B, ..., Y, Z) = P(A, B, ..., Y|Z) = P(A|B, ..., Z)P(B|C, ..., Z)...P(Y|Z)P(Z) P(A,B,...,Y,Z)=P(A,B,...,Y∣Z)=P(A∣B,...,Z)P(B∣C,...,Z)...P(Y∣Z)P(Z)
can be used to generate sentence, auto continuation

Motivation

The motivation behind the n-grams is estimating the probability of a word given prior context: P(phones | Apple gets most of the revenues selling cell)

Unigram: P(phone)
Bigram: P(phone | cell)
Trigram: P(phone | your cell)

Markov assumption

Markov assumption: the future behavior of a system
depends only on its recent (fixed length) history.
A k-th order Markov Model is a model where the next
state depends from the previous k.
We can use n-grams to create (n-1)th order Markov
models.

Notation

word sequences: w 1 n = w 1 … w n w_{1}^{n}=w_{1} \dots w_{n} w1n=w1…wn
how often it occurs: C ( w 1 n ) C\left(w_{1}^{n}\right) C(w1n)
use < s > as beginning and < \s > as ending so that no negative number.

Model

Chain rule: P ( w 1 n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 2 ) … P ( w n ∣ w 1 n − 1 ) = ∏ k = 1 n P ( w k ∣ w 1 k − 1 ) P\left(w_{1}^{n}\right)=P\left(w_{1}\right) P\left(w_{2} | w_{1}\right) P\left(w_{3} | w_{1}^{2}\right) \ldots P\left(w_{n} | w_{1}^{n-1}\right)=\prod_{k=1}^{n} P\left(w_{k} | w_{1}^{k-1}\right) P(w1n)=P(w1)P(w2∣w1)P(w3∣w12)…P(wn∣w1n−1)=k=1∏nP(wk∣w1k−1)
Markov assumption:
- Assumption on bigrams: P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-1}\right) P(wn∣w1n−1)≈P(wn∣wn−1) (Only last one matters)
- Assumption on N-grams: P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-N+1}^{n-1}\right) P(wn∣w1n−1)≈P(wn∣wn−N+1n−1) (Only last N-1 matters)
Directly:
- Bigram approximation: P ( w 1 n ) ≈ ∏ k = 1 n P ( w k ∣ w k − 1 ) P\left(w_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} | w_{k-1}\right) P(w1n)≈∏k=1nP(wk∣wk−1)
- N-gram approximation: P ( w 1 n ) ≈ ∏ k = 1 n P ( w k ∣ w k − N + 1 k − 1 ) P\left(w_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} | w_{k-N+1}^{k-1}\right) P(w1n)≈∏k=1nP(wk∣wk−N+1k−1)

Maximum Likelihood Estimation (MLE)

Bigram MLE estimation: P ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) C ( w n − 1 ) P\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)}{C\left(w_{n-1}\right)} P(wn∣wn−1)=C(wn−1)C(wn−1wn)
N-gram MLE estimation: P ( w n ∣ w n − N + 1 n − 1 ) = C ( w n − N + 1 n − 1 w n ) C ( w n − N + 1 n − 1 ) P\left(w_{n} | w_{n-N+1}^{n-1}\right)=\frac{C\left(w_{n-N+1}^{n-1} w_{n}\right)}{C\left(w_{n-N+1}^{n-1}\right)} P(wn∣wn−N+1n−1)=C(wn−N+1n−1)C(wn−N+1n−1wn)

Limitation:

sparseness
- many combinations impossible, table almost empty
- solution: look at alphabet rather than words
zeros
- we train an n-gram model from the training corpus
- we test it on sequences of length n-1 from the test corpus
- if in the test corpus there is a sequence of length n
  unseen in the training data, Our model will assign probability 0 to these events!
- Solution: smoothing: We “chip off” some probability from highly-likely sequences and we spread it among unseen sequences, assigning a small but non-zero probability.
- Laplace smoothing: Add +1 to all cells of the N-gram matrix before calculating probabilities!
- P L a p l a c e ∗ ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) + 1 C ( w n − 1 ) + ( V ) P_{\mathrm{Laplace}}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+1}{C\left(w_{n-1}\right)+(V)} PLaplace∗(wn∣wn−1)=C(wn−1)+(V)C(wn−1wn)+1 (V is the size of the vocabulary)
Unknown
- Solution: replace by < UNK >

Autoencoders to reduce the dimensions and handle sparseness

We need to transform word to vector, if we use one-hot, it will have many columns and be incredibly long.
The Neural Network is like this:
- Input and output layers of dimension V
- Hidden layer of dimension N, with N < V

word2vec to capture context

storing the meaning of words rather than words themselves

Skip-grams (k-skip n-grams)

k is the maximum distance between two words, still need n words.
can associate a more general notion of context than n-grams
Part of the surrounding context is left out (skipped): this means less overfitting when used in learning.

word2vec

Instead of learning a compressed representation of a word from itself, we will learn it from its context.
For each word, we consider the context in a sliding window around it. We create 2-tuples with the word and every skip-gram in the window that does not contain it.
Vectors for similar words are close to each other.
It can be possible to add and remove context with vector operations: (king – man + woman) = queen
For document: calculate the vectors’ sum, maximum, average, minimum.
Or: doc2vec. Instead of inputting word, input document.

更多推荐

Text Mining

本文发布于:2024-02-06 06:06:24，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1746492.html