Text Mining"/>
Text Mining
notes of course “Introduction to Data Science” from RWTH-Aachen in semester winter 19/20, Professor van der Aalst, Willibrordus
文章目录
- Basic Definitions and Preliminaries
- Definition
- Including
- Applications
- Text Processing
- Corpus
- Three Steps
- Text Modeling
- Bag-of-Word Model
- Document-Term Matrix
- Document Clustering
- N-gram model
- Motivation
- Markov assumption
- Notation
- Model
- Maximum Likelihood Estimation (MLE)
- Limitation:
- Autoencoders to reduce the dimensions and handle sparseness
- word2vec to capture context
- Skip-grams (k-skip n-grams)
- word2vec
Basic Definitions and Preliminaries
Definition
The extraction of (structured) knowledge from (unstructured) text.
Including
- Information Retrieval (IR): 访问数据库得到相关数据
- NLP
Applications
- Document Clustering
- Document Classification
- Named Entity Recognization
- Part Of Speech Tagging
- Conference Resolution: 比如识别人称he/she,知道是不是同一个人
- Sentiment Analysis: positive/negative attitude
- Keyword Extraction
- Machine Translation
Text Processing
Corpus
- The collection of text that you want to analyze, and it is divided into consistent pieces.
- The pieces of text are often called documents.
- Annotated corpus: are often down by hand.
Three Steps
- Tokenization: split unit in smaller units
- Stopword removal(stop list): remove the words that do not add meaning like ‘be’ and ‘have’ words/ articles----because two unrelated can have many similar stopwords
- Token normalization: Stemming: chop off part of the word(like computer, computation -> comput); Lemmatization: go to the root word
- problems: Homonyms: two same words having different meanings
Text Modeling
Bag-of-Word Model
If W W W is possible words, a document d d d is a multiset of words: d ∈ B ( W ) d\in \mathbb{B}(W) d∈B(W).
If D = B ( W ) D = \mathbb{B}(W) D=B(W) is the set of all documents, a corpus c c c is a multiset of documents: c ∈ B ( D ) c\in\mathbb{B}(D) c∈B(D)
But! multiset representation lose information like order.
Document-Term Matrix
Table form of Bag-of-Word
- term-frequency(TF): tf = #of occurences of word w in document d
- Inverse Document Frequency(IDF): measures the specificity of a word in the corpus that contains it. i d f ( w ) = log 2 ( N # of document that contain w at least once ) idf(w) = \log_{2}(\dfrac{N}{\#\text{of document that contain w at least once}}) idf(w)=log2(#of document that contain w at least onceN) with N equals to the number of documents in the corpus. Intuitively i d f ( w ) = log 2 ( 1 P ( w ) ) idf(w) = \log_2(\dfrac{1}{P(w)}) idf(w)=log2(P(w)1), i.e. the more unlikely, the higher the value. With higher value, words maybe not representative and can be stopword.
- TF-idf Scoring: t f i d f ( w , d ) = t f ( w , d ) ∗ i d f ( w ) tfidf(w,d) = tf(w, d) * idf(w) tfidf(w,d)=tf(w,d)∗idf(w), it shows the relevance of a word in a corpus and the strength of the association between a word and a document.
- s c o r e ( q u e r y , d ) = ∑ w ∈ q u e r y t f i d f ( w , d ) score(query, d) = \sum_{w\in{query}}tfidf(w, d) score(query,d)=∑w∈querytfidf(w,d), rank document by score
Document Clustering
imaging every token is vector and we can use cosine similarity.
N-gram model
- sequences of consecutive tokens, eg: Completion Prediction.
- N means how many words in the sequence.
- Chain rule: P ( A , B , . . . , Y , Z ) = P ( A , B , . . . , Y ∣ Z ) = P ( A ∣ B , . . . , Z ) P ( B ∣ C , . . . , Z ) . . . P ( Y ∣ Z ) P ( Z ) P(A, B, ..., Y, Z) = P(A, B, ..., Y|Z) = P(A|B, ..., Z)P(B|C, ..., Z)...P(Y|Z)P(Z) P(A,B,...,Y,Z)=P(A,B,...,Y∣Z)=P(A∣B,...,Z)P(B∣C,...,Z)...P(Y∣Z)P(Z)
- can be used to generate sentence, auto continuation
Motivation
The motivation behind the n-grams is estimating the probability of a word given prior context: P(phones | Apple gets most of the revenues selling cell)
- Unigram: P(phone)
- Bigram: P(phone | cell)
- Trigram: P(phone | your cell)
Markov assumption
- Markov assumption: the future behavior of a system
depends only on its recent (fixed length) history. - A k-th order Markov Model is a model where the next
state depends from the previous k. - We can use n-grams to create (n-1)th order Markov
models.
Notation
- word sequences: w 1 n = w 1 … w n w_{1}^{n}=w_{1} \dots w_{n} w1n=w1…wn
- how often it occurs: C ( w 1 n ) C\left(w_{1}^{n}\right) C(w1n)
- use < s > as beginning and < \s > as ending so that no negative number.
Model
- Chain rule: P ( w 1 n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 2 ) … P ( w n ∣ w 1 n − 1 ) = ∏ k = 1 n P ( w k ∣ w 1 k − 1 ) P\left(w_{1}^{n}\right)=P\left(w_{1}\right) P\left(w_{2} | w_{1}\right) P\left(w_{3} | w_{1}^{2}\right) \ldots P\left(w_{n} | w_{1}^{n-1}\right)=\prod_{k=1}^{n} P\left(w_{k} | w_{1}^{k-1}\right) P(w1n)=P(w1)P(w2∣w1)P(w3∣w12)…P(wn∣w1n−1)=k=1∏nP(wk∣w1k−1)
- Markov assumption:
- Assumption on bigrams: P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-1}\right) P(wn∣w1n−1)≈P(wn∣wn−1) (Only last one matters)
- Assumption on N-grams: P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-N+1}^{n-1}\right) P(wn∣w1n−1)≈P(wn∣wn−N+1n−1) (Only last N-1 matters)
- Directly:
- Bigram approximation: P ( w 1 n ) ≈ ∏ k = 1 n P ( w k ∣ w k − 1 ) P\left(w_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} | w_{k-1}\right) P(w1n)≈∏k=1nP(wk∣wk−1)
- N-gram approximation: P ( w 1 n ) ≈ ∏ k = 1 n P ( w k ∣ w k − N + 1 k − 1 ) P\left(w_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} | w_{k-N+1}^{k-1}\right) P(w1n)≈∏k=1nP(wk∣wk−N+1k−1)
Maximum Likelihood Estimation (MLE)
- Bigram MLE estimation: P ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) C ( w n − 1 ) P\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)}{C\left(w_{n-1}\right)} P(wn∣wn−1)=C(wn−1)C(wn−1wn)
- N-gram MLE estimation: P ( w n ∣ w n − N + 1 n − 1 ) = C ( w n − N + 1 n − 1 w n ) C ( w n − N + 1 n − 1 ) P\left(w_{n} | w_{n-N+1}^{n-1}\right)=\frac{C\left(w_{n-N+1}^{n-1} w_{n}\right)}{C\left(w_{n-N+1}^{n-1}\right)} P(wn∣wn−N+1n−1)=C(wn−N+1n−1)C(wn−N+1n−1wn)
Limitation:
- sparseness
- many combinations impossible, table almost empty
- solution: look at alphabet rather than words
- zeros
- we train an n-gram model from the training corpus
- we test it on sequences of length n-1 from the test corpus
- if in the test corpus there is a sequence of length n
unseen in the training data, Our model will assign probability 0 to these events! - Solution: smoothing: We “chip off” some probability from highly-likely sequences and we spread it among unseen sequences, assigning a small but non-zero probability.
- Laplace smoothing: Add +1 to all cells of the N-gram matrix before calculating probabilities!
- P L a p l a c e ∗ ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) + 1 C ( w n − 1 ) + ( V ) P_{\mathrm{Laplace}}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+1}{C\left(w_{n-1}\right)+(V)} PLaplace∗(wn∣wn−1)=C(wn−1)+(V)C(wn−1wn)+1 (V is the size of the vocabulary)
- Unknown
- Solution: replace by < UNK >
Autoencoders to reduce the dimensions and handle sparseness
- We need to transform word to vector, if we use one-hot, it will have many columns and be incredibly long.
- The Neural Network is like this:
- Input and output layers of dimension V
- Hidden layer of dimension N, with N < V
word2vec to capture context
storing the meaning of words rather than words themselves
Skip-grams (k-skip n-grams)
- k is the maximum distance between two words, still need n words.
- can associate a more general notion of context than n-grams
- Part of the surrounding context is left out (skipped): this means less overfitting when used in learning.
word2vec
- Instead of learning a compressed representation of a word from itself, we will learn it from its context.
- For each word, we consider the context in a sliding window around it. We create 2-tuples with the word and every skip-gram in the window that does not contain it.
- Vectors for similar words are close to each other.
- It can be possible to add and remove context with vector operations: (king – man + woman) = queen
- For document: calculate the vectors’ sum, maximum, average, minimum.
- Or: doc2vec. Instead of inputting word, input document.
更多推荐
Text Mining
发布评论