Text Mining

编程入门 行业动态 更新时间:2024-10-09 08:29:56

<a href=https://www.elefans.com/category/jswz/34/1770590.html style=Text Mining"/>

Text Mining

notes of course “Introduction to Data Science” from RWTH-Aachen in semester winter 19/20, Professor van der Aalst, Willibrordus

文章目录

  • Basic Definitions and Preliminaries
      • Definition
      • Including
      • Applications
  • Text Processing
      • Corpus
      • Three Steps
  • Text Modeling
      • Bag-of-Word Model
      • Document-Term Matrix
      • Document Clustering
  • N-gram model
      • Motivation
      • Markov assumption
      • Notation
      • Model
      • Maximum Likelihood Estimation (MLE)
    • Limitation:
  • Autoencoders to reduce the dimensions and handle sparseness
  • word2vec to capture context
    • Skip-grams (k-skip n-grams)
    • word2vec

Basic Definitions and Preliminaries

Definition

The extraction of (structured) knowledge from (unstructured) text.

Including

  1. Information Retrieval (IR): 访问数据库得到相关数据
  2. NLP

Applications

  1. Document Clustering
  2. Document Classification
  3. Named Entity Recognization
  4. Part Of Speech Tagging
  5. Conference Resolution: 比如识别人称he/she,知道是不是同一个人
  6. Sentiment Analysis: positive/negative attitude
  7. Keyword Extraction
  8. Machine Translation

Text Processing

Corpus

  • The collection of text that you want to analyze, and it is divided into consistent pieces.
  • The pieces of text are often called documents.
  • Annotated corpus: are often down by hand.

Three Steps

  1. Tokenization: split unit in smaller units
  2. Stopword removal(stop list): remove the words that do not add meaning like ‘be’ and ‘have’ words/ articles----because two unrelated can have many similar stopwords
  3. Token normalization: Stemming: chop off part of the word(like computer, computation -> comput); Lemmatization: go to the root word
  • problems: Homonyms: two same words having different meanings

Text Modeling

Bag-of-Word Model

If W W W is possible words, a document d d d is a multiset of words: d ∈ B ( W ) d\in \mathbb{B}(W) d∈B(W).
If D = B ( W ) D = \mathbb{B}(W) D=B(W) is the set of all documents, a corpus c c c is a multiset of documents: c ∈ B ( D ) c\in\mathbb{B}(D) c∈B(D)
But! multiset representation lose information like order.

Document-Term Matrix

Table form of Bag-of-Word

  • term-frequency(TF): tf = #of occurences of word w in document d
  • Inverse Document Frequency(IDF): measures the specificity of a word in the corpus that contains it. i d f ( w ) = log ⁡ 2 ( N # of document that contain w at least once ) idf(w) = \log_{2}(\dfrac{N}{\#\text{of document that contain w at least once}}) idf(w)=log2​(#of document that contain w at least onceN​) with N equals to the number of documents in the corpus. Intuitively i d f ( w ) = log ⁡ 2 ( 1 P ( w ) ) idf(w) = \log_2(\dfrac{1}{P(w)}) idf(w)=log2​(P(w)1​), i.e. the more unlikely, the higher the value. With higher value, words maybe not representative and can be stopword.
  • TF-idf Scoring: t f i d f ( w , d ) = t f ( w , d ) ∗ i d f ( w ) tfidf(w,d) = tf(w, d) * idf(w) tfidf(w,d)=tf(w,d)∗idf(w), it shows the relevance of a word in a corpus and the strength of the association between a word and a document.
  • s c o r e ( q u e r y , d ) = ∑ w ∈ q u e r y t f i d f ( w , d ) score(query, d) = \sum_{w\in{query}}tfidf(w, d) score(query,d)=∑w∈query​tfidf(w,d), rank document by score

Document Clustering

imaging every token is vector and we can use cosine similarity.

N-gram model

  • sequences of consecutive tokens, eg: Completion Prediction.
  • N means how many words in the sequence.
  • Chain rule: P ( A , B , . . . , Y , Z ) = P ( A , B , . . . , Y ∣ Z ) = P ( A ∣ B , . . . , Z ) P ( B ∣ C , . . . , Z ) . . . P ( Y ∣ Z ) P ( Z ) P(A, B, ..., Y, Z) = P(A, B, ..., Y|Z) = P(A|B, ..., Z)P(B|C, ..., Z)...P(Y|Z)P(Z) P(A,B,...,Y,Z)=P(A,B,...,Y∣Z)=P(A∣B,...,Z)P(B∣C,...,Z)...P(Y∣Z)P(Z)
  • can be used to generate sentence, auto continuation

Motivation

The motivation behind the n-grams is estimating the probability of a word given prior context: P(phones | Apple gets most of the revenues selling cell)

  • Unigram: P(phone)
  • Bigram: P(phone | cell)
  • Trigram: P(phone | your cell)

Markov assumption

  • Markov assumption: the future behavior of a system
    depends only on its recent (fixed length) history.
  • A k-th order Markov Model is a model where the next
    state depends from the previous k.
  • We can use n-grams to create (n-1)th order Markov
    models.

Notation

  • word sequences: w 1 n = w 1 … w n w_{1}^{n}=w_{1} \dots w_{n} w1n​=w1​…wn​
  • how often it occurs: C ( w 1 n ) C\left(w_{1}^{n}\right) C(w1n​)
  • use < s > as beginning and < \s > as ending so that no negative number.

Model

  • Chain rule: P ( w 1 n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 2 ) … P ( w n ∣ w 1 n − 1 ) = ∏ k = 1 n P ( w k ∣ w 1 k − 1 ) P\left(w_{1}^{n}\right)=P\left(w_{1}\right) P\left(w_{2} | w_{1}\right) P\left(w_{3} | w_{1}^{2}\right) \ldots P\left(w_{n} | w_{1}^{n-1}\right)=\prod_{k=1}^{n} P\left(w_{k} | w_{1}^{k-1}\right) P(w1n​)=P(w1​)P(w2​∣w1​)P(w3​∣w12​)…P(wn​∣w1n−1​)=k=1∏n​P(wk​∣w1k−1​)
  • Markov assumption:
    • Assumption on bigrams: P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-1}\right) P(wn​∣w1n−1​)≈P(wn​∣wn−1​) (Only last one matters)
    • Assumption on N-grams: P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-N+1}^{n-1}\right) P(wn​∣w1n−1​)≈P(wn​∣wn−N+1n−1​) (Only last N-1 matters)
  • Directly:
    • Bigram approximation: P ( w 1 n ) ≈ ∏ k = 1 n P ( w k ∣ w k − 1 ) P\left(w_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} | w_{k-1}\right) P(w1n​)≈∏k=1n​P(wk​∣wk−1​)
    • N-gram approximation: P ( w 1 n ) ≈ ∏ k = 1 n P ( w k ∣ w k − N + 1 k − 1 ) P\left(w_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} | w_{k-N+1}^{k-1}\right) P(w1n​)≈∏k=1n​P(wk​∣wk−N+1k−1​)

Maximum Likelihood Estimation (MLE)

  • Bigram MLE estimation: P ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) C ( w n − 1 ) P\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)}{C\left(w_{n-1}\right)} P(wn​∣wn−1​)=C(wn−1​)C(wn−1​wn​)​
  • N-gram MLE estimation: P ( w n ∣ w n − N + 1 n − 1 ) = C ( w n − N + 1 n − 1 w n ) C ( w n − N + 1 n − 1 ) P\left(w_{n} | w_{n-N+1}^{n-1}\right)=\frac{C\left(w_{n-N+1}^{n-1} w_{n}\right)}{C\left(w_{n-N+1}^{n-1}\right)} P(wn​∣wn−N+1n−1​)=C(wn−N+1n−1​)C(wn−N+1n−1​wn​)​

Limitation:

  • sparseness
    • many combinations impossible, table almost empty
    • solution: look at alphabet rather than words
  • zeros
    • we train an n-gram model from the training corpus
    • we test it on sequences of length n-1 from the test corpus
    • if in the test corpus there is a sequence of length n
      unseen in the training data, Our model will assign probability 0 to these events!
    • Solution: smoothing: We “chip off” some probability from highly-likely sequences and we spread it among unseen sequences, assigning a small but non-zero probability.
    • Laplace smoothing: Add +1 to all cells of the N-gram matrix before calculating probabilities!
    • P L a p l a c e ∗ ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) + 1 C ( w n − 1 ) + ( V ) P_{\mathrm{Laplace}}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+1}{C\left(w_{n-1}\right)+(V)} PLaplace∗​(wn​∣wn−1​)=C(wn−1​)+(V)C(wn−1​wn​)+1​ (V is the size of the vocabulary)
  • Unknown
    • Solution: replace by < UNK >

Autoencoders to reduce the dimensions and handle sparseness

  • We need to transform word to vector, if we use one-hot, it will have many columns and be incredibly long.
  • The Neural Network is like this:
    • Input and output layers of dimension V
    • Hidden layer of dimension N, with N < V

word2vec to capture context

storing the meaning of words rather than words themselves

Skip-grams (k-skip n-grams)

  • k is the maximum distance between two words, still need n words.
  • can associate a more general notion of context than n-grams
  • Part of the surrounding context is left out (skipped): this means less overfitting when used in learning.

word2vec

  • Instead of learning a compressed representation of a word from itself, we will learn it from its context.
  • For each word, we consider the context in a sliding window around it. We create 2-tuples with the word and every skip-gram in the window that does not contain it.
  • Vectors for similar words are close to each other.
  • It can be possible to add and remove context with vector operations: (king – man + woman) = queen
  • For document: calculate the vectors’ sum, maximum, average, minimum.
  • Or: doc2vec. Instead of inputting word, input document.

更多推荐

Text Mining

本文发布于:2024-02-06 06:06:24,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1746492.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:Text   Mining

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!