Working With Text
text1 = "Ethics are built right into the ideals and objectives of the United Nations "
len(text1) # The length of text1
76
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
len(text2)
14
text2
['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations', '']
List comprehension allows us to find specific words:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2
['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United', 'Nations']
[w for w in text2 if w.istitle()] # Capitalized words in text2
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'
We can find unique words using set().
text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4)
6
len(set(text4))
5
set(text4)
{'To', 'be', 'not', 'or', 'to'}
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.
4
set([w.lower() for w in text4])
{'be', 'not', 'or', 'to'}
Processing free-text
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6
['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG', '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr']
Finding hastags:
[w for w in text6 if w.startswith('#')]
['#UNSG']
Finding callouts:
[w for w in text6 if w.startswith('@')]
['@']
We can use regular expressions to help us with more complex parsing.
For example ‘@[A-Za-z0-9_]+’ will return all words that:
- start with ‘@’ and are followed by at least one:
- capital letter (‘A-Z’)
- lowercase letter (‘a-z’)
- number (‘0-9’)
- or underscore (’_’)
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')
import re # import re - a module that provides support for regular expressions
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]
['@UN', '@UN_Women']
Working with Text Data in pandas
import pandas as pd
time_sentences = ["Monday: The doctor's appointment is at 2:45pm.",
"Tuesday: The dentist's appointment is at 11:30 am.",
"Wednesday: At 7:00pm, there is a basketball game!",
"Thursday: Be back home by 11:15 pm at the latest.",
"Friday: Take the train at 08:10 am, arrive at 09:00am."]
df = pd.DataFrame(time_sentences, columns=['text'])
df
# find the number of characters for each string in df['text']
df['text'].str.len()
0 46
1 50
2 49
3 49
4 54
Name: text, dtype: int64
# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()
0 7
1 8
2 8
3 10
4 10
Name: text, dtype: int64
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')
0 True
1 True
2 False
3 False
4 False
Name: text, dtype: bool
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')
0 3
1 4
2 3
3 4
4 8
Name: text, dtype: int64
# find all occurances of the digits
df['text'].str.findall(r'\d')
0 [2, 4, 5]
1 [1, 1, 3, 0]
2 [7, 0, 0]
3 [1, 1, 1, 5]
4 [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')
0 [(2, 45)]
1 [(11, 30)]
2 [(7, 00)]
3 [(11, 15)]
4 [(08, 10), (09, 00)]
Name: text, dtype: object
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')
0 ???: The doctor's appointment is at 2:45pm.
1 ???: The dentist's appointment is at 11:30 am.
2 ???: At 7:00pm, there is a basketball game!
3 ???: Be back home by 11:15 pm at the latest.
4 ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])
0 Mon: The doctor's appointment is at 2:45pm.
1 Tue: The dentist's appointment is at 11:30 am.
2 Wed: At 7:00pm, there is a basketball game!
3 Thu: Be back home by 11:15 pm at the latest.
4 Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
0 1
0 2 45
1 11 30
2 7 00
3 11 15
4 08 10
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')
更多推荐
Applied Text Mining in Python Week 1(notes)
发布评论