I have a list of 'words' I want to count below
word_list = ['one','three']
And I have a column within pandas dataframe with text below.
TEXT |
-------------------------------------------|
"Perhaps she'll be the one for me." |
"Is it two or one?" |
"Mayhaps it be three afterall..." |
"Three times and it's a charm." |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat." |
"One does not simply code into pandas." |
"Two nights later..." |
"Quoth the Raven... nevermore." |
The desired output is the following below, where it keeps the original text column, but only extracted the words in word_list to a new column
TEXT | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me." | one
"Is it two or one?" | one
"Mayhaps it be three afterall..." | three
"Three times and it's a charm." | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat." | one
"One does not simply code into pandas." | one
"Two nights later..." |
"Quoth the Raven... nevermore." |
Is there a way to do this in Python 2.7?
Use str.extract
:
df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)),
flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']
0 one
1 one
2 three
3 three
4 one
5 one
6 one
7
8
Name: EXTRACT, dtype: object
Each word in word_list
is joined by the regex separator |
and then passed to str.extract
for regex pattern matching.
The re.IGNORECASE
switch is turned on for case-insensitive comparisons, and the resultant matches are lowercased to match with your expected output.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.