So I am trying to get a count for specific phrases in Python from a string I created. I have been able to make a list of specific individual words but never with anything involving two phrases. I just want to be able to create a list of items that involve two words for each item.
import pandas as pd
import numpy as np
import re
import collections
import plotly.express as px
df = pd.read_excel("Datasets/realDonaldTrumprecent2020.xlsx", sep='\t',
names=["Tweet_ID", "Date", "Text"])
df = pd.DataFrame(df)
df.head()
tweets = df["Text"]
raw_string = ''.join(tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
no_capital_letters = re.sub('[A-Z]+', lambda m: m.group(0).lower(), no_special_characters)
words_list = no_capital_letters.split(" ")
phrases = ['fake news', 'lamestream media', 'sleepy joe', 'radical left', 'rigged election']
I initially was able to get a list of just the individual words but I want to be able to get a list of instances where the phrases show up. Is there a way to do this?
Pandas provides some nice tools for doing these things.
For example, if your DataFrame was as follows:
import pandas as pd
df = pd.DataFrame({'text': [
'Encyclopedia Britannica is FAKE NEWS!',
'What does Sleepy Joe read? Webster\'s Dictionary? Fake News!',
'Sesame Street is lamestream media by radical leftist Big Bird!!!',
'1788 was a rigged election! Landslide for King George! Fake News',
]})
...you could select tweets containing the phrase 'fake news' like so:
selector = df.text.str.lower().str.contains('fake news')
This produces the following Series of booleans:
0 True
1 True
2 False
3 True
Name: text, dtype: bool
You can count how many are positive with sum:
sum(selector)
And use it to index the data frame to get an array of tweets
df.text[selector].values
If you are trying to count the number of times those phrases appear in the text, the following code should work.
for phrase in phrases:
sum(s.count(phrase) for phrase in words_list)
print(phrase, sum)
In terms of "a list of instances where the phrases show up", you should be able to slightly modify the above for loop:
phrase_list = []
for phrase in phrases:
for tweet in tweets:
if tweet in phrase:
phrase_list.append(tweet)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.