简体   繁体   中英

Counting Specific Phrases Using Python

So I am trying to get a count for specific phrases in Python from a string I created. I have been able to make a list of specific individual words but never with anything involving two phrases. I just want to be able to create a list of items that involve two words for each item.

import pandas as pd
import numpy as np
import re
import collections
import plotly.express as px

df = pd.read_excel("Datasets/realDonaldTrumprecent2020.xlsx", sep='\t', 
                   names=["Tweet_ID", "Date", "Text"])

df = pd.DataFrame(df)
df.head()

tweets = df["Text"]

raw_string = ''.join(tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
no_capital_letters = re.sub('[A-Z]+', lambda m: m.group(0).lower(), no_special_characters)

words_list = no_capital_letters.split(" ")

phrases = ['fake news', 'lamestream media', 'sleepy joe', 'radical left', 'rigged election']

I initially was able to get a list of just the individual words but I want to be able to get a list of instances where the phrases show up. Is there a way to do this?

Pandas provides some nice tools for doing these things.

For example, if your DataFrame was as follows:

import pandas as pd

df = pd.DataFrame({'text': [
    'Encyclopedia Britannica is FAKE NEWS!',
    'What does Sleepy Joe read? Webster\'s Dictionary? Fake News!',
    'Sesame Street is lamestream media by radical leftist Big Bird!!!',
    '1788 was a rigged election! Landslide for King George! Fake News',
]})

...you could select tweets containing the phrase 'fake news' like so:

selector = df.text.str.lower().str.contains('fake news')

This produces the following Series of booleans:

0     True
1     True
2    False
3     True
Name: text, dtype: bool

You can count how many are positive with sum:

sum(selector)

And use it to index the data frame to get an array of tweets

df.text[selector].values

If you are trying to count the number of times those phrases appear in the text, the following code should work.

for phrase in phrases:
    sum(s.count(phrase) for phrase in words_list)
    print(phrase, sum)

In terms of "a list of instances where the phrases show up", you should be able to slightly modify the above for loop:

phrase_list = []
for phrase in phrases:
    for tweet in tweets:
        if tweet in phrase:
            phrase_list.append(tweet)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM