[英]How do I filter data from an xlsx file based on key words in a sentence using python?
I scraped some online data using Twitter scraper. 我使用Twitter抓取工具抓取了一些在线数据。 I know I can filter this fairly easily using excel, and I did export the data to an xlsx.
我知道我可以使用excel轻松过滤此数据,并且确实将数据导出到xlsx。 But, I want to filter using Python.
但是,我想使用Python进行过滤。 I scraped data containing
Hurricane Dorian
. 我抓取了包含
Hurricane Dorian
数据。 Also, I want to filter everything that does not include the word "Bahamas"
in it. 另外,我想过滤所有不包含
"Bahamas"
一词的内容。 How would I do this? 我该怎么做?
Thank you! 谢谢!
from twitterscraper import query_tweets
import datetime as dt
import pandas as pd
begin_date = dt.date(2019, 7, 1)
end_date = dt.date(2019, 9, 9)
limit = 1000
lang = 'english'
tweets = query_tweets('Hurricane Dorian', begindate = begin_date, enddate = end_date, limit = limit, lang = lang)
df = pd.DataFrame(t.__dict__ for t in tweets)
export_excel = df.to_excel (r'C:\Users\victo\Desktop\HurricaneData.xlsx', index = None, header=True)
You can use the str functions in pandas to filter. 您可以在熊猫中使用str函数进行过滤。 See pandas help on indexing.
请参阅熊猫的索引帮助。 Here's the specific answer (code) for your posted questions:
这是您发布的问题的特定答案(代码):
from twitterscraper import query_tweets
import datetime as dt
import pandas as pd
begin_date = dt.date(2019, 7, 1)
end_date = dt.date(2019, 9, 9)
limit = 1000
lang = 'english'
tweets = query_tweets(
'Hurricane Dorian',
begindate = begin_date,
enddate = end_date,
limit = limit,
lang = lang
)
# Convert to dataframe
df = pd.DataFrame(t.__dict__ for t in tweets)
# make a boolean mask
filt = df['text'].str.contains('Bahamas')
# compare the lengths of the dataframes
print(df.shape)
print(df.loc[filt].shape)
You can see the unfiltered df has 340 rows. 您可以看到未过滤的df有340行。 Restricting it to rows where the text had 'Bahamas' reduced it to 55 rows.
将其限制为文本带有“巴哈马”的行,将其减少到55行。
(340, 16) (340,16)
(55, 16) (55,16)
To keep the ones that were true, reassign it using the filter: 要保留真实的内容,请使用过滤器将其重新分配:
df = df.loc[filt]
Or you could assign it to a new dataframe if you want to preserve the original raw data. 或者,如果您要保留原始原始数据,则可以将其分配给新的数据框。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.