简体   繁体   English

如何使用python根据句子中的关键字从xlsx文件中过滤数据?

[英]How do I filter data from an xlsx file based on key words in a sentence using python?

I scraped some online data using Twitter scraper. 我使用Twitter抓取工具抓取了一些在线数据。 I know I can filter this fairly easily using excel, and I did export the data to an xlsx. 我知道我可以使用excel轻松过滤此数据,并且确实将数据导出到xlsx。 But, I want to filter using Python. 但是,我想使用Python进行过滤。 I scraped data containing Hurricane Dorian . 我抓取了包含Hurricane Dorian数据。 Also, I want to filter everything that does not include the word "Bahamas" in it. 另外,我想过滤所有不包含"Bahamas"一词的内容。 How would I do this? 我该怎么做?

Thank you! 谢谢!

from twitterscraper import query_tweets
import datetime as dt
import pandas as pd

begin_date = dt.date(2019, 7, 1)
end_date = dt.date(2019, 9, 9)

limit = 1000
lang = 'english'

tweets = query_tweets('Hurricane Dorian', begindate = begin_date, enddate = end_date, limit = limit, lang = lang)

df = pd.DataFrame(t.__dict__ for t in tweets)

export_excel = df.to_excel (r'C:\Users\victo\Desktop\HurricaneData.xlsx', index = None, header=True)

You can use the str functions in pandas to filter. 您可以在熊猫中使用str函数进行过滤。 See pandas help on indexing. 请参阅熊猫的索引帮助。 Here's the specific answer (code) for your posted questions: 这是您发布的问题的特定答案(代码):

from twitterscraper import query_tweets 
import datetime as dt 
import pandas as pd

begin_date = dt.date(2019, 7, 1) 
end_date = dt.date(2019, 9, 9)

limit = 1000 
lang = 'english'

tweets = query_tweets(
    'Hurricane Dorian', 
    begindate = begin_date, 
    enddate = end_date, 
    limit = limit, 
    lang = lang
)

# Convert to dataframe
df = pd.DataFrame(t.__dict__ for t in tweets)

# make a boolean mask
filt = df['text'].str.contains('Bahamas')

# compare the lengths of the dataframes
print(df.shape)
print(df.loc[filt].shape)

You can see the unfiltered df has 340 rows. 您可以看到未过滤的df有340行。 Restricting it to rows where the text had 'Bahamas' reduced it to 55 rows. 将其限制为文本带有“巴哈马”的行,将其减少到55行。

(340, 16) (340,16)

(55, 16) (55,16)

To keep the ones that were true, reassign it using the filter: 要保留真实的内容,请使用过滤器将其重新分配:

df = df.loc[filt]

Or you could assign it to a new dataframe if you want to preserve the original raw data. 或者,如果您要保留原始原始数据,则可以将其分配给新的数据框。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:如何从 xlsx 文件中抓取数据的语法? - Python: How do I syntax data scraping from xlsx file? 如何根据python中允许的单词列表过滤句子? - How to filter a sentence based on list of the allowed words in python? 如何通过 uniqueid 从 xlsx 文件中提取数据并使用 Python 将该数据写入另一个具有相同列名的 xlsx 文件? - How can I pull data by uniqueid from an xlsx file and write that data to another xlsx file with the same column name using Python? 如何使用python pandas从未命名列excel中过滤包含关键字的文本数据并打印到txt文件 - How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file 如何根据句子过滤好词和坏词? - How to filter good and bad words based on sentence? 如何从Python的句子中提取预定义的关键词? - How to extract pre-defined key words from a sentence in Python? 如何使用 Python 从句子中提取基于名词的复合词? - How to extract noun-based compound words from a sentence using Python? 如何使用python从xlsx文件加载数据 - How to load data from an xlsx file using python 在Python中使用Counter,如何过滤最常见的单词 - Using Counter in Python, how do I filter the most common words Python:如何从一个XLSX中搜索一个字符串,使其位于另一个XLSX文件中? - Python: How do I search a string from one XLSX to be in another XLSX file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM