简体   繁体   English

使用Pandas搜索文本中的所有匹配项

[英]Searching for all matches in texts with Pandas

I have a list of particular words ('tokens') and need to find all of them (if any of them are present) in plain texts. 我有一个特定单词列表('令牌'),需要在纯文本中找到所有这些单词(如果有的话)。 I prefer using Pandas, to load text and perform the search. 我更喜欢使用Pandas来加载文本并执行搜索。 I'm using pandas as my collection of short text are timestamped and it is quite easy to organise these short text in a single data structure as pandas. 我正在使用pandas,因为我的短文本集合带有时间戳,并且很容易将这些短文本组织成单个数据结构中的pandas。

For example: 例如:

Consider a collection of fetched twitters uploaded in Pandas: 考虑在Pandas上传的一系列获取的twitters:

                                              twitts
0                       today is a great day for BWM
1                    prices of german cars increased
2             Japan introduced a new model of Toyota
3  German car makers, such as BMW, Audi and VW mo...

and a list of car makers: 和汽车制造商名单:

list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']

Ideally, I need to get the following data frame: 理想情况下,我需要获得以下数据框:

                                              twitts  cars_mentioned
0                       today is a great day for BMW  [BMW]
1                    prices of german cars increased  []
2             Japan introduced a new model of Toyota  [Toyota]
3  German car makers, such as BMW, Audi and VW mo...  [BMW, Audi, VW]

I'm very new to NLP and text mining methods, and I read/search on the internet a lot of materials on that topic. 我对NLP和文本挖掘方法都很陌生,我在互联网上阅读/搜索了很多关于该主题的材料。 My guess is that I can use regex and use re.findall() , but then I need to iterate over the list of tokens (car makers) the entire dataframe. 我的猜测是我可以使用regex并使用re.findall() ,但是我需要遍历整个数据帧的令牌(汽车制造商)列表。

Are there more succinct ways of doing this simple task, especially with Panads? 有没有更简洁的方法来完成这个简单的任务,特别是对于Panads?

您可以使用pandas .str方法,特别是.findall

df['cars_mentioned'] = df['twitts'].str.findall('|'.join(list_of_car_makers))

使用pandas.DataFrame.apply

df['cars_mentioned'] = df['twitts'].apply(lambda x: [c for c in list_of_car_makers if c in x])

You can use re.findall and filter . 您可以使用re.findallfilter

list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))

Python sample : Python示例

list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']

def cars_mentioned(twitt):
        return list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))

cars_mentioned('German car makers, such as BMW, Audi and VW mo...') >> ['BMW', 'Audi', 'VW']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM