简体   繁体   English

使用 str.contains 在 CSV 文件列中搜索单词

[英]Searching for words in a CSV file column with str.contains

I have the following csv file:我有以下 csv 文件:

start_date,end_date,pollster,sponsor,sample_size,population,party,subject,tracking,text,approve,disapprove,url
    2020-02-02,2020-02-04,YouGov,Economist,1500,a,all,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,42,29,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,376,a,R,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,75,6,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,523,a,D,Trump,TRUE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,21,51,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,599,a,I,Trump,,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,39,25,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-07,2020-02-09,Morning Consult,"",2200,a,all,Trump,TURE,Do you approve or disapprove of the job each of the following is doing in handling the spread of coronavirus in the United States? President Donald Trump,57,22,https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf

And I want to find all the rows where the column "text" contains both the word "Trump" and "coronavirus"我想找到“文本”列包含“特朗普”和“冠状病毒”这两个词的所有行

I am using str.contains()我正在使用 str.contains()

approval_polls[approval_polls.text.str.contains("Trump", "coronavirus")]

It seems as I was getting the correct output, but i am not sure if str.contains() can take two words as parameters.似乎我得到了正确的 output,但我不确定 str.contains() 是否可以将两个单词作为参数。

Can anyone help me with that?任何人都可以帮助我吗?

Output: Output:

start_date  end_date    pollster    sponsor     sample_size     population  party   subject     tracking    text    approve     disapprove  url
0   2020-02-02  2020-02-04  YouGov  Economist   1500.0  a   all     Trump   FALSE   Do you approve or disapprove of Donald Trump’s...   42.0    29.0    https://d25d2506sfb94s.cloudfront.net/cumulus_...
1   2020-02-02  2020-02-04  YouGov  Economist   376.0   a   R   Trump   FALSE   Do you approve or disapprove of Donald Trump’s...   75.0    6.0     https://d25d2506sfb94s.cloudfront.net/cumulus_...
2   2020-02-02  2020-02-04  YouGov  Economist   523.0   a   D   Trump   FALSE   Do you approve or disapprove of Donald Trump’s...   21.0    51.0    https://d25d2506sfb94s.cloudfront.net/cumulus_...

In your example case all rows contain both keywords, so you should get all five rows returned.在您的示例情况下,所有行都包含两个关键字,因此您应该返回所有五行。

With the function call contains('Trump', 'coronavirus') you get all rows that have 'Trump' OR 'coronavirus' in its text column.使用 function 调用contains('Trump', 'coronavirus')您将获得在其文本列中包含“Trump”或“coronavirus”的所有行。 To get only columns that contain 'Trump' AND 'coronavirus' you can use the following要仅获取包含“特朗普”和“冠状病毒”的列,您可以使用以下内容

df[df['text'].str.contains('Trump') & df['text'].str.contains('coronavirus')] 

Or you could use a regular expression, eg,或者您可以使用正则表达式,例如,

df[df['text'].str.contains(r'^(?=.*Trump)(?=.*coronavirus)')]

You can use regex to do this, but in a somewhat crafted way since we don't have a direct representative of the "AND" operator in regex.您可以使用正则表达式来执行此操作,但需要采用某种精心设计的方式,因为我们在正则表达式中没有“AND”运算符的直接代表。

import re

approval_polls[approval_polls.text.str.contains('(?=.*trump)(?=.*coronavirus)', regex=True, flags=re.IGNORECASE)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM