简体   繁体   中英

Searching for words in a CSV file column with str.contains

I have the following csv file:

start_date,end_date,pollster,sponsor,sample_size,population,party,subject,tracking,text,approve,disapprove,url
    2020-02-02,2020-02-04,YouGov,Economist,1500,a,all,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,42,29,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,376,a,R,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,75,6,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,523,a,D,Trump,TRUE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,21,51,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,599,a,I,Trump,,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,39,25,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-07,2020-02-09,Morning Consult,"",2200,a,all,Trump,TURE,Do you approve or disapprove of the job each of the following is doing in handling the spread of coronavirus in the United States? President Donald Trump,57,22,https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf

And I want to find all the rows where the column "text" contains both the word "Trump" and "coronavirus"

I am using str.contains()

approval_polls[approval_polls.text.str.contains("Trump", "coronavirus")]

It seems as I was getting the correct output, but i am not sure if str.contains() can take two words as parameters.

Can anyone help me with that?

Output:

start_date  end_date    pollster    sponsor     sample_size     population  party   subject     tracking    text    approve     disapprove  url
0   2020-02-02  2020-02-04  YouGov  Economist   1500.0  a   all     Trump   FALSE   Do you approve or disapprove of Donald Trump’s...   42.0    29.0    https://d25d2506sfb94s.cloudfront.net/cumulus_...
1   2020-02-02  2020-02-04  YouGov  Economist   376.0   a   R   Trump   FALSE   Do you approve or disapprove of Donald Trump’s...   75.0    6.0     https://d25d2506sfb94s.cloudfront.net/cumulus_...
2   2020-02-02  2020-02-04  YouGov  Economist   523.0   a   D   Trump   FALSE   Do you approve or disapprove of Donald Trump’s...   21.0    51.0    https://d25d2506sfb94s.cloudfront.net/cumulus_...

In your example case all rows contain both keywords, so you should get all five rows returned.

With the function call contains('Trump', 'coronavirus') you get all rows that have 'Trump' OR 'coronavirus' in its text column. To get only columns that contain 'Trump' AND 'coronavirus' you can use the following

df[df['text'].str.contains('Trump') & df['text'].str.contains('coronavirus')] 

Or you could use a regular expression, eg,

df[df['text'].str.contains(r'^(?=.*Trump)(?=.*coronavirus)')]

You can use regex to do this, but in a somewhat crafted way since we don't have a direct representative of the "AND" operator in regex.

import re

approval_polls[approval_polls.text.str.contains('(?=.*trump)(?=.*coronavirus)', regex=True, flags=re.IGNORECASE)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM