[英]Searching for words in a CSV file column with str.contains
I have the following csv file:我有以下 csv 文件:
start_date,end_date,pollster,sponsor,sample_size,population,party,subject,tracking,text,approve,disapprove,url
2020-02-02,2020-02-04,YouGov,Economist,1500,a,all,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,42,29,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,376,a,R,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,75,6,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,523,a,D,Trump,TRUE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,21,51,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,599,a,I,Trump,,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,39,25,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-07,2020-02-09,Morning Consult,"",2200,a,all,Trump,TURE,Do you approve or disapprove of the job each of the following is doing in handling the spread of coronavirus in the United States? President Donald Trump,57,22,https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf
And I want to find all the rows where the column "text" contains both the word "Trump" and "coronavirus"我想找到“文本”列包含“特朗普”和“冠状病毒”这两个词的所有行
I am using str.contains()我正在使用 str.contains()
approval_polls[approval_polls.text.str.contains("Trump", "coronavirus")]
It seems as I was getting the correct output, but i am not sure if str.contains() can take two words as parameters.似乎我得到了正确的 output,但我不确定 str.contains() 是否可以将两个单词作为参数。
Can anyone help me with that?任何人都可以帮助我吗?
Output: Output:
start_date end_date pollster sponsor sample_size population party subject tracking text approve disapprove url
0 2020-02-02 2020-02-04 YouGov Economist 1500.0 a all Trump FALSE Do you approve or disapprove of Donald Trump’s... 42.0 29.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
1 2020-02-02 2020-02-04 YouGov Economist 376.0 a R Trump FALSE Do you approve or disapprove of Donald Trump’s... 75.0 6.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
2 2020-02-02 2020-02-04 YouGov Economist 523.0 a D Trump FALSE Do you approve or disapprove of Donald Trump’s... 21.0 51.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
In your example case all rows contain both keywords, so you should get all five rows returned.在您的示例情况下,所有行都包含两个关键字,因此您应该返回所有五行。
With the function call contains('Trump', 'coronavirus')
you get all rows that have 'Trump' OR 'coronavirus' in its text column.使用 function 调用
contains('Trump', 'coronavirus')
您将获得在其文本列中包含“Trump”或“coronavirus”的所有行。 To get only columns that contain 'Trump' AND 'coronavirus' you can use the following要仅获取包含“特朗普”和“冠状病毒”的列,您可以使用以下内容
df[df['text'].str.contains('Trump') & df['text'].str.contains('coronavirus')]
Or you could use a regular expression, eg,或者您可以使用正则表达式,例如,
df[df['text'].str.contains(r'^(?=.*Trump)(?=.*coronavirus)')]
You can use regex to do this, but in a somewhat crafted way since we don't have a direct representative of the "AND" operator in regex.您可以使用正则表达式来执行此操作,但需要采用某种精心设计的方式,因为我们在正则表达式中没有“AND”运算符的直接代表。
import re
approval_polls[approval_polls.text.str.contains('(?=.*trump)(?=.*coronavirus)', regex=True, flags=re.IGNORECASE)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.