![](/img/trans.png)
[英]Searching for words using str.contains and regex in dataframe is slow, is there a better way?
[英]Searching for words in a CSV file column with str.contains
我有以下 csv 文件:
start_date,end_date,pollster,sponsor,sample_size,population,party,subject,tracking,text,approve,disapprove,url
2020-02-02,2020-02-04,YouGov,Economist,1500,a,all,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,42,29,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,376,a,R,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,75,6,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,523,a,D,Trump,TRUE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,21,51,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,599,a,I,Trump,,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,39,25,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-07,2020-02-09,Morning Consult,"",2200,a,all,Trump,TURE,Do you approve or disapprove of the job each of the following is doing in handling the spread of coronavirus in the United States? President Donald Trump,57,22,https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf
我想找到“文本”列包含“特朗普”和“冠狀病毒”這兩個詞的所有行
我正在使用 str.contains()
approval_polls[approval_polls.text.str.contains("Trump", "coronavirus")]
似乎我得到了正確的 output,但我不確定 str.contains() 是否可以將兩個單詞作為參數。
任何人都可以幫助我嗎?
Output:
start_date end_date pollster sponsor sample_size population party subject tracking text approve disapprove url
0 2020-02-02 2020-02-04 YouGov Economist 1500.0 a all Trump FALSE Do you approve or disapprove of Donald Trump’s... 42.0 29.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
1 2020-02-02 2020-02-04 YouGov Economist 376.0 a R Trump FALSE Do you approve or disapprove of Donald Trump’s... 75.0 6.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
2 2020-02-02 2020-02-04 YouGov Economist 523.0 a D Trump FALSE Do you approve or disapprove of Donald Trump’s... 21.0 51.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
在您的示例情況下,所有行都包含兩個關鍵字,因此您應該返回所有五行。
使用 function 調用contains('Trump', 'coronavirus')
您將獲得在其文本列中包含“Trump”或“coronavirus”的所有行。 要僅獲取包含“特朗普”和“冠狀病毒”的列,您可以使用以下內容
df[df['text'].str.contains('Trump') & df['text'].str.contains('coronavirus')]
或者您可以使用正則表達式,例如,
df[df['text'].str.contains(r'^(?=.*Trump)(?=.*coronavirus)')]
您可以使用正則表達式來執行此操作,但需要采用某種精心設計的方式,因為我們在正則表達式中沒有“AND”運算符的直接代表。
import re
approval_polls[approval_polls.text.str.contains('(?=.*trump)(?=.*coronavirus)', regex=True, flags=re.IGNORECASE)]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.