简体   繁体   English

使用多个量词正则表达式根据条件过滤 pandas dataframe 记录

[英]Filter pandas dataframe records based on condition with multiple quantifier regex

I am trying to filter some records from pandas dataframe.我正在尝试从 pandas dataframe 中过滤一些记录。 The dataframe named 'df' consist of two columns Sl.No.名为“df”的 dataframe 由两列 Sl.No. 组成。 and doc_id(which contains urls) is as follows:和 doc_id(包含 url)如下:

df

Sl.No.                        doc_id  
1.            https://www.durangoherald.com/articles/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/
2.            https://allafrica.com/stories/202206100634.html
3.            https://www.sfgate.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php
4.            https://impakter.com/goldrush-for-fossil-fuels/
5.            https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html
6.            https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product
7.            https://www.breitbart.com:443/europe/2022/06/10/climate-crazy-bojo-ignored-calls-to-cut-green-taxes-to-ease-cost-of-living-pressures
8.            https://news.yahoo.com/ship-owners-sought-co2-exemption-171700544.html
9.            https://www.mychesco.com/a/news/regional/active-world-club-lists-new-carbon-credits-crypto-currency-token-carbon-coin
10.           http://www.msn.com/en-nz/health/nutrition/scientists-reveal-plans-to-make-plant-based-cheese-out-of-yellow-peas/ar-AAYjzHx
11.           https://www.chron.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php|https://wtmj.com/national/2022/06/10/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/

I want to filter a few records from the above dataframe.我想从上面的dataframe中筛选出几条记录。 The needed urls are in a list.所需的网址在列表中。 I have used the following process to subset the dataframe.我使用以下过程对 dataframe 进行了子集化。

   needed_url = [https://impakter.com/goldrush-for-fossil-fuels/,   https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html,
 https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product]

  df[df.doc_id.str.contains('|'.join(needed_url),na=False, regex=True)]

But it is showing error:但它显示错误:

  error: multiple repeat at position 417

I presume it is due to multiple quantifier '+++' in the following urls:我认为这是由于以下网址中的多个量词“+++”:

  https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html 

I have tried to escape the '+++' through re.escape() but no luck.我试图通过 re.escape() 来逃避“+++”,但没有运气。 I am new to regex and it would be helpful if it can be solved.我是正则表达式的新手,如果可以解决它会很有帮助。 Objective is to filter the dataframe based on the matching url in the list.目标是根据列表中匹配的 url 过滤 dataframe。 Thanks in anticipation.感谢期待。

pandas dataframe isin function will take list as input and search for the values in specified column. pandas dataframe isin function 将列表作为输入并在指定列中搜索值。

   print(df)
print(df[df['col2'].isin(needed_url)])

output: df: output:df:

  col1                                               col2
0    1  https://www.durangoherald.com/articles/ship-ow...
1    2    https://allafrica.com/stories/202206100634.html
2    3  https://www.sfgate.com/news/article/Ship-owner...
3    4    https://impakter.com/goldrush-for-fossil-fuels/
4    5  https://www.streetinsider.com/Business+Wire/ED...
5    6  https://markets.financialcontent.com/stocks/ar...

formatted output:格式化 output:

   col1                                               col2
5    6  https://markets.financialcontent.com/stocks/ar...

I didn't quite understand why you joined the needed urls with '|'.我不太明白你为什么用“|”加入所需的网址。 These lines returned the needed urls for me:这些行为我返回了所需的网址:

mask = lambda x: x in needed_url
df[df.doc_id.apply(mask)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM