简体   繁体   中英

Filter pandas dataframe records based on condition with multiple quantifier regex

I am trying to filter some records from pandas dataframe. The dataframe named 'df' consist of two columns Sl.No. and doc_id(which contains urls) is as follows:

df

Sl.No.                        doc_id  
1.            https://www.durangoherald.com/articles/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/
2.            https://allafrica.com/stories/202206100634.html
3.            https://www.sfgate.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php
4.            https://impakter.com/goldrush-for-fossil-fuels/
5.            https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html
6.            https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product
7.            https://www.breitbart.com:443/europe/2022/06/10/climate-crazy-bojo-ignored-calls-to-cut-green-taxes-to-ease-cost-of-living-pressures
8.            https://news.yahoo.com/ship-owners-sought-co2-exemption-171700544.html
9.            https://www.mychesco.com/a/news/regional/active-world-club-lists-new-carbon-credits-crypto-currency-token-carbon-coin
10.           http://www.msn.com/en-nz/health/nutrition/scientists-reveal-plans-to-make-plant-based-cheese-out-of-yellow-peas/ar-AAYjzHx
11.           https://www.chron.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php|https://wtmj.com/national/2022/06/10/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/

I want to filter a few records from the above dataframe. The needed urls are in a list. I have used the following process to subset the dataframe.

   needed_url = [https://impakter.com/goldrush-for-fossil-fuels/,   https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html,
 https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product]

  df[df.doc_id.str.contains('|'.join(needed_url),na=False, regex=True)]

But it is showing error:

  error: multiple repeat at position 417

I presume it is due to multiple quantifier '+++' in the following urls:

  https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html 

I have tried to escape the '+++' through re.escape() but no luck. I am new to regex and it would be helpful if it can be solved. Objective is to filter the dataframe based on the matching url in the list. Thanks in anticipation.

pandas dataframe isin function will take list as input and search for the values in specified column.

   print(df)
print(df[df['col2'].isin(needed_url)])

output: df:

  col1                                               col2
0    1  https://www.durangoherald.com/articles/ship-ow...
1    2    https://allafrica.com/stories/202206100634.html
2    3  https://www.sfgate.com/news/article/Ship-owner...
3    4    https://impakter.com/goldrush-for-fossil-fuels/
4    5  https://www.streetinsider.com/Business+Wire/ED...
5    6  https://markets.financialcontent.com/stocks/ar...

formatted output:

   col1                                               col2
5    6  https://markets.financialcontent.com/stocks/ar...

I didn't quite understand why you joined the needed urls with '|'. These lines returned the needed urls for me:

mask = lambda x: x in needed_url
df[df.doc_id.apply(mask)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM