Filter pandas dataframe records based on condition with multiple quantifier regex

Question

I am trying to filter some records from pandas dataframe. The dataframe named 'df' consist of two columns Sl.No. and doc_id(which contains urls) is as follows:

df

Sl.No.                        doc_id  
1.            https://www.durangoherald.com/articles/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/
2.            https://allafrica.com/stories/202206100634.html
3.            https://www.sfgate.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php
4.            https://impakter.com/goldrush-for-fossil-fuels/
5.            https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html
6.            https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product
7.            https://www.breitbart.com:443/europe/2022/06/10/climate-crazy-bojo-ignored-calls-to-cut-green-taxes-to-ease-cost-of-living-pressures
8.            https://news.yahoo.com/ship-owners-sought-co2-exemption-171700544.html
9.            https://www.mychesco.com/a/news/regional/active-world-club-lists-new-carbon-credits-crypto-currency-token-carbon-coin
10.           http://www.msn.com/en-nz/health/nutrition/scientists-reveal-plans-to-make-plant-based-cheese-out-of-yellow-peas/ar-AAYjzHx
11.           https://www.chron.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php|https://wtmj.com/national/2022/06/10/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/

I want to filter a few records from the above dataframe. The needed urls are in a list. I have used the following process to subset the dataframe.

   needed_url = [https://impakter.com/goldrush-for-fossil-fuels/,   https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html,
 https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product]

  df[df.doc_id.str.contains('|'.join(needed_url),na=False, regex=True)]

But it is showing error:

  error: multiple repeat at position 417

I presume it is due to multiple quantifier '+++' in the following urls:

  https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html

I have tried to escape the '+++' through re.escape() but no luck. I am new to regex and it would be helpful if it can be solved. Objective is to filter the dataframe based on the matching url in the list. Thanks in anticipation.

Answer 1

pandas dataframe isin function will take list as input and search for the values in specified column.

   print(df)
print(df[df['col2'].isin(needed_url)])

output: df:

  col1                                               col2
0    1  https://www.durangoherald.com/articles/ship-ow...
1    2    https://allafrica.com/stories/202206100634.html
2    3  https://www.sfgate.com/news/article/Ship-owner...
3    4    https://impakter.com/goldrush-for-fossil-fuels/
4    5  https://www.streetinsider.com/Business+Wire/ED...
5    6  https://markets.financialcontent.com/stocks/ar...

formatted output:

   col1                                               col2
5    6  https://markets.financialcontent.com/stocks/ar...

Answer 2

I didn't quite understand why you joined the needed urls with '|'. These lines returned the needed urls for me:

mask = lambda x: x in needed_url
df[df.doc_id.apply(mask)]

Filter pandas dataframe records based on condition with multiple quantifier regex

Question

2 answers

solution1
1 2022-07-29 17:53:12

solution2
0 2022-07-29 17:59:17

Filter pandas dataframe records based on condition with multiple quantifier regex

Question

2 answers

solution1 1 2022-07-29 17:53:12

solution2 0 2022-07-29 17:59:17

solution1
1 2022-07-29 17:53:12

solution2
0 2022-07-29 17:59:17