簡體   English   中英

使用多個量詞正則表達式根據條件過濾 pandas dataframe 記錄

[英]Filter pandas dataframe records based on condition with multiple quantifier regex

我正在嘗試從 pandas dataframe 中過濾一些記錄。 名為“df”的 dataframe 由兩列 Sl.No. 組成。 和 doc_id(包含 url)如下:

df

Sl.No.                        doc_id  
1.            https://www.durangoherald.com/articles/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/
2.            https://allafrica.com/stories/202206100634.html
3.            https://www.sfgate.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php
4.            https://impakter.com/goldrush-for-fossil-fuels/
5.            https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html
6.            https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product
7.            https://www.breitbart.com:443/europe/2022/06/10/climate-crazy-bojo-ignored-calls-to-cut-green-taxes-to-ease-cost-of-living-pressures
8.            https://news.yahoo.com/ship-owners-sought-co2-exemption-171700544.html
9.            https://www.mychesco.com/a/news/regional/active-world-club-lists-new-carbon-credits-crypto-currency-token-carbon-coin
10.           http://www.msn.com/en-nz/health/nutrition/scientists-reveal-plans-to-make-plant-based-cheese-out-of-yellow-peas/ar-AAYjzHx
11.           https://www.chron.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php|https://wtmj.com/national/2022/06/10/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/

我想從上面的dataframe中篩選出幾條記錄。 所需的網址在列表中。 我使用以下過程對 dataframe 進行了子集化。

   needed_url = [https://impakter.com/goldrush-for-fossil-fuels/,   https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html,
 https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product]

  df[df.doc_id.str.contains('|'.join(needed_url),na=False, regex=True)]

但它顯示錯誤:

  error: multiple repeat at position 417

我認為這是由於以下網址中的多個量詞“+++”:

  https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html 

我試圖通過 re.escape() 來逃避“+++”,但沒有運氣。 我是正則表達式的新手,如果可以解決它會很有幫助。 目標是根據列表中匹配的 url 過濾 dataframe。 感謝期待。

pandas dataframe isin function 將列表作為輸入並在指定列中搜索值。

   print(df)
print(df[df['col2'].isin(needed_url)])

output:df:

  col1                                               col2
0    1  https://www.durangoherald.com/articles/ship-ow...
1    2    https://allafrica.com/stories/202206100634.html
2    3  https://www.sfgate.com/news/article/Ship-owner...
3    4    https://impakter.com/goldrush-for-fossil-fuels/
4    5  https://www.streetinsider.com/Business+Wire/ED...
5    6  https://markets.financialcontent.com/stocks/ar...

格式化 output:

   col1                                               col2
5    6  https://markets.financialcontent.com/stocks/ar...

我不太明白你為什么用“|”加入所需的網址。 這些行為我返回了所需的網址:

mask = lambda x: x in needed_url
df[df.doc_id.apply(mask)]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM