![](/img/trans.png)
[英]How to filter a dataframe and identify records based on a condition on multiple other columns
[英]Filter pandas dataframe records based on condition with multiple quantifier regex
我正在尝试从 pandas dataframe 中过滤一些记录。 名为“df”的 dataframe 由两列 Sl.No. 组成。 和 doc_id(包含 url)如下:
df
Sl.No. doc_id
1. https://www.durangoherald.com/articles/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/
2. https://allafrica.com/stories/202206100634.html
3. https://www.sfgate.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php
4. https://impakter.com/goldrush-for-fossil-fuels/
5. https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html
6. https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product
7. https://www.breitbart.com:443/europe/2022/06/10/climate-crazy-bojo-ignored-calls-to-cut-green-taxes-to-ease-cost-of-living-pressures
8. https://news.yahoo.com/ship-owners-sought-co2-exemption-171700544.html
9. https://www.mychesco.com/a/news/regional/active-world-club-lists-new-carbon-credits-crypto-currency-token-carbon-coin
10. http://www.msn.com/en-nz/health/nutrition/scientists-reveal-plans-to-make-plant-based-cheese-out-of-yellow-peas/ar-AAYjzHx
11. https://www.chron.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php|https://wtmj.com/national/2022/06/10/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/
我想从上面的dataframe中筛选出几条记录。 所需的网址在列表中。 我使用以下过程对 dataframe 进行了子集化。
needed_url = [https://impakter.com/goldrush-for-fossil-fuels/, https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html,
https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product]
df[df.doc_id.str.contains('|'.join(needed_url),na=False, regex=True)]
但它显示错误:
error: multiple repeat at position 417
我认为这是由于以下网址中的多个量词“+++”:
https://www.streetinsider.com/Business+Wire/EDF+Renewables+North+America+Awarded+Three+Contracts+totaling+1+Gigawatt+of+Solar+++Storage+in+New+York/20201448.html
我试图通过 re.escape() 来逃避“+++”,但没有运气。 我是正则表达式的新手,如果可以解决它会很有帮助。 目标是根据列表中匹配的 url 过滤 dataframe。 感谢期待。
pandas dataframe isin function 将列表作为输入并在指定列中搜索值。
print(df)
print(df[df['col2'].isin(needed_url)])
output:df:
col1 col2
0 1 https://www.durangoherald.com/articles/ship-ow...
1 2 https://allafrica.com/stories/202206100634.html
2 3 https://www.sfgate.com/news/article/Ship-owner...
3 4 https://impakter.com/goldrush-for-fossil-fuels/
4 5 https://www.streetinsider.com/Business+Wire/ED...
5 6 https://markets.financialcontent.com/stocks/ar...
格式化 output:
col1 col2
5 6 https://markets.financialcontent.com/stocks/ar...
我不太明白你为什么用“|”加入所需的网址。 这些行为我返回了所需的网址:
mask = lambda x: x in needed_url
df[df.doc_id.apply(mask)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.