简体   繁体   中英

Python REGEX remove string containing substring

I am writing a script that will scrape a newsletter for URLs. There are some URLs in the newsletter that are irrelevant (eg links to articles, mailto links, social links, etc.). I added some logic to remove those links, but for some reason not all of them are being removed. Here is my code:

from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")
termSheetLinks = []

for companyURL in htmlParser.select("table#templateBody p > a"):
    termSheetLinks.append(companyURL.get('href'))

for link in termSheetLinks:
    if "fortune.com" in link in termSheetLinks:
        termSheetLinks.remove(link)
    if "forbes.com" in link in termSheetLinks:
        termSheetLinks.remove(link)
    if "twitter.com" in link in termSheetLinks:
        termSheetLinks.remove(link)

print(termSheetLinks)

When I ran it most recently, this was my output, despite trying to remove all links containing "fortune.com":

['https://fortune.com/company/blackstone-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://fortune.com/company/tpg?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://casproviders.org/asd-guidelines/', 'https://fortune.com/company/carlyle-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5', 'mailto:termsheet@fortune.com', 'https://www.afresh.com/', 'https://www.geopagos.com/', 'https://montana-renewables.com/', 'https://descarteslabs.com/', 'https://www.dealer-pay.com/', 'https://www.sequeldm.com/', 'https://pueblo-mechanical.com/', 'https://dealcloud.com/future-proof-your-firm/', 'https://apartmentdata.com/', 'https://www.irobot.com/', 'https://www.martin-bencher.com/', 'https://cell-matters.com/', 'https://www.lever.co/', 'https://www.sigulerguff.com/']

Any help would be greatly appreciated!

It do not need a regex in my opinion - Instead of removing the urls, append only those to a list that do not contain your substrings, eg with a list comprehension :

[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a") if not any(x in companyURL.get('href') for x in ["fortune.com","forbes.com","twitter.com"])]

Example

from bs4 import BeautifulSoup
import requests

termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")

myList = ["fortune.com","forbes.com","twitter.com"]
[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a") 
     if not any(x in companyURL.get('href') for x in myList)]

Output

['https://casproviders.org/asd-guidelines/',
 'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5',
 'https://www.afresh.com/',
 'https://www.geopagos.com/',
 'https://montana-renewables.com/',
 'https://descarteslabs.com/',
 'https://www.dealer-pay.com/',
 'https://www.sequeldm.com/',
 'https://pueblo-mechanical.com/',
 'https://dealcloud.com/future-proof-your-firm/',
 'https://apartmentdata.com/',
 'https://www.irobot.com/',
 'https://www.martin-bencher.com/',
 'https://cell-matters.com/',
 'https://www.lever.co/',
 'https://www.sigulerguff.com/']

Removing the links after the for iterator will not skip any entry.

from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")
termSheetLinks = []

for companyURL in htmlParser.select("table#templateBody p > a"):
    termSheetLinks.append(companyURL.get('href'))

lRemove = []
for link in termSheetLinks:
    if "fortune.com" in link:
        lRemove.append(link)
    if "forbes.com" in link:
        lRemove.append(link)
    if "twitter.com" in link:
        lRemove.append(link)
for l in lRemove:
    termSheetLinks.remove(l)

print(termSheetLinks)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM