简体   繁体   中英

How to ignore / remove specific text line using python?

So here is my situation: I built a bot in python that scrapes eBay product listing links from HTML. Every link is navigating me to the product page beside the first one. The first one is navigating me to this page. How can I remove or / ignore that link when running a script?

Here is the code, and thank you for any help in advance:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('/Users/admin/eBay/chromedriver')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")

for a in listings:
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)


If you want to skip first link you can use list slicing with [1:] :

...

for a in listings[1:]:  # <--- ignore first link
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)

cut out that link using the if statement

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")
b=1

error_page ='https://www.ebay.com/itm/01920391?epid=26039819083&_trkparms=ispr%3D1&hash=item3b542eae7a:g:FQkAAOSwK21gKvEZ&amdata=enc%3AAQAFAAACcBaobrjLl8XobRIiIML1V4Imu%252Fn%252BzU5L90Z278x5ickkrDx%252B2NLp21dg6hHbHAkGMYdiW1E6zjXxnQ0bf7c%252Fx%252Fvs5PW%252FYFw1ZdbGMi8wsGV6qXw8OFLl4Os1ACX3bnQxFkVpRib9hMb5gVyLha4q9L0xiporu5InbX0LrSgg7nCCCwtC7y3vOE3hc8PszsrXWLb5KFdj7%252BD98et12MdkEfMPFhJZuS%252BkFsp2esVTRCYctOhcwzPSdfzCOYprlr2miQc4czCv1Tcfs3LKUPJn8uQyRc%252BAnKY1oyTeYnJ7wYuGkBU%252FSVYjziLBaPhT%252FlVu0hR9ZX6OnAeRaJ1g0iCaDjrRXEXRwUO87riWeI8kExm1zzY7QicPeMnfWZdBvVhg05GOScPOlLTVPHakqGLX0y2GUXV6fkTLua3nSF5YBmLX%252FqdCxT6yS0dutVs5MPWvQYlN474hUzbubkZVAs7Y%252BBBEsHrGjVzCj0szZ6w1%252BHgkV5O9jrXGnyew5%252Bnxy7VCq5xEkUDIt1nSg996AeDksNmSNumhfsIOGltIXbqAbjqEUpPcVO%252BDPymxlh0iMxCZQalYnmljBRzKILYWkES0vfA14Gh5E7KWrztdC6WzEEFtgVuABakQ1eAOZnuEueqK6IakC%252BIfRbXv96Tv01IPDvwPeM8wMo6j8bMjY3D5KHS5EXPVdHKUnjCJiYCcVUqcKwhL6eN2MZ%252Bn9yxmWESUPN394NPrX%252FI2z7t0Bbo7iqmsWNQcyi0EHzDwJPMK%252FNSif8%252F2adRF7dT1JrbL9sryKSN2kv9OsdGQ0fMMC1LV3Ph43HivUJdqkgjGxqEqX5v1xQ%253D%253D%7Ccksum%3A25481541593068896952f4834d93a0bb998f5b5ba5fe%7Campid%3APL_CLK%7Cclp%3A2334524'
for a in listings:
    
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/") and link !=error_page:

        page = browser.get(link)

I would have gone similar way to @SIM and relied on faster css filtering and using css classes (generally 2nd fastest way of matching on nodes in css after id).

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link')]

The introduction of the leading id limits results to the actual listings block.

If you are somehow worried that urls with other start strings might occur, which seems unlikely given the consistent design of these pages, you can add in a css attribute = value selector with ^ starts with operator:

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link[href^="https://www.ebay.com/itm/"]')]

In case of wanting more info then set listings as

listings = soup.select('#srp-river-results .s-item')

Then access links with:

links = [listing.select_one('.s-item__link[href^="https://www.ebay.com/itm/"]')['href'] for listing in listings]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM