繁体   English   中英

如何使用python忽略/删除特定的文本行?

[英]How to ignore / remove specific text line using python?

所以这是我的情况:我用 python 构建了一个机器人,它从 HTML 中抓取 eBay 产品列表链接。 每个链接都将我导航到第一个旁边的产品页面。 第一个是将我导航到此页面。 运行脚本时如何删除或/忽略该链接?

这是代码,并提前感谢您的任何帮助:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('/Users/admin/eBay/chromedriver')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")

for a in listings:
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)


如果你想跳过第一个链接,你可以使用[1:]列表切片:

...

for a in listings[1:]:  # <--- ignore first link
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)

使用 if 语句删除该链接

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")
b=1

error_page ='https://www.ebay.com/itm/01920391?epid=26039819083&_trkparms=ispr%3D1&hash=item3b542eae7a:g:FQkAAOSwK21gKvEZ&amdata=enc%3AAQAFAAACcBaobrjLl8XobRIiIML1V4Imu%252Fn%252BzU5L90Z278x5ickkrDx%252B2NLp21dg6hHbHAkGMYdiW1E6zjXxnQ0bf7c%252Fx%252Fvs5PW%252FYFw1ZdbGMi8wsGV6qXw8OFLl4Os1ACX3bnQxFkVpRib9hMb5gVyLha4q9L0xiporu5InbX0LrSgg7nCCCwtC7y3vOE3hc8PszsrXWLb5KFdj7%252BD98et12MdkEfMPFhJZuS%252BkFsp2esVTRCYctOhcwzPSdfzCOYprlr2miQc4czCv1Tcfs3LKUPJn8uQyRc%252BAnKY1oyTeYnJ7wYuGkBU%252FSVYjziLBaPhT%252FlVu0hR9ZX6OnAeRaJ1g0iCaDjrRXEXRwUO87riWeI8kExm1zzY7QicPeMnfWZdBvVhg05GOScPOlLTVPHakqGLX0y2GUXV6fkTLua3nSF5YBmLX%252FqdCxT6yS0dutVs5MPWvQYlN474hUzbubkZVAs7Y%252BBBEsHrGjVzCj0szZ6w1%252BHgkV5O9jrXGnyew5%252Bnxy7VCq5xEkUDIt1nSg996AeDksNmSNumhfsIOGltIXbqAbjqEUpPcVO%252BDPymxlh0iMxCZQalYnmljBRzKILYWkES0vfA14Gh5E7KWrztdC6WzEEFtgVuABakQ1eAOZnuEueqK6IakC%252BIfRbXv96Tv01IPDvwPeM8wMo6j8bMjY3D5KHS5EXPVdHKUnjCJiYCcVUqcKwhL6eN2MZ%252Bn9yxmWESUPN394NPrX%252FI2z7t0Bbo7iqmsWNQcyi0EHzDwJPMK%252FNSif8%252F2adRF7dT1JrbL9sryKSN2kv9OsdGQ0fMMC1LV3Ph43HivUJdqkgjGxqEqX5v1xQ%253D%253D%7Ccksum%3A25481541593068896952f4834d93a0bb998f5b5ba5fe%7Campid%3APL_CLK%7Cclp%3A2334524'
for a in listings:
    
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/") and link !=error_page:

        page = browser.get(link)

我会采用与@SIM 类似的方式,并依赖于更快的 css 过滤和使用 css 类(通常是在 id 之后在 css 中的节点上进行匹配的第二快方法)。

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link')]

前导 ID 的引入将结果限制在实际列表块中。

如果您以某种方式担心可能会出现带有其他起始字符串的 url,考虑到这些页面的一致设计,这似乎不太可能,您可以添加一个 css 属性 = 值选择器,其中 ^ 以运算符开头:

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link[href^="https://www.ebay.com/itm/"]')]

如果想要更多信息,则将列表设置为

listings = soup.select('#srp-river-results .s-item')

然后访问链接:

links = [listing.select_one('.s-item__link[href^="https://www.ebay.com/itm/"]')['href'] for listing in listings]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM