简体   繁体   English

如何使用python忽略/删除特定的文本行?

[英]How to ignore / remove specific text line using python?

So here is my situation: I built a bot in python that scrapes eBay product listing links from HTML.所以这是我的情况:我用 python 构建了一个机器人,它从 HTML 中抓取 eBay 产品列表链接。 Every link is navigating me to the product page beside the first one.每个链接都将我导航到第一个旁边的产品页面。 The first one is navigating me to this page.第一个是将我导航到此页面。 How can I remove or / ignore that link when running a script?运行脚本时如何删除或/忽略该链接?

Here is the code, and thank you for any help in advance:这是代码,并提前感谢您的任何帮助:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('/Users/admin/eBay/chromedriver')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")

for a in listings:
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)


If you want to skip first link you can use list slicing with [1:] :如果你想跳过第一个链接,你可以使用[1:]列表切片:

...

for a in listings[1:]:  # <--- ignore first link
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)

cut out that link using the if statement使用 if 语句删除该链接

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")
b=1

error_page ='https://www.ebay.com/itm/01920391?epid=26039819083&_trkparms=ispr%3D1&hash=item3b542eae7a:g:FQkAAOSwK21gKvEZ&amdata=enc%3AAQAFAAACcBaobrjLl8XobRIiIML1V4Imu%252Fn%252BzU5L90Z278x5ickkrDx%252B2NLp21dg6hHbHAkGMYdiW1E6zjXxnQ0bf7c%252Fx%252Fvs5PW%252FYFw1ZdbGMi8wsGV6qXw8OFLl4Os1ACX3bnQxFkVpRib9hMb5gVyLha4q9L0xiporu5InbX0LrSgg7nCCCwtC7y3vOE3hc8PszsrXWLb5KFdj7%252BD98et12MdkEfMPFhJZuS%252BkFsp2esVTRCYctOhcwzPSdfzCOYprlr2miQc4czCv1Tcfs3LKUPJn8uQyRc%252BAnKY1oyTeYnJ7wYuGkBU%252FSVYjziLBaPhT%252FlVu0hR9ZX6OnAeRaJ1g0iCaDjrRXEXRwUO87riWeI8kExm1zzY7QicPeMnfWZdBvVhg05GOScPOlLTVPHakqGLX0y2GUXV6fkTLua3nSF5YBmLX%252FqdCxT6yS0dutVs5MPWvQYlN474hUzbubkZVAs7Y%252BBBEsHrGjVzCj0szZ6w1%252BHgkV5O9jrXGnyew5%252Bnxy7VCq5xEkUDIt1nSg996AeDksNmSNumhfsIOGltIXbqAbjqEUpPcVO%252BDPymxlh0iMxCZQalYnmljBRzKILYWkES0vfA14Gh5E7KWrztdC6WzEEFtgVuABakQ1eAOZnuEueqK6IakC%252BIfRbXv96Tv01IPDvwPeM8wMo6j8bMjY3D5KHS5EXPVdHKUnjCJiYCcVUqcKwhL6eN2MZ%252Bn9yxmWESUPN394NPrX%252FI2z7t0Bbo7iqmsWNQcyi0EHzDwJPMK%252FNSif8%252F2adRF7dT1JrbL9sryKSN2kv9OsdGQ0fMMC1LV3Ph43HivUJdqkgjGxqEqX5v1xQ%253D%253D%7Ccksum%3A25481541593068896952f4834d93a0bb998f5b5ba5fe%7Campid%3APL_CLK%7Cclp%3A2334524'
for a in listings:
    
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/") and link !=error_page:

        page = browser.get(link)

I would have gone similar way to @SIM and relied on faster css filtering and using css classes (generally 2nd fastest way of matching on nodes in css after id).我会采用与@SIM 类似的方式,并依赖于更快的 css 过滤和使用 css 类(通常是在 id 之后在 css 中的节点上进行匹配的第二快方法)。

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link')]

The introduction of the leading id limits results to the actual listings block.前导 ID 的引入将结果限制在实际列表块中。

If you are somehow worried that urls with other start strings might occur, which seems unlikely given the consistent design of these pages, you can add in a css attribute = value selector with ^ starts with operator:如果您以某种方式担心可能会出现带有其他起始字符串的 url,考虑到这些页面的一致设计,这似乎不太可能,您可以添加一个 css 属性 = 值选择器,其中 ^ 以运算符开头:

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link[href^="https://www.ebay.com/itm/"]')]

In case of wanting more info then set listings as如果想要更多信息,则将列表设置为

listings = soup.select('#srp-river-results .s-item')

Then access links with:然后访问链接:

links = [listing.select_one('.s-item__link[href^="https://www.ebay.com/itm/"]')['href'] for listing in listings]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM