簡體   English   中英

如何使用python忽略/刪除特定的文本行?

[英]How to ignore / remove specific text line using python?

所以這是我的情況:我用 python 構建了一個機器人,它從 HTML 中抓取 eBay 產品列表鏈接。 每個鏈接都將我導航到第一個旁邊的產品頁面。 第一個是將我導航到此頁面。 運行腳本時如何刪除或/忽略該鏈接?

這是代碼,並提前感謝您的任何幫助:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('/Users/admin/eBay/chromedriver')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")

for a in listings:
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)


如果你想跳過第一個鏈接,你可以使用[1:]列表切片:

...

for a in listings[1:]:  # <--- ignore first link
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)

使用 if 語句刪除該鏈接

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")
b=1

error_page ='https://www.ebay.com/itm/01920391?epid=26039819083&_trkparms=ispr%3D1&hash=item3b542eae7a:g:FQkAAOSwK21gKvEZ&amdata=enc%3AAQAFAAACcBaobrjLl8XobRIiIML1V4Imu%252Fn%252BzU5L90Z278x5ickkrDx%252B2NLp21dg6hHbHAkGMYdiW1E6zjXxnQ0bf7c%252Fx%252Fvs5PW%252FYFw1ZdbGMi8wsGV6qXw8OFLl4Os1ACX3bnQxFkVpRib9hMb5gVyLha4q9L0xiporu5InbX0LrSgg7nCCCwtC7y3vOE3hc8PszsrXWLb5KFdj7%252BD98et12MdkEfMPFhJZuS%252BkFsp2esVTRCYctOhcwzPSdfzCOYprlr2miQc4czCv1Tcfs3LKUPJn8uQyRc%252BAnKY1oyTeYnJ7wYuGkBU%252FSVYjziLBaPhT%252FlVu0hR9ZX6OnAeRaJ1g0iCaDjrRXEXRwUO87riWeI8kExm1zzY7QicPeMnfWZdBvVhg05GOScPOlLTVPHakqGLX0y2GUXV6fkTLua3nSF5YBmLX%252FqdCxT6yS0dutVs5MPWvQYlN474hUzbubkZVAs7Y%252BBBEsHrGjVzCj0szZ6w1%252BHgkV5O9jrXGnyew5%252Bnxy7VCq5xEkUDIt1nSg996AeDksNmSNumhfsIOGltIXbqAbjqEUpPcVO%252BDPymxlh0iMxCZQalYnmljBRzKILYWkES0vfA14Gh5E7KWrztdC6WzEEFtgVuABakQ1eAOZnuEueqK6IakC%252BIfRbXv96Tv01IPDvwPeM8wMo6j8bMjY3D5KHS5EXPVdHKUnjCJiYCcVUqcKwhL6eN2MZ%252Bn9yxmWESUPN394NPrX%252FI2z7t0Bbo7iqmsWNQcyi0EHzDwJPMK%252FNSif8%252F2adRF7dT1JrbL9sryKSN2kv9OsdGQ0fMMC1LV3Ph43HivUJdqkgjGxqEqX5v1xQ%253D%253D%7Ccksum%3A25481541593068896952f4834d93a0bb998f5b5ba5fe%7Campid%3APL_CLK%7Cclp%3A2334524'
for a in listings:
    
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/") and link !=error_page:

        page = browser.get(link)

我會采用與@SIM 類似的方式,並依賴於更快的 css 過濾和使用 css 類(通常是在 id 之后在 css 中的節點上進行匹配的第二快方法)。

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link')]

前導 ID 的引入將結果限制在實際列表塊中。

如果您以某種方式擔心可能會出現帶有其他起始字符串的 url,考慮到這些頁面的一致設計,這似乎不太可能,您可以添加一個 css 屬性 = 值選擇器,其中 ^ 以運算符開頭:

links = [i['href'] for i in soup.select('#srp-river-results .s-item__link[href^="https://www.ebay.com/itm/"]')]

如果想要更多信息,則將列表設置為

listings = soup.select('#srp-river-results .s-item')

然后訪問鏈接:

links = [listing.select_one('.s-item__link[href^="https://www.ebay.com/itm/"]')['href'] for listing in listings]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM