簡體   English   中英

如何從具有相同父節點的2個模板中抓取csv?(Python,Web抓取)

[英]How to scrape in csv from 2 templates with the same parent node?(python, web scraping)

如何從具有相同父節點的2個模板中提取csv中的信息? 對於第一個模板,它可以,但是對於第二個模板,它在csv中給出錯誤的信息(我附上了csv和代碼)。 這是一個網絡抓取程序,我使用python編寫了該程序。 我將不勝感激。

這是輸出(CSV)

在此處輸入圖片說明

 from selenium import webdriver import csv import io # set the proxies to hide actual IP proxies = { 'http': 'http://5.189.133.231:80', 'https': 'https://27.111.43.178:8080' } chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()])) driver = webdriver.Chrome(executable_path="C:\\\\Users\\Andrei-PC\\Downloads\\webdriver\\chromedriver.exe", chrome_options=chrome_options) header = ['Product title', 'Product price', 'ASIN', 'Product Weight', 'Product dimensions', 'URL'] with open('csv/sort_products.csv', "w") as output: writer = csv.writer(output) writer.writerow(header) links = [ 'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh', 'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf' ] for i in range(len(links)): driver.get(links[i]) asinFound = False product_title = driver.find_elements_by_xpath('//*[@id="productTitle"][1]') prod_title = [x.text for x in product_title] try: prod_price = driver.find_element_by_xpath('//span[@id="priceblock_ourprice"]').text except: prod_price = 'No price' if asinFound == False: # try template one try: asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[5]/td').text asinFound=True except: print('no ASIN template one') try: weight = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[2]/td').text except: print('no weight template one') try: dimension = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text except: print('no dimension template one') if asinFound == False: # try template two try: asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text asinFound=True except: print('no ASIN template two') try: weight = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[2]/td').text except: print('no weight template two') try: dimension = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[3]/td').text except: print('no dimension template two') try: data = [prod_title[0], prod_price, asin, weight, dimension, links[i]] except: print('no data') with io.open('csv/sort_products.csv', "a", newline="", encoding="utf-8") as output: writer = csv.writer(output) writer.writerow(data) 

您可以嘗試這樣獲取所需的信息。 我將seleniumBeautifulSoup結合使用(雖然不是必需的)。 主要問題是第一個URL中的Product informationid名稱productDetails_detailBullets_sections1而第二個URL中的Product information在ID名稱productDetails_techSpec_section_1 我必須以這種方式編寫選擇器,以便腳本可以從兩個鏈接中獲取信息。

這是修改后的代碼:

import csv
from selenium import webdriver
from bs4 import BeautifulSoup

links = [
    'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh',
    'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf'
]

def get_information(driver,urls):
    with open("productDetails.csv","w",newline="") as infile:
        writer = csv.writer(infile)
        writer.writerow(['Title','Dimension','Weight','ASIN'])

        for url in urls:
            driver.get(url)
            soup = BeautifulSoup(driver.page_source,"lxml")
            title = soup.select_one("#productTitle").get_text(strip=True)
            dimension = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Product Dimensions" in item.text]+["N\A"])[0]
            weight = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Item Weight" in item.text]+["N\A"])[0]
            ASIN = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "ASIN" in item.text]+["N\A"])[0]

            writer.writerow([title,dimension,weight,ASIN])
            print(f'{title}\n{dimension}\n{weight}\n{ASIN}\n')

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        get_information(driver,links)
    finally:
        driver.quit()

我跳過了代理部分。 但是,您可以根據需要包括它們。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM