简体   繁体   English

如何从具有相同父节点的2个模板中抓取csv?(Python,Web抓取)

[英]How to scrape in csv from 2 templates with the same parent node?(python, web scraping)

How can I extract information in csv from 2 templates with the same parent node? 如何从具有相同父节点的2个模板中提取csv中的信息? For the first template it gives ok but for the second it gives wrong information in csv(I attached the csv and the code). 对于第一个模板,它可以,但是对于第二个模板,它在csv中给出错误的信息(我附上了csv和代码)。 This is a web scraping program and I made this program with python. 这是一个网络抓取程序,我使用python编写了该程序。 I will appreciate any help. 我将不胜感激。

THIS IS THE OUTPUT(CSV) 这是输出(CSV)

在此处输入图片说明

 from selenium import webdriver import csv import io # set the proxies to hide actual IP proxies = { 'http': 'http://5.189.133.231:80', 'https': 'https://27.111.43.178:8080' } chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()])) driver = webdriver.Chrome(executable_path="C:\\\\Users\\Andrei-PC\\Downloads\\webdriver\\chromedriver.exe", chrome_options=chrome_options) header = ['Product title', 'Product price', 'ASIN', 'Product Weight', 'Product dimensions', 'URL'] with open('csv/sort_products.csv', "w") as output: writer = csv.writer(output) writer.writerow(header) links = [ 'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh', 'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf' ] for i in range(len(links)): driver.get(links[i]) asinFound = False product_title = driver.find_elements_by_xpath('//*[@id="productTitle"][1]') prod_title = [x.text for x in product_title] try: prod_price = driver.find_element_by_xpath('//span[@id="priceblock_ourprice"]').text except: prod_price = 'No price' if asinFound == False: # try template one try: asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[5]/td').text asinFound=True except: print('no ASIN template one') try: weight = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[2]/td').text except: print('no weight template one') try: dimension = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text except: print('no dimension template one') if asinFound == False: # try template two try: asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text asinFound=True except: print('no ASIN template two') try: weight = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[2]/td').text except: print('no weight template two') try: dimension = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[3]/td').text except: print('no dimension template two') try: data = [prod_title[0], prod_price, asin, weight, dimension, links[i]] except: print('no data') with io.open('csv/sort_products.csv', "a", newline="", encoding="utf-8") as output: writer = csv.writer(output) writer.writerow(data) 

You can try like this to get the information you would like to grab. 您可以尝试这样获取所需的信息。 I used selenium in combination with BeautifulSoup (not necessary though). 我将seleniumBeautifulSoup结合使用(虽然不是必需的)。 The main problem was that the Product information in the first url is within id name productDetails_detailBullets_sections1 whereas the Product information in the second url is within id name productDetails_techSpec_section_1 . 主要问题是第一个URL中的Product informationid名称productDetails_detailBullets_sections1而第二个URL中的Product information在ID名称productDetails_techSpec_section_1 I had to write selectors in such a way so that the script can get information from the two links. 我必须以这种方式编写选择器,以便脚本可以从两个链接中获取信息。

This is the modified code: 这是修改后的代码:

import csv
from selenium import webdriver
from bs4 import BeautifulSoup

links = [
    'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh',
    'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf'
]

def get_information(driver,urls):
    with open("productDetails.csv","w",newline="") as infile:
        writer = csv.writer(infile)
        writer.writerow(['Title','Dimension','Weight','ASIN'])

        for url in urls:
            driver.get(url)
            soup = BeautifulSoup(driver.page_source,"lxml")
            title = soup.select_one("#productTitle").get_text(strip=True)
            dimension = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Product Dimensions" in item.text]+["N\A"])[0]
            weight = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Item Weight" in item.text]+["N\A"])[0]
            ASIN = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "ASIN" in item.text]+["N\A"])[0]

            writer.writerow([title,dimension,weight,ASIN])
            print(f'{title}\n{dimension}\n{weight}\n{ASIN}\n')

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        get_information(driver,links)
    finally:
        driver.quit()

I skipped the proxy part. 我跳过了代理部分。 However, you can include them as required. 但是,您可以根据需要包括它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM