简体   繁体   中英

How to scrape in csv from 2 templates with the same parent node?(python, web scraping)

How can I extract information in csv from 2 templates with the same parent node? For the first template it gives ok but for the second it gives wrong information in csv(I attached the csv and the code). This is a web scraping program and I made this program with python. I will appreciate any help.

THIS IS THE OUTPUT(CSV)

在此处输入图片说明

 from selenium import webdriver import csv import io # set the proxies to hide actual IP proxies = { 'http': 'http://5.189.133.231:80', 'https': 'https://27.111.43.178:8080' } chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()])) driver = webdriver.Chrome(executable_path="C:\\\\Users\\Andrei-PC\\Downloads\\webdriver\\chromedriver.exe", chrome_options=chrome_options) header = ['Product title', 'Product price', 'ASIN', 'Product Weight', 'Product dimensions', 'URL'] with open('csv/sort_products.csv', "w") as output: writer = csv.writer(output) writer.writerow(header) links = [ 'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh', 'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf' ] for i in range(len(links)): driver.get(links[i]) asinFound = False product_title = driver.find_elements_by_xpath('//*[@id="productTitle"][1]') prod_title = [x.text for x in product_title] try: prod_price = driver.find_element_by_xpath('//span[@id="priceblock_ourprice"]').text except: prod_price = 'No price' if asinFound == False: # try template one try: asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[5]/td').text asinFound=True except: print('no ASIN template one') try: weight = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[2]/td').text except: print('no weight template one') try: dimension = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text except: print('no dimension template one') if asinFound == False: # try template two try: asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text asinFound=True except: print('no ASIN template two') try: weight = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[2]/td').text except: print('no weight template two') try: dimension = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[3]/td').text except: print('no dimension template two') try: data = [prod_title[0], prod_price, asin, weight, dimension, links[i]] except: print('no data') with io.open('csv/sort_products.csv', "a", newline="", encoding="utf-8") as output: writer = csv.writer(output) writer.writerow(data) 

You can try like this to get the information you would like to grab. I used selenium in combination with BeautifulSoup (not necessary though). The main problem was that the Product information in the first url is within id name productDetails_detailBullets_sections1 whereas the Product information in the second url is within id name productDetails_techSpec_section_1 . I had to write selectors in such a way so that the script can get information from the two links.

This is the modified code:

import csv
from selenium import webdriver
from bs4 import BeautifulSoup

links = [
    'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh',
    'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf'
]

def get_information(driver,urls):
    with open("productDetails.csv","w",newline="") as infile:
        writer = csv.writer(infile)
        writer.writerow(['Title','Dimension','Weight','ASIN'])

        for url in urls:
            driver.get(url)
            soup = BeautifulSoup(driver.page_source,"lxml")
            title = soup.select_one("#productTitle").get_text(strip=True)
            dimension = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Product Dimensions" in item.text]+["N\A"])[0]
            weight = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Item Weight" in item.text]+["N\A"])[0]
            ASIN = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "ASIN" in item.text]+["N\A"])[0]

            writer.writerow([title,dimension,weight,ASIN])
            print(f'{title}\n{dimension}\n{weight}\n{ASIN}\n')

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        get_information(driver,links)
    finally:
        driver.quit()

I skipped the proxy part. However, you can include them as required.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM