简体   繁体   English

Python Web Scraper问题

[英]Python Web Scraper issue

在此处输入图片说明 I'm new to to programming and trying to learn by building some small side projects. 我是编程和尝试通过构建一些小型附带项目进行学习的新手。 I have this code and it is working but I am having an issue with it formatting correctly in csv when it pulls all the information. 我有此代码,并且可以正常工作,但是在提取所有信息时,在csv中正确格式化却存在问题。 It started adding weird spaces after I added price to be pulled as well. 在我也增加了要拉价之后,它开始增加了怪异的空间。 if I comment out price and remove it from write it works fine but I can't figure out why I am getting weird spaces when I add it back. 如果我注释掉价格并将其从写入中删除,它可以正常工作,但我无法弄清楚为什么重新添加时会出现奇怪的空格。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=graphics%20card&bop=And&PageSize=12&order=BESTMATCH"


# Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()


#html parsing
page_soup = soup(page_html, "html.parser")


# grabs each products
containers = page_soup.findAll("div",{"class":"item-container"})


filename = "products.csv"
f = open(filename, "w")

headers = "brand, product_name, shipping\n"

f.write(headers)

for container in containers:
    brand = container.div.div.a.img["title"]

    title_container = container.findAll("a", {"class":"item-title"})
    product_name = title_container[0].text

    shipping_container = container.findAll("li", {"class":"price-ship"})
    shipping = shipping_container[0].text.strip()

    price_container = container.findAll("li", {"class":"price-current"})
    price = price_container[0].text.strip()

    print("brand: " + brand)
    print("product_name: " + product_name)
    print("Price: " + price)
    print("shipping: " + shipping)


    f.write(brand + "," + product_name.replace(",", "|") + "," + shipping + "," + price + "\n")

f.close()

You can write to a csv file like the way I've showed below. 您可以像下面显示的那样写入csv文件。 The output it produces should serve the purpose. 它产生的输出应达到目的。 Check out this documentation to get the clarity. 请查阅此文档以获取清晰度。

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

my_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=graphics%20card&bop=And&PageSize=12&order=BESTMATCH"

page_html = urlopen(my_url).read()
page_soup = BeautifulSoup(page_html, "lxml")

with open("outputfile.csv","w",newline="") as infile:
    writer = csv.writer(infile)
    writer.writerow(["brand", "product_name", "shipping", "price"])

    for container in page_soup.findAll("div",{"class":"item-container"}):

        brand = container.find(class_="item-brand").img.get("title")
        product_name = container.find("a", {"class":"item-title"}).get_text(strip=True).replace(",", "|")
        shipping = container.find("li", {"class":"price-ship"}).get_text(strip=True)
        price = container.find("li", {"class":"price-current"}).get_text(strip=True).replace("|", "")

        writer.writerow([brand,product_name,shipping,price])

You're getting the new lines and spam characters because that is the data you're getting back from BS4: it isn't a product of the writing process. 您将获得新的行和垃圾邮件字符,因为这是您从BS4中获得的数据:它不是写入过程的产物。 This is because you're trying to get all the text in the list item, whilst there's a lot going on in there. 这是因为您试图获取列表项中的所有文本,而其中却有很多事情要做。 Having a look at the page, if you'd rather just get the price, you can concatenate the text of the strong tag within the list with the text of the sup tag, eg price = price_container[0].find("strong").text + price_container[0].find("sup").text . 看一下页面,如果您只想获取价格,则可以将列表中的强标签的文本与sup标签的文本连接起来,例如price = price_container[0].find("strong").text + price_container[0].find("sup").text That will ensure you're only picking out the data that you need. 这将确保您仅选择所需的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM