简体   繁体   English

使用BeautifulSoup刮取产品名称

[英]Scraping Product Names using BeautifulSoup

I'm using BeautifulSoup (BS4) to build a scraper tool that will allow me to pull the product name from any TopShop.com product page, which sits between 'h1' tags. 我正在使用BeautifulSoup(BS4)构建一个刮刀工具,它允许我从位于'h1'标签之间的任何TopShop.com产品页面中提取产品名称。 Can't figure out why the code I've written isn't working! 无法弄清楚为什么我写的代码不起作用!

from urllib2 import urlopen
from bs4 import BeautifulSoup
import re

TopShop_URL = raw_input("Enter a TopShop Product URL")
ProductPage = urlopen(TopShop_URL).read()

soup = BeautifulSoup(ProductPage)

ProductNames = soup.find_all('h1')

print ProductNames

I get this working using requests ( http://docs.python-requests.org/en/latest/ ) 我使用请求( http://docs.python-requests.org/en/latest/

from bs4 import BeautifulSoup
import requests

content = requests.get("TOPShop_URL").content
soup = BeautifulSoup(content)
product_names = soup.findAll("h1")
print product_names

Your code is correct, but the problem is that the div which includes the product name is dynamically generated via JavaScript. 您的代码是正确的,但问题是包含产品名称的div是通过JavaScript动态生成的。 In order to be able to successfully parse this element you should mind using Selenium or a similar tool, that will allow you to parse the webpage after all the dom has been fully loaded. 为了能够成功解析此元素,您应该介意使用Selenium或类似工具,这将允许您在所有dom完全加载后解析网页。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM