使用BeautifulSoup进行Web爬网：检索网站的源代码

Question

Good day! 美好的一天！ I am currently making a web scraper for Alibaba website. 我目前正在为阿里巴巴网站制作网页抓取工具。 My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. 我的问题是返回的源代码未显示我感兴趣的某些部分。当我使用浏览器检查源代码时，数据在那里，但使用BeautifulSoup时却无法检索到。 Any tips? 有小费吗？

from bs4 import BeautifulSoup 从bs4导入BeautifulSoup

def make_soup(url):
    try:
        html = urlopen(url).read()
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = " http://www.alibaba.com/Agricultural-Growing-Media_pid144 " soup2 = make_soup(url) url =“ http://www.alibaba.com/Agricultural-Growing-Media_pid144 ” soup2 = make_soup（url）

I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. 我对使用Chrome开发人员工具的图像中突出显示的部分感兴趣。 But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. 但是，当我尝试在文本文件中编写内容时，找不到高亮部分。 Any tips? 有小费吗？ TIA! TIA！

Answer 1

You need to provide the User-Agent header at least. 您至少需要提供User-Agent标头。

Example using requests package instead of urllib2 : 使用requests包而不是urllib2示例：

import requests
from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)

print(soup.select_one("a.next").get('href'))

Prints http://www.alibaba.com/catalogs/products/CID144/2 . 打印http://www.alibaba.com/catalogs/products/CID144/2 。

使用BeautifulSoup进行Web爬网：检索网站的源代码

问题描述

1 个解决方案

解决方案1
0 2015-12-16 17:14:37

使用BeautifulSoup进行Web爬网：检索网站的源代码

问题描述

1 个解决方案

解决方案1 0 2015-12-16 17:14:37

解决方案1
0 2015-12-16 17:14:37