简体   繁体   English

使用python从网址列表中抓取网页

[英]Web scraping from the list of urls with python

I'm trying to scrape some listings property websites from the list.我正在尝试从列表中抓取一些列表属性网站。 I wrote simple code to get data from one url, but when I'm trying with list ['url1','url2'] I have nothing as the result.我编写了简单的代码来从一个 url 获取数据,但是当我尝试使用列表 ['url1','url2'] 时,我什么也没有。 I was trying also with csv list, but I still have nothing.我也在尝试使用 csv 列表,但我仍然一无所获。 I checked a lot of similar topics, but still empty result.我检查了很多类似的主题,但结果仍然是空的。 Could you help me please to understand how to do it?你能帮我了解一下怎么做吗?

''' '''

import lxml
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.zillow.com/homedetails/105-Itasca-St-Boston-MA-02126/59137872_zpid/'
response = requests.get(url)
req_headers = {
    'accept': 
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like 
    Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
url 
r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'lxml')

price = soup.find('span', {'class': 'ds-value'}).text
property_type = soup.find('span', {'class': 'ds-home-fact-value'}).text
address = soup.find('h1', {'class': 'ds-address-container'}).text

price, property_type, address '''

To accomplish what you're asking to do with multiple urls, all you need to do is put them in a list and iterate over it:要完成您要求对多个 url 执行的操作,您需要做的就是将它们放在一个列表中并对其进行迭代:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.zillow.com/homedetails/105-Itasca-St-Boston-MA-02126/59137872_zpid/',
]

with requests.Session() as s:
    for url in urls:
        r = s.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')

        # do something with soup

However , the main issue here is that pretty much everything interesting on your example webpage seems to be generated by JavaScript.然而,这里的主要问题是您的示例网页上几乎所有有趣的东西似乎都是由 JavaScript 生成的。 For example, if you:例如,如果您:

print(soup.body)

You'll see the body of html for this webpage has next to nothing (no price, no house details, etc.), save for a captcha mechanism to verify you're a human.您将看到此网页的 html 正文几乎没有任何内容(没有价格、没有房屋详细信息等),除了用于验证您是人类的验证码机制。 You'll need to find a way to wait for the JavaScript to be rendered on the page to be able to scrape the details.您需要找到一种方法来等待 JavaScript 在页面上呈现,以便能够抓取详细信息。 Look into the python module selenium as a potential workaround for accomplishing this.查看 python 模块selenium作为实现此目的的潜在解决方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM