Web抓取以从网站获取数据

Question

I am learning python & trying to scrape a website, having 10 listing of properties on each page. 我正在学习python并试图抓取一个网站，每个页面上有10个属性列表。 I want to extract information from each listing on each page. 我想从每个页面上的每个列表中提取信息。 My code for first 5 pages is as follows :- 我的前5页代码如下：-

import requests 
from bs4 import BeautifulSoup

urls = []
for i in range(1,5):
    pages = "http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-{0}?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true".format(i)
    urls.append(pages)
    for info in urls:
         page = requests.get(info)
         soup = BeautifulSoup(page.content, 'html.parser')
         links = soup.find_all('a', attrs ={'class' :'details-panel'})
         hrefs = [link['href'] for link in links]
         Data = []
         for urls in hrefs:
             pages = requests.get(urls)
             soup_2 =BeautifulSoup(pages.content, 'html.parser')
             Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
             Address = [Address.text.strip() for Address in Address_1]
             Date = soup_2.find_all('li', attrs ={'class' :'sold-date'})
             Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date]
             Area_1 =soup_2.find_all('ul', attrs={'class' :'summaryList'})
             Area = [Area.text.strip() for Area in Area_1]
             Agency_1=soup_2.find_all('div', attrs={'class' :'agencyName ellipsis'})
             Agency_Name=[Agency_Name.text.strip() for Agency_Name in Agency_1]
             Agent_1=soup_2.find_all('div', attrs={'class' :'agentName ellipsis'})
             Agent_Name=[Agent_Name.text.strip() for Agent_Name in Agent_1]
             Data.append(Sold_Date+Address+Area+Agency_Name+Agent_Name)

The above code is not working for me. 上面的代码对我不起作用。 Please let me know the correct coding to achieve the purpose. 请让我知道正确的编码以达到目的。

Answer 1

There is one problem in your code is that you declared the variable "urls" twice. 您的代码中存在一个问题，就是您两次声明了变量“ urls”。 You need to update the code like below: 您需要更新如下代码：

import requests 
from bs4 import BeautifulSoup

urls = []
for i in range(1,6):
    pages = "http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-{0}?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true".format(i)
    urls.append(pages)

Data = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('a', attrs ={'class' :'details-panel'})
    hrefs = [link['href'] for link in links]

    for href in hrefs:
        pages = requests.get(href)
        soup_2 =BeautifulSoup(pages.content, 'html.parser')
        Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
        Address = [Address.text.strip() for Address in Address_1]
        Date = soup_2.find_all('li', attrs ={'class' :'sold-date'})
        Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date]
        Area_1 =soup_2.find_all('ul', attrs={'class' :'summaryList'})
        Area = [Area.text.strip() for Area in Area_1]
        Agency_1=soup_2.find_all('div', attrs={'class' :'agencyName ellipsis'})
        Agency_Name=[Agency_Name.text.strip() for Agency_Name in Agency_1]
        Agent_1=soup_2.find_all('div', attrs={'class' :'agentName ellipsis'})
        Agent_Name=[Agent_Name.text.strip() for Agent_Name in Agent_1]
        Data.append(Sold_Date+Address+Area+Agency_Name+Agent_Name)

print Data

Answer 2

Use headers in the code and use string concatenation instead of .format(i) 在代码中使用标头，并使用字符串串联代替.format（i）

The code looks like this 代码看起来像这样

import requests 
from bs4 import BeautifulSoup

urls = []
for i in range(1,6):
    pages = 'http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-'i+'?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true'
    urls.append(pages)

Data = []
for info in urls:
    headers = {'User-agent':'Mozilla/5.0'}
    page = requests.get(info,headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('a', attrs ={'class' :'details-panel'})
    hrefs = [link['href'] for link in links]

for href in hrefs:
    pages = requests.get(href)
    soup_2 =BeautifulSoup(pages.content, 'html.parser')
    Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
    Address = [Address.text.strip() for Address in Address_1]
    Date = soup_2.find_all('li', attrs ={'class' :'sold-date'})
    Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date]
    Area_1 =soup_2.find_all('ul', attrs={'class' :'summaryList'})
    Area = [Area.text.strip() for Area in Area_1]
    Agency_1=soup_2.find_all('div', attrs={'class' :'agencyName ellipsis'})
    Agency_Name=[Agency_Name.text.strip() for Agency_Name in Agency_1]
    Agent_1=soup_2.find_all('div', attrs={'class' :'agentName ellipsis'})
    Agent_Name=[Agent_Name.text.strip() for Agent_Name in Agent_1]
    Data.append(Sold_Date+Address+Area+Agency_Name+Agent_Name)

print Data

Answer 3

You can tell BeautifulSoup to only give you links containing a href to make your code safer. 您可以告诉BeautifulSoup仅提供包含href链接，以使您的代码更安全。 Also, rather than modifying your URL to include a page number, you could extract the next > link at the bottom. 另外，您可以将底部的next >链接提取出来，而不是修改URL以包含页码。 This would also then automatically stop when the final page has been returned: 返回最后一页时，这也将自动停止：

import requests 
from bs4 import BeautifulSoup

base_url = r"http://www.realcommercial.com.au"
url = base_url + "/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true"
data = []

for _ in range(10):
    print(url)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    hrefs = [link['href'] for link in soup.find_all('a', attrs={'class' : 'details-panel'}, href=True)]

    for href in hrefs:
         pages = requests.get(href)
         soup_2 = BeautifulSoup(pages.content, 'html.parser')
         Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
         Address = [Address.text.strip() for Address in Address_1]
         Date = soup_2.find_all('li', attrs ={'class' :'sold-date'})
         Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date]
         Area_1 = soup_2.find_all('ul', attrs={'class' :'summaryList'})
         Area = [Area.text.strip() for Area in Area_1]
         Agency_1 = soup_2.find_all('div', attrs={'class' :'agencyName ellipsis'})
         Agency_Name = [Agency_Name.text.strip() for Agency_Name in Agency_1]
         Agent_1 = soup_2.find_all('div', attrs={'class' :'agentName ellipsis'})
         Agent_Name = [Agent_Name.text.strip() for Agent_Name in Agent_1]

         data.append(Sold_Date+Address+Area+Agency_Name+Agent_Name)

    # Find next page (if any)
    next_button = soup.find('li', class_='rui-pagination-next')

    if next_button:
        url = base_url + next_button.parent['href']
    else:
        break


for entry in data:
    print(entry)
    print("---------")

Web抓取以从网站获取数据

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-06-12 06:10:18

解决方案2
1 2017-06-12 07:32:44

解决方案3
1 2017-06-12 08:45:45

Web抓取以从网站获取数据

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-06-12 06:10:18

解决方案2 1 2017-06-12 07:32:44

解决方案3 1 2017-06-12 08:45:45

解决方案1
2 已采纳 2017-06-12 06:10:18

解决方案2
1 2017-06-12 07:32:44

解决方案3
1 2017-06-12 08:45:45