简体   繁体   English

Web 使用 for 循环第 2 部分从多个页面中抓取

[英]Web scraping from multiple pages with for loop part 2

My original problem:我原来的问题:

"I have created web scraping tool for picking data from listed houses. “我创建了 web 抓取工具,用于从上市房屋中提取数据。

I have problem when it comes to changing page.我在更改页面时遇到问题。 I did make for loop to go from 1 to some number.我确实从 1 到某个数字对 go 进行了循环。

Problem is this: In this web pages last "page" can be different all the time.问题是这样的:在这个 web 页面中,最后一个“页面”可能一直不同。 Now it is 70, but tomorrow it can be 68 or 72. And if I but range for example to (1-74) it will print last page many times, because if you go over the maximum the page always loads the last."现在是 70,但明天可能是 68 或 72。如果我的范围为(1-74),它将多次打印最后一页,因为如果 go 超过最大值,页面总是加载最后一页。

Then I got help from Ricco D who wrote code that it will know when to stop:然后我得到了 Ricco D 的帮助,他编写了可以知道何时停止的代码:

import requests
from bs4 import BeautifulSoup as bs

url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')

last_page = None
pages = []

buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
    pages.append(button.text)

print(pages)

This works just fine.这工作得很好。

Butt when I try to combine this with my original code, which also works by itself I run into error:但当我尝试将它与我的原始代码结合起来时,它本身也可以工作,我遇到了错误:

Traceback (most recent call last):
  File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in <module>
    containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"})
  File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in __getattr__
    raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

This is the error I get.这是我得到的错误。

Any ideas how to get this work?任何想法如何获得这项工作? Thanks谢谢

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests

my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'

filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)

page = requests.get(my_url)
soup = soup(page.content, 'html.parser')

pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
    pages.append(button.text)


last_page = int(pages[-1])

for sivu in range(1, last_page):

    req = requests.get(my_url + str(sivu))
    page_soup = soup(req.text, "html.parser")
    containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})

    for container in containers:
        size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
        size_number = re.findall("\d+\,*\d+", size_list)
        size = ''.join(size_number)  # Asunnon koko neliöinä

        prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
        prize_number_list = re.findall("\d+\d+", prize_line)
        prize = ''.join(prize_number_list[:2])  # Asunnon hinta

        address_city = container.h4.text

        address_list = address_city.split(', ')[0:1]
        address = ' '.join(address_list)  # osoite

        city_part = address_city.split(', ')[-2]  # kaupunginosa

        city = address_city.split(', ')[-1]  # kaupunki

        type_org = container.h5.text
        type = type_org.replace("|", "").replace(",", "").replace(".", "")  # asuntotyyppi

        year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
        year_number = re.findall("\d+", year_list)
        year = ' '.join(year_number)

        print("pinta-ala: " + size)
        print("hinta: " + prize)
        print("osoite: " + address)
        print("kaupunginosa: " + city_part)
        print("kaupunki: " + city)
        print("huoneistoselittelmä: " + type)
        print("rakennusvuosi: " + year)

        f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")

f.close()

Your main problem has to do with the way you use soup .您的主要问题与您使用soup的方式有关。 You first import BeautifulSoup as soup - and then you override this name, when you create your first BeautifulSoup-instance:您首先将BeautifulSoup as soup导入 - 然后在创建第一个 BeautifulSoup 实例时覆盖此名称:

soup = soup(page.content, 'html.parser')

From this point on soup will no longer be the name library BeautifulSoup , but the object you just created.从此时起, soup将不再是名称库BeautifulSoup ,而是您刚刚创建的 object。 Hence, when you some lines further down try to create a new instance ( page_soup = soup(req.text, "html.parser") ) this fails as soup no longer refers to BeautifulSoup .因此,当您进一步向下尝试创建一个新实例( page_soup = soup(req.text, "html.parser") )时,这将失败,因为soup不再引用BeautifulSoup

So the best thing would be importing the library correctly like so: from bs4 import BeautifulSoup (or import AND use it as bs - like Ricco D did), and then change the two instantiating lines like so:所以最好的办法是正确导入库,如下所示: from bs4 import BeautifulSoup (或导入并用作bs - 就像 Ricco D 所做的那样),然后像这样更改两条实例化行:

soup = BeautifulSoup(page.content, 'html.parser') # this is Python2.7-syntax btw

and

page_soup = BeautifulSoup(req.text, "html.parser") # this is Python3-syntax btw

If you're on Python3, the proper requests -syntax would by page.text and not page.content as .content returns bytes in Python3, which is not what you want (as BeautifulSoup needs a str ).如果您使用的是 Python3,则正确的requests -syntax 将由page.text而不是page.content因为.content在 Python3 中返回bytes ,这不是您想要的(因为 BeautifulSoup 需要一个str )。 If you're on Python2.7 you should probably change req.text to req.content .如果您使用的是 Python2.7,您可能应该将req.text更改为req.content

Good luck.祝你好运。

Finding your element with class name doesn't seem to be the best idea..because of this.使用class name查找您的元素似乎不是最好的主意..因为这个。 Same class name for all the next elements.所有下一个元素的 class 名称相同。

多个 div 的类名相同

I don't know what you are looking for exactly because of the language.由于语言的原因,我不知道您在寻找什么。 I suggest..you go to the website>press f12>press ctrl+f>type the xpath ..See what elements you get.If you don't know about xpaths read this.我建议..你 go 到网站>按 f12>按 ctrl+f>输入xpath ..看看你得到了什么元素。如果你不知道xpaths读这个。 https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM