[英]Web scraping from multiple pages with for loop part 2
My original problem:我原来的问题:
"I have created web scraping tool for picking data from listed houses. “我创建了 web 抓取工具,用于从上市房屋中提取数据。
I have problem when it comes to changing page.我在更改页面时遇到问题。 I did make for loop to go from 1 to some number.
我确实从 1 到某个数字对 go 进行了循环。
Problem is this: In this web pages last "page" can be different all the time.问题是这样的:在这个 web 页面中,最后一个“页面”可能一直不同。 Now it is 70, but tomorrow it can be 68 or 72. And if I but range for example to (1-74) it will print last page many times, because if you go over the maximum the page always loads the last."
现在是 70,但明天可能是 68 或 72。如果我的范围为(1-74),它将多次打印最后一页,因为如果 go 超过最大值,页面总是加载最后一页。
Then I got help from Ricco D who wrote code that it will know when to stop:然后我得到了 Ricco D 的帮助,他编写了可以知道何时停止的代码:
import requests
from bs4 import BeautifulSoup as bs
url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')
last_page = None
pages = []
buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
pages.append(button.text)
print(pages)
This works just fine.这工作得很好。
Butt when I try to combine this with my original code, which also works by itself I run into error:但当我尝试将它与我的原始代码结合起来时,它本身也可以工作,我遇到了错误:
Traceback (most recent call last):
File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in <module>
containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"})
File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
This is the error I get.这是我得到的错误。
Any ideas how to get this work?任何想法如何获得这项工作? Thanks
谢谢
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests
my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'
filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)
page = requests.get(my_url)
soup = soup(page.content, 'html.parser')
pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
pages.append(button.text)
last_page = int(pages[-1])
for sivu in range(1, last_page):
req = requests.get(my_url + str(sivu))
page_soup = soup(req.text, "html.parser")
containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})
for container in containers:
size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
size_number = re.findall("\d+\,*\d+", size_list)
size = ''.join(size_number) # Asunnon koko neliöinä
prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
prize_number_list = re.findall("\d+\d+", prize_line)
prize = ''.join(prize_number_list[:2]) # Asunnon hinta
address_city = container.h4.text
address_list = address_city.split(', ')[0:1]
address = ' '.join(address_list) # osoite
city_part = address_city.split(', ')[-2] # kaupunginosa
city = address_city.split(', ')[-1] # kaupunki
type_org = container.h5.text
type = type_org.replace("|", "").replace(",", "").replace(".", "") # asuntotyyppi
year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
year_number = re.findall("\d+", year_list)
year = ' '.join(year_number)
print("pinta-ala: " + size)
print("hinta: " + prize)
print("osoite: " + address)
print("kaupunginosa: " + city_part)
print("kaupunki: " + city)
print("huoneistoselittelmä: " + type)
print("rakennusvuosi: " + year)
f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")
f.close()
Your main problem has to do with the way you use soup
.您的主要问题与您使用
soup
的方式有关。 You first import BeautifulSoup as soup
- and then you override this name, when you create your first BeautifulSoup-instance:您首先将
BeautifulSoup as soup
导入 - 然后在创建第一个 BeautifulSoup 实例时覆盖此名称:
soup = soup(page.content, 'html.parser')
From this point on soup
will no longer be the name library BeautifulSoup
, but the object you just created.从此时起,
soup
将不再是名称库BeautifulSoup
,而是您刚刚创建的 object。 Hence, when you some lines further down try to create a new instance ( page_soup = soup(req.text, "html.parser")
) this fails as soup
no longer refers to BeautifulSoup
.因此,当您进一步向下尝试创建一个新实例(
page_soup = soup(req.text, "html.parser")
)时,这将失败,因为soup
不再引用BeautifulSoup
。
So the best thing would be importing the library correctly like so: from bs4 import BeautifulSoup
(or import AND use it as bs
- like Ricco D did), and then change the two instantiating lines like so:所以最好的办法是正确导入库,如下所示:
from bs4 import BeautifulSoup
(或导入并用作bs
- 就像 Ricco D 所做的那样),然后像这样更改两条实例化行:
soup = BeautifulSoup(page.content, 'html.parser') # this is Python2.7-syntax btw
and和
page_soup = BeautifulSoup(req.text, "html.parser") # this is Python3-syntax btw
If you're on Python3, the proper requests
-syntax would by page.text
and not page.content
as .content
returns bytes
in Python3, which is not what you want (as BeautifulSoup needs a str
).如果您使用的是 Python3,则正确的
requests
-syntax 将由page.text
而不是page.content
因为.content
在 Python3 中返回bytes
,这不是您想要的(因为 BeautifulSoup 需要一个str
)。 If you're on Python2.7 you should probably change req.text
to req.content
.如果您使用的是 Python2.7,您可能应该将
req.text
更改为req.content
。
Good luck.祝你好运。
Finding your element with class name
doesn't seem to be the best idea..because of this.使用
class name
查找您的元素似乎不是最好的主意..因为这个。 Same class name for all the next elements.所有下一个元素的 class 名称相同。
I don't know what you are looking for exactly because of the language.由于语言的原因,我不知道您在寻找什么。 I suggest..you go to the website>press f12>press ctrl+f>type the
xpath
..See what elements you get.If you don't know about xpaths
read this.我建议..你 go 到网站>按 f12>按 ctrl+f>输入
xpath
..看看你得到了什么元素。如果你不知道xpaths
读这个。 https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.