简体   繁体   English

网页抓取多个页面

[英]Web scraping multiple pages

I am scraping a web page with multiple pages. 我正在抓取一个包含多个页面的网页。 I would very much appreciate your help for my following problem: 非常感谢您对以下问题的帮助:

I have built a loop around the URL of the web page. 我围绕网页的URL建立了一个循环。 However, when looking for the tags in the HTML code only information from page one appears. 但是,当在HTML代码中查找标签时,只会显示第一页的信息。 It seems like the loop is not really flowing through. 似乎循环并没有真正通过。 I unfortunately cannot find my mistake in the following code: 不幸的是,我在以下代码中找不到我的错误:

for pagenumber in range(1,50):
    url = "http://suchen.mobile.de/fahrzeuge/auto/search.html?zipcodeRadius=100&scopeId=C&ambitCountry=DE&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=115%2C98%2C80%2C99%2C102%2C81%2C100%2C83%2C105%2C82%2C101%2C120%2C121&makeModelVariant1.modelGroupId=53&isSearchRequest=true&pageNumber + str(pageNumber)"
    r = requests.get(url)
    soup = BeautifulSoup(r.content,"lxml")

    # parsing the data from the webpage

    carTypeTemp=[]
    carTypeWeb = soup.find_all("span", {"class":"h3"})
# writing the car type/description in a list
    for i in range(0,len(carTypeWeb),2):
        carTypeTemp.extend((carTypeWeb[i]))

In your forloop you are doing: 在您的forloop中,您正在执行以下操作:

url = "* + str(pageNumber)"

This is literally what the url will be, and isn't concatenating as you think it is. 从字面上看,这就是网址的含义,并不是您认为的那样串联。

>>> "a url + str(pageNumber)"
"a url + str(pageNumber)"

You want: 你要:

url = "*" + str(pagenumber)

Or you could use string formatters, whatever you prefer. 或者您可以使用字符串格式化程序,无论您喜欢什么。

Edit : didn't catch the difference between names / capitalization as noted in the comment. 编辑 :未注意到注释中提到的名称/大小写之间的差异。

You want pagenumber not pageNumber . 您要pagenumber而不是pageNumber pageNumber doesn't exist. pageNumber不存在。

Try changing the first two lines in your code to this: 尝试将代码中的前两行更改为此:

for pagenumber in range(1,50):
    url = "http://suchen.mobile.de/fahrzeuge/auto/search.html?zipcodeRadius=100&scopeId=C&ambitCountry=DE&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=115%2C98%2C80%2C99%2C102%2C81%2C100%2C83%2C105%2C82%2C101%2C120%2C121&makeModelVariant1.modelGroupId=53&isSearchRequest=true&pageNumber={pagenumber}".format(pagenumber))

Right now you're not sending a GET request with a proper URL. 目前,您没有发送带有正确URL的GET请求。

It seems like you forget to put "N" in 'pageNumber' instead of 'n' and change 好像您忘记在“ pageNumber”中放入“ N”而不是“ n”并进行更改

  url = "https://.................. + str(pageNumber)" 

to

url = ("http://suchen.mobile.de/fahrzeuge..... " + str(pageNumber))

this give me a loop of 这给了我一个循环

['BMW 430d xDrive Coupé M Sportpaket Head-Up ACC LED', 'BMW 425d Gran Coupé M-Sportpaket Sport-Aut. Navi Pro', 'BMW 420d xDrive Coupé M Sportpaket Navi Apps PDC']

and

['BMW 435i xDrive Gran Coupé M Sportpaket Navi Prof. A', 'BMW 420 Gran Coupé M Sportpaket NEUES MODELL Nav LED', 'BMW 435i Coupé Sport Line GSD Navi Speed Limit Info']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM