简体   繁体   English

抓取时字典循环调用同一网站

[英]Dictionary loop calling same website when scraping

I'm very new to Python so this is probably straightforward, and might be an indentation issue. 我是Python的新手,所以这可能很简单,并且可能是缩进问题。 I'm trying to scrape over several webpages using beautiful soup, creating a list of dictionaries that I can use afterwards to manipulate the data. 我正在尝试使用漂亮的汤片刮擦几个网页,创建一个字典列表,以后可以用它们来操纵数据。

The code seems to work fine, but the list I end up with (liste_flat) is just a list of the same two dictionaries. 该代码似乎可以正常工作,但是我最终得到的列表(liste_flat)只是两个相同字典的列表。 I want a list of different dictionaries. 我想要一个不同字典的列表。

def scrap_post(url):
    url = "https://www.findproperly.co.uk/property-to-rent-london/commute/W3siaWQiOjkxMDYsImZyZXEiOjUsIm1ldGgiOiJwdWJ0cmFucyIsImxuZyI6LTAuMTI0Nzg5LCJsYXQiOjUxLjUwODR9XQ==/max-time/90/page/".format(i)
    dictionary = {}
    response = requests.get(url)
    soup = bs(response.text,"lxml")
    taille = len(soup.find_all("div", class_="col-sm-6 col-md-4 col-lg-3 pl-grid-prop not-viewed ")) #48 entries
    for num_ville in range(0,taille):
        print(num_ville)
        apt_id = soup.find_all("div", class_="col-sm-6 col-md-4 col-lg-3 pl-grid-prop not-viewed ")[num_ville]['data-id']
        entry = soup.find_all("div", class_="col-sm-6 col-md-4 col-lg-3 pl-grid-prop not-viewed ")[num_ville]
        pricepw = soup.find_all('div', class_='col-xs-5 col-sm-4 price')[num_ville].find('h3').text.encode('utf-8').replace('\xc2\xa3','',).replace('pw','',).strip()
        rooms = soup.find_all('div', class_='col-xs-6 type')[num_ville].find('p').text.encode('utf-8').strip()
        lat = soup.find_all('div', {"itemprop":"geo"})[num_ville].find('meta', {'itemprop':'latitude'})['content']
        lon = soup.find_all('div', {"itemprop":"geo"})[num_ville].find('meta', {'itemprop':'longitude'})['content']
        dictionary[num_ville]={'Price per week':pricepw,'Rooms':rooms,'Latitude':lat,'Longitude':lon}
    return dictionary

#get all URLs
liste_url = []
liste_url = ['https://www.findproperly.co.uk/property-to-rent-london/commute/W3siaWQiOjkxMDYsImZyZXEiOjUsIm1ldGgiOiJwdWJ0cmFucyIsImxuZyI6LTAuMTI0Nzg5LCJsYXQiOjUxLjUwODR9XQ==/max-time/90/page/''%i' %i for i in range(1,3)]

#get flats
liste_flat = [scrap_post(i) for i in liste_url] 

I must somehow be looping over the same website twice. 我必须以某种方式在同一个网站上循环两次。 Any advice on how to make sure I'm looping over different websites? 关于如何确保我遍历不同网站的任何建议?

Thanks! 谢谢!

Yes, you are looping over the same website, because you have hardcoded the url variable in your function. 是的,您正在循环访问同一网站,因为您已在函数中对url变量进行了硬编码。

url = "https://www.findproperly.co.uk/property-to-rent-london/commute/W3siaWQiOjkxMDYsImZyZXEiOjUsIm1ldGgiOiJwdWJ0cmFucyIsImxuZyI6LTAuMTI0Nzg5LCJsYXQiOjUxLjUwODR9XQ==/max-time/90/page/".format(i)

Meaning regardless of what you send to the function, it will always use this url. 含义,无论您向函数发送什么内容,它都将始终使用此url。 You might want to remove that. 您可能要删除它。 You also haven't placed a placeholder in your string and the .format(i) essentially does nothing. 您也没有在字符串中放置占位符, .format(i)本质上什么也不做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM