简体   繁体   English

Web 抓取多个相似页面

[英]Web scraping multiple similar pages

I'm new to python web scraping, and I was attempting to take the addresses of different winmar locations in Canada, as well as put the results into a csv file.我是 python web 抓取的新手,我试图获取加拿大不同 winmar 位置的地址,并将结果放入 csv 文件中。 So far, the only way I have found to differentiate between the different locations' sites is by a code at the end of the address (numbers).到目前为止,我发现区分不同位置站点的唯一方法是在地址末尾添加一个代码(数字)。 The problem is that the results do not change as the program runs, and instead produces the results of the first location (305) when printing and into the csv file.问题是结果不会随着程序的运行而改变,而是在打印时生成第一个位置 (305) 的结果并写入 csv 文件。 Thanks for your time and consideration!感谢您的时间和考虑!

Here's my code:这是我的代码:

import csv
import requests
from bs4 import BeautifulSoup

x = 0
numbers = ['305', '405', '306', '307', '308', '309', '4273']

f = csv.writer(open('Winmar_locations.csv', 'w'))
f.writerow(['City:', 'Address:'])

for links in numbers:

    for x in range(0, 6):
        url = 'https://www.winmar.ca/find-a-location/' + str(numbers[x])
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html.parser")

    location_name = soup.find("div", attrs={"class": "title_block"})
    location_name_items = location_name.find_all('h2')

    location_list = soup.find(class_='quick_info')
    location_list_items = location_list.find_all('p')

    for name in location_name_items:
        names = name.text
        names = names.replace('Location | ', '')

    for location in location_list_items:
        locations = location.text.strip()
        locations = locations.replace('24 Hour Emergency | (902) 679-1116','')

    print(names, locations)
    x = x+1

    f.writerow([names, locations])

You had a few things wrong in your code and one thing about the website you are scraping您的代码中有一些问题,而您正在抓取的网站有一些问题

  • First accessing the url like this https://www.winmar.ca/find-a-location/308 will not change the location properly, it needs to be like this https://www.winmar.ca/find-a-location/#308 with hashbang before the number.首先访问 url 像这样https://www.winmar.ca/find-a-location/308不会正确更改位置,它需要像这样https://www.winmar.ca/find-a-location/#308在数字前带有 hashbang。

  • The website has duplicate html with the same classes, that means you nearly have all locations loaded all the time and they just choose which to show from their js code -bad practice ofcourse-, that makes your matcher always gets the same location, which explains why you always had the same location repeated.该网站有重复的 html 具有相同的类,这意味着您几乎一直加载所有位置,他们只是从他们的 js 代码中选择要显示的位置-当然是不好的做法-,这使您的匹配器始终获得相同的位置,这解释了为什么你总是重复相同的位置。

  • Lastly, you had a lot of unnecessary loops, you only need to loop over the numbers array and that's it.最后,你有很多不必要的循环,你只需要遍历 numbers 数组就可以了。

here is a modified version of your code这是您的代码的修改版本

import csv
import requests
from bs4 import BeautifulSoup

x = 0
numbers = ['305', '405', '306', '307', '308', '309', '4273']


names = []
locations = []
for x in range(0, 6):
    url = 'https://www.winmar.ca/find-a-location/#' + str(numbers[x])
    print(f"pinging url {url}")

    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    scope = soup.find(attrs={"data-id": str(numbers[x])})

    location_name = scope.find("div", attrs={"class": "title_block"})
    location_name_items = location_name.find_all('h2')


    location_list = scope.find(class_='quick_info')
    location_list_items = location_list.find_all('p')

    name = location_name.find_all("h2")[0].text
    print(name)

    names.append(name)

    for location in location_list_items:
        loc = location.text.strip()
        if '24 Hour Emergency' in loc: 
            continue
        print(loc)
        locations.append(loc)

    x = x+1

Notice the scoping I did注意我所做的范围界定

    scope = soup.find(attrs={"data-id": str(numbers[x])})

that makes your code immune to how much locations they have loaded in the html, you only targeting the scope with the location you want.这使您的代码不受他们在 html 中加载的位置的影响,您只需将 scope 定位到您想要的位置。

this results in:这导致:

pinging url https://www.winmar.ca/find-a-location/#305
Location | Annapolis
70 Donald E Hiltz Connector Road
Kentville, NS
B4N 3V7
pinging url https://www.winmar.ca/find-a-location/#405
Location | Bridgewater
15585 Highway # 3
Hebbville, NS
B4V 6X7
pinging url https://www.winmar.ca/find-a-location/#306
Location | Halifax
9 Isnor Dr
Dartmouth, NS
B3B 1M1
pinging url https://www.winmar.ca/find-a-location/#307
Location | New Glasgow
5074 Hwy. #4, RR #1
Westville, NS
B0K 2A0
pinging url https://www.winmar.ca/find-a-location/#308
Location | Port Hawkesbury
8 Industrial Park Rd
Lennox Passage, NS
B0E 1V0
pinging url https://www.winmar.ca/find-a-location/#309
Location | Sydney
358 Keltic Drive
Sydney River, NS
B1R 1V7

Although you have got a qualified answer, I thought to come up with mine.虽然你有一个合格的答案,但我想提出我的。 I've tried to make the script concise shaking off verbosity.我试图使脚本简洁,摆脱冗长。 Make sure your bs4 version is 4.7.0 or later in order for it to support pseudo selector which I've defined within the script to locate the address.确保您的 bs4 版本是 4.7.0 或更高版本,以便它支持我在脚本中定义的伪选择器来定位地址。

import csv
import requests
from bs4 import BeautifulSoup

base = 'https://www.winmar.ca/find-a-location/#{}'

numbers = ['305', '405', '306', '307', '308', '309', '4273']

with open("Winmar_locations.csv","w",newline="") as f:
    writer = csv.writer(f)
    writer.writerow(['City','Address'])

    while numbers:
        num = numbers.pop(0)
        r = requests.get(base.format(num))
        soup = BeautifulSoup(r.content,"html.parser")

        location_name = soup.select_one(f"[data-id='{num}'] .title_block > h2.title").contents[-1]
        location_address = soup.select_one(f"[data-id='{num}'] .heading:contains('Address') + p").get_text(strip=True)
        writer.writerow([location_name,location_address])
        print(location_name,location_address)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM