简体   繁体   English

如何在一个嵌套的“for-loop”中同时抓取两个页面并生成两个不同的列表?

[英]How do I simultaneously scrape two pages and produce two distinct lists within one nested 'for-loop'?

I'm scraping from two URLs that have the same DOM structure, and so I'm trying to find a way to scrape both of them at the same time. 我正在从两个具有相同DOM结构的URL中抓取,所以我试图找到一种方法同时刮掉它们。
The only caveat is that the data scraped from both these pages need to end up on distinctly named lists. 唯一需要注意的是,从这两个页面中删除的数据最终都需要以明确命名的列表结尾。

To explain with example, here is what I've tried: 以示例解释,这是我尝试过的:

import os
import requests
from bs4 import BeautifulSoup as bs


urls = ['https://www.basketball-reference.com/leaders/ws_career.html',
       'https://www.basketball-reference.com/leaders/ws_per_48_career.html',]

ws_list = []
ws48_list = []

categories = [ws_list, ws48_list]

for url in urls:
    response = requests.get(url, headers=headers)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    for a in section.find_all('a'):
        player_name = a.text
        for cat_list in categories:
            cat_list.append(player_name)
print(ws48_list)
print(ws_list)

This ends up printing two identical lists when I was shooting for 2 lists unique to its page. 当我拍摄其页面独有的2个列表时,最终会打印两个相同的列表。
How do I accomplish this? 我该如何做到这一点? Would it be better practice to code it another way? 用另一种方式编码会更好吗?

Just add them to the appropriate list and the problem is solved? 只需将它们添加到适当的列表中,问题就解决了?

for i, url in enumerate(urls):
    response = requests.get(url)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    for a in section.find_all('a'):
        player_name = a.text
        categories[i].append(player_name)
print(ws48_list)
print(ws_list)

Instead of trying to append to already existing lists. 而不是试图附加到已有的列表。 Just create new ones. 只需创建新的。 Make a function to do the scrape and pass each url in turn to it. 创建一个函数来进行刮擦并将每个url依次传递给它。

import os
import requests
from bs4 import BeautifulSoup as bs

urls = ['https://www.basketball-reference.com/leaders/ws_career.html',
       'https://www.basketball-reference.com/leaders/ws_per_48_career.html',]

def parse_page(url, headers={}):

    response = requests.get(url, headers=headers)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    return [a.text for a in section.find_all('a')]


ws_list, ws48_list = [parse_page(url) for url in urls]

print('ws_list = %r' % ws_list)
print('ws8_list = %r' % ws48_list)

You can use a function to define your scraping logic, then just call it for your urls. 您可以使用函数来定义抓取逻辑,然后只需为您的网址调用它。

import os
import requests
from bs4 import BeautifulSoup as bs

def scrape(url):    
    response = requests.get(url)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    names = []
    for a in section.find_all('a'):
        player_name = a.text
        names.append(player_name)
    return names    

ws_list = scrape('https://www.basketball-reference.com/leaders/ws_career.html')
ws48_list = scrape('https://www.basketball-reference.com/leaders/ws_per_48_career.html')

print(ws_list)
print(ws48_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM