为什么我的 python “循环”仅在我想从 url 列表中删除表并将它们转换为 df

Question

I've got some issues converting table from list of urls to a large Dataframe with all the rows from different urls.我在将表格从 url 列表转换为大型 Dataframe 时遇到了一些问题，其中所有行都来自不同的 url。 It seems that my code runs well however when I want to export a new csv it only returns me the last 10 rows from the last URL instead of each url.似乎我的代码运行良好，但是当我想导出新的 csv 时，它只返回最后一个 URL 的最后 10 行，而不是每个 Z572D4E421E5E6B9BC11D815E8A02712。 Does someone know why?有人知道为什么吗？

ps: I tried to find the answer in browsing Stack but I did not find out ps：我试图在浏览Stack中找到答案，但我没有找到

import pandas as pd
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import numpy as np

# URL 0 - 10 SCRAPE


BASE_URL = [
'https://datan.fr/groupes/legislature-16/re',
'https://datan.fr/groupes/legislature-16/rn',
'https://datan.fr/groupes/legislature-16/lfi-nupes',
    'https://datan.fr/groupes/legislature-16/lr',
    'https://datan.fr/groupes/legislature-16/dem',
    'https://datan.fr/groupes/legislature-16/soc',
    'https://datan.fr/groupes/legislature-16/hor',
    'https://datan.fr/groupes/legislature-16/ecolo',
    'https://datan.fr/groupes/legislature-16/gdr-nupes',
    'https://datan.fr/groupes/legislature-16/liot',
]

Tous_les_groupes = []
b=0
for b in BASE_URL:

    html = requests.get(b).text
    soup = BeautifulSoup(html, "html.parser")
    #identify table we want to scrape
    Tableau_groupe = soup.find('table', {"class" : "table"})
    print(Tableau_groupe)


try:

    for row in Tableau_groupe.find_all('tr'):
        cols = row.find_all('td')
        print(cols)

        if len(cols) == 4:
            Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
            #print(Tous_les_groupes)
except:
    pass
Groupes_DF = np.asarray(Tous_les_groupes)
#print(Groupes_DF)
#print(len(Groupes_DF))

df = pd.DataFrame(Groupes_DF)
df.columns = ['url','G', 'Tx', 'note ','Number']
#print(df.head(10))

df.to_csv('output.csv')

Thanks for your help, and all have a great day.感谢您的帮助，祝大家度过愉快的一天。

Answer 1

In the first loop you assign the result of soup.find to Tableau_groupe , but each time it "overwrites" the previous value, thus mantaining only the last value.在第一个循环中，您将soup.find的结果分配给Tableau_groupe ，但每次它“覆盖”前一个值，因此只保留最后一个值。

Try moving the second for loop together with the first one:尝试将第二个 for 循环与第一个循环一起移动：

for b in BASE_URL:

    html = requests.get(b).text
    soup = BeautifulSoup(html, "html.parser")
    #identify table we want to scrape
    Tableau_groupe = soup.find('table', {"class" : "table"})
    print(Tableau_groupe)


    try:

        for row in Tableau_groupe.find_all('tr'):
            cols = row.find_all('td')
            print(cols)

            if len(cols) == 4:
                Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))

    except:
        pass

为什么我的 python “循环”仅在我想从 url 列表中删除表并将它们转换为 df

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-07-27 15:16:51

为什么我的 python “循环”仅在我想从 url 列表中删除表并将它们转换为 df

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-07-27 15:16:51

解决方案1
0 已采纳 2022-07-27 15:16:51