[英]Why does my python "loop for" only work on the last table when I want to scrap tables from list of urls and convert them to df
I've got some issues converting table from list of urls to a large Dataframe with all the rows from different urls.我在将表格从 url 列表转换为大型 Dataframe 时遇到了一些问题,其中所有行都来自不同的 url。 It seems that my code runs well however when I want to export a new csv it only returns me the last 10 rows from the last URL instead of each url.
似乎我的代码运行良好,但是当我想导出新的 csv 时,它只返回最后一个 URL 的最后 10 行,而不是每个 Z572D4E421E5E6B9BC11D815E8A02712。 Does someone know why?
有人知道为什么吗?
ps: I tried to find the answer in browsing Stack but I did not find out ps:我试图在浏览Stack中找到答案,但我没有找到
import pandas as pd
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import numpy as np
# URL 0 - 10 SCRAPE
BASE_URL = [
'https://datan.fr/groupes/legislature-16/re',
'https://datan.fr/groupes/legislature-16/rn',
'https://datan.fr/groupes/legislature-16/lfi-nupes',
'https://datan.fr/groupes/legislature-16/lr',
'https://datan.fr/groupes/legislature-16/dem',
'https://datan.fr/groupes/legislature-16/soc',
'https://datan.fr/groupes/legislature-16/hor',
'https://datan.fr/groupes/legislature-16/ecolo',
'https://datan.fr/groupes/legislature-16/gdr-nupes',
'https://datan.fr/groupes/legislature-16/liot',
]
Tous_les_groupes = []
b=0
for b in BASE_URL:
html = requests.get(b).text
soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
Tableau_groupe = soup.find('table', {"class" : "table"})
print(Tableau_groupe)
try:
for row in Tableau_groupe.find_all('tr'):
cols = row.find_all('td')
print(cols)
if len(cols) == 4:
Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
#print(Tous_les_groupes)
except:
pass
Groupes_DF = np.asarray(Tous_les_groupes)
#print(Groupes_DF)
#print(len(Groupes_DF))
df = pd.DataFrame(Groupes_DF)
df.columns = ['url','G', 'Tx', 'note ','Number']
#print(df.head(10))
df.to_csv('output.csv')
Thanks for your help, and all have a great day.感谢您的帮助,祝大家度过愉快的一天。
In the first loop you assign the result of soup.find
to Tableau_groupe
, but each time it "overwrites" the previous value, thus mantaining only the last value.在第一个循环中,您将
soup.find
的结果分配给Tableau_groupe
,但每次它“覆盖”前一个值,因此只保留最后一个值。
Try moving the second for loop together with the first one:尝试将第二个 for 循环与第一个循环一起移动:
for b in BASE_URL:
html = requests.get(b).text
soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
Tableau_groupe = soup.find('table', {"class" : "table"})
print(Tableau_groupe)
try:
for row in Tableau_groupe.find_all('tr'):
cols = row.find_all('td')
print(cols)
if len(cols) == 4:
Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
except:
pass
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.