[英]Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv
I am trying to extract tabular data from a list of urls and I want to save all the table into a single csv file.我正在尝试从 url 列表中提取表格数据,并且我想将所有表保存到单个 csv 文件中。
I am new and relatively beginner in python and from non-CS background, however I am very eager to learn the same.我是 python 和非 CS 背景的新手和相对初学者,但是我非常渴望学习相同的东西。
import pandas as pd
import urllib.request
import bs4 as bs
urls = ['A', 'B','C','D',...'Z']
for url in urls:
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find('table', class_='tbldata14 bdrtpg')
table_rows = table.find_all('tr')
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)
What I Get from above code in newly created csv file is -我从新创建的 csv 文件中的上述代码中得到的是 -
ABC XYZ PQR MNL CYP ZXS
1 2 3 4 5 6
My above code only gets table from last url- 'Z' , which, as I have checked is actually the table from last url in list.我上面的代码只从最后一个 url- 'Z'获取表格,正如我检查过的那样,它实际上是列表中最后一个 url 的表格。
What I am trying to achieve here is getting all tables from list of urls - ie A to Z into single csv file.我在这里想要实现的是从 url 列表中获取所有表 - 即 A 到 Z 到单个 csv 文件中。
This is an issue with indentation and order.这是缩进和顺序的问题。
table_rows
gets reset every time through the for url in urls
loop, so you only end up with the last URLs worth of data.每次通过 for
table_rows
for url in urls
循环都会重置 table_rows,因此您最终只会得到最后一个 URL 的数据。 If you want all of the URLs worth of data in one final CSV, see the changes I made below.如果您希望在一个最终 CSV 中包含所有 URL 的数据价值,请参阅我在下面所做的更改。
import pandas as pd
import urllib.request
import bs4 as bs
urls = ['A', 'B','C','D',...'Z']
data = [] # Moved to the start
for url in urls:
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find('table', class_='tbldata14 bdrtpg')
table_rows = table.find_all('tr')
#indented the following loop so it runs with every URL data
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.