Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv

Question

I am trying to extract tabular data from a list of urls and I want to save all the table into a single csv file.我正在尝试从 url 列表中提取表格数据，并且我想将所有表保存到单个 csv 文件中。

I am new and relatively beginner in python and from non-CS background, however I am very eager to learn the same.我是 python 和非 CS 背景的新手和相对初学者，但是我非常渴望学习相同的东西。

import pandas as pd
import urllib.request
import bs4 as bs

urls = ['A', 'B','C','D',...'Z']

for url in urls:
    source = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(source,'lxml')
    table = soup.find('table', class_='tbldata14 bdrtpg')
    table_rows = table.find_all('tr')

data = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    data.append(row)

final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)

What I Get from above code in newly created csv file is -我从新创建的 csv 文件中的上述代码中得到的是 -

ABC XYZ PQR MNL CYP ZXS
1   2   3   4   5   6

My above code only gets table from last url- 'Z' , which, as I have checked is actually the table from last url in list.我上面的代码只从最后一个 url- 'Z'获取表格，正如我检查过的那样，它实际上是列表中最后一个 url 的表格。

What I am trying to achieve here is getting all tables from list of urls - ie A to Z into single csv file.我在这里想要实现的是从 url 列表中获取所有表 - 即 A 到 Z 到单个 csv 文件中。

Answer 1

This is an issue with indentation and order.这是缩进和顺序的问题。 table_rows gets reset every time through the for url in urls loop, so you only end up with the last URLs worth of data.每次通过 for table_rows for url in urls循环都会重置 table_rows，因此您最终只会得到最后一个 URL 的数据。 If you want all of the URLs worth of data in one final CSV, see the changes I made below.如果您希望在一个最终 CSV 中包含所有 URL 的数据价值，请参阅我在下面所做的更改。

import pandas as pd
import urllib.request
import bs4 as bs

urls = ['A', 'B','C','D',...'Z']
data = [] # Moved to the start
for url in urls:
    source = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(source,'lxml')
    table = soup.find('table', class_='tbldata14 bdrtpg')
    table_rows = table.find_all('tr')

    #indented the following loop so it runs with every URL data
    for tr in table_rows:
        td = tr.find_all('td')
        row = [tr.text for tr in td]
        data.append(row)

final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)

Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-05 16:47:23

Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-05 16:47:23

解决方案1
0 已采纳 2020-07-05 16:47:23