简体   繁体   English

Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv

[英]Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv

I am trying to extract tabular data from a list of urls and I want to save all the table into a single csv file.我正在尝试从 url 列表中提取表格数据,并且我想将所有表保存到单个 csv 文件中。

I am new and relatively beginner in python and from non-CS background, however I am very eager to learn the same.我是 python 和非 CS 背景的新手和相对初学者,但是我非常渴望学习相同的东西。

import pandas as pd
import urllib.request
import bs4 as bs

urls = ['A', 'B','C','D',...'Z']

for url in urls:
    source = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(source,'lxml')
    table = soup.find('table', class_='tbldata14 bdrtpg')
    table_rows = table.find_all('tr')

data = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    data.append(row)

final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)

What I Get from above code in newly created csv file is -我从新创建的 csv 文件中的上述代码中得到的是 -

ABC XYZ PQR MNL CYP ZXS
1   2   3   4   5   6

My above code only gets table from last url- 'Z' , which, as I have checked is actually the table from last url in list.我上面的代码只从最后一个 url- 'Z'获取表格,正如我检查过的那样,它实际上是列表中最后一个 url 的表格。

What I am trying to achieve here is getting all tables from list of urls - ie A to Z into single csv file.我在这里想要实现的是从 url 列表中获取所有表 - 即 A 到 Z 到单个 csv 文件中。

This is an issue with indentation and order.这是缩进和顺序的问题。 table_rows gets reset every time through the for url in urls loop, so you only end up with the last URLs worth of data.每次通过 for table_rows for url in urls循环都会重置 table_rows,因此您最终只会得到最后一个 URL 的数据。 If you want all of the URLs worth of data in one final CSV, see the changes I made below.如果您希望在一个最终 CSV 中包含所有 URL 的数据价值,请参阅我在下面所做的更改。

import pandas as pd
import urllib.request
import bs4 as bs

urls = ['A', 'B','C','D',...'Z']
data = [] # Moved to the start
for url in urls:
    source = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(source,'lxml')
    table = soup.find('table', class_='tbldata14 bdrtpg')
    table_rows = table.find_all('tr')

    #indented the following loop so it runs with every URL data
    for tr in table_rows:
        td = tr.find_all('td')
        row = [tr.text for tr in td]
        data.append(row)

final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取来自.csv 的 URL 列表,用于使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 进行抓取 - Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas Python:将数据从BeautifulSoup保存到CSV - Python: Save data from BeautifulSoup to CSV Pandas Dataframe 保存到 CSV 更改时间列表 - Pandas Dataframe save to CSV alters list of times 通过Python DPKT从PCAP提取所有协议数据并另存为CSV - Extract all protocols data from PCAP by Python DPKT and Save as CSV 从多个url(使用beautifulsoup)提取标题和表体到dataframe - Extract title and table body from multiple urls(using beautifulsoup) to dataframe Convert bunch of list items (got from scaping vertical table) into pandas dataframe of equal headers and row and ultimately save as csv or excel - Convert bunch of list items (got from scaping vertical table) into pandas dataframe of equal headers and row and ultimately save as csv or excel python将列表保存到csv单引号 - python save list to csv single quotes 从 2.8 GB XML 文件中提取数据并将其保存到 CSV/Pandas Dataframe - Extract data from 2.8 GB XML file and save it to CSV/Pandas Dataframe 在 Python 中将 dataframe 保存为 CSV - Save dataframe as CSV in Python Python:从stdout中提取模式并保存在csv中 - Python: extract pattern from stdout and save in csv
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM