简体   繁体   English

Python:循环通过 excel 表并写入 csv

[英]Python: loop through excel sheets and write to csv

I have a very large dataset (>100gb).我有一个非常大的数据集(> 100gb)。 It has many excel files (.xlsx).它有许多 excel 文件 (.xlsx)。 Each xlsx file has many sheets.每个 xlsx 文件都有许多工作表。 The data in each sheet is shown in the below picture.每张表中的数据如下图所示。

在此处输入图像描述

I would like to combine these sheets into a csv file, and change this wide format to a long format so that:我想将这些表格组合成一个 csv 文件,并将这种宽格式更改为长格式,以便:

  1. The first column contains the excel file name,第一列包含 excel 文件名,
  2. The second column contains the sheet's name,第二列包含工作表的名称,
  3. The third, fourth, and fifth are the "ticker", "Name", "Detail Holding Type" column from the picture above,第三、四、五是上图中的“ticker”、“Name”、“Detail Holding Type”栏,
  4. The sixth column would be the "date" (which are the first row), and第六列是“日期”(第一行),并且
  5. The final column contains the number.最后一列包含数字。

What would be the most effective way to do this?最有效的方法是什么? I have the code to loop through files and sheets, but cannot transpose the (wide) data to the long format that I am after.我有循环文件和工作表的代码,但无法将(宽)数据转换为我所追求的长格式。 Below is my attempt to loop:下面是我尝试循环:

import csv
from os import listdir
from os.path import isfile, join

mypath = "E:/data_download/Python_test_files/"
file_lists = [f for f in listdir(mypath) if isfile(join(mypath, f))]
import xlrd


for file in file_lists:
    book = xlrd.open_workbook(f'{mypath}{file}')
    sheet_names = book.sheet_names()
    print(sheet_names)
    for sheet in book.sheets():
        for row in sheet.get_rows():

taking things step by step (and take in mind that in order for the process to be as fast as possible, you have to use native python as much as you can and only use the other libraries when you absolutely HAVE TO.): so you want one csv file out of all those sheets.循序渐进(并记住,为了使过程尽可能快,您必须尽可能多地使用本机 python,并且仅在绝对必须时才使用其他库。):所以你想要从所有这些表格中取出一个 csv 文件。 what you should do is first make a 2D list of all the rows from all the sheets, and however you want them constructed like you've mentioned, that you want to include in the csv file, and then finally import them into the csv file with the Dataframe class using the pandas library:你应该做的是首先制作所有工作表中所有行的二维list ,但是你希望它们像你提到的那样构建,你想包含在 csv 文件中,然后最后将它们导入 csv 文件使用Dataframe class 使用 pandas 库:

import pandas as pd
my_list = [...] # your 2D list containing the rows
dataset= pd.DataFrame(my_list, columns=['column1','column2', '...') # the name of your columns
dataset.to_csv('/PATH/file.csv')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM