简体   繁体   English

读取多个压缩成一个文件的csv文件

[英]Read multiple csv files zipped in one file

I have several csv files in several zip files in on folder, so for example: 我在文件夹中的几个zip文件中有几个csv文件,因此例如:

  • A.zip (contains csv1,csv2,csv3) A.zip(包含csv1,csv2,csv3)
  • B.zip (contains csv4, csv5, csv6) B.zip(包含csv4,csv5,csv6)

which are in the folder path C:/Folder/ , when I load normal csv files in a folder I use the following code: 它们在文件夹路径C:/Folder/ ,当我在文件夹中加载普通的csv文件时,我使用以下代码:

import glob
import pandas as pd
files = glob.glob("C/folder/*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]

df = pd.concat(dfs,ignore_index=True)

followed by this post: Reading csv zipped files in python 接下来的文章: 在python中读取csv压缩文件

One csv in zip works like this: zip中的一个csv的工作方式如下:

import pandas as pd
import zipfile

zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip') 
df = pd.read_csv(zf.open('intfile.csv'))

Any idea how to optimize this loop for me? 知道如何为我优化此循环吗?

Use zip.namelist() to get list of files inside the zip 使用zip.namelist()获取zip文件的列表

Ex: 例如:

import glob
import zipfile
import pandas as pd

for zip_file in glob.glob("C/folder/*.zip"):
    zf = zipfile.ZipFile(zip_file)
    dfs = [pd.read_csv(zf.open(f), header=None, sep=";") for f in zf.namelist()]
    df = pd.concat(dfs,ignore_index=True)
    print(df)

I would try to tackle it in two passes. 我会尝试通过两个途径解决它。 First pass, extract the contents of the zipfile onto the filesystem. 首先,将zipfile的内容提取到文件系统中。 Second Pass, read all those extracted CSVs using the method you already have above: 第二次通过,使用上面已经有的方法读取所有提取的CSV:

import glob
import pandas as pd
import zipfile

def extract_files(file_path):
  archive = zipfile.ZipFile(file_path, 'r') 
  unzipped_path = archive.extractall()
  return unzipped_path

zipped_files = glob.glob("C/folder/*.zip")]
file_paths = [extract_files(zf) for zf in zipped_files]

dfs = [pd.read_csv(f, header=None, sep=";") for f in file_paths]
df = pd.concat(dfs,ignore_index=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM