简体   繁体   中英

Extracting csv files from multiple zipped files in python

I have written a script which extracts text from multiple csv's. Can someone help me embed the script in this which can read csv data from different zipped files and create multiple csv's(one for each ziped file) at a location. For example-- If i have 10 csv's in zipped folder z1 and 5 in zipped folder z2. I want to extract files from each zipped file and get the extracted files at one location. In this case it would be z1.csv(with concatenated data from 10 csv's) and z2.csv(with concatenated data from 5 csv's). I am using the following script,

allfiles=glob.glob(os.path.join(input_fldr,"*.csv"))
a=[]
b=[]
for file_ in allfiles:
    dirname, filename=os.path.split(file_)
    f=open(file_,'r',encoding='UTF-8')
    lines=f.readlines()
    f.close()
    for line in lines:
        if line.startswith('Hello'):
        a.append(filename)
        b.append(line)
df_a=pd.DataFrame(a,columns=list("A")
df_b=pd.DataFrame(b,columns=list("B")
df=pd.concat([df_a,df_b],axis=1)

The Code

The code I came to, that does roughly what I believe you are wanting to happen is this (all the files you need for this example are available here ):

import zipfile
import pandas as pd

virtual_csvs = []

with zipfile.ZipFile("test3.zip", "r") as f:
    for name in f.namelist():
        if name.endswith(".csv"):
            data = f.open(name)
            virtual_csvs.append(pd.read_csv(data, header=None))

pd.concat(virtual_csvs, axis=1).to_csv('test4.csv', header=False, index=False)

Code Breakdown

virtual_csvs = []

We start by creating an array that will store all of the panda DataFrames, much like your array [df_a, df_b]

with zipfile.ZipFile("test3.zip", "r") as f:

This will load the zipfile, "test3.zip" - replace with your zipfile name, in read mode into the variable f

for name in f.namelist():

This iterates over every file name in the zipfile, and loads that to the variable: name

if name.endswith(".csv"):

This line is rather self-explanatory - if the file has an extension of .csv , run the following code.

data = f.open(name)

The f.open(name) command opens the file ( name ) - the equivalent would be open(name, 'r') as data

virtual_csvs.append(pd.read_csv(data, header=None))

pd.read_csv(data, header=None) loads that file into a panda dataframe (header=None means no column headers so the data is loaded into a dataframe)

virtual_csvs.append loads the dataframe into the virtual_csvs list

The final line of this code:

pd.concat(virtual_csvs, axis=1).to_csv('output.csv', header=False, index=False)

concatenates all of the csv files into one larger file ('output.csv'). pd.concat(virtual_csvs, axis=1) means to join all the csv files (DataFrame) in virtual_csvs by column (this returns a pd.DataFrame )

to_csv('output.csv', header=False, index=False) means to convert the given DataFrame to a csv file, named 'output.csv'.

header=False means to remove header names for each column

index=False disables row numbers from the DataFrames

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM