I have written a script which extracts text from multiple csv's. Can someone help me embed the script in this which can read csv data from different zipped files and create multiple csv's(one for each ziped file) at a location. For example-- If i have 10 csv's in zipped folder z1 and 5 in zipped folder z2. I want to extract files from each zipped file and get the extracted files at one location. In this case it would be z1.csv(with concatenated data from 10 csv's) and z2.csv(with concatenated data from 5 csv's). I am using the following script,
allfiles=glob.glob(os.path.join(input_fldr,"*.csv"))
a=[]
b=[]
for file_ in allfiles:
dirname, filename=os.path.split(file_)
f=open(file_,'r',encoding='UTF-8')
lines=f.readlines()
f.close()
for line in lines:
if line.startswith('Hello'):
a.append(filename)
b.append(line)
df_a=pd.DataFrame(a,columns=list("A")
df_b=pd.DataFrame(b,columns=list("B")
df=pd.concat([df_a,df_b],axis=1)
The code I came to, that does roughly what I believe you are wanting to happen is this (all the files you need for this example are available here ):
import zipfile
import pandas as pd
virtual_csvs = []
with zipfile.ZipFile("test3.zip", "r") as f:
for name in f.namelist():
if name.endswith(".csv"):
data = f.open(name)
virtual_csvs.append(pd.read_csv(data, header=None))
pd.concat(virtual_csvs, axis=1).to_csv('test4.csv', header=False, index=False)
virtual_csvs = []
We start by creating an array that will store all of the panda DataFrames, much like your array [df_a, df_b]
with zipfile.ZipFile("test3.zip", "r") as f:
This will load the zipfile, "test3.zip" - replace with your zipfile name, in read mode into the variable f
for name in f.namelist():
This iterates over every file name in the zipfile, and loads that to the variable: name
if name.endswith(".csv"):
This line is rather self-explanatory - if the file has an extension of .csv
, run the following code.
data = f.open(name)
The f.open(name)
command opens the file ( name
) - the equivalent would be open(name, 'r') as data
virtual_csvs.append(pd.read_csv(data, header=None))
pd.read_csv(data, header=None)
loads that file into a panda dataframe (header=None means no column headers so the data is loaded into a dataframe)
virtual_csvs.append
loads the dataframe into the virtual_csvs list
The final line of this code:
pd.concat(virtual_csvs, axis=1).to_csv('output.csv', header=False, index=False)
concatenates all of the csv files into one larger file ('output.csv'). pd.concat(virtual_csvs, axis=1)
means to join all the csv files (DataFrame) in virtual_csvs by column (this returns a pd.DataFrame
)
to_csv('output.csv', header=False, index=False)
means to convert the given DataFrame to a csv file, named 'output.csv'.
header=False
means to remove header names for each column
index=False
disables row numbers from the DataFrames
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.