简体   繁体   中英

Read multiple large compressed (ZIP format) CSV files using Python and extract data

I downloaded world trade (exports and imports) data from a trade database, by country and by year, in the form of ZIP files (from 1989 to 2020). Each ZIP file represents a year of data. Also, each ZIP file unzips into an extremely large CSV file (5GB+). I would like to read all the ZIP files, and extract certain information for a given country. However the code I have below takes an extremely long time to process. For a single year, the code runs without any problems, but when combining all years, it's painfully slow. At the end of the process, I would like to merge all the dataframes into a single consolidated one. Any ideas on how to speed up processing and how to loop through all 32 dataframes to merge them would be extremely appreciated.

sample CSV file data for 2008:

(see link below) [1]: https://i.stack.imgur.com/V9jDd.png


reporter='Egypt'
import csv
import pandas as pd

filename="All 2020.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df1 = pd.concat(tp, ignore_index=True)
rows = df1[df1['Reporter'] == reporter]
#rows.to_csv('filename.csv')

filename="All 2019.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df2 = pd.concat(tp, ignore_index=True)
rows = df2[df2['Reporter'] == reporter]
#rows.to_csv('filename.csv')

filename="All 2018.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df3 = pd.concat(tp, ignore_index=True)    
rows = df3[df3['Reporter'] == reporter]


filename="All 2017.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df4 = pd.concat(tp, ignore_index=True)    
rows = df4[df4['Reporter'] == reporter]

filename="All 2016.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df5 = pd.concat(tp, ignore_index=True)    
rows = df5[df5['Reporter'] == reporter]


filename="All 2015.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df6 = pd.concat(tp, ignore_index=True)    
rows = df6[df6['Reporter'] == reporter]

filename="All 2014.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df7 = pd.concat(tp, ignore_index=True)    
rows = df7[df7['Reporter'] == reporter]

filename="All 2013.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df8 = pd.concat(tp, ignore_index=True)    
rows = df8[df8['Reporter'] == reporter]


filename="All 2012.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df9 = pd.concat(tp, ignore_index=True)    
rows = df9[df9['Reporter'] == reporter]


filename="All 2011.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df10 = pd.concat(tp, ignore_index=True)    
rows = df10[df10['Reporter'] == reporter]

filename="All 2010.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df11 = pd.concat(tp, ignore_index=True)    
rows = df11[df11['Reporter'] == reporter]

filename="All 2009.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df12 = pd.concat(tp, ignore_index=True)    
rows = df12[df12['Reporter'] == reporter]

filename="All 2008.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df13 = pd.concat(tp, ignore_index=True)    
rows = df13[df13['Reporter'] == reporter]


filename="All 2007.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df14 = pd.concat(tp, ignore_index=True)    
rows = df14[df14['Reporter'] == reporter]

filename="All 2006.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df15 = pd.concat(tp, ignore_index=True)    
rows = df15[df15['Reporter'] == reporter]


filename="All 2005.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df16 = pd.concat(tp, ignore_index=True)    
rows = df16[df16['Reporter'] == reporter]


filename="All 2004.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df17 = pd.concat(tp, ignore_index=True)    
rows = df17[df17['Reporter'] == reporter]


filename="All 2003.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df18 = pd.concat(tp, ignore_index=True)    
rows = df18[df18['Reporter'] == reporter]

filename="All 2002.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df19 = pd.concat(tp, ignore_index=True)    
rows = df19[df19['Reporter'] == reporter]


filename="All 2001.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df20 = pd.concat(tp, ignore_index=True)    
rows = df20[df20['Reporter'] == reporter]

filename="All 2000.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df21 = pd.concat(tp, ignore_index=True)    
rows = df21[df21['Reporter'] == reporter]

filename="All 1999.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df22 = pd.concat(tp, ignore_index=True)    
rows = df22[df22['Reporter'] == reporter]

filename="All 1998.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df23 = pd.concat(tp, ignore_index=True)    
rows = df23[df23['Reporter'] == reporter]

filename="All 1997.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df24 = pd.concat(tp, ignore_index=True)    
rows = df24[df24['Reporter'] == reporter]

filename="All 1996.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df25 = pd.concat(tp, ignore_index=True)    
rows = df25[df25['Reporter'] == reporter]


filename="All 1995.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df26 = pd.concat(tp, ignore_index=True)    
rows = df26[df26['Reporter'] == reporter]

filename="All 1994.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df27 = pd.concat(tp, ignore_index=True)    
rows = df27[df27['Reporter'] == reporter]

filename="All 1993.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df28 = pd.concat(tp, ignore_index=True)    
rows = df28[df28['Reporter'] == reporter]

filename="All 1992.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df29 = pd.concat(tp, ignore_index=True)    
rows = df29[df29['Reporter'] == reporter]


filename="All 1991.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df30 = pd.concat(tp, ignore_index=True)    
rows = df30[df30['Reporter'] == reporter]

filename="All 1990.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df31 = pd.concat(tp, ignore_index=True)    
rows = df31[df31['Reporter'] == reporter]


filename="All 1989.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df32 = pd.concat(tp, ignore_index=True)    
rows = df32[df32['Reporter'] == reporter]
reporter='Egypt'
import gc
from glob import glob
import pandas as pd
for file in glob('*.zip'):#loop all zip files
    data=[]
    for chunk in pd.read_csv(file, chunksize=10000):
        chunk=chunk[chunk['Reporter']==reporter]#slice data from chunk only
        data.append(chunk)
    df=pd.concat(data)
    df.to_csv(file.split('.')[-2]+'.csv')
    gc.collect() #i am not sure whether it help or not

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM