简体   繁体   中英

Python: Multiple dataframes from multiple CSV, encoding cp1252 to utf8

I have a zipfile of several CSV documents. I have extracted the CSV's into a folder called "staging." These documents are encoded in Windows CP1252. What I would like to do is read in each CSV file individually as a separate dataframe and then overwrite the old files with utf8 encoding after I have removed all of the null values. Or instead of rewriting the CSVs to utf8 I can encode the database strictly from the pandas dataframes that are produced. Any help would be greatly appreciated- I have browsed the Stack Overflow forums and the main topic seems to be concatenating multiple CSV's into a single dataframe- what I need is a separate dataframe for each CSV. Also, I have to remove N/A values, however, in the CSV's they have random numbers attached to them (ie- N/A (3) or N/A(1), etc)

Here is the code I am working with:

# Create the staging directory
staging_dir = "staging"
os.mkdir(staging_dir)

# Confirm the staging directory path
os.path.isdir(staging_dir)

# Machine independent path to create files
zip_file = os.path.join(staging_dir, "Hospital_Revised_Flatfiles.zip")

# Write the files to the computer
zf = open(zip_file,"wb")
zf.write(r.content)
zf.close()

# Program to unzip the files
import zipfile

z = zipfile.ZipFile(zip_file,"r")
z.extractall(staging_dir)
z.close()

#Create the dataframes

import io
import glob
import pandas as pd

files = glob.glob(os.path.join("staging" + "/*.csv"))

# OS independent reading of files
for file in files:
    dfs = pd.read_csv(file, header = 0, encoding = 'cp1252')

Just add

dfs.dropna().to_csv(file, encoding='utf-8')

to your last loop. It will drop all rows with null values and then save the dataframe by overwriting the old version.

And remove the first bracket in your last line, you open two but only close one. Thats where the EOF error is coming from.

I believe P.Tillmann's solution should've worked. Alternatively, you can load all your dataframes first and then write them back.

files = glob.glob(os.path.join("staging" + "/*.csv"))

dict_ = {}
for file in files:
    dict_[file] = pd.read_csv(file, header=0, encoding='cp1252').dropna()

for file in dict_:
    dict_[file].to_csv(file, encoding='utf-8')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM