简体   繁体   中英

How to 1. convert 4,550 dbf files to csv files 2. concatenate files based on names 3. concatenate all csv's into one big data csv for analysis?

I have multiple dbf files (~4,550) in multiple folders and sub-directories (~400) separated by state. The data was given to me in dbf files on a weekly basis separated by state.

Ex.

"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA1071.DBF"

"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA1071.DBF"

How would I convert + merge all the dbf files into one csv for each state ie keeping the states separate (for regional data analysis)?

Currently using Python 3 and Jupyter notebooks on windows 10.

This problem seems to be solvable using python, I have attempted to experiment with dbf2csv and other dbf and csv functions.

Code below shows some great starting points. Research was done through many posts and my own experimentation. I'm still getting started with using python for working with files, but I'm not entirely sure how to code around the tedious tasks.

I typically use the functions below to convert to csv , followed by a line in the command promt to combine all csv files into one.

The function below converts one specific dbf to csv

import csv
from dbfread import DBF

def dbf_to_csv(dbf_table_pth):#Input a dbf, output a csv, same name, same path, except extension
    csv_fn = dbf_table_pth[:-4]+ ".csv" #Set the csv file name
    table = DBF(dbf_table_pth)# table variable is a DBF object
    with open(csv_fn, 'w', newline = '') as f:# create a csv file, fill it with dbf content
        writer = csv.writer(f)
        writer.writerow(table.field_names)# write the column name
        for record in table:# write the rows
            writer.writerow(list(record.values()))
    return csv_fn# return the csv name

The script below converts all dbf files in a given folder to csv format. This works great, but doesn't take the subfolders and sub-directories into consideration.

import fnmatch
import os
import csv
import time
import datetime
import sys
from dbfread import DBF, FieldParser, InvalidValue          
# pip install dbfread if needed

class MyFieldParser(FieldParser):
    def parse(self, field, data):
        try:
            return FieldParser.parse(self, field, data)
        except ValueError:
            return InvalidValue(data)


debugmode=0         # Set to 1 to catch all the errors.            

for infile in os.listdir('.'):
    if fnmatch.fnmatch(infile, '*.dbf'):
        outfile = infile[:-4] + ".csv"
        print("Converting " + infile + " to " + outfile + ". Each period represents 2,000 records.")
        counter = 0
        starttime=time.clock()
        with open(outfile, 'w') as csvfile:
            table = DBF(infile, parserclass=MyFieldParser, ignore_missing_memofile=True)
            writer = csv.writer(csvfile)
            writer.writerow(table.field_names)
            for i, record in enumerate(table):
                for name, value in record.items():
                    if isinstance(value, InvalidValue):
                        if debugmode == 1:
                            print('records[{}][{!r}] == {!r}'.format(i, name, value))
                writer.writerow(list(record.values()))
                counter +=1
                if counter%100000==0:
                    sys.stdout.write('!' + '\r\n')
                    endtime=time.clock()
#                     print (str("{:,}".format(counter))) + " records in " + #str(endtime-starttime) + " seconds."
                elif counter%2000==0:
                    sys.stdout.write('.')
                else:
                    pass
        print("")
        endtime=time.clock()
        print ("Processed " + str("{:,}".format(counter)) + " records in " + str(endtime-starttime) + " seconds (" + str((endtime-starttime)/60) + " minutes.)")
        print (str(counter / (endtime-starttime)) + " records per second.")
        print("")

But this process is too tedious considering there are over 400 sub-folders.

Then using the command prompt, I type copy *.csv combine.csv but this can be done with python as well. Currently experimenting with Os.Walk , but have not made any major progress.

Ideally, the output should be a csv file with all the combined data for each individual state.

Ex.

"\Datafiles\FL.csv"
"\Datafiles\NJ.csv"

It would also be alright if the output was into a pandas dataframe for each individual state.

UPDATE Edit: I was able to convert all the dbf files to csv using the os.walk. Os.walk has also been helpful to provide me with a list of directories which contain the dbf and csv files. Ex.

fl_dirs= ['\Datafiles\\01_APRIL_2019\\01_APRIL_2019\\FL',
 '\Datafiles\\01_JUly_2019\\01_JUlY_2019\\FL',
 '\Datafiles\\03_JUNE_2019\\03_JUNE_2019\\FL',
 '\Datafiles\\04_MARCH_2019\\04_MARCH_2019\\FL']

I simply want to access the identical csv files in those directories and combine them into one csv file with python.

UPDATE: SOLVED THIS!I wrote a script that can do everything I needed!

This problem can be simplified using os.walk ( https://docs.python.org/3/library/os.html#os.listdir ).

The sub-directories can be traversed and the absolute path of each dbf file can be appended to separate lists based on the state.

Then, the files can be converted to csv using the function dbf_to_csv which then can be combined using concat feature included in pandas ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html ).

EDIT: The following code might help. Its not tested though.

import pandas as pd
import os

# basepath here
base_path="" 
#output dir here
output_path=""


#Create dictionary to store all absolute path
path_dict={"FL":[],"NJ":[]}

#recursively look up into base path
for abs_path,curr_dir,file_list in os.walk(base_path):
    if abs_path.endswith("FL"):
        path_dict["FL"].extend([os.path.join(abs_path,file) for file in file_list])
    elif abs_path.endswith ("NJ"):
        path_dict["NJ"].extend([os.path.join(abs_path,file) for file in file_list])

for paths in path_dict:
    df=pd.concat(
        [pd.read_csv(i) for i in set(path_dict[paths])],
        ignore_index=True
    )
    df.to_csv(os.path.join(output_path,paths+".csv"),index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM