简体   繁体   English

如何1.将4,550个dbf文件转换为csv文件2.根据名称连接文件3.将所有csv连接成一个大数据csv进行分析?

[英]How to 1. convert 4,550 dbf files to csv files 2. concatenate files based on names 3. concatenate all csv's into one big data csv for analysis?

I have multiple dbf files (~4,550) in multiple folders and sub-directories (~400) separated by state. 我在多个文件夹和子目录(约400个)中按状态分隔了多个dbf文件(约4,550个)。 The data was given to me in dbf files on a weekly basis separated by state. 每周都会将dbf文件中的数据提供给我, dbf州分开。

Ex. 例如

"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA1071.DBF"

"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA1071.DBF"

How would I convert + merge all the dbf files into one csv for each state ie keeping the states separate (for regional data analysis)? 如何将每个状态的所有dbf文件转换+合并为一个csv ,即保持状态分离(用于区域数据分析)?

Currently using Python 3 and Jupyter notebooks on windows 10. 当前在Windows 10上使用Python 3Jupyter笔记本。

This problem seems to be solvable using python, I have attempted to experiment with dbf2csv and other dbf and csv functions. 使用python可以解决此问题,我尝试使用dbf2csv以及其他dbfcsv函数进行实验。

Code below shows some great starting points. 下面的代码显示了一些不错的起点。 Research was done through many posts and my own experimentation. 研究是通过许多帖子和我自己的实验完成的。 I'm still getting started with using python for working with files, but I'm not entirely sure how to code around the tedious tasks. 我仍然开始使用python处理文件,但是我不确定如何围绕繁琐的任务编写代码。

I typically use the functions below to convert to csv , followed by a line in the command promt to combine all csv files into one. 我通常使用下面的函数将其转换为csv ,然后在命令promt中的一行将所有csv文件合并为一个。

The function below converts one specific dbf to csv 下面的函数将一个特定的dbf转换为csv

import csv
from dbfread import DBF

def dbf_to_csv(dbf_table_pth):#Input a dbf, output a csv, same name, same path, except extension
    csv_fn = dbf_table_pth[:-4]+ ".csv" #Set the csv file name
    table = DBF(dbf_table_pth)# table variable is a DBF object
    with open(csv_fn, 'w', newline = '') as f:# create a csv file, fill it with dbf content
        writer = csv.writer(f)
        writer.writerow(table.field_names)# write the column name
        for record in table:# write the rows
            writer.writerow(list(record.values()))
    return csv_fn# return the csv name

The script below converts all dbf files in a given folder to csv format. 下面的脚本将给定文件夹中的所有dbf文件转换为csv格式。 This works great, but doesn't take the subfolders and sub-directories into consideration. 这很好用,但是没有考虑子文件夹和子目录。

import fnmatch
import os
import csv
import time
import datetime
import sys
from dbfread import DBF, FieldParser, InvalidValue          
# pip install dbfread if needed

class MyFieldParser(FieldParser):
    def parse(self, field, data):
        try:
            return FieldParser.parse(self, field, data)
        except ValueError:
            return InvalidValue(data)


debugmode=0         # Set to 1 to catch all the errors.            

for infile in os.listdir('.'):
    if fnmatch.fnmatch(infile, '*.dbf'):
        outfile = infile[:-4] + ".csv"
        print("Converting " + infile + " to " + outfile + ". Each period represents 2,000 records.")
        counter = 0
        starttime=time.clock()
        with open(outfile, 'w') as csvfile:
            table = DBF(infile, parserclass=MyFieldParser, ignore_missing_memofile=True)
            writer = csv.writer(csvfile)
            writer.writerow(table.field_names)
            for i, record in enumerate(table):
                for name, value in record.items():
                    if isinstance(value, InvalidValue):
                        if debugmode == 1:
                            print('records[{}][{!r}] == {!r}'.format(i, name, value))
                writer.writerow(list(record.values()))
                counter +=1
                if counter%100000==0:
                    sys.stdout.write('!' + '\r\n')
                    endtime=time.clock()
#                     print (str("{:,}".format(counter))) + " records in " + #str(endtime-starttime) + " seconds."
                elif counter%2000==0:
                    sys.stdout.write('.')
                else:
                    pass
        print("")
        endtime=time.clock()
        print ("Processed " + str("{:,}".format(counter)) + " records in " + str(endtime-starttime) + " seconds (" + str((endtime-starttime)/60) + " minutes.)")
        print (str(counter / (endtime-starttime)) + " records per second.")
        print("")

But this process is too tedious considering there are over 400 sub-folders. 但是考虑到有超过400个子文件夹,此过程过于繁琐。

Then using the command prompt, I type copy *.csv combine.csv but this can be done with python as well. 然后在命令提示符下键入copy *.csv combine.csv但这也可以使用python完成。 Currently experimenting with Os.Walk , but have not made any major progress. 目前正在尝试Os.Walk ,但没有取得任何重大进展。

Ideally, the output should be a csv file with all the combined data for each individual state. 理想情况下,输出应为包含每个状态的所有组合数据的csv文件。

Ex. 例如

"\Datafiles\FL.csv"
"\Datafiles\NJ.csv"

It would also be alright if the output was into a pandas dataframe for each individual state. 如果输出进入每个单独状态的熊猫数据框,也可以。

UPDATE Edit: I was able to convert all the dbf files to csv using the os.walk. 更新编辑:我能够使用os.walk将所有dbf文件转换为csv。 Os.walk has also been helpful to provide me with a list of directories which contain the dbf and csv files. Os.walk还有助于向我提供包含dbf和csv文件的目录列表。 Ex. 例如

fl_dirs= ['\Datafiles\\01_APRIL_2019\\01_APRIL_2019\\FL',
 '\Datafiles\\01_JUly_2019\\01_JUlY_2019\\FL',
 '\Datafiles\\03_JUNE_2019\\03_JUNE_2019\\FL',
 '\Datafiles\\04_MARCH_2019\\04_MARCH_2019\\FL']

I simply want to access the identical csv files in those directories and combine them into one csv file with python. 我只想访问那些目录中的相同csv文件,然后将它们与python合并为一个csv文件。

UPDATE: SOLVED THIS!I wrote a script that can do everything I needed! 更新:已解决!我编写了一个脚本,可以执行所需的所有操作!

This problem can be simplified using os.walk ( https://docs.python.org/3/library/os.html#os.listdir ). 使用os.walk( https://docs.python.org/3/library/os.html#os.listdir )可以简化此问题。

The sub-directories can be traversed and the absolute path of each dbf file can be appended to separate lists based on the state. 可以遍历子目录,并且可以基于状态将每个dbf文件的绝对路径附加到单独的列表中。

Then, the files can be converted to csv using the function dbf_to_csv which then can be combined using concat feature included in pandas ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html ). 然后,可以使用dbf_to_csv函数将文件转换为csv,然后可以使用pandas中包含的concat功能进行组合( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html )。

EDIT: The following code might help. 编辑:以下代码可能会有所帮助。 Its not tested though. 它没有经过测试。

import pandas as pd
import os

# basepath here
base_path="" 
#output dir here
output_path=""


#Create dictionary to store all absolute path
path_dict={"FL":[],"NJ":[]}

#recursively look up into base path
for abs_path,curr_dir,file_list in os.walk(base_path):
    if abs_path.endswith("FL"):
        path_dict["FL"].extend([os.path.join(abs_path,file) for file in file_list])
    elif abs_path.endswith ("NJ"):
        path_dict["NJ"].extend([os.path.join(abs_path,file) for file in file_list])

for paths in path_dict:
    df=pd.concat(
        [pd.read_csv(i) for i in set(path_dict[paths])],
        ignore_index=True
    )
    df.to_csv(os.path.join(output_path,paths+".csv"),index=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM