如何1.将4,550个dbf文件转换为csv文件2.根据名称连接文件3.将所有csv连接成一个大数据csv进行分析？

Question

我在多个文件夹和子目录（约400个）中按状态分隔了多个dbf文件（约4,550个）。 每周都会将dbf文件中的数据提供给我， dbf州分开。

例如

"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA1071.DBF"

"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA1071.DBF"

如何将每个状态的所有dbf文件转换+合并为一个csv ，即保持状态分离（用于区域数据分析）？

当前在Windows 10上使用Python 3和Jupyter笔记本。

使用python可以解决此问题，我尝试使用dbf2csv以及其他dbf和csv函数进行实验。

下面的代码显示了一些不错的起点。 研究是通过许多帖子和我自己的实验完成的。 我仍然开始使用python处理文件，但是我不确定如何围绕繁琐的任务编写代码。

我通常使用下面的函数将其转换为csv ，然后在命令promt中的一行将所有csv文件合并为一个。

下面的函数将一个特定的dbf转换为csv

import csv
from dbfread import DBF

def dbf_to_csv(dbf_table_pth):#Input a dbf, output a csv, same name, same path, except extension
    csv_fn = dbf_table_pth[:-4]+ ".csv" #Set the csv file name
    table = DBF(dbf_table_pth)# table variable is a DBF object
    with open(csv_fn, 'w', newline = '') as f:# create a csv file, fill it with dbf content
        writer = csv.writer(f)
        writer.writerow(table.field_names)# write the column name
        for record in table:# write the rows
            writer.writerow(list(record.values()))
    return csv_fn# return the csv name

下面的脚本将给定文件夹中的所有dbf文件转换为csv格式。 这很好用，但是没有考虑子文件夹和子目录。

import fnmatch
import os
import csv
import time
import datetime
import sys
from dbfread import DBF, FieldParser, InvalidValue          
# pip install dbfread if needed

class MyFieldParser(FieldParser):
    def parse(self, field, data):
        try:
            return FieldParser.parse(self, field, data)
        except ValueError:
            return InvalidValue(data)


debugmode=0         # Set to 1 to catch all the errors.            

for infile in os.listdir('.'):
    if fnmatch.fnmatch(infile, '*.dbf'):
        outfile = infile[:-4] + ".csv"
        print("Converting " + infile + " to " + outfile + ". Each period represents 2,000 records.")
        counter = 0
        starttime=time.clock()
        with open(outfile, 'w') as csvfile:
            table = DBF(infile, parserclass=MyFieldParser, ignore_missing_memofile=True)
            writer = csv.writer(csvfile)
            writer.writerow(table.field_names)
            for i, record in enumerate(table):
                for name, value in record.items():
                    if isinstance(value, InvalidValue):
                        if debugmode == 1:
                            print('records[{}][{!r}] == {!r}'.format(i, name, value))
                writer.writerow(list(record.values()))
                counter +=1
                if counter%100000==0:
                    sys.stdout.write('!' + '\r\n')
                    endtime=time.clock()
#                     print (str("{:,}".format(counter))) + " records in " + #str(endtime-starttime) + " seconds."
                elif counter%2000==0:
                    sys.stdout.write('.')
                else:
                    pass
        print("")
        endtime=time.clock()
        print ("Processed " + str("{:,}".format(counter)) + " records in " + str(endtime-starttime) + " seconds (" + str((endtime-starttime)/60) + " minutes.)")
        print (str(counter / (endtime-starttime)) + " records per second.")
        print("")

但是考虑到有超过400个子文件夹，此过程过于繁琐。

然后在命令提示符下键入copy *.csv combine.csv但这也可以使用python完成。 目前正在尝试Os.Walk ，但没有取得任何重大进展。

理想情况下，输出应为包含每个状态的所有组合数据的csv文件。

例如

"\Datafiles\FL.csv"
"\Datafiles\NJ.csv"

如果输出进入每个单独状态的熊猫数据框，也可以。

更新编辑：我能够使用os.walk将所有dbf文件转换为csv。 Os.walk还有助于向我提供包含dbf和csv文件的目录列表。 例如

fl_dirs= ['\Datafiles\\01_APRIL_2019\\01_APRIL_2019\\FL',
 '\Datafiles\\01_JUly_2019\\01_JUlY_2019\\FL',
 '\Datafiles\\03_JUNE_2019\\03_JUNE_2019\\FL',
 '\Datafiles\\04_MARCH_2019\\04_MARCH_2019\\FL']

我只想访问那些目录中的相同csv文件，然后将它们与python合并为一个csv文件。

更新：已解决！我编写了一个脚本，可以执行所需的所有操作！

Answer 1

使用os.walk（ https://docs.python.org/3/library/os.html#os.listdir ）可以简化此问题。

可以遍历子目录，并且可以基于状态将每个dbf文件的绝对路径附加到单独的列表中。

然后，可以使用dbf_to_csv函数将文件转换为csv，然后可以使用pandas中包含的concat功能进行组合（ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html ）。

编辑：以下代码可能会有所帮助。 它没有经过测试。

import pandas as pd
import os

# basepath here
base_path="" 
#output dir here
output_path=""


#Create dictionary to store all absolute path
path_dict={"FL":[],"NJ":[]}

#recursively look up into base path
for abs_path,curr_dir,file_list in os.walk(base_path):
    if abs_path.endswith("FL"):
        path_dict["FL"].extend([os.path.join(abs_path,file) for file in file_list])
    elif abs_path.endswith ("NJ"):
        path_dict["NJ"].extend([os.path.join(abs_path,file) for file in file_list])

for paths in path_dict:
    df=pd.concat(
        [pd.read_csv(i) for i in set(path_dict[paths])],
        ignore_index=True
    )
    df.to_csv(os.path.join(output_path,paths+".csv"),index=False)

如何1.将4,550个dbf文件转换为csv文件2.根据名称连接文件3.将所有csv连接成一个大数据csv进行分析？

问题描述

1 个解决方案

解决方案1
0 2019-07-02 05:34:23

如何1.将4,550个dbf文件转换为csv文件2.根据名称连接文件3.将所有csv连接成一个大数据csv进行分析？

问题描述

1 个解决方案

解决方案1 0 2019-07-02 05:34:23

解决方案1
0 2019-07-02 05:34:23