如何1.將4,550個dbf文件轉換為csv文件2.根據名稱連接文件3.將所有csv連接成一個大數據csv進行分析？

Question

我在多個文件夾和子目錄（約400個）中按狀態分隔了多個dbf文件（約4,550個）。 每周都會將dbf文件中的數據提供給我， dbf州分開。

例如

"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA1071.DBF"

"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA1071.DBF"

如何將每個狀態的所有dbf文件轉換+合並為一個csv ，即保持狀態分離（用於區域數據分析）？

當前在Windows 10上使用Python 3和Jupyter筆記本。

使用python可以解決此問題，我嘗試使用dbf2csv以及其他dbf和csv函數進行實驗。

下面的代碼顯示了一些不錯的起點。 研究是通過許多帖子和我自己的實驗完成的。 我仍然開始使用python處理文件，但是我不確定如何圍繞繁瑣的任務編寫代碼。

我通常使用下面的函數將其轉換為csv ，然后在命令promt中的一行將所有csv文件合並為一個。

下面的函數將一個特定的dbf轉換為csv

import csv
from dbfread import DBF

def dbf_to_csv(dbf_table_pth):#Input a dbf, output a csv, same name, same path, except extension
    csv_fn = dbf_table_pth[:-4]+ ".csv" #Set the csv file name
    table = DBF(dbf_table_pth)# table variable is a DBF object
    with open(csv_fn, 'w', newline = '') as f:# create a csv file, fill it with dbf content
        writer = csv.writer(f)
        writer.writerow(table.field_names)# write the column name
        for record in table:# write the rows
            writer.writerow(list(record.values()))
    return csv_fn# return the csv name

下面的腳本將給定文件夾中的所有dbf文件轉換為csv格式。 這很好用，但是沒有考慮子文件夾和子目錄。

import fnmatch
import os
import csv
import time
import datetime
import sys
from dbfread import DBF, FieldParser, InvalidValue          
# pip install dbfread if needed

class MyFieldParser(FieldParser):
    def parse(self, field, data):
        try:
            return FieldParser.parse(self, field, data)
        except ValueError:
            return InvalidValue(data)


debugmode=0         # Set to 1 to catch all the errors.            

for infile in os.listdir('.'):
    if fnmatch.fnmatch(infile, '*.dbf'):
        outfile = infile[:-4] + ".csv"
        print("Converting " + infile + " to " + outfile + ". Each period represents 2,000 records.")
        counter = 0
        starttime=time.clock()
        with open(outfile, 'w') as csvfile:
            table = DBF(infile, parserclass=MyFieldParser, ignore_missing_memofile=True)
            writer = csv.writer(csvfile)
            writer.writerow(table.field_names)
            for i, record in enumerate(table):
                for name, value in record.items():
                    if isinstance(value, InvalidValue):
                        if debugmode == 1:
                            print('records[{}][{!r}] == {!r}'.format(i, name, value))
                writer.writerow(list(record.values()))
                counter +=1
                if counter%100000==0:
                    sys.stdout.write('!' + '\r\n')
                    endtime=time.clock()
#                     print (str("{:,}".format(counter))) + " records in " + #str(endtime-starttime) + " seconds."
                elif counter%2000==0:
                    sys.stdout.write('.')
                else:
                    pass
        print("")
        endtime=time.clock()
        print ("Processed " + str("{:,}".format(counter)) + " records in " + str(endtime-starttime) + " seconds (" + str((endtime-starttime)/60) + " minutes.)")
        print (str(counter / (endtime-starttime)) + " records per second.")
        print("")

但是考慮到有超過400個子文件夾，此過程過於繁瑣。

然后在命令提示符下鍵入copy *.csv combine.csv但這也可以使用python完成。 目前正在嘗試Os.Walk ，但沒有取得任何重大進展。

理想情況下，輸出應為包含每個狀態的所有組合數據的csv文件。

例如

"\Datafiles\FL.csv"
"\Datafiles\NJ.csv"

如果輸出進入每個單獨狀態的熊貓數據框，也可以。

更新編輯：我能夠使用os.walk將所有dbf文件轉換為csv。 Os.walk還有助於向我提供包含dbf和csv文件的目錄列表。 例如

fl_dirs= ['\Datafiles\\01_APRIL_2019\\01_APRIL_2019\\FL',
 '\Datafiles\\01_JUly_2019\\01_JUlY_2019\\FL',
 '\Datafiles\\03_JUNE_2019\\03_JUNE_2019\\FL',
 '\Datafiles\\04_MARCH_2019\\04_MARCH_2019\\FL']

我只想訪問那些目錄中的相同csv文件，然后將它們與python合並為一個csv文件。

更新：已解決！我編寫了一個腳本，可以執行所需的所有操作！

Answer 1

使用os.walk（ https://docs.python.org/3/library/os.html#os.listdir ）可以簡化此問題。

可以遍歷子目錄，並且可以基於狀態將每個dbf文件的絕對路徑附加到單獨的列表中。

然后，可以使用dbf_to_csv函數將文件轉換為csv，然后可以使用pandas中包含的concat功能進行組合（ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html ）。

編輯：以下代碼可能會有所幫助。 它沒有經過測試。

import pandas as pd
import os

# basepath here
base_path="" 
#output dir here
output_path=""


#Create dictionary to store all absolute path
path_dict={"FL":[],"NJ":[]}

#recursively look up into base path
for abs_path,curr_dir,file_list in os.walk(base_path):
    if abs_path.endswith("FL"):
        path_dict["FL"].extend([os.path.join(abs_path,file) for file in file_list])
    elif abs_path.endswith ("NJ"):
        path_dict["NJ"].extend([os.path.join(abs_path,file) for file in file_list])

for paths in path_dict:
    df=pd.concat(
        [pd.read_csv(i) for i in set(path_dict[paths])],
        ignore_index=True
    )
    df.to_csv(os.path.join(output_path,paths+".csv"),index=False)

如何1.將4,550個dbf文件轉換為csv文件2.根據名稱連接文件3.將所有csv連接成一個大數據csv進行分析？

問題描述

1 個解決方案

解決方案1
0 2019-07-02 05:34:23

如何1.將4,550個dbf文件轉換為csv文件2.根據名稱連接文件3.將所有csv連接成一個大數據csv進行分析？

問題描述

1 個解決方案

解決方案1 0 2019-07-02 05:34:23

解決方案1
0 2019-07-02 05:34:23