在 Apache Beam 中連接多個 csv 文件

Question

我試圖讀取幾個csv文件，使用fileio.MatchFiles將它們轉換為pd.DataFrame然后將它們連接成一個csv文件。 為此，我創建了兩個ParDo類來將文件轉換為 DataFrame ，然后將它們合並到merged csv中。 整個片段如下所示：

class convert_to_dataFrame(beam.DoFn):
    def process(self, element):
        return pd.DataFrame(element)

class merge_dataframes(beam.DoFn):
    def process(self, element):
        logging.info(element)
        logging.info(type(element))
        return pd.concat(element).reset_index(drop=True)

p = beam.Pipeline() 
concating = (p
             | beam.io.fileio.MatchFiles("C:/Users/firuz/Documents/task/mobilab_da_task/concats/**")
             | beam.io.fileio.ReadMatches()
             | beam.Reshuffle()
             | beam.ParDo(convert_to_dataFrame())
             | beam.combiners.ToList()
             | beam.ParDo(merge_dataframes())
             | beam.io.WriteToText('C:/Users/firuz/Documents/task/mobilab_da_task/output_tests/merged', file_name_suffix='.csv'))

p.run()

運行后，我在ParDO(merge_dataframes)上收到ValueError 。 我認為ReadMatches沒有分配任何文件或ParDo(convert_to_dataFrame)返回 None 對象。 關於這種方法的任何想法或關於讀取和合並文件的任何其他方法。 錯誤 output：

ValueError：沒有要連接的對象[運行'ParDo（merge_dataframes）'時]

Answer 1

要回答有關錯誤ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'],您在 Windows 文件系統上，您需要使用分隔符\而不是/ 。 你可以使用os.path.join代替，你不需要擔心文件系統：

import os 
all_files1 = glob.glob(os.path.join(path1, "*.csv"))

對於關於錯誤ValueError: DataFrame constructor not properly called, [while running 'ParDo(convert_to_dataFrame)'],您正在向 DataFrame 構造函數發送另一種類型的字典值，而不是字典本身。 這就是您收到該錯誤的原因。

你可以這樣做：

DataFrame(eval(data))

在 Apache Beam 中連接多個 csv 文件

問題描述

1 個解決方案

解決方案1
0 2021-12-30 17:39:16

在 Apache Beam 中連接多個 csv 文件

問題描述

1 個解決方案

解決方案1 0 2021-12-30 17:39:16

解決方案1
0 2021-12-30 17:39:16