在 Apache Beam 中连接多个 csv 文件

Question

I am trying to read several csv files using fileio.MatchFiles converting them into pd.DataFrame then later concatenating them into one csv file.我试图读取几个csv文件，使用fileio.MatchFiles将它们转换为pd.DataFrame然后将它们连接成一个csv文件。 To perform this, I have created two ParDo classes to covert files into DataFrame and then merge them into merged csv .为此，我创建了两个ParDo类来将文件转换为 DataFrame ，然后将它们合并到merged csv中。 The whole snippet looks like below:整个片段如下所示：

class convert_to_dataFrame(beam.DoFn):
    def process(self, element):
        return pd.DataFrame(element)

class merge_dataframes(beam.DoFn):
    def process(self, element):
        logging.info(element)
        logging.info(type(element))
        return pd.concat(element).reset_index(drop=True)

p = beam.Pipeline() 
concating = (p
             | beam.io.fileio.MatchFiles("C:/Users/firuz/Documents/task/mobilab_da_task/concats/**")
             | beam.io.fileio.ReadMatches()
             | beam.Reshuffle()
             | beam.ParDo(convert_to_dataFrame())
             | beam.combiners.ToList()
             | beam.ParDo(merge_dataframes())
             | beam.io.WriteToText('C:/Users/firuz/Documents/task/mobilab_da_task/output_tests/merged', file_name_suffix='.csv'))

p.run()

After running I receive an ValueError on ParDO(merge_dataframes) .运行后，我在ParDO(merge_dataframes)上收到ValueError 。 I presume that ReadMatches doesn't allocate any file or ParDo(convert_to_dataFrame) returning None objects.我认为ReadMatches没有分配任何文件或ParDo(convert_to_dataFrame)返回 None 对象。 Any ideas on this approach or any other approaches on reading and merging files.关于这种方法的任何想法或关于读取和合并文件的任何其他方法。 The Error output:错误 output：

ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'] ValueError：没有要连接的对象[运行'ParDo（merge_dataframes）'时]

Answer 1

To answer the first question regarding the error ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'], you are on a Windows file system, and you need to use the delimiter \ instead of / .要回答有关错误ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'],您在 Windows 文件系统上，您需要使用分隔符\而不是/ 。 You can use os.path.join instead, and you do not need to worry about the filesystem:你可以使用os.path.join代替，你不需要担心文件系统：

import os 
all_files1 = glob.glob(os.path.join(path1, "*.csv"))

For the second question regarding the error ValueError: DataFrame constructor not properly called, [while running 'ParDo(convert_to_dataFrame)'], you are sending another type of value of a dict to the DataFrame constructor, and not a dict itself.对于关于错误ValueError: DataFrame constructor not properly called, [while running 'ParDo(convert_to_dataFrame)'],您正在向 DataFrame 构造函数发送另一种类型的字典值，而不是字典本身。 This is the reason why you get that error.这就是您收到该错误的原因。

You could do this:你可以这样做：

DataFrame(eval(data))

在 Apache Beam 中连接多个 csv 文件

问题描述

1 个解决方案

解决方案1
0 2021-12-30 17:39:16

在 Apache Beam 中连接多个 csv 文件

问题描述

1 个解决方案

解决方案1 0 2021-12-30 17:39:16

解决方案1
0 2021-12-30 17:39:16