Concatenating multiple csv files in Apache Beam

Question

I am trying to read several csv files using fileio.MatchFiles converting them into pd.DataFrame then later concatenating them into one csv file. To perform this, I have created two ParDo classes to covert files into DataFrame and then merge them into merged csv . The whole snippet looks like below:

class convert_to_dataFrame(beam.DoFn):
    def process(self, element):
        return pd.DataFrame(element)

class merge_dataframes(beam.DoFn):
    def process(self, element):
        logging.info(element)
        logging.info(type(element))
        return pd.concat(element).reset_index(drop=True)

p = beam.Pipeline() 
concating = (p
             | beam.io.fileio.MatchFiles("C:/Users/firuz/Documents/task/mobilab_da_task/concats/**")
             | beam.io.fileio.ReadMatches()
             | beam.Reshuffle()
             | beam.ParDo(convert_to_dataFrame())
             | beam.combiners.ToList()
             | beam.ParDo(merge_dataframes())
             | beam.io.WriteToText('C:/Users/firuz/Documents/task/mobilab_da_task/output_tests/merged', file_name_suffix='.csv'))

p.run()

After running I receive an ValueError on ParDO(merge_dataframes) . I presume that ReadMatches doesn't allocate any file or ParDo(convert_to_dataFrame) returning None objects. Any ideas on this approach or any other approaches on reading and merging files. The Error output:

ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)']

Answer 1

To answer the first question regarding the error ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'], you are on a Windows file system, and you need to use the delimiter \ instead of / . You can use os.path.join instead, and you do not need to worry about the filesystem:

import os 
all_files1 = glob.glob(os.path.join(path1, "*.csv"))

For the second question regarding the error ValueError: DataFrame constructor not properly called, [while running 'ParDo(convert_to_dataFrame)'], you are sending another type of value of a dict to the DataFrame constructor, and not a dict itself. This is the reason why you get that error.

You could do this:

DataFrame(eval(data))

Concatenating multiple csv files in Apache Beam

Question

1 answers

solution1
0 2021-12-30 17:39:16

Concatenating multiple csv files in Apache Beam

Question

1 answers

solution1 0 2021-12-30 17:39:16

solution1
0 2021-12-30 17:39:16