I am trying to read several csv
files using fileio.MatchFiles
converting them into pd.DataFrame
then later concatenating them into one csv
file. To perform this, I have created two ParDo
classes to covert files into DataFrame and then merge them into merged csv
. The whole snippet looks like below:
class convert_to_dataFrame(beam.DoFn):
def process(self, element):
return pd.DataFrame(element)
class merge_dataframes(beam.DoFn):
def process(self, element):
logging.info(element)
logging.info(type(element))
return pd.concat(element).reset_index(drop=True)
p = beam.Pipeline()
concating = (p
| beam.io.fileio.MatchFiles("C:/Users/firuz/Documents/task/mobilab_da_task/concats/**")
| beam.io.fileio.ReadMatches()
| beam.Reshuffle()
| beam.ParDo(convert_to_dataFrame())
| beam.combiners.ToList()
| beam.ParDo(merge_dataframes())
| beam.io.WriteToText('C:/Users/firuz/Documents/task/mobilab_da_task/output_tests/merged', file_name_suffix='.csv'))
p.run()
After running I receive an ValueError
on ParDO(merge_dataframes)
. I presume that ReadMatches
doesn't allocate any file or ParDo(convert_to_dataFrame)
returning None objects. Any ideas on this approach or any other approaches on reading and merging files. The Error output:
ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)']
To answer the first question regarding the error ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'],
you are on a Windows file system, and you need to use the delimiter \
instead of /
. You can use os.path.join
instead, and you do not need to worry about the filesystem:
import os
all_files1 = glob.glob(os.path.join(path1, "*.csv"))
For the second question regarding the error ValueError: DataFrame constructor not properly called, [while running 'ParDo(convert_to_dataFrame)'],
you are sending another type of value of a dict to the DataFrame constructor, and not a dict itself. This is the reason why you get that error.
You could do this:
DataFrame(eval(data))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.