[英]Concatenating multiple csv files in Apache Beam
I am trying to read several csv
files using fileio.MatchFiles
converting them into pd.DataFrame
then later concatenating them into one csv
file.我试图读取几个csv
文件,使用fileio.MatchFiles
将它们转换为pd.DataFrame
然后将它们连接成一个csv
文件。 To perform this, I have created two ParDo
classes to covert files into DataFrame and then merge them into merged csv
.为此,我创建了两个ParDo
类来将文件转换为 DataFrame ,然后将它们合并到merged csv
中。 The whole snippet looks like below:整个片段如下所示:
class convert_to_dataFrame(beam.DoFn):
def process(self, element):
return pd.DataFrame(element)
class merge_dataframes(beam.DoFn):
def process(self, element):
logging.info(element)
logging.info(type(element))
return pd.concat(element).reset_index(drop=True)
p = beam.Pipeline()
concating = (p
| beam.io.fileio.MatchFiles("C:/Users/firuz/Documents/task/mobilab_da_task/concats/**")
| beam.io.fileio.ReadMatches()
| beam.Reshuffle()
| beam.ParDo(convert_to_dataFrame())
| beam.combiners.ToList()
| beam.ParDo(merge_dataframes())
| beam.io.WriteToText('C:/Users/firuz/Documents/task/mobilab_da_task/output_tests/merged', file_name_suffix='.csv'))
p.run()
After running I receive an ValueError
on ParDO(merge_dataframes)
.运行后,我在ParDO(merge_dataframes)
上收到ValueError
。 I presume that ReadMatches
doesn't allocate any file or ParDo(convert_to_dataFrame)
returning None objects.我认为ReadMatches
没有分配任何文件或ParDo(convert_to_dataFrame)
返回 None 对象。 Any ideas on this approach or any other approaches on reading and merging files.关于这种方法的任何想法或关于读取和合并文件的任何其他方法。 The Error output:错误 output:
ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'] ValueError:没有要连接的对象[运行'ParDo(merge_dataframes)'时]
To answer the first question regarding the error ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'],
you are on a Windows file system, and you need to use the delimiter \
instead of /
.要回答有关错误ValueError: No objects to concatenate [while running 'ParDo(merge_dataframes)'],
您在 Windows 文件系统上,您需要使用分隔符\
而不是/
。 You can use os.path.join
instead, and you do not need to worry about the filesystem:你可以使用os.path.join
代替,你不需要担心文件系统:
import os
all_files1 = glob.glob(os.path.join(path1, "*.csv"))
For the second question regarding the error ValueError: DataFrame constructor not properly called, [while running 'ParDo(convert_to_dataFrame)'],
you are sending another type of value of a dict to the DataFrame constructor, and not a dict itself.对于关于错误ValueError: DataFrame constructor not properly called, [while running 'ParDo(convert_to_dataFrame)'],
您正在向 DataFrame 构造函数发送另一种类型的字典值,而不是字典本身。 This is the reason why you get that error.这就是您收到该错误的原因。
You could do this:你可以这样做:
DataFrame(eval(data))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.