简体   繁体   English

awswrangler 将镶木地板数据帧写入单个文件

[英]awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly.我正在创建一个无法直接放入内存的非常大的文件。 So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them.所以我在 S3 中创建了一堆小文件,并正在编写一个可以读取这些文件并合并它们的脚本。 I am using aws wrangler to do this我正在使用 aws wrangler 来执行此操作

My code is as follows:我的代码如下:

    try:
        dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
        for df in dfs:
            path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
            logger.info(path)
    except Exception as e:
        logger.error(e, exc_info=True)
        logger.info(e)

The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM问题是 w4.s3.to_parquet 创建了很多文件,而不是写入一个文件,我也无法删除 chunked=True ,否则我的程序会因 OOM 而失败

How do I make this write a single file in s3.我如何使它在 s3 中写入单个文件。

AWS Data Wrangler is writing multiple files because you have specified dataset=True . AWS Data Wrangler 正在写入多个文件,因为您已指定dataset=True Removing this flag or switching to False should do the trick as long as you are specifying a full path只要您指定完整path删除此标志或切换到False就可以解决问题

I don't believe this is possible.我不相信这是可能的。 @Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. @Abdel Jaidi 建议不起作用,因为append=True要求数据集为真,否则会引发错误。 I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.我相信在这种情况下, append更多的是通过将新文件添加到同一文件夹中来“追加”Athena 或 Glue 中的数据。

I also don't think this is even possible for parquet in general.我也不认为这对于一般的镶木地板来说是不可能的。 As per this SO post it's not possible in a local folder, let alone S3.根据this SO post ,在本地文件夹中是不可能的,更不用说S3了。 To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.添加到这个镶木地板是压缩的,我认为在不将其全部加载到 memroy 的情况下将一行添加到压缩文件中并不容易。

I think the only solution is to get a beefy ec2 instance that can handle this.我认为唯一的解决方案是获得一个可以处理这个问题的强大的 ec2 实例。

I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones.我面临着类似的问题,我想我将遍历所有小文件并创建更大的文件。 For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.例如,您可以将服务器数据帧附加在一起,然后重写它们,但除非您获得一台具有足够内存的计算机,否则您将无法返回到一个镶木地板文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM