简体繁体 English

在Microsoft Azure的Data Lake Storage帐户的输出文件夹中写回文件时更改文件名

[英]Changing file name while writing back file in output folder in Data Lake Storage account in Microsoft Azure

原文 2017-05-17 16:35:02 0 1 azure/ hadoop/ azure-data-lake

It would be so much helpful if you would help me in my problem. 如果您能帮助我解决我的问题，那将非常有帮助。

In my project requirement, I have to store the file with some specific name in Data Lake Store in Microsoft Azure (cloud based platform). 根据我的项目要求，我必须使用一些特定名称将文件存储在Microsoft Azure（基于云的平台）的Data Lake Store中。 After performing any transformation or action on the data frame created by the loaded file in HDInsight cluster, when I am writing the data frame to any specific folder, it gets stored with name "part-00000-xxxx" ie in hadoop format. 对由HDInsight群集中的已加载文件创建的数据帧执行任何转换或操作后，当我将数据帧写入任何特定文件夹时，它会以名称“ part-00000-xxxx”（即hadoop格式）存储。

But, as I am having large number of files so I can't go inside the created folder for each file and rename the same specific to my requirement every time. 但是，由于我有大量文件，所以我无法进入每个文件的创建文件夹并每次都对自己的要求重新命名。

So, can you please help me out in this? 那么，您能帮我这个忙吗？

NOTE: After storing the file we can copy the file to another folder and while copying we can give name whatever we want.But I don't want this solution. 注意：存储文件后，我们可以将文件复制到另一个文件夹，复制时可以给我们想要的名称。但是我不想要这种解决方案。 I want to provide a specific name to the file once I am writing it back to my storage(Data Lake Store) after processing. 处理后将文件写回到存储（Data Lake Store）后，我想为文件提供一个特定的名称。

1 个解决方案

You could provide a subclass of the MultipleOutputFormat class to control the pattern of the filenames, but that will need to be in Java, since you can't write OutputFormats with the streaming API. 您可以提供MultipleOutputFormat类的子类来控制文件名的模式，但这将必须使用Java，因为您不能使用流API编写OutputFormats。

Another option might be to use the Azure Storage client to merge, and rename the output files once the job is over. 另一种选择是使用Azure存储客户端进行合并，并在作业结束后重命名输出文件。