简体   繁体   English

如何从hadoopish文件夹加载镶木地板文件

[英]How to load parquet files from hadoopish folder

If I save data frame this way in Java, ...: 如果我以这种方式在Java中保存数据帧,则...:

df.write().parquet("myTest.parquet");

..., then it gets saved in a hadoopish way (a folder with numerous files). ...,然后以强光方式保存(包含许多文件的文件夹)。

Is it possible to save data frame as a single file? 是否可以将数据帧保存为单个文件? I tried collect() , but it does not help. 我尝试了collect() ,但没有帮助。

If it's impossible, then my question is how should I change the Python code for reading Parquet files from hadoopish folder created by df.write().parquet("myTest.parquet") : 如果不可能,那么我的问题是我应该如何更改从df.write().parquet("myTest.parquet")创建的df.write().parquet("myTest.parquet")文件夹中读取Parquet文件的Python代码:

load_df = sqlContext.read.parquet("myTest.parquet").where('field1="aaa"').select('field2', 'field3').coalesce(64)

Is it possible to save data frame as a single file? 是否可以将数据帧保存为单个文件?

Yes, but you should not as you may put too much pressure on a single JVM that can lead not only to performance degradation but also to JVM termination and hence the entire Spark application failure. 是的,但是您不应该这样做,因为您可能会对单个JVM施加太大的压力,这不仅会导致性能下降,还会导致JVM终止,从而导致整个Spark应用程序故障。

So, yes, it's possible and you should repartition(1) to have a single partition: 因此,是的,有可能,您应该repartition(1)具有单个分区:

repartition(numPartitions: Int): Dataset[T] Returns a new Dataset that has exactly numPartitions partitions. repartition(numPartitions:Int):数据集[T]返回一个具有完全numPartitions分区的新数据集。


how should I change the Python code for reading Parquet files from hadoopish folder 我应该如何更改从hadoopish文件夹读取Parquet文件的Python代码

Loading the dataset from as you called it a "hadoopish" folder is to not be concerned with the internal structure at all, and consider it a single file (that is a directory under the covers). 从您所谓的“ hadoopish”文件夹中加载数据集完全不必关心内部结构,而应将其视为单个文件(即幕后目录)。

That's an internal representation of how the files is stored and does not impact the code to load it. 这是文件存储方式的内部表示 ,并不影响代码的加载。

Spark writes your files in a directory, this files in numerous as you say and if the writing operation success it saves another empty file called _SUCCESS Spark将您的文件写入一个目录中,正如您所说的,该文件数量众多,如果写入操作成功,它将保存另一个名为_SUCCESS空文件

I'm coming from scala but I do believe that there's a similar way in python 我来自scala,但我确实相信python中也有类似的方式

Save and read your files in parquet or json or whatever format you want is straightforward : parquetjson或任何您想要的格式直接保存和读取文件:

df.write.parquet("path")
loaddf = spark.read.parquet("path")

I tried collect(), but it does not help. 我尝试了collect(),但没有帮助。

Talking about collect , it is not a good practice to use it in such operations because it returns your data to driver so you will lose the parallel computation benefits, and it will cause an OutOfMemoryException if the data can't fit in the memory 谈到collect ,在此类操作中使用它不是一个好习惯,因为它将数据返回给驱动程序,因此您将失去并行计算的好处,并且如果数据无法容纳在内存中,则会导致OutOfMemoryException

Is it possible to save data frame as a single file? 是否可以将数据帧保存为单个文件?

You really don't need to do that in major cases, if so, use the repartition(1) method on your Dataframe before saving it 在大多数情况下,您实际上不需要这样做,如果是这样,请在保存数据Dataframe之前使用Dataframe repartition(1)方法

Hope it helps, Best Regards 希望对您有帮助,最好的问候

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM