[英]Is csv to parquet using pyspark distributed?
I have the following code snippet on an AWS EMR master node to convert a csv file to a parquet file. 我在AWS EMR主节点上具有以下代码片段,可将CSV文件转换为拼花文件。
%pyspark
csv_path = "s3://<bucket>/file.csv"
p_path = "s3://<bucket>/file.parquet"
df = sqlContext.read.csv(csv_path, header=True, inferSchema=True)
df.write.parquet(p_path, mode='overwrite')
If I request more nodes, will this operation be faster? 如果我请求更多节点,此操作会更快吗? In other words, is the conversion to parquet distributed in a spark cluster.
换句话说,是在火花簇中分配到拼花地板的转换。 I can't tell yet and I don't want to burn money on more nodes without knowing a little more about it.
我不能说,也不想在不了解更多信息的情况下在更多节点上花费金钱。
Yes, it is distributed. 是的,它是分布式的。
Will the operation be faster? 手术会更快吗? It depends on many factors, but in the best case it should scale linearly in terms of number of nodes as long as the code is equivalent to the one you (single stage job).
它取决于许多因素,但在最佳情况下,只要代码等于您的代码(单阶段作业),它就应根据节点数线性地扩展。
Another improvement to disable schema inference and providing explicit schema. 禁用模式推断并提供显式模式的另一项改进。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.