使用pyspark分发到镶木地板的csv是否已分发？

Question

I have the following code snippet on an AWS EMR master node to convert a csv file to a parquet file. 我在AWS EMR主节点上具有以下代码片段，可将CSV文件转换为拼花文件。

%pyspark


csv_path = "s3://<bucket>/file.csv"
p_path = "s3://<bucket>/file.parquet"

df = sqlContext.read.csv(csv_path, header=True, inferSchema=True)
df.write.parquet(p_path, mode='overwrite')

If I request more nodes, will this operation be faster? 如果我请求更多节点，此操作会更快吗？ In other words, is the conversion to parquet distributed in a spark cluster. 换句话说，是在火花簇中分配到拼花地板的转换。 I can't tell yet and I don't want to burn money on more nodes without knowing a little more about it. 我不能说，也不想在不了解更多信息的情况下在更多节点上花费金钱。

Answer 1

Yes, it is distributed. 是的，它是分布式的。

Will the operation be faster? 手术会更快吗？ It depends on many factors, but in the best case it should scale linearly in terms of number of nodes as long as the code is equivalent to the one you (single stage job). 它取决于许多因素，但在最佳情况下，只要代码等于您的代码（单阶段作业），它就应根据节点数线性地扩展。

Another improvement to disable schema inference and providing explicit schema. 禁用模式推断并提供显式模式的另一项改进。

使用pyspark分发到镶木地板的csv是否已分发？

问题描述

1 个解决方案

解决方案1
1 2017-10-21 08:19:37

使用pyspark分发到镶木地板的csv是否已分发？

问题描述

1 个解决方案

解决方案1 1 2017-10-21 08:19:37

解决方案1
1 2017-10-21 08:19:37