如何提高大数据性能？

Question

I am new at this concept, and still learning. 我是这个概念的新手，并且仍在学习中。 I have total 10 TB json files in AWS S3, 4 instances(m3.xlarge) in AWS EC2 (1 master, 3 worker). 我在AWS S3中总共有10 TB json文件，在AWS EC2中有4个实例（m3.xlarge）（1个主服务器，3个工作器）。 I am currently using spark with python on Apache Zeppelin. 我目前在Apache Zeppelin上将python与spark一起使用。

I am reading files with the following command; 我正在使用以下命令读取文件；

hcData=sqlContext.read.option("inferSchema","true").json(path)

In zeppelin interpreter settings: 在Zeppelin解释器设置中：

master = yarn-client
spark.driver.memory = 10g
spark.executor.memory = 10g
spark.cores.max = 4

It takes 1 minute to read 1GB approximately. 大约需要1分钟才能读取1GB。 What can I do more for reading big data more efficiently? 为了更有效地读取大数据，我该怎么做？

Should I do more on coding? 我应该在编码方面做更多工作吗？
Should I increase instances? 我应该增加实例吗？
Should I use another notebook platform? 我应该使用其他笔记本平台吗？

Thank you. 谢谢。

Answer 1

For performance issue, the best is to know where is the performance bottleneck. 对于性能问题，最好是知道性能瓶颈在哪里。 Or try to see where the performance problem could be. 或者尝试查看性能问题可能出在哪里。

Since 1 minute to read 1GB is pretty slow. 由于1分钟读取1GB的速度非常慢。 I would try the following steps. 我会尝试以下步骤。

Try to explicitly specify schema instead of inferSchema 尝试显式指定架构而不是inferSchema
Try to use Spark 2.0 instead of 1.6 尝试使用Spark 2.0而不是1.6
Check the connection between S3 and EC2, in case there were some misconfiguration 检查S3和EC2之间的连接，以防配置错误
Using different file format like parquet other than json 使用不同于json parquet等其他文件格式
Increase the executor memory and decrease the driver memory 增加执行程序内存并减少驱动程序内存
Use Scala instead of Python, although in this case is the least likely the issue. 使用Scala而不是Python，尽管在这种情况下最不可能出现此问题。

Answer 2

I gave a talk on this topic back in october: Spark and Object Stores 我在10月份就这个话题做了演讲： Spark和Object Stores

Essentially: use parquet/orc but tune settings for efficient reads. 本质上：使用实木复合地板/兽人，但调整设置以进行有效的读取。 Once it ships, grab Spark 2.0.x built against Hadoop 2.8 for lots of speedup work we've done, especially working with ORC & Parquet. 一旦发布，请抓住针对Hadoop 2.8构建的Spark 2.0.x，以完成我们完成的许多加速工作，尤其是与ORC和Parquet一起工作。 We also add lots of metrics too, though not yet pulling them all back in to the spark UI. 我们还添加了许多指标，尽管尚未将它们全部拉回到spark UI。

Schema inference can be slow, if it has to work through the entire dataset (CSV inference does; I don't know about JSON). 如果架构推断必须遍历整个数据集，则它可能会很慢（CSV推断确实如此；我不了解JSON）。 I'd recommend doing it once, grabbing the schema details and then explicitly declaring it as the schema next time wround. 我建议这样做一次，获取架构详细信息，然后在下一次遍历时将其明确声明为架构。

Answer 3

You can persist the data in parquet format after json read 您可以在json读取后以镶木地板格式保存数据

hcData=sqlContext.read.option("inferSchema","true").json(path)
hcData.write.parquet("hcDataFile.parquet")
val hcDataDF = spark.read.parquet("hcDataFile.parquet")

// create a temporary view in Spark 2.0 or registerAsTemp table in Spark 1.6 and use SQL for further logic //在Spark 2.0中创建一个临时视图，或在Spark 1.6中创建registerAsTemp表，并使用SQL进行进一步的逻辑

hcDataDF.createOrReplaceTempView("T_hcDataDF")

//This is a manual way of doing RDD checkingpointing (not supported for DataFrames), this will reduce RDD Lineage which will improve performance. //这是手动进行RDD检查点的方式（DataFrame不支持），这将减少RDD沿袭，从而提高性能。

For execution, use Dyanamic Resource Allocation for your spark-submit command: 为了执行，请对您的spark-submit命令使用Dyanamic资源分配：

//Make sure the following are enabled in your cluster, otherwise you can use these parameters in spark-summit command as --conf //确保集群中启用了以下功能，否则可以在spark-summit命令中将这些参数用作--conf

•   spark.dynamicAllocation.enabled=true
•   spark.dynamicAllocation.initialExecutors=5 
•   spark.dynamicAllocation.minExecutors=5
•   spark.shuffle.service.enabled=true
•   yarn.nodemanager.aux-services=mapreduce_shuffle,spark_shuffle
•   yarn.nodemanager.aux-services.spark_shuffle.class
    =org.apache.spark.network.yarn.YarnShuffleService

//Spark-submit command //火花提交命令

 ./bin/spark-submit --class package.hcDataclass \
 --master yarn-cluster \
 --deploy-mode cluster \
 --driver-memory 1G \
 --executor-memory 5G\
 hcData*.jar

//For dynamic Resource Allocation we don't need to specify the # of executors. //对于动态资源分配，我们不需要指定执行者的数量。 //Job will automatically get the resources based on cluster bandwidth. // Job将根据群集带宽自动获取资源。

如何提高大数据性能？

问题描述

3 个解决方案

解决方案1
2 已采纳 2016-11-09 21:53:41

解决方案2
2 2016-11-10 13:41:18

解决方案3
1 2016-11-10 02:30:02

如何提高大数据性能？

问题描述

3 个解决方案

解决方案1 2 已采纳 2016-11-09 21:53:41

解决方案2 2 2016-11-10 13:41:18

解决方案3 1 2016-11-10 02:30:02

解决方案1
2 已采纳 2016-11-09 21:53:41

解决方案2
2 2016-11-10 13:41:18

解决方案3
1 2016-11-10 02:30:02