如何改善我的Spark应用程序的reducebykey部分？

Question

I have 64 spark cores. 我有64个Spark Core。 I have over 80 Million rows of data which amount to 4.2 GB in my cassandra cluster. 我的cassandra群集中有超过8000万行数据，总计4.2 GB。 I now need 82 seconds to process this data. 我现在需要82秒来处理此数据。 I want this reduced to 8 seconds. 我希望将其减少到8秒。 Any thoughts on this? 有什么想法吗？ Is this even possible? 这有可能吗？ Thanks. 谢谢。

This is the part of my spark app I want to improve: 这是我要改进的Spark应用程序的一部分：

axes = sqlContext.read.format("org.apache.spark.sql.cassandra")\
    .options(table="axes", keyspace=source, numPartitions="192").load()\
    .repartition(64*3)\
    .reduceByKey(lambda x,y:x+y,52)\
    .map(lambda x:(x.article,[Row(article=x.article,at=x.at,comments=x.comments,likes=x.likes,reads=x.reads,shares=x.shares)]))\
    .map(lambda x:(x[0],sorted(x[1],key=lambda y:y.at,reverse = False))) \
    .filter(lambda x:len(x[1])>=2) \
    .map(lambda x:x[1][-1])

Edit: 编辑：

This is the code I am currently running the one posted above was an experiment sorry for the confusion. 这是我当前正在运行的代码，上面发布的代码是一个实验，很抱歉造成混淆。 The question above relate to this code. 上面的问题与此代码有关。

axes = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load().repartition(64*3) \
                    .map(lambda x:(x.article,[Row(article=x.article,at=x.at,comments=x.comments,likes=x.likes,reads=x.reads,shares=x.shares)])).reduceByKey(lambda x,y:x+y)\
                    .map(lambda x:(x[0],sorted(x[1],key=lambda y:y.at,reverse = False))) \
                    .filter(lambda x:len(x[1])>=2) \
                    .map(lambda x:x[1][-1])

Thanks 谢谢

Answer 1

Issues : 问题：

( Why this code cannot work correctly assuming unmodified Spark distribution ) （ 为什么在未修改Spark分配的情况下此代码无法正常工作 ）

Step-by-step: 一步步：

These two lines should create a Spark DataFrame . 这两行应创建一个Spark DataFrame 。 So far so good: 到现在为止还挺好：
```
 sqlContext.read.format("org.apache.spark.sql.cassandra") .options(table="axes", keyspace=source, numPartitions="192").load() 
```
The only possible concern is numPartitions which as far as I remember is not a recognized option. 唯一可能的问题是numPartitions ，据我numPartitions ，这不是公认的选项。
This is pretty much a junk code. 这几乎是一个垃圾代码。 Shuffling data around without doing any actual job is unlikely to get you anywhere. 在不做任何实际工作的情况下随机整理数据不太可能使您无所适从。
```
 .repartition(64*3) 
```
At this point you switch to RDD. 此时，您切换到RDD。 Since Row is actually a subclass of tuple and reduceByKey may work only on pairwise RDDs each element hast to be a tuple of size 2. I am not sure why you choose 52 partitions though. 由于Row实际上是tuple的子类，而且reduceByKey可能仅适用于成对的RDD，因此每个元素都必须是大小为2的元组。我不确定为什么选择52个分区。
```
 .reduceByKey(lambda x,y:x+y,52) 
```
Since reduceByKey always result in a RDD of tuples of size 2 following part simply shouldn't work 由于reduceByKey总是导致大小为2的元组的RDD，因此紧随其后的部分根本不起作用
```
 .map(lambda x: (x.article,[Row(article=x.article,at=x.at,comments=x.comments,likes=x.likes,reads=x.reads,shares=x.shares)]))\\ 
```
In particular x cannot have attributres like article or comments . 特别是x不能具有诸如article或comments类的属性。 Moreover this piece of code 而且这段代码
```
 [Row(article=x.article,at=x.at,comments=x.comments,likes=x.likes,reads=x.reads,shares=x.shares)] 
```
Creates list of size 1 (see below). 创建大小为1的list （请参见下文）。
Following part 后续部分
```
 Row(article=x.article, ...) 
```
smells fishy for one more reason. 闻到鱼腥味的另一个原因。 If there are some obsolete columns these should be filtered out before data is converted to RDD to avoid excessive traffic and reduce memory usage. 如果有一些过时的列，则应在将数据转换为RDD之前将其过滤掉，以避免过多的流量并减少内存使用量。 If there are no obsolete columns there is no reason to put more pressure on Python GC by creating new objects. 如果没有过时的列，则没有理由通过创建新对象对Python GC施加更大的压力。
Since x[1] has only one element sorting it doesn't makes sense: 由于x[1]仅具有一个元素排序，因此没有意义：
```
 .map(lambda x:(x[0],sorted(x[1],key=lambda y:y.at,reverse = False))) \\ 
```
And this filter should always return an empty RDD 并且此过滤器应始终返回空的RDD
```
 .filter(lambda x:len(x[1])>=2) \\ 
```
And this doesn't perform any useful operations: 这不会执行任何有用的操作：
```
 .map(lambda x:x[1][-1]) 
```

Summary : 总结：

If you use some version of this code it most likely the order shown in the question is mixed up and map from the point 4: 如果您使用此代码的某些版本，则很可能将问题中显示的顺序混淆并从第4点开始进行映射：

.map(lambda x: (x.article,[Row(....)]))

precedes reduceByKey : 在reduceByKey之前：

.reduceByKey(lambda x,y:x+y,52)

If thats the case you actually use .reduceByKey to perform groupByKey which is either equivalent to groupByKey with all its issues (Python) or less efficient (Scala). 如果是这样，您实际上使用.reduceByKey来执行groupByKey ，它等效于所有问题（Python）或groupByKey （Scala）效率较低的groupByKey 。 Moreover it would reduction in number of partitions highly suspicious. 此外，这将减少高度可疑的分区数量。

If thats true there is no good reason to move data out of JVM ( DataFrame -> RDD conversion) with corresponding serialization-deserialization, and even if there was, it can be easily solved by actual reduction with max not group-by-key. 如果是这样，则没有充分的理由通过相应的序列化/反序列化将数据移出JVM （ DataFrame > RDD转换），即使存在，也可以通过使用max not group-key进行实际还原来轻松解决。

from operator import attrgetter

(sqlContext.read.format(...).options(...).load()
  .select(...)  # Only the columns you actually need
  .keyBy(attrgetter("article"))
  .reduceByKey(lambda r1, r2: max(r1, r2, key=attrgetter("y"))))

Related questions: 相关问题：

如何改善我的Spark应用程序的reducebykey部分？

问题描述

1 个解决方案

解决方案1
2 已采纳

如何改善我的Spark应用程序的reducebykey部分？

问题描述

1 个解决方案

解决方案1 2 已采纳

解决方案1
2 已采纳