通過分組進行Spark數據處理

Question

我需要按一定的列對一組csv行進行分組，並對每組進行一些處理。

    JavaRDD<String> lines = sc.textFile
                        ("somefile.csv");
                JavaPairRDD<String, String> pairRDD = lines.mapToPair(new SomeParser());
                List<String> keys = pairRDD.keys().distinct().collect();
                for (String key : keys)
                {
                List<String> rows = pairRDD.lookup(key);

            noOfVisits = rows.size();
            country = COMMA.split(rows.get(0))[6];
            accessDuration = getAccessDuration(rows,timeFormat);
            Map<String,Integer> counts = getCounts(rows);
            whitepapers = counts.get("whitepapers");
            tutorials = counts.get("tutorials");
            workshops = counts.get("workshops");
            casestudies = counts.get("casestudies");
            productPages = counts.get("productpages");        
            }

    private static long dateParser(String dateString) throws ParseException {
        SimpleDateFormat format = new SimpleDateFormat("MMM dd yyyy HH:mma");
        Date date = format.parse(dateString);
        return date.getTime();
    }
dateParser is called for each row. Then min and max for the group is calculated to get the access duration. Others are string matches.

pairRDD.lookup的速度非常慢。是否有更好的方法來執行此操作。

Answer 1

我認為您可以簡單地將該列用作鍵並執行groupByKey 。 這些行上的操作沒有提及。 如果它是以某種方式組合這些行的函數，則甚至可以使用reduceByKey 。

就像是：

import org.apache.spark.SparkContext._  // implicit pair functions
val pairs = lines.map(parser _)
val grouped = pairs.groupByKey
// here grouped is of the form: (key, Iterator[String])

*編輯*在查看了該過程之后，我認為將每一行映射到它貢獻的數據中，然后使用aggregateByKey將它們全部減少為總效率會更高。 aggregateByKey具有2個函數和一個零：

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)]

第一個功能是分區聚合器，它將高效地在本地分區中運行，從而為每個分區創建本地聚合的部分。 CombineOperation將采用這些部分聚合並將它們組合在一起以獲得最終結果。

像這樣：

val lines = sc.textFile("somefile.csv")
// parse returns a key and a decomposed Record of values tracked:(key, Record("country", timestamp,"whitepaper",...)) 

val records = lines.map(parse(_))

val totals = records.aggregateByKey((0,Set[String].empty,Long.MaxValue, Long.MinValue, Map[String,Int].empty),
(record, (count, countrySet, minTime, maxTime, counterMap )) => (count+1,countrySet + record.country, math.min(minTime,record.timestamp), math.max(maxTime, record.timestamp), ...)
(cumm1, cumm2) => ???  // add each field of the cummulator
)

這是Spark中執行基於鍵的聚合的最有效方法。

通過分組進行Spark數據處理

問題描述

1 個解決方案

解決方案1
2 已采納 2014-10-28 15:30:40

通過分組進行Spark數據處理

問題描述

1 個解決方案

解決方案1 2 已采納 2014-10-28 15:30:40

解決方案1
2 已采納 2014-10-28 15:30:40