简体   繁体   English

如何使用非Lambda函数定义Spark RDD转换

[英]How to define Spark RDD transformation with non-Lambda Function

I recently started using with Spark and Java. 我最近开始使用Spark和Java。 I am currently experimenting with RDD transformations and actions. 我目前正在尝试RDD转换和操作。 For the moment I am reading data out of a csv that contains some DateTime fields and then I apply a filter to keep only those rows that are younger than 2 days and finally I check if the resulting RDD is empty. 目前,我正在从包含某些DateTime字段的csv中读取数据,然后应用过滤器以仅保留那些小于2天的行,最后检查生成的RDD是否为空。 I wrote a simple snippet that does what I want on a minimal level. 我写了一个简单的代码片段,可以在最低限度上实现我想要的功能。

Function<List<String>, Boolean> filterPredicate = row -> new DateTime(row.get(1).isAfter(dtThreshold);

sc.textFile(inputFilePath)
            .map(text -> Arrays.asList(text.split(",")))
            .filter(filterPredicate)
            .isEmpty();

On this simple case I have assumed that the DateTime objects always live on the first column. 在这种简单情况下,我假设DateTime对象始终位于第一列上。 I now want to expand that to use multiple column indexes. 我现在想扩展它以使用多个列索引。 But to do that I need to be able to define a predicate function with more than one lines. 但是要做到这一点,我需要能够定义多于一行的谓词函数。 That is the reason why I have separated the predicate function definition from the transformation code. 这就是为什么我将谓词函数定义与转换代码分开的原因。

How I am supposed to define such a function? 我应该如何定义这样的功能?

Use the curly brace notation... 使用花括号符号...

   Function<List<String>, Boolean> filterPredicate = row -> {
        boolean isDateAfter = new DateTime(row.get(1)).isAfter(dtThreshold);
        boolean hasName = row.get(2) != "";
        return isDateAfter && hasName;
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM