简体   繁体   English

如何在Spark Java中遍历/迭代数据集?

[英]How to traverse/iterate a Dataset in Spark Java?

I am trying to traverse a Dataset to do some string similarity calculations like Jaro winkler or Cosine Similarity. 我试图遍历数据集来进行一些字符串相似度计算,如Jaro winkler或Cosine Similarity。 I convert my Dataset to list of rows and then traverse with for statement which is not efficient spark way to do it. 我将我的数据集转换为行列表,然后遍历for语句,这不是有效的火花方式。 So I am looking forward for a better approach in Spark. 所以我期待在Spark中采用更好的方法。

public class sample {

    public static void main(String[] args) {
        JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Example").setMaster("local[*]"));
        SQLContext sqlContext = new SQLContext(sc);
        SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();

        List<Row> data = Arrays.asList(RowFactory.create("Mysore","Mysuru"),
                RowFactory.create("Name","FirstName"));
        StructType schema = new StructType(
                new StructField[] { new StructField("Word1", DataTypes.StringType, true, Metadata.empty()),
                        new StructField("Word2", DataTypes.StringType, true, Metadata.empty()) });

        Dataset<Row> oldDF = spark.createDataFrame(data, schema);
        oldDF.show();
        List<Row> rowslist = oldDF.collectAsList(); 
    }
}

I have found many JavaRDD examples which I am not clear. 我发现了许多我不清楚的JavaRDD示例。 An Example for Dataset will help me a lot. 数据集的示例将对我有所帮助。

您可以使用org.apache.spark.api.java.function.ForeachFunction如下所示。

oldDF.foreach((ForeachFunction<Row>) row -> System.out.println(row));

For old java jdks that don't support lambda expressions, you can use the following after importing: 对于不支持lambda表达式的旧java jdks,可以在导入后使用以下内容:

import org.apache.spark.api.java.function.VoidFunction; import org.apache.spark.api.java.function.VoidFunction;

yourDataSet.toJavaRDD().foreach(new VoidFunction<Row>() {
        public void call(Row r) throws Exception {
            System.out.println(r.getAs("your column name here"));
        }
    });

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在spark(java)中迭代数据集的所有列 - How to iterate over all columns of dataset in spark (java) 如何遍历 spark 数据集并更新 Java 中的列值? - How to iterate through a spark dataset and update a column value in Java? Java - 如何迭代/遍历对象的嵌套列表? - Java - how to iterate / traverse nested list of objects? Spark Java - 如何迭代数据帧数据集中的行<Row> , 并将一列的值添加到 Arraylist - Spark Java - How do I iterate rows in dataframe Dataset<Row>, and add values of one column to an Arraylist 使用 Apache Spark 迭代和展平数据集中的 Struct 类型数组:Java - Iterate and flatten an Array of Struct types in a Dataset using Apache Spark :Java 在Java Spark中迭代大型DataSet的最快有效方法 - Fastest And Effective Way To Iterate Large DataSet in Java Spark 如何在Java中的Apache Spark中将DataFrame转换为Dataset? - How to convert DataFrame to Dataset in Apache Spark in Java? 如何在Java中转置Apache Spark数据集 - How to transpose an Apache Spark Dataset in Java 如何在spark java中的数据集上应用map函数 - How to apply map function on dataset in spark java 如何在 Java 中创建对象集合 Spark Dataset? - How to create collection of objects Spark Dataset In Java?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM