在Apache Spark Java中无法.collect（），OutOfMemoryError：超出了GC开销限制

Question

I'm completely new to Spark (running on MacOS) and I've been trying to test out some .parquet files I have in my PC which are around 120mb in size, trying to run them locally with master set to local[*] . 我是Spark的新手（在MacOS上运行），并且我一直在尝试测试PC中的一些.parquet文件，它们的大小约为120mb，尝试在将master设置为local[*]在local[*]运行它们。

Basically the pipeline of operations I currently have is the following one... 基本上，我目前拥有的业务流程如下：

        dataset
            .where(...)
            .toJavaRDD()
            .sortBy(...) // Tried .collect() here. Current 'debug' point.
            .groupBy(...) // Tried .collect() here.
            .flatMapValues(...)
            .mapToPair(...) // Tried .collect() here.
            .reduceByKey(...); // Tried .collect() here.

The first thing I would like to ask, how can I check the parquet file schema? 我想问的第一件事， 如何检查实木复合地板文件架构？ I've read out something that it's possible with Hive but haven't found anything. 我已经读到了Hive可以实现的功能，但还没有找到任何东西。 If you have any resources that can be useful to me it would be really appreaciated. 如果您有任何对我有用的资源，将不胜感激。

Secondly, as I don't really know all the parquet column names I must access in groupBy() and such, I'm just trying to collect everything in the first sortBy(), see what comes out and such (some minor testing in order to get started with Spark and how everything works). 其次，由于我并不真正知道我必须在groupBy（）中访问的所有镶木地板列名称，因此，我只是试图在第一个sortBy（）中收集所有内容，然后查看结果如何（例如一些小测试）。以便开始使用Spark以及一切工作原理）。 But as the question says I always get the given error. 但是正如问题所说，我总是会得到给定的错误。 Is there anything I'm doing wrong? 我做错了什么吗？ Shouldn't I .collect() after all? 我毕竟不应该收集（）吗？

I tried to print at some points but it seems that it goes to some logs as far as I've read and I don't really know if they are stored locally in the computer or how can I access them so I can see the output? 我尝试在某些时候进行打印，但就我所知，它似乎进入了一些日志，我真的不知道它们是否存储在计算机中或如何访问它们，以便可以看到输出？

When it comes to the Spark configuration, it's the most basic one. 说到Spark配置，这是最基本的配置。

    final SparkConf sparkConf = new SparkConf()
        .setMaster("local[*]")
        .setAppName("test_app_name");

    final JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    // Get the singleton instance of SparkSession
    SparkSession sparkSession = SparkSession.builder()
            .config(jsc.getConf())
            .getOrCreate();

    Dataset<Row> dataset = sparkSession.read()
            .parquet(inputPath);

Answer 1

Doing a collect action in spark is not good if your data is huge. 如果您的数据很大，那么在spark中执行收集操作就不好了。 Collect essentially brings all the data to the driver. 收集实际上将所有数据带到驱动程序。 This will cause memory outOf bound in the driver. 这将导致驱动程序中的内存超出范围。 Unless you are absolutely sure that the data you are collecting is pretty small. 除非您完全确定要收集的数据很小。

To see the dataset/dataframe schema you need to do 要查看数据集/数据框架构，您需要执行

 dataset.printSchema

To print a few rows of the dataset you can use the followings 要打印数据集的几行，可以使用以下命令

 dataset.show(10) // number of rows you want to see

Or 要么

 dataset.take(10).foreach(println) // takes 10 rows and print

If you want to see random rows for sampling you might use 如果要查看随机行进行采样，可以使用

 df.select("name").sample(.2, true).show(10)

Or 要么

 df.select("name").sample(.2, true).take(10).foreach(println)

在Apache Spark Java中无法.collect（），OutOfMemoryError：超出了GC开销限制

问题描述

1 个解决方案

解决方案1
0 2017-09-28 09:47:16

在Apache Spark Java中无法.collect（），OutOfMemoryError：超出了GC开销限制

问题描述

1 个解决方案

解决方案1 0 2017-09-28 09:47:16

解决方案1
0 2017-09-28 09:47:16