简体   繁体   English

在Apache Spark Java中无法.collect(),OutOfMemoryError:超出了GC开销限制

[英]Cannot .collect() in Apache Spark Java, OutOfMemoryError: GC overhead limit exceeded

I'm completely new to Spark (running on MacOS) and I've been trying to test out some .parquet files I have in my PC which are around 120mb in size, trying to run them locally with master set to local[*] . 我是Spark的新手(在MacOS上运行),并且我一直在尝试测试PC中的一些.parquet文件,它们的大小约为120mb,尝试在将master设置为local[*]local[*]运行它们。

Basically the pipeline of operations I currently have is the following one... 基本上,我目前拥有的业务流程如下:

        dataset
            .where(...)
            .toJavaRDD()
            .sortBy(...) // Tried .collect() here. Current 'debug' point.
            .groupBy(...) // Tried .collect() here.
            .flatMapValues(...)
            .mapToPair(...) // Tried .collect() here.
            .reduceByKey(...); // Tried .collect() here.

The first thing I would like to ask, how can I check the parquet file schema? 我想问的第一件事, 如何检查实木复合地板文件架构? I've read out something that it's possible with Hive but haven't found anything. 我已经读到了Hive可以实现的功能,但还没有找到任何东西。 If you have any resources that can be useful to me it would be really appreaciated. 如果您有任何对我有用的资源,将不胜感激。

Secondly, as I don't really know all the parquet column names I must access in groupBy() and such, I'm just trying to collect everything in the first sortBy(), see what comes out and such (some minor testing in order to get started with Spark and how everything works). 其次,由于我并不真正知道我必须在groupBy()中访问的所有镶木地板列名称,因此,我只是试图在第一个sortBy()中收集所有内容,然后查看结果如何(例如一些小测试)。以便开始使用Spark以及一切工作原理)。 But as the question says I always get the given error. 但是正如问题所说,我总是会得到给定的错误。 Is there anything I'm doing wrong? 我做错了什么吗? Shouldn't I .collect() after all? 我毕竟不应该收集()吗?

I tried to print at some points but it seems that it goes to some logs as far as I've read and I don't really know if they are stored locally in the computer or how can I access them so I can see the output? 我尝试在某些时候进行打印,但就我所知,它似乎进入了一些日志,我真的不知道它们是否存储在计算机中或如何访问它们,以便可以看到输出?

When it comes to the Spark configuration, it's the most basic one. 说到Spark配置,这是最基本的配置。

    final SparkConf sparkConf = new SparkConf()
        .setMaster("local[*]")
        .setAppName("test_app_name");

    final JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    // Get the singleton instance of SparkSession
    SparkSession sparkSession = SparkSession.builder()
            .config(jsc.getConf())
            .getOrCreate();

    Dataset<Row> dataset = sparkSession.read()
            .parquet(inputPath);

Doing a collect action in spark is not good if your data is huge. 如果您的数据很大,那么在spark中执行收集操作就不好了。 Collect essentially brings all the data to the driver. 收集实际上将所有数据带到驱动程序。 This will cause memory outOf bound in the driver. 这将导致驱动程序中的内存超出范围。 Unless you are absolutely sure that the data you are collecting is pretty small. 除非您完全确定要收集的数据很小。

To see the dataset/dataframe schema you need to do 要查看数据集/数据框架构,您需要执行

 dataset.printSchema

To print a few rows of the dataset you can use the followings 要打印数据集的几行,可以使用以下命令

 dataset.show(10) // number of rows you want to see

Or 要么

 dataset.take(10).foreach(println) // takes 10 rows and print

If you want to see random rows for sampling you might use 如果要查看随机行进行采样,可以使用

 df.select("name").sample(.2, true).show(10)

Or 要么

 df.select("name").sample(.2, true).take(10).foreach(println)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark失败了java.lang.OutOfMemoryError:超出了GC开销限制? - Spark fails with java.lang.OutOfMemoryError: GC overhead limit exceeded? SPARK SQL java.lang.OutOfMemoryError:超出GC开销限制 - SPARK SQL java.lang.OutOfMemoryError: GC overhead limit exceeded OutOfMemoryError:超出了GC开销限制 - OutOfMemoryError: GC overhead limit exceeded Java Spark - java.lang.OutOfMemoryError:超出 GC 开销限制 - 大型数据集 - Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset Java-Apache poi导致java.lang.OutOfMemoryError:超出了GC开销限制 - Java - Apache poi leads to java.lang.OutOfMemoryError: GC overhead limit exceeded 在 Java Spark 中收集数据集时出现 OutOfMemoryError GC 开销限制 - Getting OutOfMemoryError GC overhead limit exceeded when collecting a dataset in Java Spark Spark作业抛出“ java.lang.OutOfMemoryError:超出了GC开销限制” - Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded” Spark DataFrame java.lang.OutOfMemoryError:长时间运行超出了GC开销限制 - Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run 线程“dispatcher-event-loop-5”java.lang.OutOfMemoryError 中的异常:超出 GC 开销限制:Spark - Exception in thread "dispatcher-event-loop-5" java.lang.OutOfMemoryError: GC overhead limit exceeded : Spark Apache POI autoColumnWidth java.lang.OutOfMemoryError:超出GC开销限制 - Apache POI autoColumnWidth java.lang.OutOfMemoryError: GC overhead limit exceeded
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM