简体   繁体   English

如何在不使用收集功能的情况下有效地将rdd转换为列表

[英]how to convert rdd to list effectively without using collect function

We know that if we need to convert RDD to a list, then we should use collect(). 我们知道,如果需要将RDD转换为列表,则应使用collect()。 but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail). 但是此功能给驱动程序带来了很大的压力(因为它将所有数据从不同的执行程序带到驱动程序),从而导致性能下降或恶化(整个应用程序可能会失败)。

Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade? 是否有其他方法可以在不使用collect()或collectAsMap()等的情况下将RDD转换为任何Java util集合,而这不会导致性能下降?

Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. 基本上,在当前以批处理或流数据处理方式处理大量数据的情况下,诸如collect()和collectAsMap()之类的API在具有大量数据的真实项目中已变得完全无用。 We can use it in demo code, but that's all there to use for these APIs. 我们可以在演示代码中使用它,但是这些API都可以使用。 So why to have an API which we can not even use (Or am I missing something). 那么为什么要拥有一个我们甚至无法使用的API(或者我错过了什么)。

Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling 是否可以有更好的方法通过其他方法来达到相同的结果,或者我们可以以更有效的方式实现collect()和collectAsMap(),而不仅仅是调用

List<String> myList= RDD.collect.toList (which effects performance) List<String> myList= RDD.collect.toList (影响性能)

I looked up to google but could not find anything which can be effective. 我抬头看谷歌,但找不到任何有效的方法。 Please help if someone has got a better approach. 如果有人有更好的方法,请提供帮助。

As you want to collect the Data in a Java Collection, the data has to collect on single JVM as the java collections won't be distributed. 当您想在Java集合中收集数据时,必须在单个JVM上收集数据,因为Java集合不会被分发。 There is no way to get all data in collection by not getting data. 无法通过不获取数据来获取集合中的所有数据。 The interpretation of problem space is wrong. 问题空间的解释是错误的。

Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade? 是否有其他方法可以在不使用collect()或collectAsMap()等的情况下将RDD转换为任何Java util集合,而这不会导致性能下降?

No, and there can't be. 不,不可能。 And if there were such a way, collect would be implemented using it in the first place. 如果有这种方法,首先将使用它来实现collect

Well, technically you could implement List interface on top of RDD (or most of it?), but that would be a bad idea and quite pointless. 好吧,从技术上讲,您可以在RDD (或其中的大多数?)之上实现List接口,但这是一个坏主意,而且毫无意义。

So why to have an API which we can not even use (Or am I missing something). 那么为什么要拥有一个我们甚至无法使用的API(或者我错过了什么)。

collect is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. collect仅用于仅输入大RDD或中间结果且输出足够小的情况。 If that's not your case, use foreach or other actions instead. 如果不是您这种情况,请改用foreach或其他操作。

collect and similar are not meant to be used in normal spark code. collect和类似内容并不意味着在正常的火花代码中使用。 They are useful for things like debugging, testing, and in some cases when working with small datasets. 它们对于诸如调试,测试以及在某些情况下使用小型数据集的操作很有用。

You need to keep your data inside of the rdd, and use rdd transformations and actions without ever taking the data out. 您需要将数据保留在rdd内,并使用rdd转换和操作,而不必取出数据。 Methods like collect which pull you data out of spark and onto your driver defeat the purpose and undo any advantage that spark might be providing since now you're processing all of your data on a single machine anyway. 诸如collect这样的方法会将数据从spark中拉出并传递到驱动程序上,从而达到目的,并消除了spark可能提供的任何优势,因为现在您无论如何都在一台计算机上处​​理所有数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不使用collec()的情况下将scd中的RDD [CassandraRow]转换为List [CassandraRow] - How to convert RDD[CassandraRow] to List[CassandraRow] in scala without using collec() 如何使用 mapPartitions Function 将 Rdd 转换为数据集 - How to convert Rdd to dataset using mapPartitions Function 如何将JavaDStream转换为RDD? 或有没有一种方法可以在JavaDStream的map函数内部创建新的RDD? - How to convert JavaDStream into RDD ? OR Is there a way i can create new RDD inside map function of JavaDStream? 如何展平RDD <List> ? - How to flatten a RDD<List>? 为什么在没有泛型的流列表上使用 collect 后会返回一个对象? - Why after using collect on a streamed list without generic returns an object? 如何在 Java 中使用流/lambda 重复调用 0 参数函数并将返回值收集到列表中? - How to call a 0-argument function repeatedly and collect the return values into a list using stream/lambda in Java? 如何将没有泛型的列表转换为使用java中的流的泛型列表? - How to convert list without generics to list with generics using streams in java? 将RDD转换为键值对RDD,且这些值在列表中 - Convert an RDD into a key value pair RDD, with the values being in a List 收集和获取不适用于 RDD - collect and take is not working with RDD 如何收集DoubleStream到列表 - How to collect DoubleStream to List
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM