在 Pyspark 中从 memory 问题中获取 Java 堆

Question

I have tried to read Multiple CSV files with a size of around 100MB using the pandas package and try to convert the file into Spark.sql.data frame and append it to a list.我尝试使用 pandas package 读取大小约为 100MB 的多个 CSV 文件，并尝试将文件转换为 Spark.sql.data 帧并将 append 转换为列表。 And this list of the spark dataset is converted to a single dataframe.而这个 spark 数据集的列表被转换为单个 dataframe。

In Spark, I have used the master as local and deployment mode as the client and my system spec is 16 RAM and 8 cores.在 Spark 中，我将 master 用作本地，将部署模式用作客户端，我的系统规格是 16 RAM 和 8 核。

The java heap issue occurred while converting the pandas data frame to the pyspark data frame.将 pandas 数据帧转换为 pyspark 数据帧时发生 java 堆问题。

As I check through the web, the most common solution is to increase the driver execution memory, which I have increased up to 6 GB but still get the same error.当我查看 web 时，最常见的解决方案是增加驱动程序执行 memory，我已将其增加到 6 GB，但仍然出现相同的错误。

I have been stuck in this error for a couple of days.几天来我一直陷入这个错误。 Can anyone provide a solution/thought to this?任何人都可以为此提供解决方案/想法吗？

My code:我的代码：

def _load_and_normalize(self, glob_paths, renames=None, columns=[]):
    renames = renames or {}
    files = sorted(glob.glob(glob_paths))
    dfs = []
    for filename in files:
        logger.info(f'adding {basename(filename)}')
        df_list = self.read_csv(filename)
        for df_chunk in df_list:
            if len(columns) > 0:
                df_chunk = df_chunk[columns]
            df = self.spark_session.createDataFrame(df_chunk.astype('str'))
            df = self.add_procuredate(df, filename)
            dfs.append(df)
        logger.info(f'Added the procure date {basename(filename)} and append to list')
    df = reduce(DataFrame.unionByName, dfs).repartition(5000)
    logger.info(f' Combine the list of dataframes into single dataframe of size {df.count()}')
    for rename_columns in renames:
        df = df.withColumnRenamed(rename_columns, renames[rename_columns])
    df = self.uppercase_and_trim_all_columns(df)
    df = df.dropDuplicates()
    all_cols_except_procure = [col for col in df.schema.names if col != 'procure_date']
    df = df.dropDuplicates(all_cols_except_procure)
    df = self.get_normalized_address(df)
    df = self.get_normalized_address(df, col_name='orig_normalized_address',
                                        full_address_col='orig_address', city_col='orig_city', state_col='orig_state',
                                        zip_col='orig_zip')
    return df

Error log:错误日志：

2022-12-29 17:55:06,163 - prepare_ncoa_sp - INFO - [542587] 0.43GB Starting
        Setting default log level to "WARN".
        To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
        22/12/29 17:55:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
        2022-12-29 17:55:11,712 - prepare_ncoa_sp - INFO - [542587] 0.55GB adding 20210203_agent_addresses_1_validated.csv
        2022-12-29 17:56:45,989 - prepare_ncoa_sp - INFO - [542587] 0.58GB adding 20210203_agent_addresses_2_validated.csv
        2022-12-29 17:57:51,402 - prepare_ncoa_sp - INFO - [542587] 0.60GB adding 20210203_agent_addresses_3_validated.csv
        2022-12-29 17:59:11,865 - prepare_ncoa_sp - INFO - [542587] 0.60GB adding 20210203_agent_addresses_4_validated.csv
        2022-12-29 18:00:33,017 - prepare_ncoa_sp - INFO - [542587] 0.61GB adding 20210203_agent_addresses_5_validated.csv
        2022-12-29 18:01:58,095 - prepare_ncoa_sp - INFO - [542587] 0.62GB adding 20210203_agent_addresses_6_validated.csv
        2022-12-29 18:03:08,589 - prepare_ncoa_sp - INFO - [542587] 0.63GB adding 20210203_agent_addresses_7_validated.csv
        2022-12-29 18:04:17,571 - prepare_ncoa_sp - INFO - [542587] 0.63GB adding 20210203_agent_addresses_8_validated.csv
        2022-12-29 18:05:32,933 - prepare_ncoa_sp - INFO - [542587] 0.67GB adding 20210306_agent_addresses_1_validated.csv
        2022-12-29 18:06:43,819 - prepare_ncoa_sp - INFO - [542587] 0.68GB adding 20210306_agent_addresses_2_validated.csv
        2022-12-29 18:08:04,511 - prepare_ncoa_sp - INFO - [542587] 0.68GB adding 20210306_agent_addresses_3_validated.csv
        Traceback (most recent call last):
        File "manage.py", line 21, in <module>
        main()
        File "manage.py", line 17, in main
        execute_from_command_line(sys.argv)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
        utility.execute()
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/django/core/management/__init__.py", line 413, in execute
        self.fetch_command(subcommand).run_from_argv(self.argv)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/django/core/management/base.py", line 354, in run_from_argv
        self.execute(*args, **cmd_options)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/django/core/management/base.py", line 398, in execute
        output = self.handle(*args, **options)
        File "/home/ubuntu/backend/first_class/core/management/commands/prepare_ncoa_sp.py", line 24, in handle
        step.start()
        File "/home/ubuntu/backend/first_class/core/management/commands/prepare_ncoa_sp.py", line 34, in start
        self.prepare_agent_address_updates()
        File "/home/ubuntu/backend/first_class/core/management/commands/prepare_ncoa_sp.py", line 122, in prepare_agent_address_updates
        prepare_agent_data = self._load_and_normalize(file_glob, {
        File "/home/ubuntu/backend/first_class/core/management/commands/prepare_ncoa_sp.py", line 93, in _load_and_normalize
        df = self.spark_session.createDataFrame(df.astype('str'))
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/sql/session.py", line 891, in createDataFrame
        return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py", line 437, in createDataFrame
        return self._create_dataframe(converted_data, schema, samplingRatio, verifySchema)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/sql/session.py", line 936, in _create_dataframe
        rdd, struct = self._createFromLocal(map(prepare, data), schema)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/sql/session.py", line 648, in _createFromLocal
        return self._sc.parallelize(internal_data), struct
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/context.py", line 674, in parallelize
        jrdd = self._serialize_to_jvm(c, serializer, reader_func, createRDDServer)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/context.py", line 720, in _serialize_to_jvm
        return reader_func(tempFile.name)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/context.py", line 668, in reader_func
        return self._jvm.PythonRDD.readRDDFromFile(self._jsc, temp_filename, numSlices)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
        return_value = get_return_value(
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/pyspark/sql/utils.py", line 190, in deco
        return f(*a, **kw)
        File "/home/ubuntu/.local/share/virtualenvs/first_class-iV0cREgX/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
        raise Py4JJavaError(
        py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
        : java.lang.OutOfMemoryError: Java heap space
        at org.apache.spark.api.java.JavaRDD$.readRDDFromInputStream(JavaRDD.scala:252)
        at org.apache.spark.api.java.JavaRDD$.readRDDFromFile(JavaRDD.scala:239)
        at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:274)
        at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)

Answer 1

I'm going to start by saying: we don't have enough info to properly judge this code.我首先要说：我们没有足够的信息来正确判断这段代码。 There are a bunch of functions you haven't shared with us ( self.read_csv(filename) , self.add_procuredate(df, filename) , self.uppercase_and_trim_all_columns(df) , ...) and we don't know the data you're working with.您还没有与我们分享许多功能（ self.read_csv(filename) 、 self.add_procuredate(df, filename) 、 self.uppercase_and_trim_all_columns(df) ，...），我们不知道您的数据正在与。 Stackoverflow is not really a code-reviewing website but it seems like you're struggling so I'll give you some pointers. Stackoverflow 并不是真正的代码审查网站，但您似乎正在苦苦挣扎，所以我会给您一些建议。

This bit of code:这段代码：

files = sorted(glob.glob(glob_paths))
dfs = []
for filename in files:
    logger.info(f'adding {basename(filename)}')
    df_list = self.read_csv(filename)
    for df_chunk in df_list:
        if len(columns) > 0:
            df_chunk = df_chunk[columns]
        df = self.spark_session.createDataFrame(df_chunk.astype('str'))
        df = self.add_procuredate(df, filename)
        dfs.append(df)
    logger.info(f'Added the procure date {basename(filename)} and append to list')
df = reduce(DataFrame.unionByName, dfs).repartition(5000)

is filled with things I believe can be better.充满了我相信可以更好的东西。 Let's have a look at this and understand why it can be improved so much.让我们看一下它并了解为什么它可以得到如此大的改进。

You're doing df_list = self.read_csv(filename) .你正在做df_list = self.read_csv(filename) 。 That seems not to be the spark.read.csv(filename) function. Why are you using something different?这似乎不是spark.read.csv(filename) function。你为什么使用不同的东西？ You're creating a list of dataframes, that scares me a little bit.你正在创建一个数据框列表，这让我有点害怕。 (could be a legit reason, but this smells bad to me). （可能是一个合法的理由，但这对我来说很难闻）。 Why not simply reading all your data into a single dataframe by doing spark.read.csv(sorted(glob.glob(glob_paths))) ?为什么不通过执行spark.read.csv(sorted(glob.glob(glob_paths)))将所有数据读入单个 dataframe 中？ That would make a single dataframe on which you can do whatever operations you want.这将生成一个 dataframe，您可以在其上执行任何您想要的操作。
Then you're doing the following (code block underneath).然后您将执行以下操作（下面的代码块）。 So for each file, you seem to create a list of Dataframes, df_list , and then looping over that list, creating a bunch of Dataframes (using createDataFrame ) per file???因此，对于每个文件，您似乎都创建了一个数据帧列表df_list ，然后遍历该列表，为每个文件创建一堆数据帧（使用createDataFrame ）？？？ That sounds like a good way to fill up your Java heap space: creating unnecessary objects everywhere you go. Also, within this loop it seems like the only thing you're doing is selecting some columns ( df_chunk = df_chunk[columns] ) and adding something to the df (with add_procuredate ).这听起来像是填满 Java 堆空间的好方法：在 go 的任何地方创建不必要的对象。此外，在这个循环中，您似乎唯一要做的就是选择一些列（ df_chunk = df_chunk[columns] ）并添加df 的一些东西（使用add_procuredate ）。 It sounds like you could perfectly do this on the final Dataframe. You're also appending to a list with an unknown length ( dfs.append(df) ), which is classical performance hogger.听起来您可以在最终的 Dataframe 上完美地执行此操作。您还附加到一个长度未知的列表 ( dfs.append(df) )，这是经典的性能浪费。

for df_chunk in df_list:
    if len(columns) > 0:
        df_chunk = df_chunk[columns]
    df = self.spark_session.createDataFrame(df_chunk.astype('str'))
    df = self.add_procuredate(df, filename)
    dfs.append(df)

Then you're creating a single DataFrame and repartitioning it: df = reduce(DataFrame.unionByName, dfs).repartition(5000) .然后，您将创建一个 DataFrame 并对其进行重新分区： df = reduce(DataFrame.unionByName, dfs).repartition(5000) 。 First of all, your code does not even get here (because if you look at the Java error stack trace you're getting a failure when reading RDDs, which is the problematic first 2 points).首先，您的代码甚至没有到达这里（因为如果您查看 Java 错误堆栈跟踪，您在读取 RDD 时会遇到失败，这是有问题的前 2 点）。 But there are 2 things here:但是这里有两件事：
3.1. 3.1. In the end, you're making a single DataFrame, unioning all of your smaller Dataframes.最后，您将制作一个 DataFrame，将所有较小的 Dataframes 联合起来。 It really looks like there are many ways to do this better.看起来确实有很多方法可以更好地做到这一点。 Of course this will depend on many things: your data, your cluster, your other code that you haven't shown us, ...当然，这将取决于很多因素：您的数据、您的集群、您尚未向我们展示的其他代码……
3.2. 3.2. You're doing .repartition(5000) .你在做.repartition(5000) 。 Why?为什么？ This triggers a shuffle: are you positive you need this?这会引发混乱：你确定你需要这个吗？

Conclusion: This code does not seem to have been written in the Spark mindset.结论：这段代码好像不是用Spark的思维写的。 There are too many things to discuss (and that is not the Stackoverflow way) so I'm just dumping them here.有太多东西要讨论（这不是 Stackoverflow 的方式）所以我只是把它们扔在这里。 My main advice is : read everything into a single dataframe (don't create multiple dataframes unnecessarily) and do all of your operations on that single dataframe.我的主要建议是：将所有内容读入单个 dataframe（不要不必要地创建多个数据帧）并在该单个 dataframe 上执行所有操作。

PS You'll possibly have some questions/remarks after reading this. PS 阅读本文后您可能会有一些问题/评论。 I suggest that we don't make a huge string of comments underneath this answer but that you chop up your question into minimal reproducible examples and ask them one question at a time (after having researched/googled around extensively of course).我建议我们不要在这个答案下发表大量评论，而是将您的问题分成最少的可重现示例，然后一次问他们一个问题（当然是在广泛研究/谷歌搜索之后）。

在 Pyspark 中从 memory 问题中获取 Java 堆

问题描述

1 个解决方案

解决方案1
0 2023-01-04 06:49:47

在 Pyspark 中从 memory 问题中获取 Java 堆

问题描述

1 个解决方案

解决方案1 0 2023-01-04 06:49:47

解决方案1
0 2023-01-04 06:49:47