需要使用 Pyspark 处理 2 亿条记录

Question

I am cleaning the data ( approx 200 Million rows) using python Pyspark but getting the error:我正在使用 python Pyspark 清理数据（大约 2 亿行），但出现错误：

ValueError: unichr() arg not in range(0x10000) (narrow Python build) ValueError: unichr() arg 不在范围内 (0x10000)（窄 Python 构建）

The size of the csv file is 21 GB and I added config('spark.driver.memory','8g') and I am using Macbook Pro 16GB. csv 文件的大小为 21 GB，我添加了config('spark.driver.memory','8g')并且我使用的是 Macbook Pro 16GB。

When I try to process the same file by reducing the size to 1GB, it executes successfully.当我尝试通过将大小减小到 1GB 来处理同一个文件时，它成功执行。

Any recommendations!!有什么推荐！！

I am using Apache 2.4我正在使用 Apache 2.4

Expected results: To process the raw_data.csv file Actual results: base =预期结果：处理raw_data.csv文件实际结果：base =

base[:pos] + unichr(char) + base[pos:]
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Answer 1

This error could be related to an unescaped unicode character in your char variable.此错误可能与char变量中未转义的 unicode 字符有关。 Could you try using:你可以尝试使用：

base = base[:pos] + char.decode('unicode-escape') + base[pos:]

Including your code would make it easier to help debug the issue.包含您的代码可以更轻松地帮助调试问题。

需要使用 Pyspark 处理 2 亿条记录

问题描述

1 个解决方案

解决方案1
0 2019-07-24 14:58:43

需要使用 Pyspark 处理 2 亿条记录

问题描述

1 个解决方案

解决方案1 0 2019-07-24 14:58:43

解决方案1
0 2019-07-24 14:58:43