[英]Need to process 200 Million record using Pyspark
I am cleaning the data ( approx 200 Million rows) using python Pyspark but getting the error:我正在使用 python Pyspark 清理数据(大约 2 亿行),但出现错误:
ValueError: unichr() arg not in range(0x10000) (narrow Python build) ValueError: unichr() arg 不在范围内 (0x10000)(窄 Python 构建)
The size of the csv file is 21 GB and I added config('spark.driver.memory','8g')
and I am using Macbook Pro 16GB. csv 文件的大小为 21 GB,我添加了config('spark.driver.memory','8g')
并且我使用的是 Macbook Pro 16GB。
When I try to process the same file by reducing the size to 1GB, it executes successfully.当我尝试通过将大小减小到 1GB 来处理同一个文件时,它成功执行。
Any recommendations!!有什么推荐!!
I am using Apache 2.4我正在使用 Apache 2.4
Expected results: To process the raw_data.csv file Actual results: base =预期结果:处理raw_data.csv文件实际结果:base =
base[:pos] + unichr(char) + base[pos:]
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
This error could be related to an unescaped unicode character in your char
variable.此错误可能与char
变量中未转义的 unicode 字符有关。 Could you try using:你可以尝试使用:
base = base[:pos] + char.decode('unicode-escape') + base[pos:]
Including your code would make it easier to help debug the issue.包含您的代码可以更轻松地帮助调试问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.