简体   繁体   中英

How to fix pyspark NLTK Error with OSError: [WinError 123]?

I got an unexcpected error when I run transforming RDD to DataFrame:

import nltk
from nltk import pos_tag
my_rdd_of_lists = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x))
my_df = spark.createDataFrame(my_rdd_of_lists)

This error appears always when I call nltk function od rdd. When I made this line with any numpy method, it did not fail.

Error code:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 323, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

And

OSError: [WinError 123] Nazwa pliku, nazwa katalogu lub składnia etykiety woluminu jest niepoprawna: 'C:\\C:\\Users\\Olga\\Desktop\\Spark\\spark-2.4.5-bin-hadoop2.7\\jars\\spark-core_2.11-2.4.5.jar'

So here is the part I don't know how to resolve. I thought that it is the problem with environment variables, but it seems there is everything ok:

SPARK HOME: C:\Users\Olga\Desktop\Spark\spark-2.4.5-bin-hadoop2.7

I've also printed my sys.path:

import sys
for i in sys.path:
    print(i) 

And got:

C:\Users\Olga\Desktop\Spark\spark-2.4.5-bin-hadoop2.7\python
C:\Users\Olga\AppData\Local\Temp\spark-22c0eb38-fcc0-4f1f-b8dd-af83e15d342c\userFiles-3195dcc7-0fc6-469f-9afc-7752510f2471
C:\Users\Olga\Desktop\Spark\spark-2.4.5-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip
C:\Users\Olga
C:\Users\Olga\Anaconda3\python37.zip
C:\Users\Olga\Anaconda3\DLLs
C:\Users\Olga\Anaconda3\lib
C:\Users\Olga\Anaconda3

C:\Users\Olga\Anaconda3\lib\site-packages
C:\Users\Olga\Anaconda3\lib\site-packages\win32
C:\Users\Olga\Anaconda3\lib\site-packages\win32\lib
C:\Users\Olga\Anaconda3\lib\site-packages\Pythonwin
C:\Users\Olga\Anaconda3\lib\site-packages\IPython\extensions
C:\Users\Olga\.ipython

Here also everything looks ok for me. Please help, I don't know what to do. Earlier parts of codes were running without any error. Should I install nltk in any other way to run it with spark?

Hai Milva to solve the os error in your code you can just import the os so all the permission for running your program is given to the code like:

{{{ import os }}}

Hope this answer helps you

It seems that it was some problem with packages.

I uninstalled nltk, pandas and numpy with pip and then I did the same but with conda .

After that I listed my packages and found one weird called package that seemed to be a bug, called "-umpy".

I could not even uninstall it - no with command prompt, neither with Anaconda navigator. So I just found it in files on my computer and removed. Then I installed nltk once again.

After that it started working correctly and bug did not appear.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM