nltk.download('wordnet') in Dataproc

Question

When I run the following script in Dataproc

import nltk
nltk.download('wordnet')

The nltk_data is downloaded only in master node but not in worker nodes. Thus submitting PySpark job in dataproc it is failing to read from worker nodes.

What solutions do you suggest? How can download nltk_data in worker nodes too?

Answer 1

You can use init actions to do this on all cluster nodes: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

nltk.download('wordnet') in Dataproc

Question

1 answers

solution1
0 2023-02-02 03:39:31

nltk.download('wordnet') in Dataproc

Question

1 answers

solution1 0 2023-02-02 03:39:31

solution1
0 2023-02-02 03:39:31