When I run the following script in Dataproc
import nltk
nltk.download('wordnet')
The nltk_data is downloaded only in master node but not in worker nodes. Thus submitting PySpark job in dataproc it is failing to read from worker nodes.
What solutions do you suggest? How can download nltk_data in worker nodes too?
You can use init actions to do this on all cluster nodes: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.