[英]NameError: name 'spark' is not defined, how to solve?
I have just installed pyspark2.4.5 in my ubuntu18.04 laptop, and when I run following codes,我刚刚在我的 ubuntu18.04 笔记本电脑上安装了 pyspark2.4.5,当我运行以下代码时,
#this is a part of the code.
import pubmed_parser as pp
from pyspark.sql import SparkSession
from pyspark.sql import Row
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
parse_results_rdd = medline_files_rdd.\
flatMap(lambda x: [Row(file_name=os.path.basename(x), **publication_dict)
for publication_dict in pp.parse_medline_xml(x)])
medline_df = parse_results_rdd.toDF()
# save to parquet
medline_df.write.parquet('raw_medline.parquet', mode='overwrite')
medline_df = spark.read.parquet('raw_medline.parquet')
I get such error,我得到这样的错误,
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
NameError: name 'spark' is not defined
I have seen similiar questions on StackOverflow, but all of them can not solve my problem.Does anyone can help me?Thanks a lot.我在 StackOverflow 上看到过类似的问题,但它们都无法解决我的问题。有人可以帮助我吗?非常感谢。
By the way, I am new in spark, if I just want to use spark in Python, does it enough that I just install pyspark by using pip install pyspark
? By the way, I am new in spark, if I just want to use spark in Python, does it enough that I just install pyspark by using pip install pyspark
? any others should I do?我应该做其他任何事情吗? Should I install Hadoop or others?我应该安装 Hadoop 还是其他?
Just create spark session in the starting只需在启动中创建火花 session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.