Apache Spark，NameError：名称“ flatMap”未定义

Question

When I try 当我尝试

tokens = cleaned_book(flatMap(normalize_tokenize))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'flatMap' is not defined

where 哪里

cleaned_book.count()
65744

and 和

def normalize_tokenize(line):
...     return re.sub('\s+', ' ', line).strip().lower().split(' ')

On the other side 另一方面

sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()

works fine from the same Pyspark shell 在同一个Pyspark外壳上可以正常工作

[1, 2, 1, 2, 3, 1, 2, 3, 4]

Why do I have NameError? 为什么会有NameError？

Answer 1

OK, here is a Scala example with tokenizer that leads me to think you are looking at it wrongly. 好的，这是一个带令牌化程序的Scala示例，使我认为您看错了。

def tokenize(f: RDD[String]) = {
      f.map(_.split(" "))
}

val dfsFilename = "/FileStore/tables/some.txt"
val readFileRDD = spark.sparkContext.textFile(dfsFilename)
val wcounts = tokenize(spark.sparkContext.textFile(dfsFilename)).flatMap(x => x).map(word=>(word, 1)).reduceByKey(_ + _)
wcounts.collect()

This works fine, you need the functional . 这工作正常，您需要功能。 aspect, thus .flatMap and in this sequence. 方面，因此是.flatMap，并按此顺序。 The inline approach I find easier, but I note the comment also alludes to the .flatMap. 我发现内联方法更容易，但是我注意到该注释还暗示了.flatMap。

Apache Spark，NameError：名称“ flatMap”未定义

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-14 23:40:23

Apache Spark，NameError：名称“ flatMap”未定义

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-14 23:40:23

解决方案1
1 已采纳 2019-08-14 23:40:23