[英]Apache Spark TFIDF using Python
The Spark documentation states to use HashingTF
feature, but I'm unsure what the transform function expects as input. Spark文档声明使用HashingTF
功能,但我不确定转换函数期望什么作为输入。 http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
I tried running the tutorial code: 我尝试运行教程代码:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
sc = SparkContext()
# Load documents (one per line).
documents = sc.textFile("...").map(lambda line: line.split(" "))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
but I get the following error: 但是我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/salloumm/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/pipeline.py", line 114, in transform
return self._transform(dataset)
File "/Users/salloumm/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/wrapper.py", line 148, in _transform
return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)
AttributeError: 'list' object has no attribute '_jdf'
Based on the error you've shown it is clear you don't follow the tutorial or use code included in the question. 根据您显示的错误,很明显您没有按照教程或使用问题中包含的代码。
This error is a result of using from pyspark.ml.feature.HashingTF
instead of pyspark.mllib.feature.HashingTF
. 此错误是使用from pyspark.ml.feature.HashingTF
而不是pyspark.mllib.feature.HashingTF
。 Just clean your environment and make sure you use correct imports. 只需清理您的环境并确保使用正确的导入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.