[英]JSON to Spark RDD in Python
我對Spark非常陌生,並且已經嘗試了一段時間以讓Spark理解我的JSON輸入,但是我沒有進行管理。 總之,我正在使用Spark的ALS算法來提出建議。 當我提供一個csv文件作為輸入時,一切正常。 但是,我的輸入實際上是一個json,如下所示:
all_user_recipe_rating = [{'rating': 1, 'recipe_id': 8798, 'user_id': 2108}, {'rating': 4, 'recipe_id': 6985, 'user_id': 4236}, {'rating': 4, 'recipe_id': 13572, 'user_id': 2743}, {'rating': 4, 'recipe_id': 6312, 'user_id': 3156}, {'rating': 1, 'recipe_id': 12836, 'user_id': 768}, {'rating': 1, 'recipe_id': 9237, 'user_id': 1599}, {'rating': 2, 'recipe_id': 16946, 'user_id': 2687}, {'rating': 2, 'recipe_id': 20728, 'user_id': 58}, {'rating': 4, 'recipe_id': 12921, 'user_id': 2221}, {'rating': 2, 'recipe_id': 10693, 'user_id': 2114}, {'rating': 2, 'recipe_id': 18301, 'user_id': 4898}, {'rating': 2, 'recipe_id': 9967, 'user_id': 3010}, {'rating': 2, 'recipe_id': 16393, 'user_id': 4830}, {'rating': 4, 'recipe_id': 14838, 'user_id': 583}]
ratings_RDD = self.spark.parallelize(all_user_recipe_rating)
ratings = ratings_RDD.map(lambda row:
(Rating(int(row['user_id']),
int(row['recipe_id']),
float(row['rating']))))
model = self.build_model(ratings)
這是我在看到一些示例后想到的,但這是我得到的:
MatrixFactorizationModel: User factor is not cached. Prediction could be slow.
16/12/21 03:54:53 WARN MatrixFactorizationModel: Product factor does not have a partitioner. Prediction on individual records could be slow.
16/12/21 03:54:53 WARN MatrixFactorizationModel: Product factor is not cached. Prediction could be slow.
16/12/21 03:54:53 WARN MatrixFactorizationModelWrapper: User factor does not have a partitioner. Prediction on individual records could be slow.
和
File "/usr/local/spark/python/pyspark/mllib/recommendation.py", line 147, in <lambda>
user_product = user_product.map(lambda u_p: (int(u_p[0]), int(u_p[1])))
TypeError: int() argument must be a string or a number, not 'Rating'
有人可以幫我嗎? :) 謝謝!
好,
您的錯誤是由於一件事而發生的。
您所遭受的此異常是關於ALS函數的 predictAll
函數 。
這里的問題是,您正在嘗試將Rating對象發送給需要接收RDD<int, int>
的函數
我獲取了您的代碼,並構建了所需的代碼:
>>> from pyspark.mllib.recommendation import Rating
>>> ratings = ratings_RDD.map(lambda row:
... (Rating(int(row['user_id']),
... int(row['recipe_id']),
... float(row['rating']))))
>>> model = ALS.trainImplicit(ratings, 1, seed=10)
>>> to_predict = spark.parallelize([[2108, 16393], [583, 20728]])
>>> model.predictAll(to_predict).take(2)
[Rating(user=583, product=20728, rating=0.0741161997082127), Rating(user=2108, product=16393, rating=0.05669039815320609)]
您的JSON沒錯,在調用predictAll
,您遇到的問題是您發送的是Rating
對象而不是RDD<int, int>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.