简体   繁体   English

使用元组的RDD更新PySpark数据框列

[英]Updating a PySpark dataframe column with an RDD of tuples

I am working with data where user info is string. 我正在处理用户信息为字符串的数据。 I would like to assign unique integer values to those strings. 我想为这些字符串分配唯一的整数值。

I was somewhat following this stack overflow post here . 我在这里关注这个堆栈溢出帖子。 I am using the expression below to have an RDD of tuples: 我使用下面的表达式具有元组的RDD:

user = data.map(lambda x:x[0]).distinct().zipWithUniqueId()

After that, I did 之后,我做了

data = data.map(lambda x: Rating(int(user.lookup(x[0])), int(x[1]), float(x[2]))) 

What I ultimately want to do is run an ALS model on it, but so far I have been getting this error message 我最终想要做的是在其上运行ALS模型,但到目前为止,我一直在收到此错误消息

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. 例外:您似乎正在尝试广播RDD或从操作或转换引用RDD。

I think the data type is somehow wrong, but I am not sure how to fix this. 我认为数据类型某种程度上是错误的,但是我不确定该如何解决。

lookup approach suggested in the linked answer is simply invalid. 链接答案中建议的lookup方法完全无效。 Spark doesn't support nested action nor transformations so you cannot call RDD.lookup inside a map . Spark不支持嵌套动作也不支持转换,因此您不能在map内部调用RDD.lookup If data is to large to be handled using a standard Python dict for lookups you can simply join and reshape: 如果数据很大,可以使用标准的Python dict进行查找来处理,则可以简单地join并重塑它们:

from operator import itemgetter
from pyspark.mllib.recommendation import Rating

data = sc.parallelize([("foo", 1, 2.0), ("bar", 2, 3.0)])

user = itemgetter(0)

def to_rating(record):
    """
    >>> to_rating((("foobar", 99, 5.0), 1000))
    Rating(user=1000, product=99, rating=5.0)
    """
    (_, item, rating), user = record
    return Rating(user, item, rating)

user_lookup = data.map(user).distinct().zipWithIndex()

ratings = (data
    .keyBy(user)  # Add user string as a key
    .join(user_lookup)  # Join with lookup
    .values()  # Drop keys
    .map(to_rating))  # Create Ratings

ratings.first()
## Rating(user=1, product=1, rating=2.0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM