简体   繁体   English

在PySpark中将StringType转换为ArrayType

[英]Convert StringType to ArrayType in PySpark

I am trying to Run the FPGrowth algorithm in PySpark on my Dataset. 我正在尝试在数据集上的PySpark中运行FPGrowth算法。

from pyspark.ml.fpm import FPGrowth

fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6) 
model = fpGrowth.fit(df)

I am getting the following error: 我收到以下错误:

An error occurred while calling o2139.fit.
: java.lang.IllegalArgumentException: requirement failed: The input 
column must be ArrayType, but got StringType.
at scala.Predef$.require(Predef.scala:224)

My Dataframe df is in the form: 我的Dataframe df的格式为:

df.show(2)

+---+---------+--------------------+
| id|     name|               actor|
+---+---------+--------------------+
|  0|['ab,df']|                 tom|
|  1|['rs,ce']|                brad|
+---+---------+--------------------+
only showing top 2 rows

The FP algorithm works if my data in column "name" is in the form: 如果我在“名称”列中的数据采用以下格式,则FP算法有效:

 name
[ab,df]
[rs,ce]

How do I get it in this form that is convert from StringType to ArrayType 我如何以这种形式将其从StringType转换为ArrayType

I formed the Dataframe from my RDD: 我从RDD中形成了Dataframe:

rd2=rd.map(lambda x: (x[1], x[0][0] , [x[0][1]]))

rd3 = rd2.map(lambda p:Row(id=int(p[0]),name=str(p[2]),actor=str(p[1])))
df = spark.createDataFrame(rd3)

rd2.take(2):

[(0, 'tom', ['ab,df']), (1, 'brad', ['rs,ce'])]

Split by comma for each row in the name column of your dataframe. 为数据框name列中的每一行用逗号分隔。 eg 例如

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('list', PandasUDFType.SCALAR)
def split_comma(v):
    return v[1:-1].split(',')

df.withColumn('name', split_comma(df.name))

Or better, don't defer this. 或者更好的是,不要推迟。 Set name directly to the list. 将名称直接设置到列表中。

rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))

Based on your previous question , it seems as though you are building rdd2 incorrectly. 根据上一个问题 ,似乎您在错误地构建rdd2

Try this: 尝试这个:

rd2 = rd.map(lambda x: (x[1], x[0][0] , x[0][1].split(",")))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))

The change is that we call str.split(",") on x[0][1] so that it will convert a string like 'a,b' to a list: ['a', 'b'] . str.split(",")的更改是,我们在x[0][1]上调用str.split(",") ,以便它将类似'a,b'的字符串转换为列表: ['a', 'b']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM