有没有办法在 pyspark 数组 function 中放置多列？（FP成长准备）

Question

I have a DataFrame with symptoms of a disease, I want to run FP Growt on the entire DataFrame.我有一个带有疾病症状的 DataFrame，我想在整个 DataFrame 上运行 FP Growt。 FP Growt wants an array as input and it works with this code: FP Growt 想要一个数组作为输入，它使用以下代码：

dfFPG = (df.select(F.array(df["Gender"], 
                        df["Polyuria"], 
                        df["Polydipsia"], 
                        df["Sudden weight loss"], 
                        df["Weakness"], 
                        df["Polyphagia"],
                        df["Genital rush"],
                        df["Visual blurring"],
                        df["Itching"]).alias("features")

from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="features", minSupport=0.3, minConfidence=0.2)
model = fpGrowth.fit(dfFPG)

model.freqItemsets.show(20,truncate=False)

the features list is longer and if I have to change the name of df I have to use find and replace.功能列表更长，如果我必须更改 df 的名称，我必须使用查找和替换。 I know I can use F.col("Gender") instead of df["Gender"] but is there a way to put all the columns inside F.array() in once and be able to exclude few of them like df["Age"] ?我知道我可以使用F.col("Gender")而不是df["Gender"]但是有没有办法将F.array()中的所有列一次放入并能够排除其中的一些列，例如df["Age"] ? Or, is there any other efficient way to prepare categorical features to FP Growt that I'm not aware of?或者，有没有其他有效的方法可以为我不知道的 FP Growt 准备分类特征？

Answer 1

You can get all the column names using df.columns and put them all into the array :您可以使用df.columns获取所有列名并将它们全部放入array中：

import pyspark.sql.functions as F

dfFPG = df.select(F.array(*[c for c in df.columns if c not in ['col1', 'col2']]).alias("features"))

有没有办法在 pyspark 数组 function 中放置多列？（FP成长准备）

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-02 08:50:10

有没有办法在 pyspark 数组 function 中放置多列？ （FP成长准备）

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-02 08:50:10

有没有办法在 pyspark 数组 function 中放置多列？（FP成长准备）

解决方案1
1 已采纳 2021-02-02 08:50:10