简体   繁体   English

有没有办法在 pyspark 数组 function 中放置多列? (FP成长准备)

[英]Is there a way to put multiple columns in pyspark array function? (FP Growt prep)

I have a DataFrame with symptoms of a disease, I want to run FP Growt on the entire DataFrame.我有一个带有疾病症状的 DataFrame,我想在整个 DataFrame 上运行 FP Growt。 FP Growt wants an array as input and it works with this code: FP Growt 想要一个数组作为输入,它使用以下代码:

dfFPG = (df.select(F.array(df["Gender"], 
                        df["Polyuria"], 
                        df["Polydipsia"], 
                        df["Sudden weight loss"], 
                        df["Weakness"], 
                        df["Polyphagia"],
                        df["Genital rush"],
                        df["Visual blurring"],
                        df["Itching"]).alias("features")

from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="features", minSupport=0.3, minConfidence=0.2)
model = fpGrowth.fit(dfFPG)

model.freqItemsets.show(20,truncate=False)

the features list is longer and if I have to change the name of df I have to use find and replace.功能列表更长,如果我必须更改 df 的名称,我必须使用查找和替换。 I know I can use F.col("Gender") instead of df["Gender"] but is there a way to put all the columns inside F.array() in once and be able to exclude few of them like df["Age"] ?我知道我可以使用F.col("Gender")而不是df["Gender"]但是有没有办法将F.array()中的所有列一次放入并能够排除其中的一些列,例如df["Age"] ? Or, is there any other efficient way to prepare categorical features to FP Growt that I'm not aware of?或者,有没有其他有效的方法可以为我不知道的 FP Growt 准备分类特征?

You can get all the column names using df.columns and put them all into the array :您可以使用df.columns获取所有列名并将它们全部放入array中:

import pyspark.sql.functions as F

dfFPG = df.select(F.array(*[c for c in df.columns if c not in ['col1', 'col2']]).alias("features"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM