使用 Spark 将列名附加到列值

Question

I have data in comma separated file, I have loaded it in the spark data frame: The data looks like:我有逗号分隔文件中的数据，我已将其加载到 spark 数据框中：数据如下所示：

I want to transform the above data frame in spark using pyspark as:我想使用 pyspark 将上面的数据框转换为 spark 格式：

   A    B   C
  A_1  B_2  C_3
  A_4  B_5  C_6
  --------------

Then convert it to list of list using pyspark as:然后使用 pyspark 将其转换为列表列表：

[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]

And then run FP Growth algorithm using pyspark on the above data set.然后在上述数据集上使用pyspark运行FP增长算法。

The code that I have tried is below:我试过的代码如下：

from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext

sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")

 names=df.schema.names

Then I thought of doing something inside for loop:然后我想到在 for 循环里面做一些事情：

 for name in names:
      -----
      ------

After this I will be using fpgrowth:在此之后，我将使用 fpgrowth：

df = spark.createDataFrame([
    (0, [ A_1 , B_2 , C_3]),
    (1, [A_4 , B_5 , C_6]),)], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)

Answer 1

A number of concepts here for those who use Scala normally showing how to do with pyspark.这里为那些通常使用 Scala 的人展示了如何使用 pyspark 的一些概念。 Somewhat different but learnsome for sure, although to how many is the big question.有点不同，但肯定会学到一些东西，尽管有多少是大问题。 I certainly learnt a point on pyspark with zipWithIndex myself.我当然自己用 zipWithIndex 在 pyspark 上学到了一点。 Anyway.反正。

First part is to get stuff into desired format, probably too may imports but leaving as is:第一部分是将内容转换为所需的格式，可能也可以导入但保持原样：

from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import * 
from pyspark.sql import Row
from pyspark.sql import functions as f

source_df = spark.createDataFrame(
   [
    (1, 11, 111),
    (2, 22, 222)
   ],
   ["colA", "colB", "colC"]
                                 )

intermediate_df = (reduce(
                    lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
                    source_df.columns,
                    source_df
                   )     )

allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))

result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))

# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])

# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
               lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
              )

final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)

returns:返回：

 +---------------------------+-----+
 |ARRAY_COLS                 |index|
 +---------------------------+-----+
 |[colA_1, colB_11, colC_111]|0    |
 |[colA_2, colB_22, colC_222]|1    |
 +---------------------------+-----+

Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.第二部分是带有 pyspark 的旧 zipWithIndex，如果您需要 0,1，.. 与 Scala 相比痛苦。

In general easier to solve in Scala.通常在 Scala 中更容易解决。

Not sure on performance, not a foldLeft, interesting.不确定性能，不是 foldLeft，有趣。 I think it is OK actually.我觉得其实还可以。

使用 Spark 将列名附加到列值

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-08-12 14:14:32

使用 Spark 将列名附加到列值

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-08-12 14:14:32

解决方案1
2 已采纳 2019-08-12 14:14:32