在 PYSPARK 中将行转换为列

Question

I am working a project that requires data to be transposed.我正在做一个需要转置数据的项目。 In the past, I had done it using SAS and SQL which used to be super fast.过去，我使用 SAS 和 SQL 完成了它，它们曾经非常快。 I used the expr function with Stack as outlined below (code section).我使用了带有堆栈的 expr function，如下所述（代码部分）。

The problem I am facing is 2 fold.我面临的问题是2倍。

The input data is about 200 GB (500 Million rows vs 70 columns) and stored as parquet files.输入数据约为 200 GB（5 亿行 vs 70 列）并存储为 parquet 文件。
The step that transposes (df2) runs for about 4-5 hours and terminates.转置步骤 (df2) 运行大约 4-5 小时并终止。 I had changed the time out settings and played around with the Spark session settings but no luck so far.我已经更改了超时设置并使用了 Spark session 设置，但到目前为止还没有运气。

What I did so far: The data is stored as parquet files in Azure Synapse Workspace.到目前为止我所做的：数据以镶木地板文件的形式存储在 Azure Synapse Workspace 中。 Firstly, I had assigned a ROWNUMBER to each row in the data frame.首先，我为数据框中的每一行分配了一个 ROWNUMBER。 Then I have split the data into two data frames.然后我将数据分成两个数据框。

df1 has ROWNUMBER and all the necessary columns (minus 25 diagnosis columns) df1 有 ROWNUMBER 和所有必要的列（减去 25 个诊断列）
df2 has ROWNUMBER as the 25 Diagnosis columns. df2 将 ROWNUMBER 作为 25 个诊断列。
I then tried to create df3 by joining df1 and df2 on ROWNUMBER.然后我尝试通过在 ROWNUMBER 上加入 df1 和 df2 来创建 df3。

Step 2 is a killer, I mean I was not able to get past this step as the session terminates after 4 hours.第 2 步是一个杀手，我的意思是我无法通过这一步，因为 session 在 4 小时后终止。

I tried with SPARK SQL as well, but no luck there was well.我也尝试过使用 SPARK SQL，但运气不好。 Further, I was advised not to use SQL in SPARK as it will deteriorate the performance.此外，我被建议不要在 SPARK 中使用 SQL，因为它会降低性能。

I am also thinking of doing the transpose outside of PYSPARK (not sure how and if it is even advisable to do so).我也在考虑在 PYSPARK 之外进行转置（不确定如何以及是否建议这样做）。

Code I wrote so far:到目前为止我写的代码：

import sys
import pyspark.sql as t
import pyspark.sql.functions as f
from pyspark.sql.types import *

df_raw=spark.read.parquet("abfss:path/med_claims/*.parquet")
df_rn=df_raw.withColumn("ROWNUM", f.row_number().over(t.Window.orderBy(df_raw.MEMBER_ID, df_raw.SERVICE_FROM_DATE, df_raw.SERVICE_THRU_DATE)))

df1=df_rn.select(
                 df_rn.ROWNUM,
                 df_rn.MEMBER_ID,
                 df_rn.MEMBER_ID_DEPENDENT,
                 df_rn.SERVICE_FROM_DATE,
                 df_rn.SERVICE_THRU_DATE,
                 df_rn.SERVICE_PROCEDURE_CODE
                )

df2=df_rn.select(df_rn.ROWNUM,
             f.expr("stack(25, code1, code2, code3, code4, code5, \
                             code6, code7, code8, code9, code10, \
                             code11, code12, code13, code14, code15, \
                             code16, code17, code18, code19, code20, \
                             code21, code22, code23, code24, code25) as (TRANPOSED_DIAG)")) \
             .dropDuplicates() \
             .where(" (TRANPOSED_DIAG IS NOT NULL) OR (TRIM(TRANPOSED_DIAG) <> '') ")

df3=df1.join(df2, df1.ROWNUM == df2.ROWNUM, 'left') \
       .select(df1.ROWNUM,
             df1.MEMBER_ID,
             df1.MEMBER_ID_DEPENDENT,
             df1.SERVICE_FROM_DATE,
             df1.SERVICE_THRU_DATE,
             df1.SERVICE_PROCEDURE_CODE,
             df2.TRANPOSED_DIAG
            )

Input Data:输入数据：

MEMBER_ID会员ID	MEMBER_ID_DEPENDENT MEMBER_ID_DEPENDENT	PROVIDER_KEY PROVIDER_KEY	REVENUE_KEY REVENUE_KEY	PLACE_OF_SERVICE_KEY PLACE_OF_SERVICE_KEY	SERVICE_FROM_DATE SERVICE_FROM_DATE	SERVICE_THRU_DATE SERVICE_THRU_DATE	SERVICE_PROCEDURE_CODE SERVICE_PROCEDURE_CODE	CODE1代码1	CODE2代码2	CODE3代码3	CODE4 CODE4	CODE5 CODE5	CODE6 CODE6	CODE7 CODE7	CODE8 CODE8	CODE9 CODE9	CODE10 CODE10	CODE11 CODE11	CODE12 CODE12	CODE13 CODE13	CODE14 CODE14	CODE15 CODE15	CODE16 CODE16	CODE17 CODE17	CODE18 CODE18	CODE19 CODE19	CODE20 CODE20	CODE21 CODE21	CODE22 CODE22	CODE23 CODE23	CODE24 CODE24	CODE25 CODE25
A1 A1	A11 A11	AB05547 AB05547	4.85148E+12 4.85148E+12	7.96651E+11 7.96651E+11	9/23/2019 0:00 2019 年 9 月 23 日 0:00	9/23/2019 0:00 2019 年 9 月 23 日 0:00	89240 89240	Z0000 Z0000	M25852 M25852	M25851 M25851	Z0000 Z0000	M25551 M25551	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null
A1 A1	A11 A11	AB92685 AB92685	4.85148E+12 4.85148E+12	7.96651E+11 7.96651E+11	10/23/2020 0:00 2020 年 10 月 23 日 0:00	10/23/2020 0:00 2020 年 10 月 23 日 0:00	89240 89240	Z524 Z524	Z524 Z524	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null
A2 A2	A12 A12	AB64081 AB64081	4.8515E+12 4.8515E+12	7.96651E+11 7.96651E+11	6/19/2020 0:00 2020 年 6 月 19 日 0:00	6/19/2020 0:00 2020 年 6 月 19 日 0:00	76499 76499	Z9884 Z9884	R109 R109	K219 K219	K449 K449	Z9884 Z9884	null null	null null	null null	null null	null null	null null	null null	null null	null null		null null	null null	null null	null null	null null	null null	null null	null null	null null	null null
A3 A3	A13 A13	AB64081 AB64081	4.8515E+12 4.8515E+12	7.96651E+11 7.96651E+11	9/13/2019 0:00 2019 年 9 月 13 日 0:00	9/13/2019 0:00 2019 年 9 月 13 日 0:00	76499 76499	Z1231 Z1231	Z1231 Z1231	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null
A4 A4	A14 A14	AB74417 AB74417	4.8515E+12 4.8515E+12	7.96651E+11 7.96651E+11	9/30/2019 0:00 2019 年 9 月 30 日 0:00	9/30/2019 0:00 2019 年 9 月 30 日 0:00	76499 76499	N210 N210	N400 N400	E782 E782	E119 E119	I10 I10	Z87891 Z87891	N210 N210	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null	null null

Expected Output:预期 Output：

MEMBER_ID会员ID	MEMBER_ID_DEPENDENT MEMBER_ID_DEPENDENT	PROVIDER_KEY PROVIDER_KEY	REVENUE_KEY REVENUE_KEY	PLACE_OF_SERVICE_KEY PLACE_OF_SERVICE_KEY	SERVICE_FROM_DATE SERVICE_FROM_DATE	SERVICE_THRU_DATE SERVICE_THRU_DATE	SERVICE_PROCEDURE_CODE SERVICE_PROCEDURE_CODE	TRANSPOSED_DIAGNOSIS TRANSPOSED_DIAGNOSIS
A1 A1	A11 A11	AB05547 AB05547	4851484842551 4851484842551	796650504854 796650504854	9/23/2019 0:00 2019 年 9 月 23 日 0:00	9/23/2019 0:00 2019 年 9 月 23 日 0:00	89240 89240	Z0000 Z0000
A1 A1	A11 A11	AB05548 AB05548	4851484842551 4851484842551	796650504854 796650504854	9/23/2019 0:00 2019 年 9 月 23 日 0:00	9/23/2019 0:00 2019 年 9 月 23 日 0:00	89241 89241	M25852 M25852
A1 A1	A11 A11	AB05549 AB05549	4851484842551 4851484842551	796650504854 796650504854	9/23/2019 0:00 2019 年 9 月 23 日 0:00	9/23/2019 0:00 2019 年 9 月 23 日 0:00	89242 89242	M25851 M25851
A1 A1	A11 A11	AB05550 AB05550	4851484842551 4851484842551	796650504854 796650504854	9/23/2019 0:00 2019 年 9 月 23 日 0:00	9/23/2019 0:00 2019 年 9 月 23 日 0:00	89243 89243	M25551 M25551
A1 A1	A11 A11	AB92685 AB92685	4851484842551 4851484842551	796650504854 796650504854	10/23/2020 0:00 2020 年 10 月 23 日 0:00	10/23/2020 0:00 2020 年 10 月 23 日 0:00	89240 89240	Z524 Z524
A2 A2	A12 A12	AB64081 AB64081	4851504842551 4851504842551	796650504854 796650504854	6/19/2020 0:00 2020 年 6 月 19 日 0:00	6/19/2020 0:00 2020 年 6 月 19 日 0:00	76499 76499	Z9884 Z9884
A2 A2	A12 A12	AB64082 AB64082	4851504842551 4851504842551	796650504854 796650504854	6/19/2020 0:00 2020 年 6 月 19 日 0:00	6/19/2020 0:00 2020 年 6 月 19 日 0:00	76500 76500	R109 R109
A2 A2	A12 A12	AB64083 AB64083	4851504842551 4851504842551	796650504854 796650504854	6/19/2020 0:00 2020 年 6 月 19 日 0:00	6/19/2020 0:00 2020 年 6 月 19 日 0:00	76501 76501	K219 K219
A2 A2	A12 A12	AB64084 AB64084	4851504842551 4851504842551	796650504854 796650504854	6/19/2020 0:00 2020 年 6 月 19 日 0:00	6/19/2020 0:00 2020 年 6 月 19 日 0:00	76502 76502	K449 K449
A3 A3	A13 A13	AB64081 AB64081	4851504842551 4851504842551	796650504854 796650504854	9/13/2019 0:00 2019 年 9 月 13 日 0:00	9/13/2019 0:00 2019 年 9 月 13 日 0:00	76499 76499	Z1231 Z1231
A4 A4	A14 A14	AB74417 AB74417	4851504842551 4851504842551	796650504854 796650504854	9/30/2019 0:00 2019 年 9 月 30 日 0:00	9/30/2019 0:00 2019 年 9 月 30 日 0:00	76499 76499	N210 N210
A4 A4	A14 A14	AB74418 AB74418	4851504842551 4851504842551	796650504854 796650504854	9/30/2019 0:00 2019 年 9 月 30 日 0:00	9/30/2019 0:00 2019 年 9 月 30 日 0:00	76500 76500	N400 N400
A4 A4	A14 A14	AB74419 AB74419	4851504842551 4851504842551	796650504854 796650504854	9/30/2019 0:00 2019 年 9 月 30 日 0:00	9/30/2019 0:00 2019 年 9 月 30 日 0:00	76501 76501	E782 E782
A4 A4	A14 A14	AB74420 AB74420	4851504842551 4851504842551	796650504854 796650504854	9/30/2019 0:00 2019 年 9 月 30 日 0:00	9/30/2019 0:00 2019 年 9 月 30 日 0:00	76502 76502	E119 E119
A4 A4	A14 A14	AB74421 AB74421	4851504842551 4851504842551	796650504854 796650504854	9/30/2019 0:00 2019 年 9 月 30 日 0:00	9/30/2019 0:00 2019 年 9 月 30 日 0:00	76503 76503	I10 I10
A4 A4	A14 A14	AB74422 AB74422	4851504842551 4851504842551	796650504854 796650504854	9/30/2019 0:00 2019 年 9 月 30 日 0:00	9/30/2019 0:00 2019 年 9 月 30 日 0:00	76504 76504	Z87891 Z87891

Answer 1

This will be an expensive operation in any approach, however you may consider the following approaches which avoids using another expensive join.这在任何方法中都是一项昂贵的操作，但是您可以考虑使用以下方法来避免使用另一个昂贵的连接。

For simplification and code re-use I've filtered out the desired and code related columns into different variables instead of hardcoding them.为了简化和代码重用，我将所需的列和与代码相关的列过滤到不同的变量中，而不是对它们进行硬编码。

Approach 1: Recommended方法一：推荐

Continuing from df_raw 's first load, you may try the following:从df_raw的第一次加载开始，您可以尝试以下操作：

from pyspark.sql import functions as F
from pyspark.sql import Window

# extract service procedure code columns from `df_raw` by looking for the simple pattern 'CODE'. 
# This filter can be easily modified for more complex code columns names
service_procedure_cols = [col for col in df_raw.columns if 'CODE' in col and 'SERVICE' not in col]

# extract the desired column names in the dataframe
desired_cols = [col for col in df_raw.columns if 'CODE' not in col or 'SERVICE' in col]
#build the stack expresssion by counting the number of columns with `len` and concatenating the column names
code_column_stack_expression = "stack("+str(len(service_procedure_cols))+", "+",".join(service_procedure_cols)+") as (TRANSPOSED_DIAGNOSIS)"

df_step_1 = (
    # select the desired column names and unpivot the data
    df_raw.select(desired_cols + [ F.expr(code_column_stack_expression)])
    # filter or remove null and empty columns
          .where(F.col("TRANSPOSED_DIAGNOSIS").isNotNull() & (F.trim("TRANSPOSED_DIAGNOSIS") != '' ))
    # remove duplicates
          .dropDuplicates()
)

df_step_1.show(truncate=False)

Outputs:输出：

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K449

df_step_2 = (
    # Replace the existing `SERVICE_PROCEDURE_CODE` column with the new service procedure column by casting it as an integer and adding the generated row number partitioned by your desired columns and ordered by the columns you specified in your example
    df_step_1.withColumn(
        "SERVICE_PROCEDURE_CODE",
        F.col("SERVICE_PROCEDURE_CODE").cast("INT")+F.row_number().over(
            Window.partitionBy(desired_cols).orderBy(["MEMBER_ID", "SERVICE_FROM_DATE", "SERVICE_THRU_DATE"]) -1
        )
    )
)
df_step_2.show(truncate=False)

Outputs:输出：

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89241                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89242                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89243                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89244                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89241                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76500                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76501                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76502                 |K449                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76503                 |Z9884               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76499                 |Z1231               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76500                 |Z1231               |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N210                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76500                 |N400                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76501                 |E782                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76502                 |E119                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76503                 |I10                 |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76504                 |Z87891              |
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
only showing top 20 rows

Approach 2: Uses original code number to update service code方法二：使用原始码号更新服务码

This approach may also be simpler to read for some as it uses a loop to build a union of the desired dataset.对于某些人来说，这种方法也可能更易于阅读，因为它使用循环来构建所需数据集的联合。

NB.注意。 This may cause overlaps in your service procedure code这可能会导致您的服务过程代码重叠

Continuing from df_raw 's first load, you may try the following:从df_raw的第一次加载开始，您可以尝试以下操作：

from pyspark.sql import functions as F
from pyspark.sql import Window

# cache the original df
df_raw.cache()

# extract service procedure code columns from `df_raw` by looking for the simple pattern 'CODE'. 
# This filter can be easily modified for more complex code columns names
service_procedure_cols = [col for col in df_raw.columns if 'CODE' in col and 'SERVICE' not in col]

# extract the desired column names in the dataframe
desired_cols = [col for col in df_raw.columns if 'CODE' not in col or 'SERVICE' in col]

# use a temp variable `df_combined` to store the final dataframe
df_combined = None
# for each of the service procedure columns
for col in service_procedure_cols:
    # extract the code number
    col_num = int(col.replace("CODE",""))
    # combined the desired columns with this code column to get all desired columns for the diagnosis
    diagnosis_desired_columns = desired_cols + [col]
    # creating a temporary df
    interim_df = (
    # select all desired columns
        df_raw.select(*diagnosis_desired_columns)
    # update the service procedure code with the extracted code number
              .withColumn(
                  "SERVICE_PROCEDURE_CODE",
                  F.col("SERVICE_PROCEDURE_CODE").cast("INT")+col_num
              )
     # rename the code column
              .withColumnRenamed(col,"TRANSPOSED_DIAGNOSIS")
     # filter null and empty columns
              .where(F.col("TRANSPOSED_DIAGNOSIS").isNotNull() & (F.trim("TRANSPOSED_DIAGNOSIS") !=''))
              .dropDuplicates()
    )
    # if the initial combined df variable is empty assign it `interim_df`
    # otherwise perform a union and store the result
    if df_combined is None:
        df_combined = interim_df 
    else:
        df_combined = df_combined.union(interim_df)

# only here for debugging purposes to show the results
df_combined.orderBy(desired_cols).show(truncate=False)

Outputs:输出：

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89241                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89242                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89243                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89244                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89245                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89241                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89242                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76500                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76501                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76502                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76503                 |K449                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76504                 |Z9884               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76500                 |Z1231               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76501                 |Z1231               |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76500                 |N210                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76501                 |N400                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76502                 |E782                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76503                 |E119                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76504                 |I10                 |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76505                 |Z87891              |
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
only showing top 20 rows

Answer 2

Merge columns, explode it after filtering the null values.合并列，过滤 null 值后将其分解。

codes = list(filter(lambda c: c.startswith('CODE'), df.columns))

df.withColumn('TRANSPOSED_DIAGNOSIS', f.array(*map(lambda c: f.col(c), codes))) \
  .drop(*codes) \
  .withColumn('TRANSPOSED_DIAGNOSIS', f.expr('filter(TRANSPOSED_DIAGNOSIS, x -> x is not null)')) \
  .withColumn('TRANSPOSED_DIAGNOSIS', f.explode('TRANSPOSED_DIAGNOSIS')) \
  .show(30, truncate=False)

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K449                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76499                 |Z1231               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76499                 |Z1231               |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N210                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N400                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |E782                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |E119                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |I10                 |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |Z87891              |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N210                |
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+

在 PYSPARK 中将行转换为列

问题描述

2 个解决方案

解决方案1
0 2021-11-19 18:50:09

Approach 1: Recommended方法一：推荐

Approach 2: Uses original code number to update service code方法二：使用原始码号更新服务码

解决方案2
0 2021-11-20 05:52:39

在 PYSPARK 中将行转换为列

问题描述

2 个解决方案

解决方案1 0 2021-11-19 18:50:09

Approach 1: Recommended方法一：推荐

Approach 2: Uses original code number to update service code方法二：使用原始码号更新服务码

解决方案2 0 2021-11-20 05:52:39

解决方案1
0 2021-11-19 18:50:09

解决方案2
0 2021-11-20 05:52:39