简体   繁体   English

在 PYSPARK 中将行转换为列

[英]Convert Rows to Columns in PYSPARK

I am working a project that requires data to be transposed.我正在做一个需要转置数据的项目。 In the past, I had done it using SAS and SQL which used to be super fast.过去,我使用 SAS 和 SQL 完成了它,它们曾经非常快。 I used the expr function with Stack as outlined below (code section).我使用了带有堆栈的 expr function,如下所述(代码部分)。

The problem I am facing is 2 fold.我面临的问题是2倍。

  1. The input data is about 200 GB (500 Million rows vs 70 columns) and stored as parquet files.输入数据约为 200 GB(5 亿行 vs 70 列)并存储为 parquet 文件。
  2. The step that transposes (df2) runs for about 4-5 hours and terminates.转置步骤 (df2) 运行大约 4-5 小时并终止。 I had changed the time out settings and played around with the Spark session settings but no luck so far.我已经更改了超时设置并使用了 Spark session 设置,但到目前为止还没有运气。

What I did so far: The data is stored as parquet files in Azure Synapse Workspace.到目前为止我所做的:数据以镶木地板文件的形式存储在 Azure Synapse Workspace 中。 Firstly, I had assigned a ROWNUMBER to each row in the data frame.首先,我为数据框中的每一行分配了一个 ROWNUMBER。 Then I have split the data into two data frames.然后我将数据分成两个数据框。

  1. df1 has ROWNUMBER and all the necessary columns (minus 25 diagnosis columns) df1 有 ROWNUMBER 和所有必要的列(减去 25 个诊断列)
  2. df2 has ROWNUMBER as the 25 Diagnosis columns. df2 将 ROWNUMBER 作为 25 个诊断列。
  3. I then tried to create df3 by joining df1 and df2 on ROWNUMBER.然后我尝试通过在 ROWNUMBER 上加入 df1 和 df2 来创建 df3。

Step 2 is a killer, I mean I was not able to get past this step as the session terminates after 4 hours.第 2 步是一个杀手,我的意思是我无法通过这一步,因为 session 在 4 小时后终止。

I tried with SPARK SQL as well, but no luck there was well.我也尝试过使用 SPARK SQL,但运气不好。 Further, I was advised not to use SQL in SPARK as it will deteriorate the performance.此外,我被建议不要在 SPARK 中使用 SQL,因为它会降低性能。

I am also thinking of doing the transpose outside of PYSPARK (not sure how and if it is even advisable to do so).我也在考虑在 PYSPARK 之外进行转置(不确定如何以及是否建议这样做)。

Code I wrote so far:到目前为止我写的代码:

import sys
import pyspark.sql as t
import pyspark.sql.functions as f
from pyspark.sql.types import *

df_raw=spark.read.parquet("abfss:path/med_claims/*.parquet")
df_rn=df_raw.withColumn("ROWNUM", f.row_number().over(t.Window.orderBy(df_raw.MEMBER_ID, df_raw.SERVICE_FROM_DATE, df_raw.SERVICE_THRU_DATE)))

df1=df_rn.select(
                 df_rn.ROWNUM,
                 df_rn.MEMBER_ID,
                 df_rn.MEMBER_ID_DEPENDENT,
                 df_rn.SERVICE_FROM_DATE,
                 df_rn.SERVICE_THRU_DATE,
                 df_rn.SERVICE_PROCEDURE_CODE
                )

df2=df_rn.select(df_rn.ROWNUM,
             f.expr("stack(25, code1, code2, code3, code4, code5, \
                             code6, code7, code8, code9, code10, \
                             code11, code12, code13, code14, code15, \
                             code16, code17, code18, code19, code20, \
                             code21, code22, code23, code24, code25) as (TRANPOSED_DIAG)")) \
             .dropDuplicates() \
             .where(" (TRANPOSED_DIAG IS NOT NULL) OR (TRIM(TRANPOSED_DIAG) <> '') ")

df3=df1.join(df2, df1.ROWNUM == df2.ROWNUM, 'left') \
       .select(df1.ROWNUM,
             df1.MEMBER_ID,
             df1.MEMBER_ID_DEPENDENT,
             df1.SERVICE_FROM_DATE,
             df1.SERVICE_THRU_DATE,
             df1.SERVICE_PROCEDURE_CODE,
             df2.TRANPOSED_DIAG
            )

Input Data:输入数据:

MEMBER_ID会员ID MEMBER_ID_DEPENDENT MEMBER_ID_DEPENDENT PROVIDER_KEY PROVIDER_KEY REVENUE_KEY REVENUE_KEY PLACE_OF_SERVICE_KEY PLACE_OF_SERVICE_KEY SERVICE_FROM_DATE SERVICE_FROM_DATE SERVICE_THRU_DATE SERVICE_THRU_DATE SERVICE_PROCEDURE_CODE SERVICE_PROCEDURE_CODE CODE1代码1 CODE2代码2 CODE3代码3 CODE4 CODE4 CODE5 CODE5 CODE6 CODE6 CODE7 CODE7 CODE8 CODE8 CODE9 CODE9 CODE10 CODE10 CODE11 CODE11 CODE12 CODE12 CODE13 CODE13 CODE14 CODE14 CODE15 CODE15 CODE16 CODE16 CODE17 CODE17 CODE18 CODE18 CODE19 CODE19 CODE20 CODE20 CODE21 CODE21 CODE22 CODE22 CODE23 CODE23 CODE24 CODE24 CODE25 CODE25
A1 A1 A11 A11 AB05547 AB05547 4.85148E+12 4.85148E+12 7.96651E+11 7.96651E+11 9/23/2019 0:00 2019 年 9 月 23 日 0:00 9/23/2019 0:00 2019 年 9 月 23 日 0:00 89240 89240 Z0000 Z0000 M25852 M25852 M25851 M25851 Z0000 Z0000 M25551 M25551 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null
A1 A1 A11 A11 AB92685 AB92685 4.85148E+12 4.85148E+12 7.96651E+11 7.96651E+11 10/23/2020 0:00 2020 年 10 月 23 日 0:00 10/23/2020 0:00 2020 年 10 月 23 日 0:00 89240 89240 Z524 Z524 Z524 Z524 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null
A2 A2 A12 A12 AB64081 AB64081 4.8515E+12 4.8515E+12 7.96651E+11 7.96651E+11 6/19/2020 0:00 2020 年 6 月 19 日 0:00 6/19/2020 0:00 2020 年 6 月 19 日 0:00 76499 76499 Z9884 Z9884 R109 R109 K219 K219 K449 K449 Z9884 Z9884 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null
A3 A3 A13 A13 AB64081 AB64081 4.8515E+12 4.8515E+12 7.96651E+11 7.96651E+11 9/13/2019 0:00 2019 年 9 月 13 日 0:00 9/13/2019 0:00 2019 年 9 月 13 日 0:00 76499 76499 Z1231 Z1231 Z1231 Z1231 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null
A4 A4 A14 A14 AB74417 AB74417 4.8515E+12 4.8515E+12 7.96651E+11 7.96651E+11 9/30/2019 0:00 2019 年 9 月 30 日 0:00 9/30/2019 0:00 2019 年 9 月 30 日 0:00 76499 76499 N210 N210 N400 N400 E782 E782 E119 E119 I10 I10 Z87891 Z87891 N210 N210 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null

Expected Output:预期 Output:

MEMBER_ID会员ID MEMBER_ID_DEPENDENT MEMBER_ID_DEPENDENT PROVIDER_KEY PROVIDER_KEY REVENUE_KEY REVENUE_KEY PLACE_OF_SERVICE_KEY PLACE_OF_SERVICE_KEY SERVICE_FROM_DATE SERVICE_FROM_DATE SERVICE_THRU_DATE SERVICE_THRU_DATE SERVICE_PROCEDURE_CODE SERVICE_PROCEDURE_CODE TRANSPOSED_DIAGNOSIS TRANSPOSED_DIAGNOSIS
A1 A1 A11 A11 AB05547 AB05547 4851484842551 4851484842551 796650504854 796650504854 9/23/2019 0:00 2019 年 9 月 23 日 0:00 9/23/2019 0:00 2019 年 9 月 23 日 0:00 89240 89240 Z0000 Z0000
A1 A1 A11 A11 AB05548 AB05548 4851484842551 4851484842551 796650504854 796650504854 9/23/2019 0:00 2019 年 9 月 23 日 0:00 9/23/2019 0:00 2019 年 9 月 23 日 0:00 89241 89241 M25852 M25852
A1 A1 A11 A11 AB05549 AB05549 4851484842551 4851484842551 796650504854 796650504854 9/23/2019 0:00 2019 年 9 月 23 日 0:00 9/23/2019 0:00 2019 年 9 月 23 日 0:00 89242 89242 M25851 M25851
A1 A1 A11 A11 AB05550 AB05550 4851484842551 4851484842551 796650504854 796650504854 9/23/2019 0:00 2019 年 9 月 23 日 0:00 9/23/2019 0:00 2019 年 9 月 23 日 0:00 89243 89243 M25551 M25551
A1 A1 A11 A11 AB92685 AB92685 4851484842551 4851484842551 796650504854 796650504854 10/23/2020 0:00 2020 年 10 月 23 日 0:00 10/23/2020 0:00 2020 年 10 月 23 日 0:00 89240 89240 Z524 Z524
A2 A2 A12 A12 AB64081 AB64081 4851504842551 4851504842551 796650504854 796650504854 6/19/2020 0:00 2020 年 6 月 19 日 0:00 6/19/2020 0:00 2020 年 6 月 19 日 0:00 76499 76499 Z9884 Z9884
A2 A2 A12 A12 AB64082 AB64082 4851504842551 4851504842551 796650504854 796650504854 6/19/2020 0:00 2020 年 6 月 19 日 0:00 6/19/2020 0:00 2020 年 6 月 19 日 0:00 76500 76500 R109 R109
A2 A2 A12 A12 AB64083 AB64083 4851504842551 4851504842551 796650504854 796650504854 6/19/2020 0:00 2020 年 6 月 19 日 0:00 6/19/2020 0:00 2020 年 6 月 19 日 0:00 76501 76501 K219 K219
A2 A2 A12 A12 AB64084 AB64084 4851504842551 4851504842551 796650504854 796650504854 6/19/2020 0:00 2020 年 6 月 19 日 0:00 6/19/2020 0:00 2020 年 6 月 19 日 0:00 76502 76502 K449 K449
A3 A3 A13 A13 AB64081 AB64081 4851504842551 4851504842551 796650504854 796650504854 9/13/2019 0:00 2019 年 9 月 13 日 0:00 9/13/2019 0:00 2019 年 9 月 13 日 0:00 76499 76499 Z1231 Z1231
A4 A4 A14 A14 AB74417 AB74417 4851504842551 4851504842551 796650504854 796650504854 9/30/2019 0:00 2019 年 9 月 30 日 0:00 9/30/2019 0:00 2019 年 9 月 30 日 0:00 76499 76499 N210 N210
A4 A4 A14 A14 AB74418 AB74418 4851504842551 4851504842551 796650504854 796650504854 9/30/2019 0:00 2019 年 9 月 30 日 0:00 9/30/2019 0:00 2019 年 9 月 30 日 0:00 76500 76500 N400 N400
A4 A4 A14 A14 AB74419 AB74419 4851504842551 4851504842551 796650504854 796650504854 9/30/2019 0:00 2019 年 9 月 30 日 0:00 9/30/2019 0:00 2019 年 9 月 30 日 0:00 76501 76501 E782 E782
A4 A4 A14 A14 AB74420 AB74420 4851504842551 4851504842551 796650504854 796650504854 9/30/2019 0:00 2019 年 9 月 30 日 0:00 9/30/2019 0:00 2019 年 9 月 30 日 0:00 76502 76502 E119 E119
A4 A4 A14 A14 AB74421 AB74421 4851504842551 4851504842551 796650504854 796650504854 9/30/2019 0:00 2019 年 9 月 30 日 0:00 9/30/2019 0:00 2019 年 9 月 30 日 0:00 76503 76503 I10 I10
A4 A4 A14 A14 AB74422 AB74422 4851504842551 4851504842551 796650504854 796650504854 9/30/2019 0:00 2019 年 9 月 30 日 0:00 9/30/2019 0:00 2019 年 9 月 30 日 0:00 76504 76504 Z87891 Z87891

This will be an expensive operation in any approach, however you may consider the following approaches which avoids using another expensive join.这在任何方法中都是一项昂贵的操作,但是您可以考虑使用以下方法来避免使用另一个昂贵的连接。

For simplification and code re-use I've filtered out the desired and code related columns into different variables instead of hardcoding them.为了简化和代码重用,我将所需的列和与代码相关的列过滤到不同的变量中,而不是对它们进行硬编码。

Approach 1: Recommended方法一:推荐

Continuing from df_raw 's first load, you may try the following:df_raw的第一次加载开始,您可以尝试以下操作:

from pyspark.sql import functions as F
from pyspark.sql import Window

# extract service procedure code columns from `df_raw` by looking for the simple pattern 'CODE'. 
# This filter can be easily modified for more complex code columns names
service_procedure_cols = [col for col in df_raw.columns if 'CODE' in col and 'SERVICE' not in col]

# extract the desired column names in the dataframe
desired_cols = [col for col in df_raw.columns if 'CODE' not in col or 'SERVICE' in col]
#build the stack expresssion by counting the number of columns with `len` and concatenating the column names
code_column_stack_expression = "stack("+str(len(service_procedure_cols))+", "+",".join(service_procedure_cols)+") as (TRANSPOSED_DIAGNOSIS)"

df_step_1 = (
    # select the desired column names and unpivot the data
    df_raw.select(desired_cols + [ F.expr(code_column_stack_expression)])
    # filter or remove null and empty columns
          .where(F.col("TRANSPOSED_DIAGNOSIS").isNotNull() & (F.trim("TRANSPOSED_DIAGNOSIS") != '' ))
    # remove duplicates
          .dropDuplicates()
)

df_step_1.show(truncate=False)

Outputs:输出:

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K449      
df_step_2 = (
    # Replace the existing `SERVICE_PROCEDURE_CODE` column with the new service procedure column by casting it as an integer and adding the generated row number partitioned by your desired columns and ordered by the columns you specified in your example
    df_step_1.withColumn(
        "SERVICE_PROCEDURE_CODE",
        F.col("SERVICE_PROCEDURE_CODE").cast("INT")+F.row_number().over(
            Window.partitionBy(desired_cols).orderBy(["MEMBER_ID", "SERVICE_FROM_DATE", "SERVICE_THRU_DATE"]) -1
        )
    )
)
df_step_2.show(truncate=False)

Outputs:输出:

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89241                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89242                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89243                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89244                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89241                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76500                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76501                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76502                 |K449                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76503                 |Z9884               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76499                 |Z1231               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76500                 |Z1231               |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N210                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76500                 |N400                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76501                 |E782                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76502                 |E119                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76503                 |I10                 |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76504                 |Z87891              |
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
only showing top 20 rows

Approach 2: Uses original code number to update service code方法二:使用原始码号更新服务码

This approach may also be simpler to read for some as it uses a loop to build a union of the desired dataset.对于某些人来说,这种方法也可能更易于阅读,因为它使用循环来构建所需数据集的联合。

NB.注意。 This may cause overlaps in your service procedure code这可能会导致您的服务过程代码重叠

Continuing from df_raw 's first load, you may try the following:df_raw的第一次加载开始,您可以尝试以下操作:

from pyspark.sql import functions as F
from pyspark.sql import Window

# cache the original df
df_raw.cache()

# extract service procedure code columns from `df_raw` by looking for the simple pattern 'CODE'. 
# This filter can be easily modified for more complex code columns names
service_procedure_cols = [col for col in df_raw.columns if 'CODE' in col and 'SERVICE' not in col]

# extract the desired column names in the dataframe
desired_cols = [col for col in df_raw.columns if 'CODE' not in col or 'SERVICE' in col]

# use a temp variable `df_combined` to store the final dataframe
df_combined = None
# for each of the service procedure columns
for col in service_procedure_cols:
    # extract the code number
    col_num = int(col.replace("CODE",""))
    # combined the desired columns with this code column to get all desired columns for the diagnosis
    diagnosis_desired_columns = desired_cols + [col]
    # creating a temporary df
    interim_df = (
    # select all desired columns
        df_raw.select(*diagnosis_desired_columns)
    # update the service procedure code with the extracted code number
              .withColumn(
                  "SERVICE_PROCEDURE_CODE",
                  F.col("SERVICE_PROCEDURE_CODE").cast("INT")+col_num
              )
     # rename the code column
              .withColumnRenamed(col,"TRANSPOSED_DIAGNOSIS")
     # filter null and empty columns
              .where(F.col("TRANSPOSED_DIAGNOSIS").isNotNull() & (F.trim("TRANSPOSED_DIAGNOSIS") !=''))
              .dropDuplicates()
    )
    # if the initial combined df variable is empty assign it `interim_df`
    # otherwise perform a union and store the result
    if df_combined is None:
        df_combined = interim_df 
    else:
        df_combined = df_combined.union(interim_df)

# only here for debugging purposes to show the results
df_combined.orderBy(desired_cols).show(truncate=False)

Outputs:输出:

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89241                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89242                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89243                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89244                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89245                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89241                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89242                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76500                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76501                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76502                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76503                 |K449                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76504                 |Z9884               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76500                 |Z1231               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76501                 |Z1231               |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76500                 |N210                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76501                 |N400                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76502                 |E782                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76503                 |E119                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76504                 |I10                 |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76505                 |Z87891              |
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
only showing top 20 rows

Merge columns, explode it after filtering the null values.合并列,过滤 null 值后将其分解。

codes = list(filter(lambda c: c.startswith('CODE'), df.columns))

df.withColumn('TRANSPOSED_DIAGNOSIS', f.array(*map(lambda c: f.col(c), codes))) \
  .drop(*codes) \
  .withColumn('TRANSPOSED_DIAGNOSIS', f.expr('filter(TRANSPOSED_DIAGNOSIS, x -> x is not null)')) \
  .withColumn('TRANSPOSED_DIAGNOSIS', f.explode('TRANSPOSED_DIAGNOSIS')) \
  .show(30, truncate=False)

+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|MEMBER_ID|MEMBER_ID_DEPENDENT|PROVIDER_KEY|REVENUE_KEY|PLACE_OF_SERVICE_KEY|SERVICE_FROM_DATE|SERVICE_THRU_DATE|SERVICE_PROCEDURE_CODE|TRANSPOSED_DIAGNOSIS|
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25852              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25851              |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |Z0000               |
|A1       |A11                |AB05547     |4.85148E+12|7.96651E+11         |9/23/2019 0:00   |9/23/2019 0:00   |89240                 |M25551              |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A1       |A11                |AB92685     |4.85148E+12|7.96651E+11         |10/23/2020 0:00  |10/23/2020 0:00  |89240                 |Z524                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |R109                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K219                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |K449                |
|A2       |A12                |AB64081     |4.8515E+12 |7.96651E+11         |6/19/2020 0:00   |6/19/2020 0:00   |76499                 |Z9884               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76499                 |Z1231               |
|A3       |A13                |AB64081     |4.8515E+12 |7.96651E+11         |9/13/2019 0:00   |9/13/2019 0:00   |76499                 |Z1231               |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N210                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N400                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |E782                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |E119                |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |I10                 |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |Z87891              |
|A4       |A14                |AB74417     |4.8515E+12 |7.96651E+11         |9/30/2019 0:00   |9/30/2019 0:00   |76499                 |N210                |
+---------+-------------------+------------+-----------+--------------------+-----------------+-----------------+----------------------+--------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM