将分隔列拆分为 pyspark dataframe 中的新列

Question

need to split the delimited(~) column values into new columns dynamically.需要将分隔（〜）列值动态拆分为新列。 Thie input sa dataframe and column name list.输入 sa dataframe 和列名列表。 We are trying to solve using spark datfarame functions.我们正在尝试使用 spark datfaram 函数来解决。 Please help.请帮忙。

Input:

|Raw_column_name|
|1~Ram~1000~US|
|2~john~2000~UK|
|3~Marry~7000~IND|

col_names=[id,names,sal,country]

output:
id | names | sal | country
1 | Ram | 1000 | US
2 | joh n| 2000 | UK
3 | Marry | 7000 | IND

Answer 1

We can use split() and then use the resulting array's elements to create columns.我们可以使用split()然后使用结果数组的元素来创建列。

data_sdf. \
    withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
    select(func.col('raw_col_split_arr').getItem(0).alias('id'),
           func.col('raw_col_split_arr').getItem(1).alias('name'),
           func.col('raw_col_split_arr').getItem(2).alias('sal'),
           func.col('raw_col_split_arr').getItem(3).alias('country')
           ). \
    show()

# +---+-----+----+-------+
# | id| name| sal|country|
# +---+-----+----+-------+
# |  1|  Ram|1000|     US|
# |  2| john|2000|     UK|
# |  3|Marry|7000|    IND|
# +---+-----+----+-------+

In case the use case is extended to be a dynamic list of columns.如果用例扩展为列的动态列表。

col_names = ['id', 'names', 'sal', 'country']

data_sdf. \
    withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
    select(*[func.col('raw_col_split_arr').getItem(i).alias(k) for i, k in enumerate(col_names)]). \
    show()

# +---+-----+----+-------+
# | id|names| sal|country|
# +---+-----+----+-------+
# |  1|  Ram|1000|     US|
# |  2| john|2000|     UK|
# |  3|Marry|7000|    IND|
# +---+-----+----+-------+

Answer 2

Another option is from_csv() function.另一个选项是 from_csv() function。 The only thing that needs to be defined is schema:唯一需要定义的是模式：

from pyspark.sql.functions import from_csv, col

df = spark.createDataFrame([('1~Ram~1000~US',), ('2~john~2000~UK',), ('3~Marry~7000~IND',)], ["Raw_column_name"])
df.show()

schema = "id int, names string, sal string, country string"
options = {'sep': '~'}
df2 = (df
       .select(from_csv(col('Raw_column_name'), schema, options).alias('cols'))
       .select(col('cols.*'))
       )
df2.show()

将分隔列拆分为 pyspark dataframe 中的新列

问题描述

2 个解决方案

解决方案1
0 2022-07-27 17:45:43

解决方案2
0 2022-07-27 18:26:07

将分隔列拆分为 pyspark dataframe 中的新列

问题描述

2 个解决方案

解决方案1 0 2022-07-27 17:45:43

解决方案2 0 2022-07-27 18:26:07

解决方案1
0 2022-07-27 17:45:43

解决方案2
0 2022-07-27 18:26:07