简体   繁体   English

将分隔列拆分为 pyspark dataframe 中的新列

[英]split delimited column into new columns in pyspark dataframe

need to split the delimited(~) column values into new columns dynamically.需要将分隔(〜)列值动态拆分为新列。 Thie input sa dataframe and column name list.输入 sa dataframe 和列名列表。 We are trying to solve using spark datfarame functions.我们正在尝试使用 spark datfaram 函数来解决。 Please help.请帮忙。

Input:

|Raw_column_name|
|1~Ram~1000~US|
|2~john~2000~UK|
|3~Marry~7000~IND|

col_names=[id,names,sal,country]

output:
id | names | sal | country
1 | Ram | 1000 | US
2 | joh n| 2000 | UK
3 | Marry | 7000 | IND 

We can use split() and then use the resulting array's elements to create columns.我们可以使用split()然后使用结果数组的元素来创建列。

data_sdf. \
    withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
    select(func.col('raw_col_split_arr').getItem(0).alias('id'),
           func.col('raw_col_split_arr').getItem(1).alias('name'),
           func.col('raw_col_split_arr').getItem(2).alias('sal'),
           func.col('raw_col_split_arr').getItem(3).alias('country')
           ). \
    show()

# +---+-----+----+-------+
# | id| name| sal|country|
# +---+-----+----+-------+
# |  1|  Ram|1000|     US|
# |  2| john|2000|     UK|
# |  3|Marry|7000|    IND|
# +---+-----+----+-------+

In case the use case is extended to be a dynamic list of columns.如果用例扩展为列的动态列表。

col_names = ['id', 'names', 'sal', 'country']

data_sdf. \
    withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
    select(*[func.col('raw_col_split_arr').getItem(i).alias(k) for i, k in enumerate(col_names)]). \
    show()

# +---+-----+----+-------+
# | id|names| sal|country|
# +---+-----+----+-------+
# |  1|  Ram|1000|     US|
# |  2| john|2000|     UK|
# |  3|Marry|7000|    IND|
# +---+-----+----+-------+

Another option is from_csv() function.另一个选项是 from_csv() function。 The only thing that needs to be defined is schema:唯一需要定义的是模式:

from pyspark.sql.functions import from_csv, col

df = spark.createDataFrame([('1~Ram~1000~US',), ('2~john~2000~UK',), ('3~Marry~7000~IND',)], ["Raw_column_name"])
df.show()

schema = "id int, names string, sal string, country string"
options = {'sep': '~'}
df2 = (df
       .select(from_csv(col('Raw_column_name'), schema, options).alias('cols'))
       .select(col('cols.*'))
       )
df2.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM