简体   繁体   English

将数据框中的列值转换为列表

[英]Convert Column value in Dataframe to list

I have the following source file.我有以下源文件。 I have a name called " john " in my file wanted to split to list ['j','o','h','n'] .我的文件中有一个名为“ john ”的名字,想要拆分为列表['j','o','h','n'] Please find the person file as follows.请按如下方式查找人员文件。

Source File:源文件:

id,name,class,start_data,end_date
1,john,xii,20170909,20210909

Code:代码:

from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.appName("PersonProcessing").getOrCreate()

    df = spark.read.csv('person.txt', header=True)
    nameList = [x['name'] for x in df.rdd.collect()]
    print(list(nameList))
    df.show()

if __name__ == '__main__':
    main()

Actual Output:实际输出:

[u'john']

Desired Output:期望输出:

['j','o','h','n']

If you want to in python:如果你想在python中:

nameList = [c  for x in df.rdd.collect() for c in x['name']]

or If you want to do it in spark:或者如果你想在火花中做到这一点:

from pyspark.sql import functions as F

df.withColumn('name', F.split(F.col('name'), '')).show()

Result:结果:

+---+--------------+-----+----------+--------+
| id|          name|class|start_data|end_date|
+---+--------------+-----+----------+--------+
|  1|[j, o, h, n, ]|  xii|  20170909|20210909|
+---+--------------+-----+----------+--------+
nameList = [x for x in 'john']

.tolist() turns a pandas series into a python list, so you should create a list first from the data and loop over the list created. .tolist() 将 pandas 系列转换为 python 列表,因此您应该首先从数据创建一个列表,然后遍历创建的列表。

namelist=df['name'].tolist()
for x in namelist:
    print(x)

If you are doing this in spark scala (spark 2.3.1 & scala-2.11.8 ) Below code works.如果您在 spark scala (spark 2.3.1 & scala-2.11.8) 中执行此操作,则以下代码有效。 We will get an extra record with blank name hence filtering it .我们将得到一个带有空白名称的额外记录,因此对其进行过滤。

import spark.implicits._ val classDF = spark.sparkContext.parallelize(Seq((1, "John", "Xii", "20170909", "20210909"))) .toDF("ID", "Name", "Class", "Start_Date", "End_Date") import spark.implicits._ val classDF = spark.sparkContext.parallelize(Seq((1, "John", "Xii", "20170909", "20210909"))) .toDF("ID", "Name", "类", "Start_Date", "End_Date")

classDF.withColumn("Name", explode((split(trim(col("Name")), ""))))
  .withColumn("Start_Date", to_date(col("Start_Date"), "yyyyMMdd"))
  .withColumn("End_Date", to_date(col("End_Date"), "yyyyMMdd")).filter(col("Name").=!=("")).show

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM