从pyspark数据框中的结构类型获取字段值

Question

I have to get the schema from a csv file (the column name and datatype).I have reached so far - 我必须从csv文件（列名和数据类型）中获取模式。到目前为止，我已经达到了-

l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
print(df2.schema)
#StructType(List(StructField(name,StringType,true),StructField(age,LongType,true)))

I want to extract the values name and age along with StringType and LongType however I don't see any method on struct type. 我想提取值name和age以及StringType和LongType但是在结构类型上看不到任何方法。

There's toDDL method of struct type in scala but the same is not available for python. toDDL有结构类型的toDDL方法，但python无法使用。

This is an extension of the mentioned question where I already got help , however I wanted to create a new thread - Get dataframe schema load to metadata table 这是我已经获得帮助的上述问题的扩展，但是我想创建一个新线程-将数据框架构加载到元数据表

Thanks for the reply , I am updating the full code - 感谢您的答复，我正在更新完整的代码-

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.sql.catalogImplementation", "in-memory") \
    .getOrCreate()
from pyspark.sql import Row
l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df3=df2.dtypes
df1=spark.createDataFrame(df3, ['colname', 'datatype'])
df1.show()
df1.createOrReplaceTempView("test")
spark.sql('''select * from test ''').show()

Output 产量

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+

Answer 1

IIUC, you can loop over the values in df2.schema.fields and get the name and dataType : IIUC，您可以遍历df2.schema.fields的值并获取name和dataType ：

print([(x.name, x.dataType) for x in df2.schema.fields])
#[('name', StringType), ('age', LongType)]

There is also dtypes : 还有dtypes ：

print(df2.dtypes)
#[('name', 'string'), ('age', 'bigint')]

and you may also be interested in printSchema() : 并且您可能也对printSchema()感兴趣：

df2.printSchema()
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

从pyspark数据框中的结构类型获取字段值

问题描述

1 个解决方案

解决方案1
1 2019-07-01 17:59:09

从pyspark数据框中的结构类型获取字段值

问题描述

1 个解决方案

解决方案1 1 2019-07-01 17:59:09

解决方案1
1 2019-07-01 17:59:09