简体   繁体   English

从pyspark数据框中的结构类型获取字段值

[英]Get field values from a structtype in pyspark dataframe

I have to get the schema from a csv file (the column name and datatype).I have reached so far - 我必须从csv文件(列名和数据类型)中获取模式。到目前为止,我已经达到了-

l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
print(df2.schema)
#StructType(List(StructField(name,StringType,true),StructField(age,LongType,true)))

I want to extract the values name and age along with StringType and LongType however I don't see any method on struct type. 我想提取值nameage以及StringTypeLongType但是在结构类型上看不到任何方法。

There's toDDL method of struct type in scala but the same is not available for python. toDDL有结构类型的toDDL方法,但python无法使用。

This is an extension of the mentioned question where I already got help , however I wanted to create a new thread - Get dataframe schema load to metadata table 这是我已经获得帮助的上述问题的扩展,但是我想创建一个新线程-将数据框架构加载到元数据表

Thanks for the reply , I am updating the full code - 感谢您的答复,我正在更新完整的代码-

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.sql.catalogImplementation", "in-memory") \
    .getOrCreate()
from pyspark.sql import Row
l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df3=df2.dtypes
df1=spark.createDataFrame(df3, ['colname', 'datatype'])
df1.show()
df1.createOrReplaceTempView("test")
spark.sql('''select * from test ''').show()

Output 产量

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+

IIUC, you can loop over the values in df2.schema.fields and get the name and dataType : IIUC,您可以遍历df2.schema.fields的值并获取namedataType

print([(x.name, x.dataType) for x in df2.schema.fields])
#[('name', StringType), ('age', LongType)]

There is also dtypes : 还有dtypes

print(df2.dtypes)
#[('name', 'string'), ('age', 'bigint')]

and you may also be interested in printSchema() : 并且您可能也对printSchema()感兴趣:

df2.printSchema()
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM