简体   繁体   English

从 pyspark 中的数据帧构建 StructType

[英]Building a StructType from a dataframe in pyspark

I am new spark and python and facing this difficulty of building a schema from a metadata file that can be applied to my data file.我是新的 spark 和 python,面临着从可以应用于我的数据文件的元数据文件构建模式的困难。 Scenario: Metadata File for the Data file(csv format), contains the columns and their types: for example:场景:数据文件的元数据文件(csv格式),包含列及其类型:例如:

id,int,10,"","",id,"","",TRUE,"",0
created_at,timestamp,"","","",created_at,"","",FALSE,"",0

I have successfully converted this to a dataframe that looks like:我已成功将其转换为如下所示的数据框:

+--------------------+---------------+
|                name|           type|
+--------------------+---------------+
|                  id|  IntegerType()|
|          created_at|TimestampType()|
|          updated_at|   StringType()|

But when I try to convert this to a StructField format using this但是当我尝试使用它将其转换为 StructField 格式时

fields = schemaLoansNew.map(lambda l:([StructField(l.name, l.type, 'true')]))

OR

schemaList = schemaLoansNew.map(lambda l: ("StructField(" + l.name + "," + l.type + ",true)")).collect()

And then later convert it to StructType, using然后将其转换为 StructType,使用

schemaFinal = StructType(schemaList)

I get the following error:我收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/mapr/spark/spark-1.4.1/python/pyspark/sql/types.py", line 372, in __init__
assert all(isinstance(f, DataType) for f in fields), "fields should be a list of DataType"
AssertionError: fields should be a list of DataType

I am stuck on this due to my lack of knowledge on Data Frames, can you please advise, how to proceed on this.由于我对数据帧缺乏了解,我被困在这个问题上,请您指教,如何继续。 once I have schema ready I want to use createDataFrame to apply to my data File.一旦我准备好架构,我想使用 createDataFrame 应用于我的数据文件。 This process has to be done for many tables so I do not want to hardcode the types rather use the metadata file to build the schema and then apply to the RDD.必须为许多表完成此过程,因此我不想对类型进行硬编码,而是使用元数据文件来构建模式,然后应用于 RDD。

Thanks in advance.提前致谢。

Fields have argument have to be a list of DataType objects.字段的参数必须是DataType对象的列表。 This:这个:

.map(lambda l:([StructField(l.name, l.type, 'true')]))

generates after collect a list of lists of tuples ( Rows ) of DataType ( list[list[tuple[DataType]]] ) not to mention that nullable argument should be boolean not a string.生成后collect一个listliststuplesRows )的DataTypelist[list[tuple[DataType]]]更不用说nullable参数应该是布尔值不是字符串。

Your second attempt:你的第二次尝试:

.map(lambda l: ("StructField(" + l.name + "," + l.type + ",true)")).

generates after collect a list of str objects. collect str对象list后生成。

Correct schema for the record you've shown should look more or less like this:您显示的记录的正确架构应该或多或少如下所示:

from pyspark.sql.types import *

StructType([
    StructField("id", IntegerType(), True),
    StructField("created_at", TimestampType(), True),
    StructField("updated_at", StringType(), True)
])

Although using distributed data structures for task like this is a serious overkill, not to mention inefficient, you can try to adjust your first solution as follows:尽管对这样的任务使用分布式数据结构是一种严重的矫枉过正,更不用说效率低下,您可以尝试如下调整您的第一个解决方案:

StructType([
    StructField(name, eval(type), True) for (name, type) in  df.rdd.collect()
])

but it is not particularly safe ( eval ).但它不是特别安全( eval )。 It could be easier to build a schema from JSON / dictionary.从 JSON/字典构建模式可能更容易。 Assuming you have function which maps from type description to canonical type name:假设您具有从类型描述映射到规范类型名称的函数:

def get_type_name(s: str) -> str:
    """
    >>> get_type_name("int")
    'integer'
    """
    _map = {
        'int': IntegerType().typeName(),
        'timestamp': TimestampType().typeName(),
        # ...
    } 
    return _map.get(s, StringType().typeName())

You can build dictionary of following shape:您可以构建以下形状的字典:

schema_dict = {'fields': [
    {'metadata': {}, 'name': 'id', 'nullable': True, 'type': 'integer'},
    {'metadata': {}, 'name': 'created_at', 'nullable': True, 'type': 'timestamp'}
], 'type': 'struct'}

and feed it to StructType.fromJson :并将其提供给StructType.fromJson

StructType.fromJson(schema_dict)

Below steps can be followed to change the Datatype Objects可以按照以下步骤更改数据类型对象

data_schema=[
    StructField("age", IntegerType(), True),
    StructField("name", StringType(), True)
]



final_struct=StructType(fields=data_schema)

df=spark.read.json('/home/abcde/Python-and-Spark-for-Big-Data-master/Spark_DataFrames/people.json', schema=final_struct)



df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)
val columns: Array[String] = df1.columns
val reorderedColumnNames: Array[String] = df2.columns //or do the reordering you want
val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)

I am new spark and python and facing this difficulty of building a schema from a metadata file that can be applied to my data file.我是新火花和Python,面临着从可应用于我的数据文件的元数据文件构建架构的难题。 Scenario: Metadata File for the Data file(csv format), contains the columns and their types: for example:场景:数据文件(csv格式)的元数据文件包含列及其类型:例如:

id,int,10,"","",id,"","",TRUE,"",0
created_at,timestamp,"","","",created_at,"","",FALSE,"",0

I have successfully converted this to a dataframe that looks like:我已成功将其转换为如下所示的数据框:

+--------------------+---------------+
|                name|           type|
+--------------------+---------------+
|                  id|  IntegerType()|
|          created_at|TimestampType()|
|          updated_at|   StringType()|

But when I try to convert this to a StructField format using this但是当我尝试使用此将其转换为StructField格式时

fields = schemaLoansNew.map(lambda l:([StructField(l.name, l.type, 'true')]))

OR或者

schemaList = schemaLoansNew.map(lambda l: ("StructField(" + l.name + "," + l.type + ",true)")).collect()

And then later convert it to StructType, using然后使用将其转换为StructType

schemaFinal = StructType(schemaList)

I get the following error:我收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/mapr/spark/spark-1.4.1/python/pyspark/sql/types.py", line 372, in __init__
assert all(isinstance(f, DataType) for f in fields), "fields should be a list of DataType"
AssertionError: fields should be a list of DataType

I am stuck on this due to my lack of knowledge on Data Frames, can you please advise, how to proceed on this.由于缺乏对数据框架的了解,我对此深感困惑,请您指教一下如何进行此操作。 once I have schema ready I want to use createDataFrame to apply to my data File.准备好架构后,我想使用createDataFrame应用于我的数据文件。 This process has to be done for many tables so I do not want to hardcode the types rather use the metadata file to build the schema and then apply to the RDD.必须对许多表执行此过程,因此我不想对类型进行硬编码,而是使用元数据文件来构建架构,然后将其应用于RDD。

Thanks in advance.提前致谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM