从JSON模式构建Spark模式

Question

I am trying to build a spark schema the want to explicity supply while creating the dataframe I can generate the json schema using below 我正在尝试在创建数据框时构建想要显式提供的Spark模式，我可以使用以下方式生成json模式

from pyspark.sql.types import StructType
# Save schema from the original DataFrame into json:
schema_json = df.schema.json()

which gives me 这给了我

{"fields":[{"metadata":{},"name":"cloud_events_version","nullable":true,"type":"string"},{"metadata":{},"name":"data","nullable":true,"type":{"fields":[{"metadata":{},"name":"email","nullable":true,"type":"string"},{"metadata":{},"name":"member_role","nullable":true,"type":"string"},{"metadata":{},"name":"reg_source_product","nullable":true,"type":"string"},{"metadata":{},"name":"school_type","nullable":true,"type":"string"},{"metadata":{},"name":"year_in_college","nullable":true,"type":"long"}],"type":"struct"}},{"metadata":{},"name":"event_time","nullable":true,"type":"string"},{"metadata":{},"name":"event_type","nullable":true,"type":"string"},{"metadata":{},"name":"event_type_version","nullable":true,"type":"string"},{"metadata":{},"name":"event_validated_ts","nullable":true,"type":"string"},{"metadata":{},"name":"event_validation_status","nullable":true,"type":"string"},{"metadata":{},"name":"extensions","nullable":true,"type":{"fields":[{"metadata":{},"name":"client_common","nullable":true,"type":{"fields":[{"metadata":{},"name":"adobe_mcid","nullable":true,"type":"string"},{"metadata":{},"name":"adobe_sdid","nullable":true,"type":"string"},{"metadata":{},"name":"auth_state","nullable":true,"type":"string"},{"metadata":{},"name":"uuid","nullable":true,"type":"string"},{"metadata":{},"name":"client_experiments","nullable":true,"type":"string"},{"metadata":{},"name":"client_ip_address","nullable":true,"type":"string"},{"metadata":{},"name":"device_id","nullable":true,"type":"string"},{"metadata":{},"name":"page_name","nullable":true,"type":"string"},{"metadata":{},"name":"referral_url","nullable":true,"type":"string"},{"metadata":{},"name":"url","nullable":true,"type":"string"},{"metadata":{},"name":"user_agent","nullable":true,"type":"string"},{"metadata":{},"name":"uvn","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"source","nullable":true,"type":"string"},{"metadata":{},"name":"validated_message","nullable":true,"type":"string"},{"metadata":{},"name":"year","nullable":true,"type":"integer"},{"metadata":{},"name":"mon","nullable":true,"type":"integer"},{"metadata":{},"name":"day","nullable":true,"type":"integer"},{"metadata":{},"name":"hour","nullable":true,"type":"integer"}],"type":"struct"}

But for this I need to parse the dataframe , which takes a bit of time and I am trying to avoid 但是为此，我需要解析数据帧，这需要一些时间，因此我试图避免

One thing I can do is fetch the desired schema from a catalog that we have internally . 我可以做的一件事是从内部拥有的目录中获取所需的架构。 Which gives something like 这给像

[{u'Name': u'cloud_events_version', u'Type': u'string'},
 {u'Name': u'event_type', u'Type': u'string'},
 {u'Name': u'event_time', u'Type': u'string'},
 {u'Name': u'data', u'Type': u'struct<school_type:string,reg_source_product:string,member_role:string,email:string,year_in_college:int>'},
 {u'Name': u'source', u'Type': u'string'},
 {u'Name': u'extensions', u'Type': u'struct<client_common:struct<auth_state:string,client_ip_address:string,client_experiments:string,uvn:string,device_id:string,adobe_sdid:string,url:string,page_name:string,user_agent:string,uuid:string,adobe_mcid:string,referral_url:string>>'},
 {u'Name': u'event_type_version', u'Type': u'string'},
 {u'Name': u'event_validation_status', u'Type': u'string'},
 {u'Name': u'event_validated_ts', u'Type': u'string'},
 {u'Name': u'validated_message', u'Type': u'string'}]

I am trying to write a recursive python query which generate the above json. 我正在尝试编写一个生成上面的json的递归python查询。 Logic is to loop through this list of dict and assign name and type to this dictionary , when the type is string 逻辑是遍历此字典列表，并在类型为字符串时为该字典分配名称和类型

{"metadata" : {},"name" : columnName,"nullable" : True,"type" : columnType}

but when the type is struct then it creates a list of dictionary of all the element of struct and assign it to type and do it recursively until it does not find any struct. 但是当类型为struct时，它将创建所有struct元素的字典列表，并将其分配给type，然后递归进行操作，直到找不到任何struct为止。

All i can muster is 我所能召集的是

def structRecursive(columnName,columnType):
    if "struct" not in columnType:
        ColumnDict = {"metadata" : {},"name" : columnName,"nullable" : True,"type" : columnType}
    else:
        structColumnList = []
        structColumnDict = {
            'metadata': {},
            'name': columnName,
            'nullable': True,
            'type': {'fields': structColumnList, 'type': 'struct'}
        }
        if columnType.count('struct<')==1:
            structCol = columnName
            structColList = columnType.encode('utf-8').replace('struct<',
                    '').replace('>', '').split(',')
            for item in structColList:
                fieldName = item.split(':')[0]
                dataType = item.split(':')[1]
                nodeDict = {}
                nodeDict['metadata'] = {}
                nodeDict['name'] = '{}'.format(fieldName)
                nodeDict['nullable'] = True
                nodeDict['type'] = '{}'.format(dataType)
                structColumnList.append(nodeDict)
        else:
            columnName = columnType.replace('struct<','',1).replace('>','').split(':')[0]
            columnType = columnType.split("{}:".format(columnName),1)[1].replace('>','',1)
        return structColumnDict

MainStructList = []
MainStructDict = {'fields': MainStructList, 'type': 'struct'}
for item in ListOfDict :
    columnName = item['Name'].encode('utf-8')
    columnType = item['Type'].encode('utf-8')
    MainStructList.append(structRecursive(columnName,columnType))

Ofcourse this is not giving the desired result. 当然，这并不能达到预期的效果。 Would love to get some suggestion here . 希望能在这里得到一些建议。

Answer 1

If I get your question right, you would like to parse the list of columns and convert it into a dictionary describing the schema with the complex types as well. 如果我的问题正确，那么您想解析列列表，并将其转换为描述具有复杂类型的架构的字典。 The hard part is parsing the string representing the complex types. 困难的部分是解析表示复杂类型的字符串。 First we need a method that extracts the struct entries from a column definition: 首先，我们需要一种从列定义中提取结构项的方法：

def extract_struct(text):
    stop = 7
    flag = 1
    for c in text[7:]:
        stop += 1
        if c == "<":
            flag += 1
        if c == ">":
            flag -= 1
        if flag == 0:
            return text[:stop], text[stop:]

This will return you the extracted struct and the remaining text from the column definition. 这将返回提取的结构以及列定义中的其余文本。 For example 例如

extract_struct("struct<a:int,b:double>,c:string")

will return 将返回

("struct<a:int,d:double>", "c:string").

Second, we need to walk through each column type and get the definition of the struct entries: 其次，我们需要遍历每种列类型并获取结构条目的定义：

def parse(s, node):
    while s != '':
        # Strip column name
        col_name = s.partition(':')[0]
        s = s.partition(':')[2]

        # If column type is struct, parse it as well
        if s.startswith('struct'):
            col_type, s = extract_struct(s)
            node[col_name] = {}
            parse(col_type[7:-1], node[col_name])
        else:
            # Just add column definition
            col_type = s.partition(',')[0]
            node[col_name] = {
                "metadata": {},
                "name": col_name,
                "nullable": True,
                "type": col_type
            }

        # Go to next entry
        s = s.partition(',')[2]

The above method simply adds a new column to the current node in the schema tree if the column type is simple, otherwise it extracts the name and struct and recursively goes through the struct's children entries. 如果列类型很简单，则上述方法只是将新列添加到架构树中的当前节点，否则，将提取名称和结构，然后递归遍历该结构的子项。 Now we only have to go through each column and parse them. 现在我们只需要遍历每一列并对其进行解析。 So after wrapping up the above in one method: 因此，在用一种方法包装完上面的内容之后：

def build(columns):
    def extract_struct(text):
        stop = 7
        flag = 1
        for c in text[7:]:
            stop += 1
            if c == '<':
                flag += 1
            if c == '>':
                flag -= 1
            if flag == 0:
                return text[:stop], text[stop:]

    def parse(s, node):
        while s != '':
            # Strip column name
            col_name = s.partition(':')[0]
            s = s.partition(':')[2]

            # If column type is struct, parse it as well
            if s.startswith('struct'):
                col_type, s = extract_struct(s)
                node[col_name] = {}
                parse(col_type[7:-1], node[col_name])
            else:
                # Just add column definition
                col_type = s.partition(',')[0]
                node[col_name] = {
                    "metadata": {},
                    "name": col_name,
                    "nullable": True,
                    "type": col_type
                }

            # Go to next entry
            s = s.partition(',')[2]

    schema = {}
    for column in columns:
        parse("{}:{}".format(column['Name'], column['Type']), schema)
    return schema

Now if you run it on your example list, you get the following dictionary (which is easy to convert into a list of columns, but the order doesn't really matter anyway): 现在，如果在示例列表中运行它，您将获得以下字典（可以很容易地将其转换为列列表，但是顺序实际上并不重要）：

{
  "cloud_events_version": {
    "nullable": true, 
    "type": "string", 
    "name": "cloud_events_version", 
    "metadata": {}
  }, 
  "event_type": {
    "nullable": true, 
    "type": "string", 
    "name": "event_type", 
    "metadata": {}
  }, 
  "event_time": {
    "nullable": true, 
    "type": "string", 
    "name": "event_time", 
    "metadata": {}
  }, 
  "event_validated_ts": {
    "nullable": true, 
    "type": "string", 
    "name": "event_validated_ts", 
    "metadata": {}
  }, 
  "event_type_version": {
    "nullable": true, 
    "type": "string", 
    "name": "event_type_version", 
    "metadata": {}
  }, 
  "source": {
    "nullable": true, 
    "type": "string", 
    "name": "source", 
    "metadata": {}
  }, 
  "extensions": {
    "client_common": {
      "adobe_sdid": {
        "nullable": true, 
        "type": "string", 
        "name": "adobe_sdid", 
        "metadata": {}
      }, 
      "auth_state": {
        "nullable": true, 
        "type": "string", 
        "name": "auth_state", 
        "metadata": {}
      }, 
      "client_ip_address": {
        "nullable": true, 
        "type": "string", 
        "name": "client_ip_address", 
        "metadata": {}
      }, 
      "url": {
        "nullable": true, 
        "type": "string", 
        "name": "url", 
        "metadata": {}
      }, 
      "client_experiments": {
        "nullable": true, 
        "type": "string", 
        "name": "client_experiments", 
        "metadata": {}
      }, 
      "referral_url": {
        "nullable": true, 
        "type": "string", 
        "name": "referral_url", 
        "metadata": {}
      }, 
      "page_name": {
        "nullable": true, 
        "type": "string", 
        "name": "page_name", 
        "metadata": {}
      }, 
      "user_agent": {
        "nullable": true, 
        "type": "string", 
        "name": "user_agent", 
        "metadata": {}
      }, 
      "uvn": {
        "nullable": true, 
        "type": "string", 
        "name": "uvn", 
        "metadata": {}
      }, 
      "chegg_uuid": {
        "nullable": true, 
        "type": "string", 
        "name": "chegg_uuid", 
        "metadata": {}
      }, 
      "adobe_mcid": {
        "nullable": true, 
        "type": "string", 
        "name": "adobe_mcid", 
        "metadata": {}
      }, 
      "device_id": {
        "nullable": true, 
        "type": "string", 
        "name": "device_id", 
        "metadata": {}
      }
    }
  }, 
  "validated_message": {
    "nullable": true, 
    "type": "string", 
    "name": "validated_message", 
    "metadata": {}
  }, 
  "event_validation_status": {
    "nullable": true, 
    "type": "string", 
    "name": "event_validation_status", 
    "metadata": {}
  }, 
  "data": {
    "school_type": {
      "nullable": true, 
      "type": "string", 
      "name": "school_type", 
      "metadata": {}
    }, 
    "reg_source_product": {
      "nullable": true, 
      "type": "string", 
      "name": "reg_source_product", 
      "metadata": {}
    }, 
    "member_role": {
      "nullable": true, 
      "type": "string", 
      "name": "member_role", 
      "metadata": {}
    }, 
    "email": {
      "nullable": true, 
      "type": "string", 
      "name": "email", 
      "metadata": {}
    }, 
    "year_in_college": {
      "nullable": true, 
      "type": "int", 
      "name": "year_in_college", 
      "metadata": {}
    }
  }
}

Finally, note that this only works for simple types and struct (not for array or map types). 最后，请注意，这仅适用于简单类型和struct （不适用于array或map类型）。 But it's fairly easy to extend to other complex types as well. 但是，将其扩展到其他复杂类型也相当容易。

Answer 2

为什么不使用标准json-schema https://json-schema.org/understanding-json-schema/about.html并使用此json-shema转换为火花模式https://github.com/zalando-incubator/spark -json-模式＃快速入门

Answer 3

Was able to solve this finally 终于能够解决这个问题

def struct_definition(column_name, column_type):
    column_dict = {"metadata": {}, "name": column_name, "nullable": True, "type": column_type}
    return column_dict


def convert_to_json_array(struct_def):
    striped = struct_def.lstrip('struct')
    striped = striped.lstrip('<')
    striped = striped.rstrip('>')
    main_struct_list = []
    if striped.__contains__('struct'):
        name = striped.split(':')[0]
        json = {'Name': name, 'Type': striped.lstrip(name + ':') + '>'}
        main_struct_list.append(json)
    else:
        for i in striped.split(','):
            key_value = i.split(':')
            normalized_json = {'Name': key_value[0], 'Type': key_value[1]}
            main_struct_list.append(normalized_json)
    return main_struct_list


def to_json(input_list):
    main_struct_list = []
    for x in input_list:
        column_name = x['Name']
        column_type = x['Type']
        if column_type.startswith('struct'):
            main_struct_list.append(
                struct_definition(column_name,
                                  {'fields': to_json(convert_to_json_array(column_type)), 'type': 'struct'}))
        else:
            main_struct_list.append(struct_definition(column_name, column_type))
    return main_struct_list


if __name__ == '__main__':
    sample_list = [{u'Name': u'cloud_events_version', u'Type': u'string'},
                   {u'Name': u'event_type', u'Type': u'string'},
                   {u'Name': u'event_time', u'Type': u'string'},
                   {u'Name': u'data',
                    u'Type': u'struct<school_type:string,reg_source_product:string,member_role:string,email:string,year_in_college:int>'},
                   {u'Name': u'source', u'Type': u'string'},
                   {u'Name': u'extensions',
                    u'Type': u'struct<client_common:struct<auth_state:string,client_ip_address:string,client_experiments:string,uvn:string,device_id:string,adobe_sdid:string,url:string,page_name:string,user_agent:string,uuid:string,adobe_mcid:string,referral_url:string>>'},
                   {u'Name': u'event_type_version', u'Type': u'string'},
                   {u'Name': u'event_validation_status', u'Type': u'string'},
                   {u'Name': u'event_validated_ts', u'Type': u'string'},
                   {u'Name': u'validated_message', u'Type': u'string'}]
    main_struct_dict = {'fields': to_json(sample_list), 'type': 'struct'}
    print(main_struct_dict)

从JSON模式构建Spark模式

问题描述

3 个解决方案

解决方案1
1 2018-09-27 22:02:10

解决方案2
0 2019-03-05 03:09:14

解决方案3
-1 2018-09-28 05:46:54

从JSON模式构建Spark模式

问题描述

3 个解决方案

解决方案1 1 2018-09-27 22:02:10

解决方案2 0 2019-03-05 03:09:14

解决方案3 -1 2018-09-28 05:46:54

解决方案1
1 2018-09-27 22:02:10

解决方案2
0 2019-03-05 03:09:14

解决方案3
-1 2018-09-28 05:46:54