简体   繁体   中英

Reading and accessing nested fields in json files using spark

I have multiple json files I wish to use to create a spark data frame from. In testing with a subset, when I load the files, I get rows of the json information themselves instead of parsed json information. I am doing the following:

    df = spark.read.json('gutenberg/test')
    df.show()
    +--------------------+--------------------+--------------------+
    |                   1|                  10|                   5|
    +--------------------+--------------------+--------------------+
    |                null|[WrappedArray(),W...|                null|
    |                null|                null|[WrappedArray(Uni...|
    |[WrappedArray(Jef...|                null|                null|
    +--------------------+--------------------+--------------------+

When I check the schema of the dataframe, It appears to be there, but am having trouble accessing it:

    df.printSchema()
    root
     |-- 1: struct (nullable = true)
     |    |-- author: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- formaturi: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- language: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- rights: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- subject: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- title: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- txt: string (nullable = true)
     |-- 10: struct (nullable = true)
     |    |-- author: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- formaturi: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- language: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- rights: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- subject: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- title: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- txt: string (nullable = true)
     |-- 5: struct (nullable = true)
     |    |-- author: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- formaturi: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- language: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- rights: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- subject: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- title: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- txt: string (nullable = true)

I keep getting errors when trying to access the information, so any help would be great.

Specifically, I am looking to create a new dataframe where the columns are ('author', 'formaturi', 'language', 'rights', 'subject', 'title', 'txt')

I am using pyspark 2.2

Since I do not know what json file is exactly like, assuming it is a new line delimited jsons, this should work.

def _construct_key(previous_key, separator, new_key):
    if previous_key:
        return "{}{}{}".format(previous_key, separator, new_key)
    else:
        return new_key

def flatten(nested_dict, separator="_", root_keys_to_ignore=set()):
    assert isinstance(nested_dict, dict)
    assert isinstance(separator, str)
    flattened_dict = dict()

    def _flatten(object_, key):     
        if isinstance(object_, dict):
            for object_key in object_:
                if not (not key and object_key in root_keys_to_ignore):
                    _flatten(object_[object_key], _construct_key(key,\ 
                                       separator, object_key))
        elif isinstance(object_, list) or isinstance(object_, set):
            for index, item in enumerate(object_):
                _flatten(item, _construct_key(key, separator, index))
        else:
            flattened_dict[key] = object_

    _flatten(nested_dict, None)
    return flattened_dict

def flatten(_json):
    return flatt(_json.asDict(True))

df = spark.read.json('gutenberg/test',\
                     primitivesAsString=True,\
                     allowComments=True,\
                     allowUnquotedFieldNames=True,\
                     allowNumericLeadingZero=True,\
                     allowBackslashEscapingAnyCharacter=True,\
                     mode='DROPMALFORMED')\
                     .rdd.map(flatten).toDF()
df.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM