简体   繁体   中英

Apache spark parsing json with splitted records

As far as I know, Apache spark requires json file to have one record in exactly one string. I have a splitted by fields json file like this:

{"id": 123,
"name": "Aaron",
"city": {
    "id" : 1,
    "title": "Berlin"
}}
{"id": 125,
"name": "Bernard",
"city": {
    "id" : 2,
    "title": "Paris"
}}
{...many more lines
...}

How can I parse it using Spark? Do I need a preprocessor or can I provide custom splitter?

Spark uses splitting by newline to distinguish records. This mean that when using the standard json reader you would need to have one record per line.

You can convert by doing something like in this answer: https://stackoverflow.com/a/30452120/1547734

The basic idea would be to read as a wholeTextFiles and then load it to a json reader which would parse it and flatmap the results.

Of course this assumes the files are big enough to be in memory and parsed one at a time. Otherwise you would need more complicated solutions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM