简体   繁体   English

Apache spark解析带有拆分记录的json

[英]Apache spark parsing json with splitted records

As far as I know, Apache spark requires json file to have one record in exactly one string. 据我所知,Apache spark需要json文件在一个字符串中只有一条记录。 I have a splitted by fields json file like this: 我有一个分区字段json文件,如下所示:

{"id": 123,
"name": "Aaron",
"city": {
    "id" : 1,
    "title": "Berlin"
}}
{"id": 125,
"name": "Bernard",
"city": {
    "id" : 2,
    "title": "Paris"
}}
{...many more lines
...}

How can I parse it using Spark? 如何使用Spark解析它? Do I need a preprocessor or can I provide custom splitter? 我需要预处理器还是可以提供自定义分离器?

Spark uses splitting by newline to distinguish records. Spark使用换行符来区分记录。 This mean that when using the standard json reader you would need to have one record per line. 这意味着当使用标准的json阅读器时,每行需要一条记录。

You can convert by doing something like in this answer: https://stackoverflow.com/a/30452120/1547734 您可以通过执行以下答案进行转换: https//stackoverflow.com/a/30452120/1547734

The basic idea would be to read as a wholeTextFiles and then load it to a json reader which would parse it and flatmap the results. 基本的想法是作为一个整体文本读取,然后将其加载到一个json阅读器,它将解析它并平面化结果。

Of course this assumes the files are big enough to be in memory and parsed one at a time. 当然,这假设文件足够大,可以在内存中一次解析一个。 Otherwise you would need more complicated solutions. 否则你需要更复杂的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM