如何将我的JSON字符串的RDD转换为DataFrame

Question

I created RDD[String] in which each String element contains multiple JSON strings, but all these JSON strings have the same scheme over the whole RDD . 我创建了RDD[String] ，其中每个String元素包含多个JSON字符串，但所有这些JSON字符串在整个RDD具有相同的方案。

For example: 例如：

RDD{String] called as rdd contains the following entries: String 1: 称为rdd RDD{String]包含以下条目： 字符串1：

{"data":"abc", "field1":"def"}
{"data":"123", "field1":"degf"}
{"data":"87j", "field1":"hzc"}
{"data":"efs", "field1":"ssaf"}

String 2: 字符串2：

{"data":"fsg", "field1":"agas"}
{"data":"sgs", "field1":"agg"}
{"data":"sdg", "field1":"agads"}

My goal is to convert this RDD[String] into DataFrame . 我的目标是将此RDD[String]转换为DataFrame 。 If I just do it this way: 如果我这样做：

val df = rdd.toDF()

..., then it does not work correctly. ...，然后它无法正常工作。 Actually df.count() gives me 2 , instead of 7 for the above example, because JSON strings are batched and are not recognized individually. 实际上，对于上面的例子， df.count()给了我2而不是7 ，因为JSON字符串是批处理的，不能单独识别。

How can I create DataFrame so that each row would correspond to particular JSON string? 如何创建DataFrame以使每行对应于特定的JSON字符串？

Answer 1

I can't check it right now but i think this should work: 我现在无法检查它，但我认为这应该有效：

// split each string by newline character
val splitted: RDD[Array[String]] = rdd.map(_.split("\n"))

// flatten
val jsonRdd: RDD[String] = splitted.flatMap(identity)

Answer 2

By following the information you've provided in your question, following can be your solution : 通过遵循您在问题中提供的信息，以下可以是您的解决方案：

import sqlContext.implicits._
val str1 = "{\"data\":\"abc\", \"field1\":\"def\"}\n{\"data\":\"123\", \"field1\":\"degf\"}\n{\"data\":\"87j\", \"field1\":\"hzc\"}\n{\"data\":\"efs\", \"field1\":\"ssaf\"}"
val str2 = "{\"data\":\"fsg\", \"field1\":\"agas\"}\n{\"data\":\"sgs\", \"field1\":\"agg\"}\n{\"data\":\"sdg\", \"field1\":\"agads\"}"
val input = Seq(str1, str2)

val rddData = sc.parallelize(input).flatMap(_.split("\n"))
  .map(line => line.split(","))
  .map(array => (array(0).split(":")(1).trim.replaceAll("\\W", ""), array(1).split(":")(1).trim.replaceAll("\\W", "")))
rddData.toDF("data", "field1").show

Edited 编辑
You can exclude the fieldNames and just use .toDF() but that would give default column names from your data (like _1 _2 or col_1 col_2 etc) 您可以排除fieldNames并使用.toDF()但这会从您的数据中提供default column names （如_1 _2或col_1 col_2等）
Instead you can create a schema to create dataframe as below (you can add more fields) 相反，您可以创建一个schema来创建dataframe ，如下所示（您可以添加更多字段）

val rddData = sc.parallelize(input).flatMap(_.split("\n"))
  .map(line => line.split(","))
  .map(array => Row.fromSeq(Seq(array(0).split(":")(1).trim.replaceAll("\\W", ""), array(1).split(":")(1).trim.replaceAll("\\W", ""))))

val schema = StructType(Array(StructField("data", StringType, true),
  StructField("field1", StringType, true)))

sqlContext.createDataFrame(rddData, schema).show

Or 要么
You can just create dataset directly but you will need a case class (you can add more fields) as below 您可以直接创建dataset ，但需要一个case class （可以添加更多字段），如下所示

val dataSet = sc.parallelize(input).flatMap(_.split("\n"))
  .map(line => line.split(","))
  .map(array => Dinasaurius(array(0).split(":")(1).trim.replaceAll("\\W", ""),
    array(1).split(":")(1).trim.replaceAll("\\W", ""))).toDS

dataSet.show

The case class for above dataset is 上述dataset的case class是

case class Dinasaurius(data: String,
                       field1: String)

I hope I answered all your questions 我希望我回答你所有的问题

如何将我的JSON字符串的RDD转换为DataFrame

问题描述

2 个解决方案

解决方案1
2 2017-05-12 16:49:39

解决方案2
1 已采纳 2017-05-12 17:45:00

如何将我的JSON字符串的RDD转换为DataFrame

问题描述

2 个解决方案

解决方案1 2 2017-05-12 16:49:39

解决方案2 1 已采纳 2017-05-12 17:45:00

解决方案1
2 2017-05-12 16:49:39

解决方案2
1 已采纳 2017-05-12 17:45:00