[英]How to convert my RDD of JSON strings to DataFrame
I created RDD[String]
in which each String
element contains multiple JSON
strings, but all these JSON
strings have the same scheme over the whole RDD
. 我创建了
RDD[String]
,其中每个String
元素包含多个JSON
字符串,但所有这些JSON
字符串在整个RDD
具有相同的方案。
For example: 例如:
RDD{String]
called as rdd
contains the following entries: String 1: 称为
rdd
RDD{String]
包含以下条目: 字符串1:
{"data":"abc", "field1":"def"}
{"data":"123", "field1":"degf"}
{"data":"87j", "field1":"hzc"}
{"data":"efs", "field1":"ssaf"}
String 2: 字符串2:
{"data":"fsg", "field1":"agas"}
{"data":"sgs", "field1":"agg"}
{"data":"sdg", "field1":"agads"}
My goal is to convert this RDD[String]
into DataFrame
. 我的目标是将此
RDD[String]
转换为DataFrame
。 If I just do it this way: 如果我这样做:
val df = rdd.toDF()
..., then it does not work correctly. ...,然后它无法正常工作。 Actually
df.count()
gives me 2
, instead of 7
for the above example, because JSON
strings are batched and are not recognized individually. 实际上,对于上面的例子,
df.count()
给了我2
而不是7
,因为JSON
字符串是批处理的,不能单独识别。
How can I create DataFrame
so that each row would correspond to particular JSON
string? 如何创建
DataFrame
以使每行对应于特定的JSON
字符串?
I can't check it right now but i think this should work: 我现在无法检查它,但我认为这应该有效:
// split each string by newline character
val splitted: RDD[Array[String]] = rdd.map(_.split("\n"))
// flatten
val jsonRdd: RDD[String] = splitted.flatMap(identity)
By following the information you've provided in your question, following can be your solution : 通过遵循您在问题中提供的信息,以下可以是您的解决方案:
import sqlContext.implicits._
val str1 = "{\"data\":\"abc\", \"field1\":\"def\"}\n{\"data\":\"123\", \"field1\":\"degf\"}\n{\"data\":\"87j\", \"field1\":\"hzc\"}\n{\"data\":\"efs\", \"field1\":\"ssaf\"}"
val str2 = "{\"data\":\"fsg\", \"field1\":\"agas\"}\n{\"data\":\"sgs\", \"field1\":\"agg\"}\n{\"data\":\"sdg\", \"field1\":\"agads\"}"
val input = Seq(str1, str2)
val rddData = sc.parallelize(input).flatMap(_.split("\n"))
.map(line => line.split(","))
.map(array => (array(0).split(":")(1).trim.replaceAll("\\W", ""), array(1).split(":")(1).trim.replaceAll("\\W", "")))
rddData.toDF("data", "field1").show
Edited 编辑
You can exclude the fieldNames and just use .toDF()
but that would give default column names
from your data (like _1
_2
or col_1
col_2
etc) 您可以排除fieldNames并使用
.toDF()
但这会从您的数据中提供default column names
(如_1
_2
或col_1
col_2
等)
Instead you can create a schema
to create dataframe
as below (you can add more fields) 相反,您可以创建一个
schema
来创建dataframe
,如下所示(您可以添加更多字段)
val rddData = sc.parallelize(input).flatMap(_.split("\n"))
.map(line => line.split(","))
.map(array => Row.fromSeq(Seq(array(0).split(":")(1).trim.replaceAll("\\W", ""), array(1).split(":")(1).trim.replaceAll("\\W", ""))))
val schema = StructType(Array(StructField("data", StringType, true),
StructField("field1", StringType, true)))
sqlContext.createDataFrame(rddData, schema).show
Or 要么
You can just create dataset
directly but you will need a case class
(you can add more fields) as below 您可以直接创建
dataset
,但需要一个case class
(可以添加更多字段),如下所示
val dataSet = sc.parallelize(input).flatMap(_.split("\n"))
.map(line => line.split(","))
.map(array => Dinasaurius(array(0).split(":")(1).trim.replaceAll("\\W", ""),
array(1).split(":")(1).trim.replaceAll("\\W", ""))).toDS
dataSet.show
The case class
for above dataset
is 上述
dataset
的case class
是
case class Dinasaurius(data: String,
field1: String)
I hope I answered all your questions 我希望我回答你所有的问题
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.