简体   繁体   English

从`org.apache.spark.sql.Row`中提取信息

[英]Extract information from a `org.apache.spark.sql.Row`

I have Array[org.apache.spark.sql.Row] returned by sqc.sql(sqlcmd).collect() : 我有sqc.sql(sqlcmd).collect()返回的Array[org.apache.spark.sql.Row]

Array([10479,6,10], [8975,149,640], ...)

I can get the individual values: 我可以得到个人价值观:

scala> pixels(0)(0)
res34: Any = 10479

but they are Any , not Int . 但它们是Any ,而不是Int

How do I extract them as Int ? 如何将它们作为Int提取?

The most obvious solution did not work: 最明显的解决方案不起作用:

scala> pixels(0).getInt(0)
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Int

PS. PS。 I can do pixels(0)(0).toString.toInt or pixels(0).getString(0).toInt , but they feel wrong... 我可以做pixels(0)(0).toString.toIntpixels(0).getString(0).toInt ,但他们感觉不对...

Using getInt should work. 使用getInt应该可行。 Here is a contrived example as a proof of concept 这是一个人为的例子作为概念证明

import org.apache.spark.sql._
sc.parallelize(Array(1,2,3)).map(Row(_)).collect()(0).getInt(0)

This return 1 这回报1

However, 然而,

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getInt(0)

fails. 失败。 So, it looks like it is coming in as a string and you will have to convert to an int manually. 所以,它看起来像是一个字符串,你必须手动转换为int。

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getString(0).toInt

The documentation states that getInt : 文档说明了getInt

Returns the value of column i as an int. 将列i的值作为int返回。 This function will throw an exception if the value is at i is not an integer, or if it is null. 如果值不是整数,或者它是null,则此函数将抛出异常。

So, it will not try to cast for you it seems 所以,它似乎不会试图为你施展

The Row class (also see https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package ) has methods getInt(i: Int) , getDouble(i: Int) etc. Row (另见https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package )有方法getInt(i: Int)getDouble(i: Int)

Also note that a SchemaRDD is an RDD[Row] plus a schema that tells you which column has which data type. 另请注意, SchemaRDD是一个RDD[Row] 加上一个schema ,告诉您哪个列具有哪种数据类型。 If you do .collect() you will only get an Array[Row] which does not have that information. 如果你执行.collect()你将只得到一个没有该信息的Array[Row] So unless you know for sure what your data looks like, get the schema from the SchemaRDD , then collect the rows and then access each field using the correct type information. 因此,除非您确切知道数据是什么样的,否则从SchemaRDD获取模式,然后收集行,然后使用正确的类型信息访问每个字段。

the answer is relevant. 答案是相关的。 you dont need to use collect instead you need to call the methods getInt getString and getAs as well in case the datatype is complex 您不需要使用collect而是需要调用方法getInt getStringgetAs以防数据类型复杂

val popularHashTags = sqlContext.sql("SELECT hashtags, usersMentioned, Url FROM tweets")
var hashTagsList =  popularHashTags.flatMap ( x => x.getAs[Seq[String]](0)) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM