繁体   English   中英

如何在Scala中使用Spark SQL返回多个JSON对象

[英]How to return multiple JSON objects using Spark SQL in Scala

我有一个csv文件和一个json文件。 在count.csv中有三列(纬度,经度,计数)。 在json中,这是一个示例:

{
  "type": "Feature",
  "properties": {
  "ID": "15280000000231",
  "TYPES": "Second Class",
  "N2C": "9",
  "NAME": "Century Road"
  },
  "geometry": {
    "type": "LineString",
    "coordinates": [
      [
        6.1395489,
        52.3107973
      ],
      [
        6.1401178,
        52.3088457
      ],
      [
        6.1401126,
        52.3088071
      ]
    ]
  }
}

目前,我的scala代码近似并匹配经度和纬度,并过滤csv文件以匹配lon / lat,返回lat / lon并计为csv。

我想从json返回所有属性(ID,TYPE,N2C和NAME),并将匹配的经纬度/纬度作为原始线串+ csv的计数+ json的属性全部作为json文件而不是csv返回。

到目前为止,我一直在努力做到这一点?

case class ScoredLocation(latitude: Double, longitude: Double, score: Int)

object ScoreFilter {
  val Epsilon = 10000

  val DoubleToRoundInt = udf(
    (coord:Double) => (coord * Epsilon).toInt
  )

  val schema = Encoders.product[ScoredLocation].schema
  val route_count = spark.read.schema(schema).format("csv").load("count.csv")
    .withColumn("lat_aprx", DoubleToRoundInt($"latitude"))
    .withColumn("lon_aprx", DoubleToRoundInt($"longitude"))

  val match_route = spark.read.format("json").load("matchroute.json")
    .select(explode($"geometry.coordinates"))
    .select($"col".getItem(0).alias("latitude"), $"col".getItem(1).alias("longitude"))
    .withColumn("lat_aprx", DoubleToRoundInt($"latitude"))
    .withColumn("lon_aprx", DoubleToRoundInt($"longitude"))

  europe_count.show()
  scenic_route.show()

  val result = route_count.join(match_route, Seq("lat_aprx", "lon_aprx"), "leftsemi")
    .select($"latitude", $"longitude", $"count")

  result.show()
  result.write.format("csv").save("result.csv")
}

响应编辑:

使用解决方案时出现此错误。

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
cannot resolve '`ID`' given input columns: [count, latitude, 
longitude, lat_aprx, lon_aprx];;
'Project [latitude#3, longitude#4, score#5, 'ID, 'TYPES, 'N2C, 'NAME]
+- Project [lat_aprx#10, lon_aprx#16, latitude#3, longitude#4,score#5]
  +- Join LeftSemi, ((lat_aprx#10 = lat_aprx#55) && (lon_aprx#16 = lon_aprx#63))
     :- Project [latitude#3, longitude#4, score#5, lat_aprx#10, if 
(isnull(longitude#4)) null else UDF(longitude#4) AS lon_aprx#16]
      :  +- Project [latitude#3, longitude#4, count#5, if 
(isnull(latitude#3)) null else UDF(latitude#3) AS lat_aprx#10]
      :     +- Relation[latitude#3,longitude#4,count#5] csv
      +- Project [ID#38, TYPES#39, N2C#40, NAME#41, coords#48, 
lat_aprx#55, if (isnull(coords#48[1])) null else UDF(coords#48[1]) AS 
lon_aprx#63]
         +- Project [ID#38, TYPES#39, N2C#40, NAME#41, coords#48, if 
 (isnull(coords#48[0])) null else UDF(coords#48[0]) AS lat_aprx#55]
             +- Project [properties#32.ID AS ID#38, 
properties#32.TYPES AS TYPES#39, properties#32.N2C AS N2C#40, 
properties#32.NAME AS NAME#41, coords#48]
                +- Generate explode(geometry#31.coordinates), true, 
 false, [coords#48]
                  +- Relation[geometry#31,properties#32,type#33] json

编辑2:我现在返回添加了计数的json,但是现在的问题是返回原始的geojson,为linestring类型,并添加了计数,例如下面的示例。 它应该更像上面的原始json。 我想可以在以后对其进行操作,但是我希望将其作为一个Spark sql进程来进行。 有任何想法吗?

{  
   "lat":5.2509524,
   "lon":53.3926721,
   "count":1,
   "ID":"15280000814947",
   "TYPES":"Second Class",
   "N2C":"9"
}{  
   "lat":5.251464,
   "lon":53.3919782,
   "count":4,
   "ID":"15280000814947",
   "TYPES":"Second Class",
   "N2C":"9"
}{  
   "lat":5.251674,
   "lon":53.3916119,
   "count":4,
   "ID":"15280000814947",
   "TYPES":"Second Class",
   "N2C":"9"
}

处理match_route数据帧时,请确保选择您实际上要保留的所有列。 例如:

val match_route = spark.read.format("json").load("matchroute.json")
  .select($"properties.*", explode($"geometry.coordinates").as("coords"))
  .withColumn("latitude", $"coords".getItem(0))
  .withColumn("longitude", $"coords".getItem(1))
  .withColumn("lat_aprx", DoubleToRoundInt(latitude))
  .withColumn("lon_aprx", DoubleToRoundInt(longitude))
  .drop($"coords")

确保在最后一个选择中也添加相关列,

val result = route_count.join(match_route, Seq("lat_aprx", "lon_aprx"), "leftsemi")
  .select($"latitude", $"longitude", $"count", $"ID", $"TYPES", $"N2C", $"NAME")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM