I am trying to get avg of ratings of all json objects in a file. I loaded the file and converted to data frame but getting error while parsing for avg. Sample Request :
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "2.3",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3.3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "4.3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
so for this json, US avg rating will be (2.3 + 3.3)/2 = 2.8
{
"country": "Egypt",
"customerId": "Egypt009",
"visited": [
{
"placeName": "US",
"rating": "1.3",
"famousRest": "McDonald",
"placeId": "Dedcf3"
},
{
"placeName": "US",
"rating": "3.3",
"famousRest": "EagleNest",
"placeId": "CDfet3"
},
}
{
"country": "Canada",
"customerId": "Canada012",
"visited": [
{
"placeName": "UK",
"rating": "3.3",
"famousRest": "N/A",
"placeId": "XSdce2"
},
]
}
for this avg for us= (3.3 +1.3)/2 = 2.3
so over all, the average rating will be : (2.8 + 2.3)/2 = 2.55 (only two requests have 'US' in their visited list)
My schema :
root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- placeId: string (nullable = true)
| | |-- placeName: string (nullable = true)
| | |-- famousRest: string (nullable = true)
| | |-- rating: string (nullable = true)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
df.show()
When doing :
val app = df.select("strategies"); app.registerTempTable("app"); app.printSchema(); app.show()
app.foreach({
t => t.select("placeName", "rating").where(t("placeName") == "US")
}).show()
I am getting :
<console>:31: error: value select is not a member of org.apache.spark.sql.Row t => t.select("placeName", "rating").where(t("placeName") == "US") ^
Can someone tell me what I am doing wrong here ?
Assuming app
is a Dataframe
(your code example isn't comprehensible... you create a df
variable and query an app
variable), you shouldn't call foreach
in order to select from it:
app.select("placeName", "rating").where(t("placeName") == "US")
foreach
would call a function on each record (of type Row
). That is useful mostly for invoking some side-effect (eg print to console / send to external service etc.). Mostly, you wouldn't use it for selecting / transforming Dataframes.
UPDATE :
As for the original question of how to calculate the average of US-only visits:
// explode to make a record out of each "visited" Array item,
// taking only "placeName" and "rating" columns
val exploded: DataFrame = df.explode(df("visited")) {
case Row(visits: Seq[Row]) =>
visits.map(r => (r.getAs[String]("placeName"), r.getAs[String]("rating")))
}
// make some order: rename columns named _1, _2 (since we used a tuple),
// and cast ratings to Double:
val ratings: DataFrame = exploded
.withColumnRenamed("_1", "placeName")
.withColumn("rating", exploded("_2").cast(DoubleType))
.select("placeName", "rating")
ratings.printSchema()
ratings.show()
/* prints:
root
|-- placeName: string (nullable = true)
|-- rating: double (nullable = true)
+---------+------+
|placeName|rating|
+---------+------+
| US| 1.3|
| US| 3.3|
| UK| 3.3|
+---------+------+
*/
// now filter US only and get average rating:
val avg = ratings
.filter(ratings("placeName") === "US")
.select(mean("rating"))
avg.show()
/* prints:
+-----------+
|avg(rating)|
+-----------+
| 2.3|
+-----------+
*/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.