I'm trying to combine rows in Spark.
The dataset has rows of Year, Zip code, HPI_with_2000_based, etc. I selected three zip codes and their information of HPI_with_2000_based. What I want to do is I want to combine these rows(three zip codes and their HPI_with_2000_based) and Year after 2000.
When I typed like this and it worked:
df6 = spark.sql("select ZipCode,Year, HPI_with_2000_base from df1 where ZipCode = 94122 or ZipCode = 10583 or ZipCode = 91411")
Resulting dataframe:
+-------+----+------------------+
|ZipCode|Year|HPI_with_2000_base|
+-------+----+------------------+
| 10583|1976| 16.66|
| 10583|1977| 16.81|
| 10583|1978| 18.37|
| 10583|1979| 23.06|
| 10583|1980| 24.37|
| 10583|1981| 30.82|
| 10583|1982| 32.46|
| 10583|1983| 35.25|
| 10583|1984| 42.15|
| 10583|1985| 48.94|
| 10583|1986| 57.22|
| 10583|1987| 66.24|
| 10583|1988| 76.98|
| 10583|1989| 77.28|
| 10583|1990| 74.44|
| 10583|1991| 69.85|
| 10583|1992| 70.86|
| 10583|1993| 70.98|
| 10583|1994| 71.39|
| 10583|1995| 71.27|
+-------+----+------------------+
only showing top 20 rows
However, when I typed like this, it failed:
df6 = spark.sql("select ZipCode,Year, HPI_with_2000_base from df1 where ZipCode = 94122 or ZipCode = 10583 or ZipCode = 91411" or Year >= '2000'").show()
Can you advise that what should I do to get a result? Thank you.
If I understand the question correctly, you want to add the condition Year >= 2000
to the current SQL statement. Your "
seems a bit misplaced and you need to surround the ZipCode or ZipCode or ZipCode
part with parenthesis. A working statement can look like this:
val df6 = spark.sql("""select ZipCode, Year, HPI_with_2000_base from df1
where ZipCode IN(94122, 10583, 91411) and Year >= 2000""")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.