We are in the unfortunate situation of having to deal with messy incoming json data, and have found a difference in the way that Spark 2.0 (pyspark) handles quotes within a json key.
If we use the following as a sample file (sample.json):
{"event":"abc"}
{"event":"xyz","otherdata[\"this.is.ugly\"]":"value1"}
In Spark 1.6.2, we can run the following and get results:
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
data = sqlContext.read.json("sample.json")
data.printSchema()
Results are:
root
|-- event: string (nullable = true)
|-- otherdata["this.is.ugly"]: string (nullable = true)
And we can see data when we do a show:
data.show(2)
+-----+-------------------------+
|event|otherdata["this.is.ugly"]|
+-----+-------------------------+
| abc| null|
| xyz| value1|
+-----+-------------------------+
However, running the same code in Spark 2.0 shows the same schema:
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
data = sqlContext.read.json("sample.json")
data.printSchema()
root
|-- event: string (nullable = true)
|-- otherdata["this.is.ugly"]: string (nullable = true)
But the show fails:
data.show(2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
print(self._jdf.showString(n, truncate))
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to resolve otherdata["this.is.ugly"] given [event, otherdata["this.is.ugly"]];'
Is this a bug or is there a parameter in Spark 2.0 that I'm missing?
I believe this is addressed in https://issues.apache.org/jira/browse/SPARK-16698 (dot in JSON keys). The fix is scheduled to be released in 2.0.1
(I don't have enough reputation to comment)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.