简体   繁体   中英

Spark 2.0 reading json into a dataframe with quotes in a key - different behaviour than spark 1.6… bug?

We are in the unfortunate situation of having to deal with messy incoming json data, and have found a difference in the way that Spark 2.0 (pyspark) handles quotes within a json key.

If we use the following as a sample file (sample.json):

{"event":"abc"}
{"event":"xyz","otherdata[\"this.is.ugly\"]":"value1"}

In Spark 1.6.2, we can run the following and get results:

from pyspark import  SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

data =     sqlContext.read.json("sample.json")
data.printSchema()

Results are:

root
 |-- event: string (nullable = true)
 |-- otherdata["this.is.ugly"]: string (nullable = true)

And we can see data when we do a show:

data.show(2)

+-----+-------------------------+
|event|otherdata["this.is.ugly"]|
+-----+-------------------------+
|  abc|                     null|
|  xyz|                   value1|
+-----+-------------------------+

However, running the same code in Spark 2.0 shows the same schema:

from pyspark import  SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

data =     sqlContext.read.json("sample.json")
data.printSchema()

root
 |-- event: string (nullable = true)
 |-- otherdata["this.is.ugly"]: string (nullable = true)

But the show fails:

data.show(2)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
    print(self._jdf.showString(n, truncate))
  File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to resolve otherdata["this.is.ugly"] given [event, otherdata["this.is.ugly"]];'

Is this a bug or is there a parameter in Spark 2.0 that I'm missing?

I believe this is addressed in https://issues.apache.org/jira/browse/SPARK-16698 (dot in JSON keys). The fix is scheduled to be released in 2.0.1

(I don't have enough reputation to comment)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM