简体   繁体   English

Spark 2.0将json读入数据帧,并在键中带有引号-与Spark 1.6的行为不同……错误?

[英]Spark 2.0 reading json into a dataframe with quotes in a key - different behaviour than spark 1.6… bug?

We are in the unfortunate situation of having to deal with messy incoming json data, and have found a difference in the way that Spark 2.0 (pyspark) handles quotes within a json key. 不幸的是,我们不得不处理混乱的传入json数据,并且发现Spark 2.0(pyspark)处理json密钥中的引号的方式有所不同。

If we use the following as a sample file (sample.json): 如果我们将以下内容用作示例文件(sample.json):

{"event":"abc"}
{"event":"xyz","otherdata[\"this.is.ugly\"]":"value1"}

In Spark 1.6.2, we can run the following and get results: 在Spark 1.6.2中,我们可以运行以下命令并获得结果:

from pyspark import  SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

data =     sqlContext.read.json("sample.json")
data.printSchema()

Results are: 结果是:

root
 |-- event: string (nullable = true)
 |-- otherdata["this.is.ugly"]: string (nullable = true)

And we can see data when we do a show: 表演时我们可以看到数据:

data.show(2)

+-----+-------------------------+
|event|otherdata["this.is.ugly"]|
+-----+-------------------------+
|  abc|                     null|
|  xyz|                   value1|
+-----+-------------------------+

However, running the same code in Spark 2.0 shows the same schema: 但是,在Spark 2.0中运行相同的代码将显示相同的架构:

from pyspark import  SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

data =     sqlContext.read.json("sample.json")
data.printSchema()

root
 |-- event: string (nullable = true)
 |-- otherdata["this.is.ugly"]: string (nullable = true)

But the show fails: 但是演出失败了:

data.show(2)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
    print(self._jdf.showString(n, truncate))
  File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to resolve otherdata["this.is.ugly"] given [event, otherdata["this.is.ugly"]];'

Is this a bug or is there a parameter in Spark 2.0 that I'm missing? 这是错误还是我缺少的Spark 2.0中的参数?

I believe this is addressed in https://issues.apache.org/jira/browse/SPARK-16698 (dot in JSON keys). 我相信可以在https://issues.apache.org/jira/browse/SPARK-16698(JSON密钥中的点)中解决此问题。 The fix is scheduled to be released in 2.0.1 该修复程序计划在2.0.1中发布。

(I don't have enough reputation to comment) (我没有足够的声誉来发表评论)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM