简体   繁体   English

如何通过多值列过滤JSON数据

[英]How to filter JSON data by multi-value column

With help of Spark SQL I'm trying to filter out all business items from which belongs to a specific group category. 在Spark SQL的帮助下,我试图过滤出属于特定组类别的所有业务项目。

The data is loaded from JSON file: 数据是从JSON文件加载的:

businessJSON = os.path.join(targetDir, 'yelp_academic_dataset_business.json')
businessDF = sqlContext.read.json(businessJSON)

The schema of the file is following: 该文件的架构如下:

businessDF.printSchema()

root
  |-- business_id: string (nullable = true)
  |-- categories: array (nullable = true)
  |    |-- element: string (containsNull = true)
  ..
  |-- type: string (nullable = true)

I'm trying to extract all business connected to restaurant business: 我正在尝试提取与餐厅业务相关的所有业务:

restaurants = businessDF[businessDF.categories.inSet("Restaurants")]

but it doesn't work because as I understand the expected type of column should be a string, but in my case this is array. 但这不起作用,因为据我了解,预期的列类型应该是字符串,但就我而言,这是数组。 About it tells me an exception: 关于它告诉我一个例外:

Py4JJavaError: An error occurred while calling o1589.filter.
: org.apache.spark.sql.AnalysisException: invalid cast from string to array<string>;

Can you please suggest any other way to get what I want? 您能否建议其他方法来获得我想要的东西?

How about an UDF? UDF呢?

from pyspark.sql.functions import udf, col, lit
from pyspark.sql.types import BooleanType

contains = udf(lambda xs, val: val in xs, BooleanType())
df = sqlContext.createDataFrame([Row(categories=["foo", "bar"])])

df.select(contains(df.categories, lit("foo"))).show()
## +----------------------------------+
## |PythonUDF#<lambda>(categories,foo)|
## +----------------------------------+
## |                              true|
## +----------------------------------+

df.select(contains(df.categories, lit("foobar"))).show()
## +-------------------------------------+
## |PythonUDF#<lambda>(categories,foobar)|
## +-------------------------------------+
## |                                false|
## +-------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM