简体   繁体   English

在Spark DataFrame中按数组值过滤

[英]Filter by array value in Spark DataFrame

I am using apache spark 1.5 dataframe with elasticsearch, I am try to filter id from a column that contains a list(array) of ids. 我正在使用带有Elasticsearch的Apache Spark 1.5数据帧,我尝试从包含ID列表(数组)的列中过滤ID。

For example the mapping of elasticsearch column is looks like this: 例如,elasticsearch列的映射如下所示:

    {
        "people":{
            "properties":{
                "artist":{
                   "properties":{
                      "id":{
                         "index":"not_analyzed",
                         "type":"string"
                       },
                       "name":{
                          "type":"string",
                          "index":"not_analyzed",
                       }
                   }
               }
          }
    }

The example data format will be like following 示例数据格式如下所示

{
    "people": {
        "artist": {
            [
                  {
                       "id": "153",
                       "name": "Tom"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  }
            ]
        }
    }
},
{
    "people": {
        "artist": {
            [
                  {
                       "id": "369",
                       "name": "Carl"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  },
                 {
                       "id": "698",
                       "name": "Sol"
                  }
            ]
        }
    }
}

In spark I try this: 在火花我尝试这样做:

val peopleId  = 152
val dataFrame = sqlContext.read
     .format("org.elasticsearch.spark.sql")
     .load("index/type")

dataFrame.filter(dataFrame("people.artist.id").contains(peopleId))
    .select("people_sequence.artist.id")

I got all the id that is contains 152, for example 1523 , 152978 but not only id == 152 我得到了包含152的所有id,例如1523,152978,但不仅id == 152

Then I tried 然后我尝试

dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId))
    .select("people.artist.id")

I get empty, I understand why, it's because I have array of people.artist.id 我变得空虚,我明白为什么,这是因为我有很多人。

Can anyone tell me how to filter when I have list of ids ? 谁能告诉我如何在拥有ID列表时进行过滤?

In Spark 1.5+ you can use array_contains function: 在Spark 1.5+中,您可以使用array_contains函数:

df.where(array_contains($"people.artist.id", "153"))

If you use an earlier version you can try an UDF like this: 如果您使用的是较早版本,则可以尝试这样的UDF:

val containsId = udf(
  (rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM