简体   繁体   English

Scala - 来自 spark 数据集的具有重复值的对象的 ID 列表

[英]Scala - Ids lists of objects with duplicated values from spark dataset

I need to create an IDs lists for all objects that have identical (same value and quantity) parameters.我需要为所有具有相同(相同值和数量)参数的对象创建一个 ID 列表。 I am looking for a solution that will be more efficient than two nested loops and an if.我正在寻找一种比两个嵌套循环和一个 if 更有效的解决方案。
Object structure in the dataset:数据集中的Object结构:

case class MergedProduct(id: String,
                   products: List[Product])

case class Product(productUrl: String, productId: String)

Example of data in dataset:数据集中的数据示例:

[  {
   "id": "ID1",
   "products": [
     {
       "product": {
         "productUrl": "SOMEURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID2",
   "products": [
     {
       "product": {
         "productUrl": "SOMEURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID3",
   "products": [
     {
       "product": {
         "productUrl": "DIFFERENTURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID4",
   "products": [
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "DIFFERENTURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID5",
   "products": [
     {
       "product": {
         "productUrl": "NOTDUPLICATEDURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "DIFFERENTURL",
         "productId": "1"
       }
     }
   ],
 }
]

In this example, we have 4 objects that are duplicated, so I would like to get their ID in the corresponding lists.在这个例子中,我们有 4 个重复的对象,所以我想在相应的列表中获取它们的 ID。

Example output is List[List[String]]: List(List("ID1", "ID2"), List("ID3","ID4")) I am looking for something efficient and readable - the dataset we are talking about has nearly 700 million objects.示例 output is List[List[String]]: List(List("ID1", "ID2"), List("ID3","ID4"))我正在寻找一些高效且可读的东西——我们正在谈论的数据集拥有近 7 亿个物体。
As I can remove the listed duplicates from the dataset (it does not affect the database) because the goal is one - logging them exists, so I was thinking about the solution of taking MergedProduct one by one, searching for other MergedProduct with identical Products, getting their ID, logging in they exist and then remove the mentioned MergedProduct ID from the dataset and move on to the next one until I check the whole dataset but in this case I would have to collect it first as a list of MergedProducts and then do all operations - seems like going around因为我可以从数据集中删除列出的重复项(它不会影响数据库),因为目标是一个 - 记录它们存在,所以我正在考虑一个一个地采用 MergedProduct 的解决方案,搜索具有相同产品的其他 MergedProduct,获取他们的 ID,登录他们存在,然后从数据集中删除提到的 MergedProduct ID,然后继续下一个,直到我检查整个数据集,但在这种情况下,我必须首先将其收集为 MergedProducts 列表,然后执行所有操作 - 似乎四处走动

After trying some options and looking for neat solutions- I think this is kinda ok:在尝试了一些选项并寻找简洁的解决方案之后 - 我认为这还可以:

      private def getDuplicates(mergedProducts: List[MergedProduct]): List[List[String]] = {
val duplicates = mergedProducts.groupBy(_.products.sortBy(_.product.productId)).filter(_._2.size > 1).values.toList
duplicates.map(duplicates => duplicates.map(_.id))
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM