简体   繁体   English

Sparql如何对这类数据进行分组

[英]Sparql how to group this kind of data

Because I was worried that you won't understand my situation, I made this visual illustration for you (click on the image to have a good quality version). 因为我担心您不会理解我的情况,所以我为您制作了此视觉插图(单击图像可获得高质量的版本)。

在此处输入图片说明

I know that a user (whoever is, we don't care), likes an item (i1) . 我知道用户(无论我们在乎什么)都喜欢项(i1)

We want to suggest other items: 我们想建议其他项目:

i1 is similar to i2 depending on a specific criteria (so there is a similarity value, let's call it s1 ) 根据特定条件, i1i2相似(因此有一个相似性值,我们称其为s1

i1 is also similar to the same i2 , but depending on another criteria( so there is a similarity value, let's call it s2 ) i1也类似于相同的i2 ,但是取决于另一个条件(因此有一个相似性值,我们称它为s2

i1 is also similar to the same i2 , but depending on a third criteria (so there is a similarity value, let's call it s3 ) i1也类似于相同的i2 ,但是取决于第三个条件(因此存在相似性值,我们称其为s3

now i2 belongs to two classes, and each one of them affects the similarity by a specific weight . 现在i2属于两个类别, 每个类别都通过特定的权重影响相似度

my problem 我的问题

is i want to calculate the ultimate final similarity between i1 and i2 and i did almost all of it except the weight for the specific class. 我是否要计算i1i2之间的最终最终相似度,除了特定类别的权重,我几乎完成了所有相似度。

my problem is that this weight should not be applied on the criteria that led to the selection of i2 . 我的问题是,不应在导致选择i2的标准上应用此权重。 in other words, if i2 was select 1000 times using 1000 criteria, and i2 belongs to a specific class, then the weight of that class will be applied just once, not 1000 times, and if i2 belongs to two classes, the two weights for these two classes will be applied just once regarding of how many criterials led to select i2 换句话说,如果使用1000条条件将i2选择了1000次,并且i2属于特定类别,则该类别的权重将仅应用一次,而不是1000次,并且如果i2属于两个类别,则两个权重为关于导致选择i2标准数,这两个类将仅应用一次

Now 现在

To make it easy for you to help me, i did this query (okay long but it has to be long to show you the case), but i also make it easy for you by making my query selects only the required infomration so you just can add another layer of select above it. 为了方便您帮助我,我进行了此查询(可以,但必须很长才能向您展示情况),但我也可以通过使我的查询仅选择所需的信息来简化您的工作,因此您只需可以在其上方添加另一层选择。

    prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>


select  ?item ?similarityValue ?finalWeight where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .
  optional{
    ?item :hasContextValue ?weight .
  }
  bind (if(bound(?weight), ?weight, 1) as ?finalWeight)
}

So the result of that query is (look at the item i2 ) it repeats 6 times (as expected) with three different similarities (as expected because of three different criterias), and the finalWeight , which is the weight, repeats for each criteria: 因此,该查询的结果是(请看第i2项)它重复6次(按预期),具有三个不同的相似性(由于三个不同的标准而如预期的那样),并且finalWeight (即权重)针对每个条件重复:

在此处输入图片说明

Finally 最后

Here is the data 这是数据

@prefix : <http://example.org/rs#>
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

:i1 :similaryTo1 :i2 .
:similaryTo1 :hasValue 0.5 .
:i1 :similaryTo2 :i2 .
:similaryTo2 :hasValue 0.6 .
:i1 :similaryTo3 :i2 .
:similaryTo3 :hasValue 0.7 .
:i2 :hasContextValue 0.1 .
:i2 :hasContextValue 0.4 .
:i1 :similaryTo4 :i3 .
:similaryTo4 :hasValue 0.5 .

I hope you help me and I really appreciate it 我希望你能帮助我,我真的很感激

So what I want to do 所以我想做什么

Imagine that there is no weight at all, so my query will be: 想象一下,根本没有权重,所以我的查询将是:

prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select  ?item ?similarityValue  where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .

}

and the result will be: 结果将是: 在此处输入图片说明

Then I make aggregation on the items with the sum of similarities like this: 然后,我对相似项之和进行汇总,如下所示:

prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select  ?item (SUM(?similarityValue) as ?sumSimilarities)  where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .
}
group by ?item

and the result is: 结果是: 在此处输入图片说明

What I want is to multiply each row of this result by the sum of two weights that are associated with ?item , which are (0.1 * 0.4) for i2 and (1) for i3 我想要的是将此结果的每一行乘以与?item相关联的两个权重之和,i2为(0.1 * 0.4),i3为(1)

Notice that some items doesn't have two weights, some has one, some nothing, and notice that even for those that have two, the two values could be the same so be careful if you use distinct here. 请注意,有些项目没有两个权重,有些没有一个权重,有些没有,并且请注意,即使对于那些具有两个权重的项目,这两个值也可能相同,因此如果在此处使用distinct,请小心。

Lastly, I'm saying always two just as an example, but in the real life, this number comes from the dynamic system. 最后,我仅以两个为例进行说明,但是在现实生活中,这个数字来自动态系统。

<3>Update After @Joshua Taylor answer, I understood his sample data as: <3>更新@Joshua Taylor回答后,我理解他的示例数据为:

在此处输入图片说明

Some data 一些数据

First, some data that we can work with. 首先,我们可以使用一些数据。 The item :a has a bunch of similarity connections, each of which specifies an item and a reason. :a具有许多相似性连接,每个相似性都指定一个项和一个原因。 :a can be similar to an item for a few different reasons, and there can even be duplicated similarities with the same item and reason. :a可能由于某些不同的原因而与某个项目相似,甚至可能由于相同的项目和原因而重复相似之处。 I think that this matches your use case so far. 我认为到目前为止,这与您的用例相符。 (Sample data in the question could make this clearer, but I think this is along the lines of what you've got). (问题中的样本数据可以使这一点更加清楚,但是我认为这与您所掌握的相近)。 Then, each item has contextual value, and each reason has an optional weight. 然后,每个项目都具有上下文值,每个原因都有一个可选的权重。

@prefix : <urn:ex:>

:a :similarTo [ :item :b ; :reason :p ] ,
              [ :item :b ; :reason :p ] , # a duplicate
              [ :item :b ; :reason :q ] ,
              [ :item :b ; :reason :r ] ,
              [ :item :c ; :reason :p ] ,
              [ :item :c ; :reason :q ] ,
              [ :item :d ; :reason :r ] ,
              [ :item :d ; :reason :s ] .

:b :context 0.01 .
:b :context 0.02 .
:c :context 0.04 .
:d :context 0.05 .
:e :context 0.06 . # not used

:p :weight 0.1 .
:q :weight 0.3 .
:r :weight 0.5 .
# no weight for :s
:t :weight 0.9 . # not used

It sounds like what you want to do is to compute the sum the context values for the similar items, including the context value for each occurrence, but to sum the reason weights, but only for the distinct occurrences. 听起来您想要做的是计算相似项的上下文值的总和,包括每个事件的上下文值,但要对原因权重求和,但仅针对不同的事件。 If that's the correct understanding, then I think you want something like the following. 如果是正确的理解,那么我认为您需要类似以下内容。

Getting weights for the reasons 权重的原因

The first step is to be able to get the sum of weights for the distinct reasons for each similar item. 第一步是由于每个相似项目的不同原因而能够获得权重之和。

prefix : <urn:ex:>

select * where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }
}
------------------------------
| i  | item | propertyWeight |
==============================
| :a | :b   | 0.9            |
| :a | :c   | 0.4            |
| :a | :d   | 0.5            |
------------------------------

Getting weights for the items 获取物品的重量

Now, you still need the sum of the values for each item, counting the weight for each occurrence. 现在,您仍然需要每个项目的值总和,计算每个事件的权重。 So we extend the query: 因此,我们扩展了查询:

select * where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}
----------------------------------------
| i  | item | propertyWeight | context |
========================================
| :a | :b   | 0.9            | 0.03    |
| :a | :c   | 0.4            | 0.04    |
| :a | :d   | 0.5            | 0.05    |
----------------------------------------

Note that it's OK that search for ?item :context ?context_ . 请注意,搜索?item:context?context_是可以的。 in the second subquery, and don't even ensure that ?item is one of the similar items. 在第二个子查询中,甚至不确保?item是类似项之一。 Since the results of the two subqueries are joined, we'll only get results for the values of ?item that were also returned by the first subquery. 由于两个子查询的结果是结合在一起的,因此我们将仅获得第一个子查询还返回的?item值的结果。

Putting them together 放在一起

Now you can just add, or multiply, or do whatever else you want to do to combine the sum of the reason weights with the sum of the context values. 现在,您可以加,乘或做任何其他您想做的事情,以将原因权重之和与上下文值之和相结合。 For instance, if you're summing them: 例如,如果要对它们求和:

select ?i ?item ((?propertyWeight + ?context) as ?similarity) where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}
--------------------------
| i  | item | similarity |
==========================
| :a | :b   | 0.93       |
| :a | :c   | 0.44       |
| :a | :d   | 0.55       |
--------------------------

Final cleanup 最终清理

Looking at the final query, two things bugged me a bit. 看最后一个查询,有两件事让我有些烦恼。 The first is that we retrieved the reason weight for each solution in the inner subquery, whereas we only need to retrieve it once per property per item. 首先是我们在内部子查询中检索了每个解决方案的原因权重,而对于每个项目的每个属性,我们只需检索一次。 That is, we can move the optional part to the outer, inner subquery. 也就是说,我们可以将可选部分移至外部,内部子查询。 Then, we've got a bind that sets a variable that we only use in the aggregation. 然后,我们有了一个绑定 ,该绑定设置了一个仅在聚合中使用的变量。 We can replace it by summing coalesce (?weight,0.0) to use ?weight if it's bound, and 0.0 otherwise. 我们可以通过总结替换COALESCE (?重量,0.0)使用?重量 ,如果它的约束,和0.0不然。 After making those changes, we end up with: 进行了这些更改之后,我们最终得到:

select ?i ?item ((?propertyWeight + ?context) as ?similarity) where {
  values ?i { :a }

  #-- get the sum of weights of distinct properties
  #-- using 0.0 as the weight for a property that doesn't
  #-- actually specify a weight.
  { select ?item (sum(coalesce(?weight,0.0)) as ?propertyWeight) {

      #-- get the distinct properties for each ?item.
      { select distinct ?item ?property {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
        } }

       #-- then get each property's optional weight.
       optional { ?property :weight ?weight }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}

It's not a huge change, but it makes things a little bit cleaner, I think, and a little bit easier to understand. 我认为这不是一个巨大的变化,但是它使事情变得更整洁,更容易理解。

Comments 评论

It's almost become my mantra at this point, but these kinds of questions are much easier to answer if there's sample data provided. 在这一点上,这几乎成为我的口头禅,但是,如果提供了样本数据,这些类型的问题就容易回答了。 In this case, most of the actual mechanics of how you're getting these values in the first place doesn't really matter. 在这种情况下,关于如何首先获取这些值的大多数实际机制并不重要。 It's how you're aggregating them afterward that does. 之后,您将如何对它们进行汇总。 That's why we can use really simple data like what I created from scratch at the beginning of this answer. 这就是为什么我们可以使用非常简单的数据(例如我在此答案开头重新创建的数据)的原因。

I think that the big take-away from this, though, is that one of the important techniques in using SPARQL (and other query languages, too, I expect) is having separate subqueries and joining their results. 我认为,最大的收获是使用SPARQL(我希望其他查询语言也可以使用)的重要技术之一是具有单独的子查询并将其结果合并。 In this case, we ended up with a couple of subqueries, because we really needed to group in a couple of different ways. 在这种情况下,我们最终遇到了两个子查询,因为我们确实需要以几种不同的方式进行分组。 This could have have been simpler if SPARQL provided a distinct by operator, so that we could say something like 如果SPARQL提供了一个操作符来区分的话,这本来可能更简单,所以我们可以说

sum(distinct by(?property) ?weight)

but that has the issue that if a distinct property could have more than one weight, which of those weights would you choose? 但这是一个问题,如果一个独特的属性可能具有多个权重,那么您会选择哪些权重? So the solution really seems to be a couple of subqueries so that we can do a few different kinds of grouping. 因此,解决方案实际上似乎是几个子查询,以便我们可以进行几种不同的分组。 This is why I was asking about the actual formula you're trying to compute. 这就是为什么我要问您要计算的实际公式的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM