简体   繁体   中英

Sparql how to group this kind of data

Because I was worried that you won't understand my situation, I made this visual illustration for you (click on the image to have a good quality version).

在此处输入图片说明

I know that a user (whoever is, we don't care), likes an item (i1) .

We want to suggest other items:

i1 is similar to i2 depending on a specific criteria (so there is a similarity value, let's call it s1 )

i1 is also similar to the same i2 , but depending on another criteria( so there is a similarity value, let's call it s2 )

i1 is also similar to the same i2 , but depending on a third criteria (so there is a similarity value, let's call it s3 )

now i2 belongs to two classes, and each one of them affects the similarity by a specific weight .

my problem

is i want to calculate the ultimate final similarity between i1 and i2 and i did almost all of it except the weight for the specific class.

my problem is that this weight should not be applied on the criteria that led to the selection of i2 . in other words, if i2 was select 1000 times using 1000 criteria, and i2 belongs to a specific class, then the weight of that class will be applied just once, not 1000 times, and if i2 belongs to two classes, the two weights for these two classes will be applied just once regarding of how many criterials led to select i2

Now

To make it easy for you to help me, i did this query (okay long but it has to be long to show you the case), but i also make it easy for you by making my query selects only the required infomration so you just can add another layer of select above it.

    prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>


select  ?item ?similarityValue ?finalWeight where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .
  optional{
    ?item :hasContextValue ?weight .
  }
  bind (if(bound(?weight), ?weight, 1) as ?finalWeight)
}

So the result of that query is (look at the item i2 ) it repeats 6 times (as expected) with three different similarities (as expected because of three different criterias), and the finalWeight , which is the weight, repeats for each criteria:

在此处输入图片说明

Finally

Here is the data

@prefix : <http://example.org/rs#>
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

:i1 :similaryTo1 :i2 .
:similaryTo1 :hasValue 0.5 .
:i1 :similaryTo2 :i2 .
:similaryTo2 :hasValue 0.6 .
:i1 :similaryTo3 :i2 .
:similaryTo3 :hasValue 0.7 .
:i2 :hasContextValue 0.1 .
:i2 :hasContextValue 0.4 .
:i1 :similaryTo4 :i3 .
:similaryTo4 :hasValue 0.5 .

I hope you help me and I really appreciate it

So what I want to do

Imagine that there is no weight at all, so my query will be:

prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select  ?item ?similarityValue  where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .

}

and the result will be: 在此处输入图片说明

Then I make aggregation on the items with the sum of similarities like this:

prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select  ?item (SUM(?similarityValue) as ?sumSimilarities)  where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .
}
group by ?item

and the result is: 在此处输入图片说明

What I want is to multiply each row of this result by the sum of two weights that are associated with ?item , which are (0.1 * 0.4) for i2 and (1) for i3

Notice that some items doesn't have two weights, some has one, some nothing, and notice that even for those that have two, the two values could be the same so be careful if you use distinct here.

Lastly, I'm saying always two just as an example, but in the real life, this number comes from the dynamic system.

<3>Update After @Joshua Taylor answer, I understood his sample data as:

在此处输入图片说明

Some data

First, some data that we can work with. The item :a has a bunch of similarity connections, each of which specifies an item and a reason. :a can be similar to an item for a few different reasons, and there can even be duplicated similarities with the same item and reason. I think that this matches your use case so far. (Sample data in the question could make this clearer, but I think this is along the lines of what you've got). Then, each item has contextual value, and each reason has an optional weight.

@prefix : <urn:ex:>

:a :similarTo [ :item :b ; :reason :p ] ,
              [ :item :b ; :reason :p ] , # a duplicate
              [ :item :b ; :reason :q ] ,
              [ :item :b ; :reason :r ] ,
              [ :item :c ; :reason :p ] ,
              [ :item :c ; :reason :q ] ,
              [ :item :d ; :reason :r ] ,
              [ :item :d ; :reason :s ] .

:b :context 0.01 .
:b :context 0.02 .
:c :context 0.04 .
:d :context 0.05 .
:e :context 0.06 . # not used

:p :weight 0.1 .
:q :weight 0.3 .
:r :weight 0.5 .
# no weight for :s
:t :weight 0.9 . # not used

It sounds like what you want to do is to compute the sum the context values for the similar items, including the context value for each occurrence, but to sum the reason weights, but only for the distinct occurrences. If that's the correct understanding, then I think you want something like the following.

Getting weights for the reasons

The first step is to be able to get the sum of weights for the distinct reasons for each similar item.

prefix : <urn:ex:>

select * where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }
}
------------------------------
| i  | item | propertyWeight |
==============================
| :a | :b   | 0.9            |
| :a | :c   | 0.4            |
| :a | :d   | 0.5            |
------------------------------

Getting weights for the items

Now, you still need the sum of the values for each item, counting the weight for each occurrence. So we extend the query:

select * where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}
----------------------------------------
| i  | item | propertyWeight | context |
========================================
| :a | :b   | 0.9            | 0.03    |
| :a | :c   | 0.4            | 0.04    |
| :a | :d   | 0.5            | 0.05    |
----------------------------------------

Note that it's OK that search for ?item :context ?context_ . in the second subquery, and don't even ensure that ?item is one of the similar items. Since the results of the two subqueries are joined, we'll only get results for the values of ?item that were also returned by the first subquery.

Putting them together

Now you can just add, or multiply, or do whatever else you want to do to combine the sum of the reason weights with the sum of the context values. For instance, if you're summing them:

select ?i ?item ((?propertyWeight + ?context) as ?similarity) where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}
--------------------------
| i  | item | similarity |
==========================
| :a | :b   | 0.93       |
| :a | :c   | 0.44       |
| :a | :d   | 0.55       |
--------------------------

Final cleanup

Looking at the final query, two things bugged me a bit. The first is that we retrieved the reason weight for each solution in the inner subquery, whereas we only need to retrieve it once per property per item. That is, we can move the optional part to the outer, inner subquery. Then, we've got a bind that sets a variable that we only use in the aggregation. We can replace it by summing coalesce (?weight,0.0) to use ?weight if it's bound, and 0.0 otherwise. After making those changes, we end up with:

select ?i ?item ((?propertyWeight + ?context) as ?similarity) where {
  values ?i { :a }

  #-- get the sum of weights of distinct properties
  #-- using 0.0 as the weight for a property that doesn't
  #-- actually specify a weight.
  { select ?item (sum(coalesce(?weight,0.0)) as ?propertyWeight) {

      #-- get the distinct properties for each ?item.
      { select distinct ?item ?property {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
        } }

       #-- then get each property's optional weight.
       optional { ?property :weight ?weight }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}

It's not a huge change, but it makes things a little bit cleaner, I think, and a little bit easier to understand.

Comments

It's almost become my mantra at this point, but these kinds of questions are much easier to answer if there's sample data provided. In this case, most of the actual mechanics of how you're getting these values in the first place doesn't really matter. It's how you're aggregating them afterward that does. That's why we can use really simple data like what I created from scratch at the beginning of this answer.

I think that the big take-away from this, though, is that one of the important techniques in using SPARQL (and other query languages, too, I expect) is having separate subqueries and joining their results. In this case, we ended up with a couple of subqueries, because we really needed to group in a couple of different ways. This could have have been simpler if SPARQL provided a distinct by operator, so that we could say something like

sum(distinct by(?property) ?weight)

but that has the issue that if a distinct property could have more than one weight, which of those weights would you choose? So the solution really seems to be a couple of subqueries so that we can do a few different kinds of grouping. This is why I was asking about the actual formula you're trying to compute.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM