简体   繁体   中英

How can I build a better SPARQL query, to get only the data I want from DBpedia? (was: “How to get rid of multiple rows with DBPEDIA SPARQL”)

I run from SPARQL Explorer at DBpedia . I wish to get each President only once, but as some of them have multiple entries for birthplace it gives multiple rows.

SELECT DISTINCT ?person ?birthPlace  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.
      ?person dbo:birthPlace ?birthPlace .

       OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
       OPTIONAL { ?person dbp:presidentStart ?presidentStart }  . 

FILTER ( regex(?birthPlace,   "_")  OR
         regex(?birthPlace, ";_")
       ) . 
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

I would like to have only the STATE where they are born.

:Abraham_Lincoln [http] :Hodgenville,_Kentucky [http]   -   -
:Barack_Obama [http]    :Kapiolani_Medical_Center_for_Women_and_Children [http] -   -
:Bill_Clinton [http]    :Hope,_Arkansas [http]  -   -
:Dwight_D._Eisenhower [http]    :Denison,_Texas [http]  -   -
:George_W._Bush [http]  :New_Haven,_Connecticut [http]  -   -
:George_Washington [http]   :Westmoreland_County,_Virginia [http]   -   -
:George_Washington [http]   :British_America [http] -   -
:George_Washington [http]   :George_Washington_Birthplace_National_Monument [http]  -   -
:James_A._Garfield [http]   :Orange,_Ohio [http]    -   -
:James_A._Garfield [http]   :Moreland_Hills,_Ohio [http]    -   -
:Jimmy_Carter [http]    :Plains,_Georgia 

As SPARQL is a pattern matching language, the trick, when your query result is "too broad/general", is to create a more specific pattern . In this case, your intent is not just to get back all resources that are marked as dbo:birthPlace values, but only those resources that represent US states.

So we need to figure out how US states are distinguished from other locations in DBPedia.

Let's take Kentucky as an example. The resource representing Kentucky is http://dbpedia.org/resource/Kentucky . If we scroll down the page outlining the properties of that resource, we find multiple entries for the rdf:type relation, but the one that jumps out at me as most suitable is yago:WikicatStatesOfTheUnitedStates ( http://dbpedia.org/class/yago/WikicatStatesOfTheUnitedStates ).

If we modify your query to put that in as an extra restriction, and drop the weird regular expression, like so:

SELECT DISTINCT ?person ?birthPlace  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.
      ?person dbo:birthPlace ?birthPlace .
      ?birthPlace a yago:WikicatStatesOfTheUnitedStates .

   OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
   OPTIONAL { ?person dbp:presidentStart ?presidentStart }  .  
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

You should get what you need.

Unfortunately, if you try, you find that you don't. This is because DBPedia data is messy . The above query only returns three results, and worse, one result is clearly incorrect:

person                 birthPlace   presidentStart  presidentEnd
dbr:Barack_Obama       dbr:Hawaii
dbr:George_Washington  dbr:Virginia
dbr:Theodore_Roosevelt dbr:New_York_City        

There's two things going on here: first of all, New York City is incorrectly classified as a state in DBPedia. Secondly, most presidents do not explicitly have their state marked as their birthplace, but only things like their home town.

Fortunately, we can amend slightly. DBPedia knows that HodgenVille, Kentucky, is located in Kentucky. How does it know? Well, have a look at the resource page for Hodgenville: http://dbpedia.org/resource/Hodgenville,_Kentucky . You'll see that it has a dbo:isPartOf relation with the resource representing the state of Kentucky.

So, we need to rephrase our query again: we want the state for each president where their birthplace is part of that state. In SPARQL:

SELECT DISTINCT ?person ?birthState  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.
      ?person dbo:birthPlace ?birthPlace .
      ?birthPlace dbo:isPartOf ?birthState .
      ?birthState a yago:WikicatStatesOfTheUnitedStates .

   OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
   OPTIONAL { ?person dbp:presidentStart ?presidentStart }  .  
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

This should get you almost completely the result you need.

Update as you noted, Donald Trump is missing from the list. This looks to be because DBPedia is behind the times, and he's still classified as a "presidential candidate" rather than a president.

As for Grover Cleveland appearing four times, this is an interesting anomaly. Cleveland served two non-consecutive terms as president, from 1885 to 1889, and again from 1893 to 1897. So there's two start dates, and two end dates. Because in DBPeda it is not explicitly modeled which start date belongs to which end date, you simply get a result for each combination of start and end dates, four in total. There may be a way to query around this (one option would be to group start and end dates together using a group_concat aggregate), but it's such an edge case that it might be simpler to just handle it in post-processing.

Focusing on

I would like to have only the STATE where they are born

rather than on

How to get rid of multiple rows with DBPEDIA SPARQL

this could be a solution:

SELECT DISTINCT ?person ?birthState  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.


       OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
       OPTIONAL { ?person dbp:presidentStart ?presidentStart }  .
       OPTIONAL {?person dbo:birthPlace/dbp:subdivisionType/dbp:territory ?birthState } .

FILTER ( regex(?birthState,   "_")  OR
         regex(?birthState, ";_")
       ) . 
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM