简体   繁体   English

我该如何构建更好的SPARQL查询,以仅从DBpedia中获取我想要的数据? (原为:“如何使用DBPEDIA SPARQL消除多行”)

[英]How can I build a better SPARQL query, to get only the data I want from DBpedia? (was: “How to get rid of multiple rows with DBPEDIA SPARQL”)

I run from SPARQL Explorer at DBpedia . 我从DBpedia的SPARQL Explorer运行。 I wish to get each President only once, but as some of them have multiple entries for birthplace it gives multiple rows. 我希望只让每位总统一次,但是由于其中一些总统有多个出生地条目,因此会产生多行。

SELECT DISTINCT ?person ?birthPlace  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.
      ?person dbo:birthPlace ?birthPlace .

       OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
       OPTIONAL { ?person dbp:presidentStart ?presidentStart }  . 

FILTER ( regex(?birthPlace,   "_")  OR
         regex(?birthPlace, ";_")
       ) . 
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

I would like to have only the STATE where they are born. 我只想知道他们出生的州。

:Abraham_Lincoln [http] :Hodgenville,_Kentucky [http]   -   -
:Barack_Obama [http]    :Kapiolani_Medical_Center_for_Women_and_Children [http] -   -
:Bill_Clinton [http]    :Hope,_Arkansas [http]  -   -
:Dwight_D._Eisenhower [http]    :Denison,_Texas [http]  -   -
:George_W._Bush [http]  :New_Haven,_Connecticut [http]  -   -
:George_Washington [http]   :Westmoreland_County,_Virginia [http]   -   -
:George_Washington [http]   :British_America [http] -   -
:George_Washington [http]   :George_Washington_Birthplace_National_Monument [http]  -   -
:James_A._Garfield [http]   :Orange,_Ohio [http]    -   -
:James_A._Garfield [http]   :Moreland_Hills,_Ohio [http]    -   -
:Jimmy_Carter [http]    :Plains,_Georgia 

As SPARQL is a pattern matching language, the trick, when your query result is "too broad/general", is to create a more specific pattern . 由于SPARQL是一种模式匹配语言,因此当您的查询结果为“太宽泛/笼统”时,诀窍就是创建更具体的模式 In this case, your intent is not just to get back all resources that are marked as dbo:birthPlace values, but only those resources that represent US states. 在这种情况下,你的目的不只是要回被标记为所有资源dbo:birthPlace值代表美国各州,但只有那些资源。

So we need to figure out how US states are distinguished from other locations in DBPedia. 因此,我们需要弄清楚如何将美国各州与DBPedia中的其他地区区分开。

Let's take Kentucky as an example. 让我们以肯塔基州为例。 The resource representing Kentucky is http://dbpedia.org/resource/Kentucky . 代表肯塔基州的资源是http://dbpedia.org/resource/Kentucky If we scroll down the page outlining the properties of that resource, we find multiple entries for the rdf:type relation, but the one that jumps out at me as most suitable is yago:WikicatStatesOfTheUnitedStates ( http://dbpedia.org/class/yago/WikicatStatesOfTheUnitedStates ). 如果向下滚动以概述该资源的属性的页面, yago:WikicatStatesOfTheUnitedStates发现rdf:type关系的多个条目,但是最适合我的是yago:WikicatStatesOfTheUnitedStateshttp://dbpedia.org/class/ yago / WikicatStatesOfTheUnitedStates )。

If we modify your query to put that in as an extra restriction, and drop the weird regular expression, like so: 如果我们修改您的查询以将其作为额外的限制,然后删除奇怪的正则表达式,如下所示:

SELECT DISTINCT ?person ?birthPlace  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.
      ?person dbo:birthPlace ?birthPlace .
      ?birthPlace a yago:WikicatStatesOfTheUnitedStates .

   OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
   OPTIONAL { ?person dbp:presidentStart ?presidentStart }  .  
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

You should get what you need. 您应该得到所需的东西。

Unfortunately, if you try, you find that you don't. 不幸的是,如果尝试,您会发现自己没有。 This is because DBPedia data is messy . 这是因为DBPedia数据混乱 The above query only returns three results, and worse, one result is clearly incorrect: 上面的查询仅返回三个结果,更糟糕的是,一个结果显然是不正确的:

person                 birthPlace   presidentStart  presidentEnd
dbr:Barack_Obama       dbr:Hawaii
dbr:George_Washington  dbr:Virginia
dbr:Theodore_Roosevelt dbr:New_York_City        

There's two things going on here: first of all, New York City is incorrectly classified as a state in DBPedia. 这里有两件事:首先,在DBPedia中,纽约市被错误地归为州。 Secondly, most presidents do not explicitly have their state marked as their birthplace, but only things like their home town. 其次,大多数总统并没有明确地将其州标记为他们的出生地,而只是像他们的家乡那样。

Fortunately, we can amend slightly. 幸运的是,我们可以稍作修改。 DBPedia knows that HodgenVille, Kentucky, is located in Kentucky. DBPedia知道肯塔基州的HodgenVille位于肯塔基州。 How does it know? 怎么知道 Well, have a look at the resource page for Hodgenville: http://dbpedia.org/resource/Hodgenville,_Kentucky . 好吧,看看Hodgenville的资源页面: http : //dbpedia.org/resource/Hodgenville,_Kentucky You'll see that it has a dbo:isPartOf relation with the resource representing the state of Kentucky. 您将看到它具有dbo:isPartOf关系,并且资源表示肯塔基州。

So, we need to rephrase our query again: we want the state for each president where their birthplace is part of that state. 因此,我们需要再次重新表述我们的查询:我们想要每个州长的出生地都属于该州的州。 In SPARQL: 在SPARQL中:

SELECT DISTINCT ?person ?birthState  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.
      ?person dbo:birthPlace ?birthPlace .
      ?birthPlace dbo:isPartOf ?birthState .
      ?birthState a yago:WikicatStatesOfTheUnitedStates .

   OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
   OPTIONAL { ?person dbp:presidentStart ?presidentStart }  .  
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

This should get you almost completely the result you need. 这应该可以使您几乎完全获得所需的结果。

Update as you noted, Donald Trump is missing from the list. 如您所述, 更新中,唐纳德·特朗普(Donald Trump)从列表中丢失了。 This looks to be because DBPedia is behind the times, and he's still classified as a "presidential candidate" rather than a president. 这似乎是因为DBPedia落后于时代,他仍然被归类为“总统候选人”而不是总统。

As for Grover Cleveland appearing four times, this is an interesting anomaly. 至于格罗弗·克利夫兰出现过四次,这是一个有趣的异常。 Cleveland served two non-consecutive terms as president, from 1885 to 1889, and again from 1893 to 1897. So there's two start dates, and two end dates. 克利夫兰在1885年至1889年以及两次从1893年至1897年的任期中连续两次担任总统。因此,有两个开始日期和两个结束日期。 Because in DBPeda it is not explicitly modeled which start date belongs to which end date, you simply get a result for each combination of start and end dates, four in total. 因为在DBPeda中没有显式地建模哪个开始日期属于哪个结束日期,所以您只需为每个开始日期和结束日期的组合(总共四个)获得结果。 There may be a way to query around this (one option would be to group start and end dates together using a group_concat aggregate), but it's such an edge case that it might be simpler to just handle it in post-processing. 可能有一种查询方法(一种选择是使用group_concat聚合将开始日期和结束日期group_concat在一起),但是这种group_concat情况使得在后期处理中处理起来可能会更简单。

Focusing on 专注于

I would like to have only the STATE where they are born 我只想知道他们出生的州

rather than on 而不是

How to get rid of multiple rows with DBPEDIA SPARQL 如何使用DBPEDIA SPARQL摆脱多行

this could be a solution: 这可能是一个解决方案:

SELECT DISTINCT ?person ?birthState  ?presidentStart ?presidentEnd 
WHERE {
      ?person dct:subject dbc:Presidents_of_the_United_States.


       OPTIONAL { ?person dbp:presidentEnd   ?presidentEnd }  .
       OPTIONAL { ?person dbp:presidentStart ?presidentStart }  .
       OPTIONAL {?person dbo:birthPlace/dbp:subdivisionType/dbp:territory ?birthState } .

FILTER ( regex(?birthState,   "_")  OR
         regex(?birthState, ";_")
       ) . 
} 
GROUP BY ?person 
ORDER BY  ?presidentStart ?person 
LIMIT 100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM