是否有可能使用 SPARQL 和 RDF4J 批处理选择查询？

Question

我正在使用存储在 graphDB Free 中并在我的本地开发人员机器上运行的相当大的数据集（大约 500Mio-Triples）。

我想使用 RDF4J 对数据集进行一些操作，并且必须或多或少地对整个数据集进行 SELECT。 为了做一个测试，我只是 SELECT 所需的元组。 前百万元组的代码运行良好，之后它变得非常慢，因为 graphDB 继续分配更多的 RAM。

是否有可能对非常大的数据集进行 SELECT-Query 并批量获取它们？

基本上我只想通过一些选定的三元组“迭代”，所以应该不需要使用来自 graphDB 的那么多 RAM。 我可以看到在查询完成之前我已经在 RDF4J 中获取数据，因为它仅在大约 1.4 个 Mio 读取元组时崩溃（HeapSpaceError）。 不幸的是，graphDB 并没有释放已经读取的元组的 memory。 我错过了什么吗？

非常感谢你的帮助。

附言。 我已经将 graphDB 的可用 heapSpace 设置为 20GB。

RDF4J (Java) 代码如下所示：

package ch.test;


import org.eclipse.rdf4j.query.*;
import org.eclipse.rdf4j.repository.RepositoryConnection;
import org.eclipse.rdf4j.repository.http.HTTPRepository;

import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;

public class RDF2RDF {

    public static void main(String[] args) {
        System.out.println("Running RDF2RDF");

        HTTPRepository sourceRepo = new HTTPRepository("http://localhost:7200/repositories/datatraining");
        try {
            String path = new File("").getAbsolutePath();
            String sparqlCommand= Files.readString(Paths.get(path + "/src/main/resources/sparql/select.sparql"), StandardCharsets.ISO_8859_1);

            int chunkSize = 10000;
            int positionInChunk = 0;
            long loadedTuples = 0;

            RepositoryConnection sourceConnection = sourceRepo.getConnection();
            TupleQuery query = sourceConnection.prepareTupleQuery(sparqlCommand);

            try (TupleQueryResult result = query.evaluate()) {
                for (BindingSet solution:result) {
                    loadedTuples++;
                    positionInChunk++;

                    if (positionInChunk >= chunkSize) {
                        System.out.println("Got " + loadedTuples + " Tuples");
                        positionInChunk = 0;
                    }
                }
            }

        } catch (IOException err) {
            err.printStackTrace();
        }
    }
}

select.sparql：

PREFIX XXX_meta_schema: <http://schema.XXX.ch/meta/>
PREFIX XXX_post_schema: <http://schema.XXX.ch/post/>
PREFIX XXX_post_tech_schema: <http://schema.XXX.ch/post/tech/>

PREFIX XXX_geo_schema: <http://schema.XXX.ch/geo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX XXX_raw_schema: <http://schema.XXX.ch/raw/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT * WHERE {

    BIND(<http://data.XXX.ch/raw/Table/XXX.csv> as ?table).

    ?row XXX_raw_schema:isDefinedBy ?table.

    ?cellStreetAdress XXX_raw_schema:isDefinedBy ?row;
        XXX_raw_schema:ofColumn <http://data.XXX.ch/raw/Column/Objektadresse>;
        rdf:value ?valueStreetAdress.

    ?cellOrt mobi_raw_schema:isDefinedBy ?row;
        XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/Ort>;
        rdf:value ?valueOrt.

    ?cellPlz mobi_raw_schema:isDefinedBy ?row;
        XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/PLZ>;
        rdf:value ?valuePLZ.

    BIND (URI(concat("http://data.XXX.ch/post/tech/Adress/", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).
}

我的解决方案：使用首先获取所有“行”的子选择语句。

PREFIX mobi_post_schema: <http://schema.mobi.ch/post/>
PREFIX mobi_post_tech_schema: <http://schema.mobi.ch/post/tech/>

PREFIX mobi_geo_schema: <http://schema.mobi.ch/geo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mobi_raw_schema: <http://schema.mobi.ch/raw/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT * WHERE {

    {
        SELECT ?row WHERE
        {
            BIND(<http://data.mobi.ch/raw/Table/Gebaeudeobjekte_August2020_ARA_Post.csv> as ?table).

            ?row mobi_raw_schema:isDefinedBy ?table.
        }
    }


    ?cellStreetAdress mobi_raw_schema:isDefinedBy ?row;
        mobi_raw_schema:ofColumn <http://data.mobi.ch/raw/Column/Objektadresse>;
        rdf:value ?valueStreetAdress.

    ?cellOrt mobi_raw_schema:isDefinedBy ?row;
        mobi_raw_schema:ofColumn <http://data.mobi.ch/raw/Column/Ort>;
        rdf:value ?valueOrt.

    ?cellPlz mobi_raw_schema:isDefinedBy ?row;
        mobi_raw_schema:ofColumn <http://data.mobi.ch/raw/Column/PLZ>;
        rdf:value ?valuePLZ.

    BIND (URI(concat("http://data.mobi.ch/post/tech/Adress/", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).
}

Answer 1

我不立即知道为什么给出的查询对于 GraphDB Free 执行而言在内存方面会如此昂贵，但通常很大程度上取决于数据集的形状和大小。 当然，做一个基本上检索整个数据库的查询，一开始就不一定是明智之举。

话虽如此，您可以尝试几件事。 使用LIMIT和OFFSET作为分页机制是一种方法。

您可以尝试的另一种选择是将您的查询一分为二：一个查询检索您感兴趣的资源的所有标识符，然后您遍历这些标识符，并为每个查询执行单独的查询以获取该资源的详细信息（属性和关系）特定资源。

在您的示例中，您可以拆分?row ，因此您首先执行查询以获取给定表的所有行：

SELECT ?row WHERE {
    VALUES ?table { <http://data.XXX.ch/raw/Table/XXX.csv> }
    ?row XXX_raw_schema:isDefinedBy ?table.
}

然后迭代该结果，将?row的每个返回值注入到检索详细信息的查询中：

SELECT * WHERE {
    VALUES ?row { <http://data.XXX.ch/raw/Table/XXX.csv#row1> }

    ?cellStreetAdress XXX_raw_schema:isDefinedBy ?row;
        XXX_raw_schema:ofColumn <http://data.XXX.ch/raw/Column/Objektadresse>;
        rdf:value ?valueStreetAdress.

    ?cellOrt mobi_raw_schema:isDefinedBy ?row;
        XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/Ort>;
        rdf:value ?valueOrt.

    ?cellPlz mobi_raw_schema:isDefinedBy ?row;
        XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/PLZ>;
        rdf:value ?valuePLZ.

    BIND (URI(concat("http://data.XXX.ch/post/tech/Adress/", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).
}

在 Java 代码中，如下所示：


String sparqlCommand1 = // the query for all rows of the table

// query for details of each row. Value of row will be injected via the API
String sparqlCommand2 = "SELECT * WHERE { \n"
                    + "    ?cellStreetAdress XXX_raw_schema:isDefinedBy ?row;\n"
                    + "        XXX_raw_schema:ofColumn <http://data.XXX.ch/raw/Column/Objektadresse>;\n"
                    + "        rdf:value ?valueStreetAdress.\n"
                    + "    ?cellOrt mobi_raw_schema:isDefinedBy ?row;\n"
                    + "        XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/Ort>;\n"
                    + "        rdf:value ?valueOrt.\n"
                    + "    ?cellPlz mobi_raw_schema:isDefinedBy ?row;\n"
                    + "        XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/PLZ>;\n"
                    + "        rdf:value ?valuePLZ.\n"
                    + "    BIND (URI(concat(\"http://data.XXX.ch/post/tech/Adress/\", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).\n"
                    + "}";

try(RepositoryConnection sourceConnection = sourceRepo.getConnection()) {
     TupleQuery rowQuery = sourceConnection.prepareTupleQuery(sparqlCommand1);     
     TupleQuery detailsQuery = sourceConnection.prepareTupleQuery(sparqlCommand2);

     try (TupleQueryResult result = rowQuery.evaluate()) {
         for (BindingSet solution: result) {
                // inject the current row identifier
                detailsQuery.setBinding("row", solution.getValue("row"));

                // execute the details query for the row and do something with 
                // the result
                detailsQuery.evaluate().forEach(System.out::println);
         }
     }
}

当然，您正在以这种方式进行更多查询（N+1，其中 N 是行数），但每个单独的查询结果只是一小块，对于 GraphDB Free（以及您自己的应用程序）来说可能更容易管理.

是否有可能使用 SPARQL 和 RDF4J 批处理选择查询？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-06 21:51:08

是否有可能使用 SPARQL 和 RDF4J 批处理选择查询？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-06 21:51:08

解决方案1
1 已采纳 2021-01-06 21:51:08