简体   繁体   English

在JAVA中进行Dbpedia资源解析

[英]Dbpedia resource parsing in JAVA

By using DBpedia Spotlight , I get DBpedia URIs. 通过使用DBpedia Spotlight ,我获得了DBpedia URI。 For example 例如

http://dbpedia.org/resource/Part-of-speech_tagging http://dbpedia.org/resource/Part-of-speech_tagging

I need to request this URI in Java so that it can return me some json/xml and I can fetch the necessary information from the response. 我需要在Java中请求这个URI,以便它可以返回一些json / xml,我可以从响应中获取必要的信息。

For example, in the above mentioned URI, I need the value of dct:subject 例如,在上面提到的URI中,我需要dct:subject的值

Below is the screenshot of the response what I get in the browser. 下面是我在浏览器中获得的响应的屏幕截图。

实体主体

I'm not exactly sure which values you are looking for but you should be able to do this without any dependencies to scrape what you want from the page source. 我不确定您正在寻找哪些值,但您应该能够在没有任何依赖性的情况下执行此操作,从页面源中删除您想要的内容。 The four Java methods supplied below should get you what you need (one method is a support method). 下面提供的四种Java方法应该可以满足您的需求(一种方法是支持方法)。

Getting the Web Page HTML Source: 获取网页HTML来源:

First we acquire the Web Page HTML Source by using the getWebPageSource() method. 首先,我们使用getWebPageSource()方法获取Web页面HTML源代码。 This method will get the entire HTML source code that makes up the web page located at the supplied Link String. 此方法将获取构成位于提供的链接字符串的网页的整个HTML源代码。 The Source is returned in a List Interface object ( List<String> ). Source在List Interface对象( List <String> )中返回。 Example usage would be: 示例用法是:

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);

When this code is run the pageSource List variable will contain all the HTML source code for the web link string you provided which in this case is: "http://dbpedia.org/resource/Part-of-speech_tagging" . 运行此代码时, pageSource List变量将包含您提供的Web链接字符串的所有HTML源代码,在本例中为: "http://dbpedia.org/resource/Part-of-speech_tagging" If you like you can create a loop to iterate through the list and display it in your Console Window with the System.out.println() method like this: 如果您愿意,可以创建一个循环来遍历列表并使用System.out.println()方法在控制台窗口中显示它,如下所示:

for (int i = 0; i < pageSource.size(); i++) {
    System.out.println(pageSource.get(i));
}

Getting Related Links Using A Reference String: 使用参考字符串获取相关链接:

Now that you have the Web Page Source you can locate and grab the data you want. 现在您拥有了Web页面源,您可以找到并获取所需的数据。 The next method is the getRelatedLinks() method. 下一个方法是getRelatedLinks()方法。 This method will retrieve all links which are contained between specifically supplied String Tags where the desired Links may reside between and are related to the supplied Reference String . 此方法将检索特定提供的字符串标记之间包含的所有链接,其中所需的链接可能位于所提供的参考字符串之间并与之相关。 In your case the Reference String would be: "rel=\\"dct:subject\\"" . 在您的情况下,引用字符串将是: "rel=\\"dct:subject\\"" The String Start Tag would be "href=\\"" and the String End Tag would be "\\">" . 字符串开始标记为"href=\\"" ,字符串结束标记为"\\">" So, any Web Page Source line that contains the Reference String of "rel=\\"dct:subject\\"" is looked at and if on the same source line the supplied Start Tag String ( "href=\\"" ) and the supplied End Tag String ( "\\">" ) are found then the text between those tags is retrieved. 因此,查看包含"rel=\\"dct:subject\\""的引用字符串的任何Web页面源行,如果在同一源行上,则提供的开始标记字符串( "href=\\"" )和提供的找到结束标记字符串( "\\">" ),然后检索这些标记之间的文本。 Example usage would be: 示例用法是:

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");

All links related to the reference string of: "rel=\\"dct:subject\\"" will now be held within the String Array variable named relatedLinksTo . 与参考字符串相关的所有链接: "rel=\\"dct:subject\\""现在将保存在名为relatedLinksTo的String Array变量中。 If you were to iterate through the Array and display its contents to the Console Window: 如果您要遍历数组并将其内容显示到控制台窗口:

// Display Related Links...
for (int i = 0; i < relatedLinksTo.length; i++) {
    System.out.println(relatedLinksTo[i]);
}

you will see: 你会看见:

http://dbpedia.org/resource/Category:Corpus_linguistics
http://dbpedia.org/resource/Category:Markov_models
http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing
http://dbpedia.org/resource/Category:Word-sense_disambiguation

And if you just want the title(s) which each link is related to instead of the entire Link String then you would do it this way: 如果您只想要与每个链接相关的标题而不是整个链接字符串,那么您可以这样做:

// Display Related Links Titles...
for (int i = 0; i < relatedLinksTo.length; i++) {
    String rLink = relatedLinksTo[i].substring(relatedLinksTo[i].lastIndexOf(":") + 1);
    System.out.println(rLink);
}

and what you will see within the Console Window is: 您将在控制台窗口中看到的内容是:

Corpus_linguistics
Markov_models
Tasks_of_natural_language_processing
Word-sense_disambiguation

This method utilizes the support method named getBetween() also supplied below. 此方法使用下面提供的名为getBetween()的支持方法。

Getting A Specific Link From A Related Link List: 从相关链接列表获取特定链接:

You may not want the entire Related Link List but instead just one or more specific links to a specific title like: Tasks_of_natural_language_processing . 您可能不需要整个相关链接列表,而只需要一个或多个指向特定标题的特定链接,例如: Tasks_of_natural_language_processing To get this one or more links you would use the getFromRelatedLinksThatContain() method. 要获得这一个或多个链接,您将使用getFromRelatedLinksThatContain()方法。 Here is how you would achieve this: 以下是您将如何实现这一目标:

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");
String[] desiredLinks = getFromRelatedLinksThatContain(relatedLinksTo, "Tasks_of_natural_language_processing");

This method requires you to pass what was returned from the getRelatedLinks() method along with the desired title you want the Link for ( Tasks_of_natural_language_processing ). 此方法要求您传递从getRelatedLinks()方法返回的内容以及您希望链接所需的标题( Tasks_of_natural_language_processing )。 The title must be actual text contained within any link. 标题必须是任何链接中包含的实际文本。 If you were to now iterate through the desiredLinks array: 如果您现在要遍历desiredLinks数组:

for (int i = 0; i < desiredLinks.length; i++) {
    System.out.println(desiredLinks[i]);
}

You will see the following Link String displayed within the Console Window: 您将在控制台窗口中看到以下链接字符串:

http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing.

The TESTED Methods: 测试方法:

/**
 * Returns a List ArrayList containing the page source for the supplied web
 * page link.<br><br>
 *
 * @param link (String) The URL address of the web page to process.<br>
 *
 * @return (List ArrayList) A List ArrayList containing the page source for
 *         the supplied web page link.
 */
public List<String> getWebPageSource(String webLink) {
    if (webLink.equals("")) {
        return null;
    }
    try {
        URL url = new URL(webLink);

        URLConnection yc;
        //If url is a SSL Endpoint (using a Secure Socket Layer such as https)...
        if (webLink.startsWith("https:")) {
            yc = new URL(webLink).openConnection();
            //send request for page data...
            yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
            yc.connect();
        }
        //and if not a SLL Endpoint (just http)...
        else {
            yc = url.openConnection();
        }

        InputStream inputStream = yc.getInputStream();
        InputStreamReader streamReader = null;
        String encoding = null;
        try {
            encoding = yc.getContentEncoding().toLowerCase();
        }
        catch (Exception ex) {
        }
        if (null == encoding) {
            encoding = "UTF-8";
            streamReader = new InputStreamReader(yc.getInputStream(), encoding);
        }
        else {
            switch (encoding) {
                case "gzip":
                    // Is compressed using GZip: Wrap the reader
                    inputStream = new GZIPInputStream(inputStream);
                    streamReader = new InputStreamReader(inputStream);
                    break;
                //streamReader = new InputStreamReader(inputStream);
                case "utf-8":
                    encoding = "UTF-8";
                    streamReader = new InputStreamReader(yc.getInputStream(), encoding);
                    break;
                case "utf-16":
                    encoding = "UTF-16";
                    streamReader = new InputStreamReader(yc.getInputStream(), encoding);
                    break;
                default:
                    break;
            }
        }

        List<String> sourceText;
        try (BufferedReader in = new BufferedReader(streamReader)) {
            String inputLine;
            sourceText = new ArrayList<>();
            while ((inputLine = in.readLine()) != null) {
                sourceText.add(inputLine);
            }
        }
        return sourceText;
    }
    catch (MalformedURLException ex) {
        // Do whatever you want with exception.
        ex.printStackTrace();
    }
    catch (IOException ex) {
        // Do whatever you want with exception.
        ex.printStackTrace();
    }
    return null;
}

/**
 * This method will retrieve all links which are contained between specifically 
 * supplied String Tags where the desired Links may reside between and are related 
 * to the supplied <b>Reference String</b>. A String Start Tag and a String End Tag 
 * would be required as well.<br><br>
 * 
 * So, if any Web Page Source line that contains the Reference String of:<pre>
 * 
 *     "rel=\"dct:subject\""</pre><br>
 * 
 * is looked at and if <i>on the same source line</i> the supplied Start Tag 
 * String (ie: "href=\"") and the supplied End Tag String (ie: "\">") are found then 
 * the text between those tags is retrieved.<br><br>
 * 
 * This method utilizes the support method named <b>getBetween()</b>.<br><br>
 * 
 * @param referenceString (String) The reference string to look for on any web 
 * page source line.<br>
 * 
 * @param pageSource (List Interface of String) The List which contains all the 
 * HTML Web Page Source.<br>
 * 
 * @param desiredLinkStartTag (String) The Start Tag String where the desired 
 * Link or links may reside after. This can be any string. Links are retrieved 
 * from between the Start Tag and the End Tag.<br>
 * 
 * @param desiredLinkEndTag (String) The End Tag String where the desired 
 * Link or links may reside before. This can be any string. Links are retrieved 
 * from between the Start Tag and the End Tag.<br>
 * 
 * @return (1D String Array) A String Array containing the Links Found.<br>
 * 
 * @see #getBetween(java.lang.String, java.lang.String, java.lang.String, boolean...) getBetween()
 */
public String[] getRelatedLinks(String referenceString, List<String> pageSource, 
        String desiredLinkStartTag, String desiredLinkEndTag) {
    List<String> links = new ArrayList<>();
    for (int i = 0; i < pageSource.size(); i++) {
        if (pageSource.get(i).contains(referenceString)) {
            String[] lnks = getBetween(pageSource.get(i), desiredLinkStartTag, desiredLinkEndTag);
            links.addAll(Arrays.asList(lnks));
        }
    }
    return links.toArray(new String[0]);
}

/**
 * Retrieves a specific Link from within the Related Links List generated by 
 * the <b>getRelatedLinks()</b> method.<br><br>
 * 
 * This method requires the use of the <b>getRelatedLinks()</b> method.
 * 
 * @param relatedArray (1D String Array) The array returned from the <b>getRelatedLinks()</b> 
 * method.<br>
 * 
 * @param desiredStringInLink (String - Letter Case Sensitive) The string title 
 * contained within the link to retrieve.<br>
 * 
 * @return (1D String Array) Containing any links found.<br>
 * 
 * @see #getRelatedLinks(java.lang.String, java.util.List, java.lang.String, java.lang.String) getRelatedLinks()
 * 
 */
public String[] getFromRelatedLinksThatContain(String[] relatedArray, String desiredStringInLink) {
    List<String> desiredLinks = new ArrayList<>();
    for (int i = 0; i < relatedArray.length; i++) {
        if (relatedArray[i].contains(desiredStringInLink)) {
            desiredLinks.add(relatedArray[i]);
        }
    }
    return desiredLinks.toArray(new String[0]);
}

/**
 * Retrieves any string data located between the supplied string leftString
 * parameter and the supplied string rightString parameter.<br><br>

 * This method will return all instances of a substring located between the
 * supplied Left String and the supplied Right String which may be found
 * within the supplied Input String.<br>
 *
 * @param inputString (String) The string to look for substring(s) in.
 *
 * @param leftString  (String) What may be to the Left side of the substring
 *                    we want within the main input string. Sometimes the
 *                    substring you want may be contained at the very
 *                    beginning of a string and therefore there is no
 *                    Left-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param rightString (String) What may be to the Right side of the
 *                    substring we want within the main input string.
 *                    Sometimes the substring you want may be contained at
 *                    the very end of a string and therefore there is no
 *                    Right-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param options     (Optional - Boolean - 2 Parameters):<pre>
 *
 *      ignoreLetterCase    - Default is false. This option works against the
 *                            string supplied within the leftString parameter
 *                            and the string supplied within the rightString
 *                            parameter. If set to true then letter case is
 *                            ignored when searching for strings supplied in
 *                            these two parameters. If left at default false
 *                            then letter case is not ignored.
 *
 *      trimFound           - Default is true. By default this method will trim
 *                            off leading and trailing white-spaces from found
 *                            sub-string items. General sentences which obviously
 *                            contain spaces will almost always give you a white-
 *                            space within an extracted sub-string. By setting
 *                            this parameter to false, leading and trailing white-
 *                            spaces are not trimmed off before they are placed
 *                            into the returned Array.</pre>
 *
 * @return (1D String Array) Returns a Single Dimensional String Array
 *         containing all the sub-strings found within the supplied Input
 *         String which are between the supplied Left String and supplied
 *         Right String. You can shorten this method up a little by
 *         returning a List&lt;String&gt; ArrayList and removing the 'List
 *         to 1D Array' conversion code at the end of this method. This
 *         method initially stores its findings within a List object
 *         anyways.
 */
public static String[] getBetween(String inputString, String leftString, String rightString, boolean... options) {
    // Return nothing if nothing was supplied.
    if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
        return null;
    }

    // Prepare optional parameters if any supplied.
    // If none supplied then use Defaults...
    boolean ignoreCase = false; // Default.
    boolean trimFound = true;   // Default.
    if (options.length > 0) {
        if (options.length >= 1) {
            ignoreCase = options[0];
        }
        if (options.length >= 2) {
            trimFound = options[1];
        }
    }

    // Remove any ASCII control characters from the
    // supplied string (if they exist).
    String modString = inputString.replaceAll("\\p{Cntrl}", "");

    // Establish a List String Array Object to hold
    // our found substrings between the supplied Left
    // String and supplied Right String.
    List<String> list = new ArrayList<>();

    // Use Pattern Matching to locate our possible
    // substrings within the supplied Input String.
    String regEx = Pattern.quote(leftString)
            + (!rightString.equals("") ? "(.*?)" : "(.*)?")
            + Pattern.quote(rightString);
    if (ignoreCase) {
        regEx = "(?i)" + regEx;
    }
    Pattern pattern = Pattern.compile(regEx);
    Matcher matcher = pattern.matcher(modString);
    while (matcher.find()) {
        // Add the found substrings into the List.
        String found = matcher.group(1);
        if (trimFound) {
            found = found.trim();
        }
        list.add(found);
    }

    String[] res;
    // Convert the ArrayList to a 1D String Array.
    // If the List contains something then convert
    if (list.size() > 0) {
        res = new String[list.size()];
        res = list.toArray(res);
    } // Otherwise return Null.
    else {
        res = null;
    }
    // Return the String Array.
    return res;
}

Or ... Use SPARQL or any other desirable parser like jSON. 或者......使用SPARQL或任何其他理想的解析器,如jSON。

There isn't enough info in your question about what you're trying to achieve to provide the best path by which to reach that goal. 您的问题中没有足够的信息来说明您要实现的目标,以提供实现该目标的最佳途径。 You might consider using the Jena or RDF4J/Sesame Frameworks. 您可以考虑使用JenaRDF4J / Sesame框架。

Or you might consider just asking the DBpedia endpoint for the thing you want, whether that's the complete description of <http://dbpedia.org/resource/Part-of-speech_tagging> , here in JSON (as linked from the Formats menu seen in your screencap), or using a SPARQL query URI to request just the dct:subject values -- 或者您可以考虑向DBpedia端点询问您想要的东西,这是否是<http://dbpedia.org/resource/Part-of-speech_tagging>完整描述,这里是JSON (从Formats菜单链接到在撷取画面),或使用SPARQL查询URI请求只是 dct:subject -

PREFIX dbr: <http://dbpedia.org/resource/>
SELECT DISTINCT ?subject
  WHERE { dbr:Part-of-speech_tagging dct:subject ?subject }
LIMIT 100

-- which might be retrieved in various serializations -- here in JSON . - 可以在各种序列化中检索 - 这里是JSON

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM