简体   繁体   中英

Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper . I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i'm using the library: (first attempt based on the URL):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

You don't have to modify inner Boilerpipe classes.

Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

Regards!

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.

Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.

This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.

Java:

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse: Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

在此输入图像描述

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

Try debugging their code a bit, starting with ArticleExtractor.getText(URL) , and see if you can override the encoding

Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM