I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper . I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library: (first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
You don't have to modify inner Boilerpipe
classes.
Just pass InputSource
object to the ArticleExtractor.INSTANCE.getText()
method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!
Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.
Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse: Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.
Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL)
, and see if you can override the encoding
Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.