简体   繁体   English

使用samppipe提取非英文文章

[英]Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites. 我正在尝试使用boilerpipe java库,从一组网站中提取新闻文章。 It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. 它适用于英语文本,但对于带有特殊字符的文本,例如带有重音符号(história)的单词,此特殊字符无法正确提取。 I think it is an encoding problem. 我认为这是一个编码问题。

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper . 在boilerpipe faq中,它说“如果你提取非英文文本,你可能需要更改一些参数”,然后引用一篇论文 I found no solution in this paper. 我在本文中找不到任何解决方案。

My question is, are there any params when using boilerpipe where i can specify the encoding? 我的问题是,在使用套管管时我可以指定编码吗? Is there any way to go around and get the text correctly? 有没有办法绕过并正确获取文本?

How i'm using the library: (first attempt based on the URL): 我如何使用该库:(基于URL的第一次尝试):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code) (关于HTLM源代码的第二个)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

You don't have to modify inner Boilerpipe classes. 您不必修改内部Boilerpipe类。

Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. 只需将InputSource对象传递给ArticleExtractor.INSTANCE.getText()方法,并强制对该对象进行编码。 For example: 例如:

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

Regards! 问候!

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate. Boilerpipe的ArticleExtractor使用了一些专门针对英语定制的算法 - 测量平均短语中的单词数量等。在任何比英语(或其他语言)更加或更简洁的语言中,这些算法都不太准确。

Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages. 此外,图书馆使用一些英语短语来尝试找到文章的结尾(评论,发表评论,发表你的意见等),这显然不适用于其他语言。

This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages. 这并不是说图书馆会彻底失败 - 只要知道在非英语语言中可能需要进行一些修改以获得良好的结果。

Java: Java的:

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse: Run > Run Configurations > Common Tab. Eclipse:运行>运行配置>公共选项卡。 Set Encoding to Other(UTF-8), then click Run. 将Encoding设置为Other(UTF-8),然后单击Run。

在此输入图像描述

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. 好吧,从我看到的,当你这样使用它时,库将自动选择要使用的编码。 From the HTMLFetcher source: 从HTMLFetcher源:

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

Try debugging their code a bit, starting with ArticleExtractor.getText(URL) , and see if you can override the encoding 尝试调试他们的代码,从ArticleExtractor.getText(URL) ,看看是否可以覆盖编码

Ok, got a solution. 好的,有一个解决方案。 As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. 正如Andrei所说,我必须更改类HTMLFecther,它位于包de.l3s.boilerpipe.sax中。我所做的是将所有提取的文本转换为UTF-8。 At the end of the fetch function, i had to add two lines, and change the last one: 在fetch函数结束时,我必须添加两行,并更改最后一行:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

I had the some problem; 我遇到了一些问题; the cnr solution works great. cnr解决方案很有效。 Just change UTF-8 encoding to ISO-8859-1. 只需将UTF-8编码更改为ISO-8859-1即可。 Thank's 谢谢

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM