简体繁体中英

Extract article's headline from HTML(using Boilerpipe)

原文 2016-10-21 08:27:09 3 1 java/ html/ html-content-extraction/ boilerpipe

Boilerpipe allows to extract just the article's text from webpage, cleaning up all the HTML mess. However, how could I extract article's headline? There is aa way to just use page's title, but it is sometimes incorrect and contains unneeded words(eg "title - sitename").

Another idea is to find text between <h1> and </h1> , but I still thought I would ask some more solutions.

1 answers

Are you writing a web crawler? I think the difficulty is that you need to know where the title is in a whole html. For most website they have a unique pattern for writing html, it should be known before the crawler being written.

How to extract news content from a web page using Boilerpipe?

Using boilerpipe to extract non-english articles

Im trying using boilerpipe library for article extraction in java

Not able to parse new york times article using boilerpipe

What is the best way to detect and extract article content / comments from blog's article

Using boilerpipe on Android application

Using boilerpipe in Android

How to extract published-time and article-content from a news article using java?

How to extract headline titles followed by respective text from Wikipedia

How do I extract only number from dynamic headline/text?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to extract news content from a web page using Boilerpipe? Using boilerpipe to extract non-english articles Im trying using boilerpipe library for article extraction in java Not able to parse new york times article using boilerpipe What is the best way to detect and extract article content / comments from blog's article Using boilerpipe on Android application Using boilerpipe in Android How to extract published-time and article-content from a news article using java? How to extract headline titles followed by respective text from Wikipedia How do I extract only number from dynamic headline/text?

Related Tags

Extract article's headline from HTML(using Boilerpipe)

Question

1 answers

solution1 0 2016-10-21 09:33:51

solution1
0 2016-10-21 09:33:51