简体   繁体   中英

Extract article's headline from HTML(using Boilerpipe)

Boilerpipe allows to extract just the article's text from webpage, cleaning up all the HTML mess. However, how could I extract article's headline? There is aa way to just use page's title, but it is sometimes incorrect and contains unneeded words(eg "title - sitename").

Another idea is to find text between <h1> and </h1> , but I still thought I would ask some more solutions.

Are you writing a web crawler? I think the difficulty is that you need to know where the title is in a whole html. For most website they have a unique pattern for writing html, it should be known before the crawler being written.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM