简体   繁体   中英

What is the best way to detect and extract article content / comments from blog's article

I have blog post ( sample 1 , sample 2 ). What is the best way to parse HTML and detect author, title, date, article content , comments (separately). Whole other content should be skipped.

您可能找不到想要的一切,但我认为Boilerpipe值得一看。

Assuming your blogsite has an RSS feed, you can use Java's SAX Parser to whip through the XML

http://download.oracle.com/javase/1.4.2/docs/api/javax/xml/parsers/SAXParser.html

Here's an example of someone parsing an RSS using a SAX Parser

http://javabeanz.wordpress.com/2007/07/25/rss-parser-sax/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM