简体   繁体   中英

how to identify feeds in a web crawl?

I've run a web crawl and gathered a lot of html and xml pages. My purpose is to extract all Rss/Atom feeds out of them. I noticed that many sites simply use "text/xml" as content type on the header, so I can't identify a feed from any other kind of xml. So I wrote this piece of code:

public boolean isFeed(String content){
    Document doc = Jsoup.parse(content);
    Elements feed = doc.getElementsByTag("feed");
    Elements channel = doc.getElementsByTag("channel");
    if(feed!=null){
        if(!feed.isEmpty()){
             return true;
        }
    }
    if(channel!=null){
        if(!channel.isEmpty()){
             return true;
        }
    }
    return false;
}

Is there anything missing here? Any problem with it?

Parse the document using full-blown XML parser. If it doesn't compile, it's not Atom. Then take the document (root) element. If it's not <feed xmlns="http://www.w3.org/2005/Atom"> , it's not Atom. Of course use appropriate APIs to read tag name and namespace, don't compare strings.

Take similar approach to discover RSS. Or use Rome library to parse the document for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM