how to identify feeds in a web crawl?

Question

I've run a web crawl and gathered a lot of html and xml pages. My purpose is to extract all Rss/Atom feeds out of them. I noticed that many sites simply use "text/xml" as content type on the header, so I can't identify a feed from any other kind of xml. So I wrote this piece of code:

public boolean isFeed(String content){
    Document doc = Jsoup.parse(content);
    Elements feed = doc.getElementsByTag("feed");
    Elements channel = doc.getElementsByTag("channel");
    if(feed!=null){
        if(!feed.isEmpty()){
             return true;
        }
    }
    if(channel!=null){
        if(!channel.isEmpty()){
             return true;
        }
    }
    return false;
}

Is there anything missing here? Any problem with it?

Answer 1

Parse the document using full-blown XML parser. If it doesn't compile, it's not Atom. Then take the document (root) element. If it's not <feed xmlns="http://www.w3.org/2005/Atom"> , it's not Atom. Of course use appropriate APIs to read tag name and namespace, don't compare strings.

Take similar approach to discover RSS. Or use Rome library to parse the document for you.

how to identify feeds in a web crawl?

Question

1 answers

solution1
1 ACCPTED 2012-10-19 17:27:30

how to identify feeds in a web crawl?

Question

1 answers

solution1 1 ACCPTED 2012-10-19 17:27:30

solution1
1 ACCPTED 2012-10-19 17:27:30