简体繁体中英

Aggregating from various sources

原文 2010-09-10 13:55:46 6 2 rss/ yahoo-pipes/ aggregators

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).

I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:

Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?

Please add any other things (problems, suggestions, whatever) I might not have considered.

2 answers

Duplication is a nasty issue. What I eventually ended up doing:

1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.

Here's why you leave the link in: A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.

Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<; But it is not. It is encoded < But so too are HTML tags!

:<p> I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.

Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.

You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.

About duplicates, take a look at this pipe .

Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, eg get it as JSON an manipulate it with Javascript, or process it server-side.

Get information from various sources

How do I data mine various news sources?

Aggregating and displaying content from hundreds of RSS feeds

Is there a way to read logo images from different news sources with an standard approach?

How to view feeds from multiple sources in RSS android application

php grabbing rss feeds from too many sources

Aggregating feeds in Rails application

Aggregating RSS Items in Java

grouping / comparing the similar news stories together which are gathered from different sources

read xml with various namespaces with php

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Get information from various sources How do I data mine various news sources? Aggregating and displaying content from hundreds of RSS feeds Is there a way to read logo images from different news sources with an standard approach? How to view feeds from multiple sources in RSS android application php grabbing rss feeds from too many sources Aggregating feeds in Rails application Aggregating RSS Items in Java grouping / comparing the similar news stories together which are gathered from different sources read xml with various namespaces with php

Related Tags

Aggregating from various sources

Question

2 answers

solution1
1 2010-09-13 02:25:53

solution2
0 2010-09-10 17:02:14

Aggregating from various sources

Question

2 answers

solution1 1 2010-09-13 02:25:53

solution2 0 2010-09-10 17:02:14

solution1
1 2010-09-13 02:25:53

solution2
0 2010-09-10 17:02:14