简体   繁体   中英

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).

I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:

  1. Some sources don't provide rss feeds. How do I create one?
  2. What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?

Please add any other things (problems, suggestions, whatever) I might not have considered.

Duplication is a nasty issue. What I eventually ended up doing:

  • 1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
  • 2. Strip out all whitespace
  • 3. Case-desensitize
  • 4. Hash all that with MD5.

Here's why you leave the link in: A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.

Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<; But it is not. It is encoded < But so too are HTML tags!

:<p> I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.

Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.

You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.

About duplicates, take a look at this pipe .

Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, eg get it as JSON an manipulate it with Javascript, or process it server-side.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM