简体   繁体   中英

Web Scraping with Google App Engine

I am trying to scrape some website and republish the data as a RSS feed. How hard is this to setup with Google App Engine? Disadvantages and Advantages using GAE. Any recommendations and guidelines greatly appreciated!

Google AppEngine offers much more functionality (and complexity) than you will need if truly all you will want to do is republish some structured data as RSS. Personally, I would use something like Yahoo pipes for a task like this.

That being said... if you want/need to get your feet wet with GAE, go for it!

Working with Google App Engine is pretty straight forward. I would recommend going through the Getting Started guide . It's short and simple and touches on essential GAE topics. There are more pros and cons than I will list here.

Pros:
In general, App Engine is designed for high traffic web applications that need to scale. Furthermore, it is designed from a programmer's perspective. Much of the scalability issues (database optimization, server administration, etc) are dealt with by Google. Having said that, I find it to be a nice platform. It is still being actively developed by Google engineers, and scheduling of tasks (a feature that has been long requested) is in the current road map.

Cons:
Perhaps the biggest downside right now is again the lack of official scheduling support and the quota limits currently set for free accounts. However you can't complain much if its free. Currently it only supports Python as a programming interface (although a new language [Java I predict] is coming soon). Furthermore, Python 2.6 (and 3.0 for that matter) are not yet supported. In addition, Django 1.0 is not officially supported in App Engine (although you can package Django 1.0 with your application ).

Harder than it would be in most other technologies.

GAE can sort of do scheduled batch stuff like this now, but it's really not intended for that type of thing. Pick pretty much any other language and platform for this particular task, and you'll make your life a lot easier.

You might also want to look into Yahoo! Query Language (YQL)

I think BeautifulSoup could run on GAE, so all your scraping needs are handled :D Also, GAE has a geturl thingy. The only problem I think you might have is not having enough time to get the data (30 secs limitation).

I am working on a same project and I've decided that it's easier to prepare the data on another server and push them to GAE.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM