简体   繁体   中英

Collect, manage data and make it available through an api

Here is my problem: I have many known locations (I have no influence to these) with a lot of data. Each locations offers me in individual periods of a lot new data. Some give me differential updates, some just the whole dataset, some via xml, for some I have to build a webscraper, some need authentication etc... These collected data should be stored in a database. I have to program an api to send requested data in xml back.

Many roads lead to Rome but which should i choose?

Which software would you suggest me to use?

I am familiar with C++,C#,Java,PHP,MySQL,JS but new stuff is still ok.

My idea is to use cron jobs + php (or shell script) + curl to fetch the data. Then I need a module to parse and insert the data into a database (mysql). The data requests from clients could answer a php script.

I think the input data volume is about 1-5GB/day.

The one correct answer doesn't exist, but can you give me some advice? It would be great if you can show me smarter ways to do this.

Thank you very much :-)

LAMP : Stick to PHP and MySQL (and make occasional forays into perl/python): availability of PHP libraries, storage solutions, scalability and API solutions and its community size well makes up for any other environment offerings.

API : Ensure that the designed API queries (and storage/database) can meet all end-product needs before you get to writing any importers. Date ranges, tagging, special cases.

PERFORMANCE : If you need lightning fast queries for insanely large data sets, sphinx-search can help. It's got more than just text search (tags, binary, etc) but make sure you spec the server requirements with more RAM.

IMPORTER : Make it modular: as in, for each different data source, write a pluggable importer that can be enabled/disabled by admin, and of course, individually tested. Pick a language and library based on what's best and easiest fit for the job: bash script is okay.

In terms of parsing libraries for PHP, there are many. One of recent popular ones is simplehtmldom and I found it to work quite well.

TRANSFORMER : Make data transformation routines modular as well so it can be written as a need arises. Don't make the importer alter original data, just make it the quickest way into an indexed database. Transformation routines (or later plugins) should be combined with API query for whatever end result.

TIMING : There is nothing wrong with cron executions, as long as they don't become runaway or cause your input sources to start throttling or blocking you so you need that awareness.

VERSIONING : Design the database, imports, etc to where errant data can be rolled back easily by an admin.

Vendor Solution : Check out scraperwiki - they've made a business out of scraping tools and data storage.

Hope this helps. Out of curiosity, any project details to volunteer? A colleague of mine is interested in exchanging notes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM