简体   繁体   中英

How do i best store web crawled data collected daily to look for changes

I'm crawling a website daily to identify changes in what products are in stock.

How do I best store this data for comparison between previous dates?

The data looks like this:

{'name': productname, 'url': "URL to product", "status": "In stock or not", "variants": ['3', '7', '9']}

There are about 1000 products.

I need to store all this data once every day, so I can retrieve it and do a comparison with previous dates, to note if products have gone out/in of stock. I also need to see if variants have been removed/added.

I'm lost with regards to how I should structure this. Should I use a database, several CSV files, text files?

Any suggestions?

This isn't a particularly big amount of data, so pickle should be enough for this (and easiest), unless you're particularly concerned about performance (you're not running python on a embedded system, are you?).

All you need to do to see if there were any changes is to keep the data from the previous crawl, so you'll only need to store 1000 products, ever. When you detect a change, you could log it to a file, for example, or a database, if you plan to do many crawls, or keep the system running for a long time.

Please note that this approach will only save the changes of the variables you selected. If you later decide you want a changelog of some other variable, you won't be able to calculate it.

Also, it's probably worthwhile to convert the status value to a boolean, if it can only take two values.

In such situations I find it best to store data in text files so that you can read the file to check the data and edit it manually if necessary. Storing it in a database would be overkill.

You can store this in a single CSV file with name, url, status and variants as the fields. During each run you can read the CSV file, look for changes and update the file. Until you have the process debugged you can also save previous versions of the file so you can see the changes as they happen.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM