简体   繁体   English

使用python和sqlite进行网页抓取。 如何有效地存储抓取的数据?

[英]Web scraping with python and sqlite. How to store scraped data effectively?

I want to scrape some specific webpages on a regular basis (eg each hour). 我想定期(例如每小时)抓取一些特定的网页。 This I want to do with python. 我想用python做。 The scraped results should get inserted into an SQLite table. 抓取的结果应插入到SQLite表中。 New info will be scraped but also 'old' information will get scraped again, since the python-script will run each hour. 新信息将被删除,但是“旧”信息将再次被删除,因为python脚本将每小时运行一次。

To be more precise, I want to scrape a sports-result page, where more and more match-results get published on the same page as the tournament proceeds. 更准确地说,我想抓取一个体育比赛结果页面,随着比赛的进行,越来越多的比赛结果发布在同一页面上。 So with each new scraping I just need the new results to be entered in the SQLite-table, since the older ones already got scraped (and inserted into the table) one hour before (or even earlier). 因此,对于每个新的抓取,我只需要将新结果输入到SQLite表中,因为较早的结果已经在一小时(甚至更早)之前就被抓取了(并插入到表中)。

I also don't want to insert the same result twice, when it gets scraped the second time. 当第二次被抓取时,我也不想插入相同的结果两次。 So there should be some mechanism to check if one result already got scraped. 因此,应该有某种机制来检查是否已刮取一个结果。 Can this be done on SQL-level? 可以在SQL级别完成吗? So, that I scrape the whole page, make an INSERT statement for each result, but only those INSERT statements get executed successfully which were not present in the database before. 因此,我刮了整个页面,为每个结果创建一个INSERT语句,但是只有那些INSERT语句才能成功执行,而这些语句以前是数据库中不存在的。 I'm thinking of something like a UNIQUE keyword or so. 我在想类似UNIQUE关键字之类的东西。

Or am I thinking too much about performance and should solve this by doing a DROP TABLE each time before I start scraping and then just scrape everything from scratch again? 还是我对性能的考虑太多了,应该在每次开始抓取然后再次从头开始抓取所有内容之前通过做一个DROP TABLE来解决此问题? I don't talk about really much data. 我不会谈论太多数据。 It's just about 100 records (= matches) for 1 tournament and about 50 tournaments a year. 一年大约有100场比赛的记录(=比赛)。

Basically I would just be interested in some kind of best-practice approach. 基本上,我只会对某种最佳实践方法感兴趣。

What you want to do is an upsert (update or insert if it doesn't exist). 您想要做的是upsert(如果不存在,请更新或插入)。 Check here to see how to do it in sqlite: SQLite UPSERT - ON DUPLICATE KEY UPDATE 检查此处以了解如何在sqlite中执行此操作: SQLite UPSERT-重复键更新

It looks like you want to insert data if it doesn't exist? 好像您要插入数据(如果不存在)? Perhaps something like: 也许像这样:

  1. Check if the entry exists 检查条目是否存在
  2. Insert Data if it doesn't 如果没有插入数据
  3. Update the entry if it does? 是否更新条目? (do you want to update) (您要更新)

You could issue 2 seperate sql statements SELECT then INSERT/UPDATE 您可以发出2个单独的sql语句SELECT,然后执行INSERT / UPDATE

Or You could set unique, and i beileve sqllite will raise IntegrityError 或者您可以设置唯一,并且我相信sqllite将引发IntegrityError

try:
  # your insert here
  pass
except sqlite.IntegrityError:
  # data is duplicate insert
  pass

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM