创建API端点以根据时间获取动态数据

Question

I have a scraper which periodically scrapes articles from news sites and stores them in a database [MYSQL]. 我有一个刮刀，定期从新闻网站上抓取文章并将它们存储在数据库[MYSQL]中。 The way the scraping works is that the oldest articles are scraped first and then i move onto much more recent articles. 刮擦的工作方式是先将最旧的物品刮掉，然后再转到更近期的文章。

For example an article that was written on the 1st of Jan would be scraped first and given an ID 1 and an article that was scraped on the 2nd of Jan would have an ID 2 . 例如，在1月1日写的文章将首先被删除并给出ID 1，并且在1月2日被删除的文章将具有ID 2 。

So the recent articles would have a higher id as compared to older articles. 因此，与旧文章相比，最近的文章将具有更高的ID。

There are multiple scrapers running at the same time. 有多个刮刀同时运行。

Now i need an endpoint which i can query based on timestamp of the articles and i also have a limit of 10 articles on each fetch. 现在我需要一个端点，我可以根据文章的时间戳查询，每次获取时我也有10篇文章的限制。

The problem arises for example when there are 20 articles which were posted with a timestamp of 1499241705 and when i query the endpoint with a timestamp of 1499241705 a check is made to give me all articles that is >=1499241705 in which case i would always get the same 10 articles each time,changing the condition to a > would mean i skip out on the articles from 11-20 . 例如，当有20篇文章以时间戳1499241705发布时，当我查询时间戳为1499241705的终点时，会出现问题，我会检查所有文章> = 1499241705，在这种情况下，我总会得到每次相同的10篇文章，将条件改为a >意味着我会跳过11-20篇文章。 Adding another where clause to check on id is unsuccessful because articles may not always be inserted in the correct date order as the scraper is running concurrently. 添加另一个where子句以检查id是不成功的，因为当刮刀并发运行时，可能无法始终以正确的日期顺序插入文章。

Is there a way i can query this end point so i can always get consistent data from it with the latest articles coming first and then the older articles. 有没有办法可以查询这个终点，这样我就可以随时获得一致的数据，包括最新的文章，然后是旧的文章。

EDIT: 编辑：

   +-----------------------+
   |   id | unix_timestamp |
   +-----------------------+
   |    1 |   1000         |
   |    2 |   1001         |
   |    3 |   1002         |
   |    4 |   1003         |
   |   11 |   1000         |
   |   12 |   1001         |
   |   13 |   1002         |
   |   14 |   1003         |
   +-----------------------+

The last timestamp and ID is being sent through the WHERE clause. 最后一个时间戳和ID通过WHERE子句发送。

Eg $this->db->where('unix_timestamp <=', $timestamp); $this->db->where('id <', $offset); $this->db->order_by('unix_timestamp ', 'DESC'); $this->db->order_by('id', 'DESC'); 例如$this->db->where('unix_timestamp <=', $timestamp); $this->db->where('id <', $offset); $this->db->order_by('unix_timestamp ', 'DESC'); $this->db->order_by('id', 'DESC'); $this->db->where('unix_timestamp <=', $timestamp); $this->db->where('id <', $offset); $this->db->order_by('unix_timestamp ', 'DESC'); $this->db->order_by('id', 'DESC');

On querying with a timestamp of 1003, ids 14 and 4 are fetched. 在查询时间戳为1003时，将获取ID 14和4。 But then during the next call, id 4 would be the offset thereby not fetching id 13 and only fetching id 3 the next time around.So data would be missing . 但是在下一次调用期间，id 4将是偏移量，从而不会获取id 13并且仅在下一次获取id 3时。因此数据将丢失。

Answer 1

Two parts: timestamp and id. 两部分：时间戳和id。

WHERE   timestamp <= $ts_leftoff
  AND ( timestamp <  $ts_leftoff
            OR id <= $id_leftoff )
ORDER BY (timestamp DESC, id DESC)

So, assuming id is unique, it won't matter if lots of rows have the same timestamp , the order is fully deterministic. 因此，假设id是唯一的，如果许多行具有相同的timestamp ，则顺序是完全确定的并不重要。

There is a syntax for this, but unfortunately it is not well optimized: 有一种语法，但不幸的是它没有很好地优化：

WHERE (timestamp, id) <= ($ts_leftoff, $id_leftoff)

So, I advise against using it. 所以，我建议不要使用它。

More on the concept of "left off": http://mysql.rjweb.org/doc.php/pagination 更多关于“离开”的概念： http ： //mysql.rjweb.org/doc.php/pagination

创建API端点以根据时间获取动态数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-07-08 17:38:03

创建API端点以根据时间获取动态数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-07-08 17:38:03

解决方案1
2 已采纳 2017-07-08 17:38:03