简体   繁体   English

SQL以“循环”顺序返回行

[英]SQL return rows in a “round-robin” order

I have a bunch of URLs stored in a table waiting to be scraped by a script. 我有一堆存储在表中的URL等待脚本删除。 However, many of those URLs are from the same site. 但是,其中许多网址来自同一网站。 I would like to return those URLs in a "site-friendly" order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many http requests in a short time. 我想以“网站友好”的顺序返回这些网址(也就是说,尝试连续避免来自同一网站的两个网址),这样我就不会在短时间内通过制作太多的http请求而意外阻止。

The database layout is something like this: 数据库布局是这样的:

create table urls (
    site varchar,       -- holds e.g. www.example.com or stockoverflow.com
    url varchar unique
);

Example result: 示例结果:

SELECT url FROM urls ORDER BY mysterious_round_robin_function(site);

http://www.example.com/some/file
http://stackoverflow.com/questions/ask
http://use.perl.org/
http://www.example.com/some/other/file
http://stackoverflow.com/tags

I thought of something like " ORDER BY site <> @last_site DESC " but I have no idea how to go about writing something like that. 我想到了“ ORDER BY site <> @last_site DESC ”之类的东西,但我不知道如何写这样的东西。

See this article in my blog for more detailed explanations on how it works: 有关其工作原理的更详细说明,请参阅我的博客中的这篇文章:

With new PostgreSQL 8.4 : 使用新的PostgreSQL 8.4

SELECT  *
FROM    (
        SELECT  site, url, ROW_NUMBER() OVER (PARTITION BY site ORDER BY url) AS rn
        FROM    urls
        )
ORDER BY
        rn, site

With elder versions: 随着旧版本:

SELECT  site,
        (
        SELECT  url
        FROM    urls ui
        WHERE   ui.site = sites.site
        ORDER BY
                url
        OFFSET  total
        LIMIT   1
        ) AS url
FROM    ( 
        SELECT  site, generate_series(0, cnt - 1) AS total
        FROM    (
                SELECT  site, COUNT(*) AS cnt
                FROM    urls
                GROUP BY
                        site
                ) s
        ) sites
ORDER BY
        total, site

, though it can be less efficient. ,虽然它可能效率较低。

I think you're overcomplicating this. 我觉得你过于复杂了。 Why not just use 为什么不用

ORDER BY NewID() 订购NewID()

You are asking for round-robin, but I think a simple 你要求循环赛,但我认为这很简单

SELECT site, url FROM urls ORDER BY RANDOM()

will do the trick. 会做的。 It should work even if urls from the same site are clustered in db. 即使来自同一站点的URL聚集在db中,它也应该工作。

If the URLs don't change very often, you can come up with a somewhat-complicated job that you could run periodically (nightly?) which would assign integers to each record based on the different sites present. 如果URL不经常更改,您可以提出一个稍微复杂的工作,您可以定期运行(每晚?),这将根据存在的不同站点为每个记录分配整数。

What you can do is write a routine that parses the domain out from a URL (you should be able to find a snippet that does this nearly anywhere). 您可以做的是编写一个从URL解析域的例程(您应该能够找到几乎可以在任何地方执行此操作的片段)。

Then, you create a temporary table that contains each unique domain, plus a number. 然后,您创建一个包含每个唯一域的临时表,以及一个数字。

Then, for every record in your URLs table, you look up the domain in your temp table, assign that record the number stored there, and add a large number to that temp table's number. 然后,对于URL表中的每个记录,您在临时表中查找域,为该记录分配存储在那里的数字,并在该临时表的数字中添加一个大数字。

Then for the rest of the day, sort by the number. 然后在剩下的时间里,按数字排序。


Here's an example with the five records you used in your question: 以下是您在问题中使用的五条记录的示例:

URLs: 网址:

Temp table: 临时表:

example.com       1
stackoverflow.com 2
perl.org          3

Then for each URL, you look up the value in the temp table, and add 3 to it (because it's got 3 distinct records): 然后,对于每个URL,您在临时表中查找值,并向其添加3(因为它有3个不同的记录):

Iteration 1: 迭代1:

URLs: 网址:

http://www.example.com/some/file         1
http://www.example.com/some/other/file   NULL
https://stackoverflow.com/questions/ask   NULL
https://stackoverflow.com/tags            NULL
http://use.perl.org/                     NULL

Temp table: 临时表:

example.com       4
stackoverflow.com 2
perl.org          3

Iteration 2: 迭代2:

URLs: 网址:

http://www.example.com/some/file         1
http://www.example.com/some/other/file   4
https://stackoverflow.com/questions/ask   NULL
https://stackoverflow.com/tags            NULL
http://use.perl.org/                     NULL

Temp table: 临时表:

example.com       7
stackoverflow.com 2
perl.org          3

et cetera until you get to 等到你到达

http://www.example.com/some/file         1
http://www.example.com/some/other/file   4
https://stackoverflow.com/questions/ask   2
https://stackoverflow.com/tags            5
http://use.perl.org/                     3

For a lot of records, it's going to be slow. 对于很多记录来说,它会很慢。 And it will be difficult to work with many inserts/deletions, but the result will be a flawless round-robin ordering. 并且很难使用许多插入/删除,但结果将是一个完美的循环排序。

There is a much simpler and faster solution... 有一个更简单,更快速的解决方案......

  • add a sort_order column of type TEXT 添加TEXT类型的sort_order列
  • add an ON INSERT trigger which sets sort_order to md5( url ) 添加一个ON INSERT触发器,将sort_order设置为md5(url)
  • index on sort_order sort_order上的索引
  • grab the rows in (sort_order, primary key) order 抓取(sort_order,主键)顺序中的行

-> it's very fast and indexed -> rows will come in a repeatable, yet random order - >它非常快且索引 - >行将以可重复但随机的顺序出现

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM