[英]SQL return rows in a “round-robin” order
I have a bunch of URLs stored in a table waiting to be scraped by a script. 我有一堆存储在表中的URL等待脚本删除。 However, many of those URLs are from the same site. 但是,其中许多网址来自同一网站。 I would like to return those URLs in a "site-friendly" order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many http requests in a short time. 我想以“网站友好”的顺序返回这些网址(也就是说,尝试连续避免来自同一网站的两个网址),这样我就不会在短时间内通过制作太多的http请求而意外阻止。
The database layout is something like this: 数据库布局是这样的:
create table urls ( site varchar, -- holds e.g. www.example.com or stockoverflow.com url varchar unique );
Example result: 示例结果:
SELECT url FROM urls ORDER BY mysterious_round_robin_function(site); http://www.example.com/some/file http://stackoverflow.com/questions/ask http://use.perl.org/ http://www.example.com/some/other/file http://stackoverflow.com/tags
I thought of something like " ORDER BY site <> @last_site DESC
" but I have no idea how to go about writing something like that. 我想到了“ ORDER BY site <> @last_site DESC
”之类的东西,但我不知道如何写这样的东西。
See this article in my blog for more detailed explanations on how it works: 有关其工作原理的更详细说明,请参阅我的博客中的这篇文章:
With new PostgreSQL 8.4
: 使用新的PostgreSQL 8.4
:
SELECT *
FROM (
SELECT site, url, ROW_NUMBER() OVER (PARTITION BY site ORDER BY url) AS rn
FROM urls
)
ORDER BY
rn, site
With elder versions: 随着旧版本:
SELECT site,
(
SELECT url
FROM urls ui
WHERE ui.site = sites.site
ORDER BY
url
OFFSET total
LIMIT 1
) AS url
FROM (
SELECT site, generate_series(0, cnt - 1) AS total
FROM (
SELECT site, COUNT(*) AS cnt
FROM urls
GROUP BY
site
) s
) sites
ORDER BY
total, site
, though it can be less efficient. ,虽然它可能效率较低。
I think you're overcomplicating this. 我觉得你过于复杂了。 Why not just use 为什么不用
ORDER BY NewID() 订购NewID()
You are asking for round-robin, but I think a simple 你要求循环赛,但我认为这很简单
SELECT site, url FROM urls ORDER BY RANDOM()
will do the trick. 会做的。 It should work even if urls from the same site are clustered in db. 即使来自同一站点的URL聚集在db中,它也应该工作。
If the URLs don't change very often, you can come up with a somewhat-complicated job that you could run periodically (nightly?) which would assign integers to each record based on the different sites present. 如果URL不经常更改,您可以提出一个稍微复杂的工作,您可以定期运行(每晚?),这将根据存在的不同站点为每个记录分配整数。
What you can do is write a routine that parses the domain out from a URL (you should be able to find a snippet that does this nearly anywhere). 您可以做的是编写一个从URL解析域的例程(您应该能够找到几乎可以在任何地方执行此操作的片段)。
Then, you create a temporary table that contains each unique domain, plus a number. 然后,您创建一个包含每个唯一域的临时表,以及一个数字。
Then, for every record in your URLs table, you look up the domain in your temp table, assign that record the number stored there, and add a large number to that temp table's number. 然后,对于URL表中的每个记录,您在临时表中查找域,为该记录分配存储在那里的数字,并在该临时表的数字中添加一个大数字。
Then for the rest of the day, sort by the number. 然后在剩下的时间里,按数字排序。
Here's an example with the five records you used in your question: 以下是您在问题中使用的五条记录的示例:
URLs: 网址:
Temp table: 临时表:
example.com 1
stackoverflow.com 2
perl.org 3
Then for each URL, you look up the value in the temp table, and add 3 to it (because it's got 3 distinct records): 然后,对于每个URL,您在临时表中查找值,并向其添加3(因为它有3个不同的记录):
URLs: 网址:
http://www.example.com/some/file 1
http://www.example.com/some/other/file NULL
https://stackoverflow.com/questions/ask NULL
https://stackoverflow.com/tags NULL
http://use.perl.org/ NULL
Temp table: 临时表:
example.com 4
stackoverflow.com 2
perl.org 3
URLs: 网址:
http://www.example.com/some/file 1
http://www.example.com/some/other/file 4
https://stackoverflow.com/questions/ask NULL
https://stackoverflow.com/tags NULL
http://use.perl.org/ NULL
Temp table: 临时表:
example.com 7
stackoverflow.com 2
perl.org 3
et cetera until you get to 等到你到达
http://www.example.com/some/file 1
http://www.example.com/some/other/file 4
https://stackoverflow.com/questions/ask 2
https://stackoverflow.com/tags 5
http://use.perl.org/ 3
For a lot of records, it's going to be slow. 对于很多记录来说,它会很慢。 And it will be difficult to work with many inserts/deletions, but the result will be a flawless round-robin ordering. 并且很难使用许多插入/删除,但结果将是一个完美的循环排序。
There is a much simpler and faster solution... 有一个更简单,更快速的解决方案......
-> it's very fast and indexed -> rows will come in a repeatable, yet random order - >它非常快且索引 - >行将以可重复但随机的顺序出现
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.