简体   繁体   English

SQL Server中的优先级队列

[英]Priority queue in SQL Server

I'm currently in the process of building a web crawler in C#. 我目前正在使用C#构建Web爬网程序。 To queue the URLs which have yet to be crawled I use SQL Server. 要对尚未进行爬网的URL进行排队,请使用SQL Server。 It works pretty fast, but it starts getting really large over time which slows down my stored procedures. 它运行得非常快,但随着时间的推移它开始变得非常大,这会减慢我的存储过程。

CREATE TABLE PriorityQueue
(
ID int IDENTITY(0,1) PRIMARY KEY,
absolute_url varchar (400),
depth int,
priorty int,
domain_host varchar (255),
);

CREATE INDEX queueItem ON PriorityQueue(absolute_url);
CREATE INDEX queueHost ON PriorityQueue(domain_host);

This is the table I use for my queue. 这是我用于队列的表。 The priority numbers from 1 to 5 with 1 being the highest priority. 优先级从1到5,其中1是最高优先级。 As you can see I also use indexes for my stored procedures down below. 如您所见,我还在下面使用索引存储过程。

Procedure for adding new items to the queue: 将新项添加到队列的过程:

DROP PROCEDURE IF EXISTS dbo.Enqueue
GO
CREATE PROCEDURE dbo.Enqueue(@absolute_url varchar(255), @depth int, @priorty int, @host varchar(255))
AS
BEGIN
    INSERT INTO [WebshopCrawler].[dbo].[PriorityQueue] (absolute_url, depth, priorty, domain_host) VALUES (@absolute_url, @depth, @priorty, @host);
END
GO

Procedure for getting the item with the highest priority: 获取具有最高优先级的项目的过程:

DROP PROCEDURE IF EXISTS dbo.Dequeue
GO
CREATE PROCEDURE dbo.Dequeue
AS
BEGIN
    SELECT top 1 absolute_url, depth, priorty
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE priorty = (SELECT MIN(priorty) FROM [WebshopCrawler].[dbo].[PriorityQueue])
END
GO

This one gets realy slow with larger data. 对于更大的数据,这个变得非常慢。

Procedure to delete the dequeued item: 删除出列项的过程:

DROP PROCEDURE IF EXISTS dbo.RemoveFromQueue
GO
CREATE PROCEDURE dbo.RemoveFromQueue(@absolute_url varchar(400))
AS
BEGIN
    DELETE 
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE absolute_url = @absolute_url
END
GO

I tried using lots of different indexes, but nothing seemed to make the procedures go any faster. 我尝试使用了很多不同的索引,但似乎没有什么能让程序更快。 I hope someone has an idea on how to improve this. 我希望有人知道如何改善这一点。

Please read Using tables as Queues . 请阅读将表格用作队列 The important issues: 重要的问题:

  • You must organize the table according to the dequeue strategy. 您必须根据出列策略组织表格。 Primary key in IDENTITY makes absolutely no sense. IDENTITY中的主键完全没有意义。 Use a clustered index based on priority and dequeue order. 使用基于优先级和出列顺序的聚簇索引。
  • You must dequeue atomically in a single statement, use DELETE ... OUTPUT ... 你必须在一个语句中原子地出列,使用DELETE ... OUTPUT ...

So it should be something along these lines: 所以它应该是这样的:

CREATE TABLE PriorityQueue
(
  priority int not null,
  enqueue_time datetime not null default GETUTCDATE(),
  absolute_url varchar (8000) not null,
  depth int not null,
  domain_host varchar (255) not null,
);

CREATE CLUSTERED INDEX PriorityQueueCdx on PriorityQueue(priority DESC, enqueue_time);

CREATE PROCEDURE dbo.Dequeue
AS
BEGIN
    with cte as (
       SELECT top 1 absolute_url, depth, priority
       FROM [PriorityQueue] with (rowlock, readpast)
       ORDER BY priority DESC, enqueue_time)
     DELETE FROM cte
         OUTPUT DELETED.*;
END
GO

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM