I am using an update join query to update some records. I am actually joining an indexed table to itself, and updating where a pattern is met.
This query worked fine for about a million records, but with 14 million records it just doesn't scale. The reason I am doing it this way is because the only other option I was aware of was to use a cursor, which would have been atrocious.
Right now the query is taking more than 12 hours to run. Any help to find a better way to do this would be GREATLY appreciated. I am using SQL Server Management Studio. For the query below, here is how the index was created in the AIS_Positions table:
CREATE INDEX SID ON AIS_Positions (Id)
UPDATE R1
SET
BOUNDARY = 'BERTH',
TRAVEL_MODE = 'HOTEL',
BerthStartFlag = 'YES',
BerthStartTime = R1.IntervalStart,
BerthEndTime = R2.IntervalEnd,
BerthStart_ID = R1.Id,
BerthEnd_ID = R2.Id
FROM
AIS_Positions R1
INNER JOIN
AIS_Positions R2 ON R1.MMSI = R2.MMSI
AND R1.ID < R2.ID
AND R1.IntervalSpeed <= 0.1
AND R2.IntervalSpeed <= 0.1
AND DATEDIFF(HOUR, R1.POSITIONTIME, R2.POSITIONTIME) BETWEEN 1 AND 72
AND (SELECT TOP 1 IntervalSpeed
FROM AIS_Positions
WHERE MMSI = R1.MMSI AND ID = R1.ID-1) > 0.1
AND (SELECT TOP 1 IntervalSpeed
FROM AIS_Positions
WHERE MMSI = R1.MMSI AND ID = R2.ID+1) > 0.1
AND (SELECT TOP 1 Boundary
FROM AIS_Positions
WHERE MMSI = R1.MMSI AND ID = R1.ID-1) IS NULL
This might be a good start:
/*
create nonclustered index [ix_ais_positions_mmsi_inc] on ais_positions
(mmsi)
include (id, intervalspeed, boundary, PositionTime, IntervalStart, IntervalEnd);
*/
update R1 set
boundary = 'berth',
travel_mode = 'hotel',
BerthStartFlag = 'yes',
BerthStartTime = R1.IntervalStart,
BerthEndTime = R2.IntervalEnd,
BerthStart_id = R1.Id,
BerthEnd_id = R2.Id
from ais_positions R1
inner join ais_positions R2
on R1.mmsi = R2.mmsi
and R1.id < R2.id
--How many matches does R1.id < R2.id yield? Is this updating the same row more than once?
and R1.IntervalSpeed <= 0.1
and R2.IntervalSpeed <= 0.1
--and datediff(hour, R1.positiontime, R2.positiontime) between 1 and 72
and datediff(hour, R1.positiontime, R2.positiontime) >= 1 and datediff(hour, R1.positiontime, R2.positiontime) <= 72
--and (select top 1 IntervalSpeed from ais_positions where mmsi = R1.mmsi and id = R1.id-1) > 0.1
and exists (select 1 from ais_positions i where i.mmsi = R1.mmsi and i.id = R1.id-1 and i.IntervalSpeed > 0.1 and i.Boundary is null)
--and (select top 1 IntervalSpeed from ais_positions where mmsi = R1.mmsi and id = R2.id+1) > 0.1
and exists (select 1 from ais_positions where mmsi = R1.mmsi and id = R2.id+1 and IntervalSpeed > 0.1)
--and (select top 1 Boundary from ais_positions where mmsi = R1.mmsi and id=R1.id-1) is null
Have you considered using temporary tables for the conditions of your subqueries? Your query may be running the subqueries for each line of the query above them. Maybe something like this:
SELECT A1.ID, A1.IntervalSpeed as topint1
INTO #Int_tabl_1
FROM AIS_Positions as A1
INNER JOIN AIS_Positions as A2
ON A1.MMSI = A2.MMSI AND A1.ID = A2.ID -1
SELECT A1.ID, A1.IntervalSpeed as topint2
INTO #Int_tabl_2
FROM AIS_Positions as A1
INNER JOIN AIS_Positions as A2
ON A1.MMSI = A2.MMSI AND A1.ID = A2.ID+1
SELECT A1.ID, A1.Boundary
INTO #Bound_tbl
FROM AIS_Positions as A1
INNER JOIN AIS_Positions as A2
ON A1.MMSI = A2.MMSI AND A1.ID = A2.ID-1
Then test against
topint1 > 0.1
, topint2 > 0.1
, and Boundary is null
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.