简体   繁体   中英

update sql server rows, while reading the same table

I have a database in SQL Server 2012 and want to update a table in it.

My table has three columns, the first column is of type nchar(24) . It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.

I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.

My question is what is the fastest way to read the information from the first column of the table and update the second and third column.

Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?

As it comes to billion of rows, performance is the only important thing here.

Let me know if you need any more information!

EDIT 1: My calculation can´t be expressed in SQL. As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.

Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).

Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.

As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.

There is no simple how and a one-solution-fits all here.

If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.

What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.

The computations? How expensive are they? Just c=a+b or heavy processing of other text files.

Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.

Try a couple of things and measure it.

As a general rule: Writing to a database can be improved by batching instead of single updates.

Using a async pattern can free up some of the time for calculations instead of waiting.

EDIT in reply to comment If calculations take 20ms biggest problem is IO. Multithreading won't bring you much. Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.

Find the sweet spot for the right batchsize by experimenting.

Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM