Fun with Deadlocks in SQL Server 2005

by Nicholas Piasecki on February 16th, 2009

So we used to have a batch script that would run every evening to synchronize inventory between two different systems. And while this batch script was running, there’d be a high probability of having requests served by the Web site fail. The database that the Web site was accessing would toss an error like “Process Foo deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction” for some of the requests.

The error also occurred sporadically in other scenarios, but it most often happened when the batch script was running. This helped me deduce that any long running transaction occurring against the [orders] table in our database while the Web site was being accessed greatly increased the chances of the error occurring.

So how did I resolve the deadlocks such that I haven’t seen one since? Well, the answer was to put on my detective hat and hold a stakeout, waiting for the error to occur again after I had enabled some tracing features of SQL Server.

Capturing Data

By default, SQL Server does not log any particularly useful information about the deadlock. To tell it to do so, I executed the following command:

DBCC TRACEON(1204, 3605, -1)

This command tells SQL Server to send debugging information (identified by trace type 1204) directly to the error log (identified by parameter 3605). In my system, the error log was stored at C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLLOGERRORLOG.

Three days later, a deadlock occurred. When I checked the log, a wealth of cryptic information awaited:

	Deadlock encountered .... Printing deadlock information
	Wait-for graph
 
	Node:1
 
	KEY: 5:72057594071154688 (5200690f2156) CleanCnt:3 Mode:S Flags: 0x0
	       Grant List 2:
	         Owner:0x2FF3C660 Mode: S        Flg:0x0 Ref:1 Life:00000000 SPID:54 ECID:0 XactLockInfo: 0x2B5DCEDC
	         SPID: 54 ECID: 0 Statement Type: SELECT Line #: 1
	         Input Buf: Language Event: (@param0 nvarchar(7),@param1 int)SELECT TOP 5         COALESCE((           SELECT            SUM(oi.QtyShipped)           FROM dbo.orderItems AS oi           INNER JOIN dbo.[order] AS o ON o.ID = oi.OrderID           INNER JOIN dbo.productItem AS inner_pi
	       Requested By: 
	         ResType:LockOwner Stype:'OR'Xdes:0x03977770 Mode: X SPID:60 BatchID:0 ECID:0 TaskProxy:(0x3C718378) Value:0x96fe8a0 Cost:(0/1704)
 
	Node:2
 
	RID: 5:1:11345:3               CleanCnt:2 Mode:X Flags: 0x2
	       Grant List 3:
	         Owner:0x1549C0A0 Mode: X        Flg:0x0 Ref:0 Life:02000000 SPID:60 ECID:0 XactLockInfo: 0x03977794
	         SPID: 60 ECID: 0 Statement Type: UPDATE Line #: 1
	         Input Buf: Language Event: (@param0 int,@param1 int,@param2 int,@param3 int)UPDATE dbo.orderItems SET          Qty =  @param0 ,          Backorder =  @param1 ,          QtyShipped =  @param2         WHERE ID =  @param3
	       Requested By: 
	         ResType:LockOwner Stype:'OR'Xdes:0x2B5DCEB8 Mode: S SPID:54 BatchID:0 ECID:0 TaskProxy:(0x2D6D0378) Value:0x2a504c60 Cost:(0/0)
 
	      Victim Resource Owner:
	       ResType:LockOwner Stype:'OR'Xdes:0x2B5DCEB8 Mode: S SPID:54 BatchID:0 ECID:0 TaskProxy:(0x2D6D0378) Value:0x2a504c60 Cost:(0/0)

Analyzing Data

At first glance, this looked like a bunch of unhelpful gibberish. After grabbing a magnifying glass and some painkillers, though, it started to make some sense.

In Node:1, the Web site was trying to generate a list of similar products for a particular product page. In Node:2, an order was being marked as shipped by our internal order processing application. SQL Server killed Node:1, since it by default sees killing a SELECT statement as a less potentially damaging operation than killing an UPDATE. I figured this much out by recognizing the snippets of the SQL queries under Language Event. The log only contained about the first 100 characters of the offending SQL statements, but this was good enough.

Next, I used the trace output above to determine some more information about the locked resources.

Node:1 was the resource identified by KEY: 5:72057594071154688 (520060f2156), and the process SPID 54, which was running the similar products SELECT statement, had a shared lock on it. I figured out what this key referred to by referencing the system partitions table:

SELECT * FROM sys.partitions WHERE hobt_id = 72057594071154688

(I just stripped off the 5: part of the KEY and omitted the numbers in parentheses.) This command returned some gibberish, but the important values returned in it were the object_id (798625888) and the index_id (1). I used this information in another query:

SELECT * FROM sys.indexes WHERE object_id = 798625888 AND index_id = 1
SELECT OBJECT_NAME(798625888)

These queries told me that the KEY resource was actually the clustered index named PK_OrderItems on our orderitems table. Good–I figured out what the first locked resource was.

Now I had to figure out Node:2. It was the resource identified by RID: 5:1:11345:3, and the process SPID 60, which was running an UPDATE against the orderitems table, had an exclusive lock on it. I figured out what this RID referred to by using the DBCC tool:

DBCC TRACEON(3604) -- Enables output to be sent to the Query Analyzer window
DBCC PAGE(5,1,11345,0)

It returned a lot of information, but the real gem was m_objId (1531152500). I used that next:

SELECT OBJECT_NAME(1531152500)

This informed me that the RID was a particular page (for our purposes, a particular row) on the orders table.

Next, I looked at the queries each of these processes was running.

The similar products query was intensive (it has since be rewritten, but that’s besides the point). It tried to find popular products (based on the QtyShipped in the orderitems table) that match the given style. “Popular products” was defined as products that have had the most shipped in the past 7 days. It looked something like this:

	      SELECT TOP 5
	       COALESCE((
	         SELECT
	          SUM(oi.QtyShipped)
	         FROM dbo.orderItems AS oi
	         INNER JOIN dbo.[ORDER] AS o ON o.ID = oi.OrderID
	         INNER JOIN dbo.productItem AS inner_pi ON inner_pi.Upc = oi.Upc
	         WHERE inner_pi.Sku = p.Sku
	         AND o.ShipDate >= DATEADD(day, -7, GETDATE())
	       ), 0) AS ProductSalesVolume,
	       -- snipped various other columns
	      WHERE -- snipped some where conditions
	      ORDER BY ProductSalesVolume DESC

The update query, on the other hand, was rather simple:

	      UPDATE dbo.orderItems SET
	        Qty = #Quantity#,
	        Backorder = #QuantityBackordered#,
	        QtyShipped = #QuantityShipped#
	      WHERE ID = #Id#

Now let’s remember what the log data indicated:

  • RID: 5:1:11345:3 is a single row within the order table.
  • KEY: 5:72057594071154688 (5200690f2156) is the PK_OrderItems clustered index on the orderItems table.

SPID 54, the SELECT, had a shared lock on the PK_OrderItems index of the orderItems table. It would have liked to acquire a shared lock on a particular row of the order table. SPID 60, the UPDATE, had an exclusive lock on the desired row/page in the order table. It would have liked to acquire an exclusive lock on the PK_OrderItems index of the orderItems table.

Why was this so?

The simple UPDATE wasn’t actually touching the order table, which was confusing. What was happening was that the internal order processing application was saving the order and the order items in one transaction. Since the order table was updated during this transaction, the process that was running the transaction (SPID 60) had an exclusive lock on that row in the order table. This clue led me to the sequence events that occurred for the deadlock to happen:

Time SPID 54 Log SPID 60 Log
T0 BEGIN TRANSACTION; BEGIN TRANSACTION;
T1 [requests shared lock on PK_OrderItems index so it can perform the INNER JOIN in the upcoming SELECT] [requests exclusive lock on particular [order] row so it can perform an UPDATE on that row]
T2 [shared lock on PK_OrderItems index is granted] [exclusive lock on particular [order] row is granted]
T3 SELECT SUM(oi.QtyShipped) … UPDATE [order] SET Flag = ‘P’ WHERE ID = ‘Blah’
T4 (while continuing processing query from T3)
[requests shared lock on particular [order] row so it can read its ShipDate]
[requests exclusive lock on PK_OrderItems index so it can perform an UPDATE on a particular row]
T5 [request is blocked by SPID 60's exclusive lock on particular [order] row] [request is blocked by SPID 54's shared lock on PK_OrderItems index]
T6 [deadlock detected; this process is killed as the deadlock victim] [exclusive lock on PK_OrderItems index is granted]
T7 UPDATE [orderItems] SET QtyShipped = 3 WHERE ID = Blah;
T8 COMMIT TRANSACTION;

Resolution

One option to resolve this deadlock was to remove the inner SELECT that was looking at the QtyShipped when determining the most popular products for the similar products list, perhaps by caching this value and updating it periodically in a column called CachedSalesVolume that was updated infrequently. Or, it could have been broken out into a separate query that could be cached independently by the Web site.

Another option was to change the SELECT query to use the WITH (NOLOCK) query optimizer hint, which would tell SQL Server to run the query at the READ UNCOMMITTED isolation level. This would have enabled the query to read data that was locked by another user (in our case SPID 60) without requiring a shared lock on that data. This is referred to as a “dirty read” because you can read values that are not yet committed and are subject to change. Even more unfortunately, in the extremely rare case where a query that is using NOLOCK tries to read data that is actually being moved or changed by another user in that instant of time, a 605 error occurs, and the SELECT would still fail anyway.

The final and best option was to explore whether or not adding an additional index could alleviate the issue. The queries run by SPID 60 were always going to acquire an exclusive lock on the PK_OrderItems and a particular row in the order table. There really isn’t a way around this; it’s just the way the UPDATE process works. The only option, then, was to provide an index that the query used by SPID 54 could use instead of the PK_OrderItems index.

The SPID 54 query was using the PK_OrderItems index as part of a Key Lookup operation. Here’s why: when performing the first INNER JOIN, the query used the IX_OrderItems_OrderId index on the orderItems table to obtain a set of orderItems for a given order number. The next INNER JOIN needed to look at the UPC, and the final SELECT needed to look at the QtyShipped column. Unfortunately, the query was now only holding the OrderId and ID columns, so it needed to look up each order item and fetch its UPC value and its QtyShipped value. Looking up these additional columns for each row, which are not included in the index, was what caused the shared lock on the PK_OrderItems clustered index. By adding the UPC and QtyShipped columns as included columns in the IX_OrderItems_OrderId index, I could avoid the clustered index seek on the PK_OrderItems index entirely.

Here’s an abbreviated query plan before adding these included columns:

Before adding the included columns.

Before adding the included columns.

The disaster happened in row 13. The SELECT acquired a shared lock on PK_OrderItems because it needed to look up each order item in order to grab the UPC and QtyShipped fields.

Here is the abbreviated query plan after adding these included columns:

After adding the included columns.

After adding the included columns.

You can see that two steps have been removed from the query plan. Importantly, the Clustered Index Seek was now gone.

Now when the SELECT was scanning through the index, it already had the UPC and QtyShipped information it needed to complete its query. It no longer needed to lock and go rummaging through the row for these values. Since the lock was completely eliminated, the deadlock was eliminated, too.

So the moral of the story? If you have deadlocks, the first thing you should look at is your indexes. Try adding or removing indexes, or try adding included columns to indexes, to eliminate the shared lock/exclusive lock scenario from occurring.

Have fun making things work!

2 Comments
  1. Patrick permalink

    Good article. I learned a lot from it. BTW, it is better to turn off the traceon flag since it hurt performance.

  2. Keith permalink

    Thanks for taking the time to share this analysis and conclusion.

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS