Friday, February 24, 2012

Checking to see if a record exists and if so update else insert

I've decided to post this as a sticky given the frequency this question is asked.

For those of you wishing to build a package that determines if a source row exists in the destination and if so update it else insert it, this link is for you.

http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx

Thanks Jamie!

If you want to do a similar concept to Jamie's blog post above, but with the Konesan's Checksum Transformation to quickly compare MANY fields, you can visit here:
http://www.ssistalk.com/2007/03/09/ssis-using-a-checksum-to-determine-if-a-row-has-changed/

PhilThis is exactly what I was looking for.

I appreciate that.
Fahad|||It's an awesome article - and I used this method in my 1st SSIS package.

However, if number of records is high (over 1000) the OLE Update Command performs slowly. So in that case you might better off loading source data to staging table and creating Stored Procedure which Updates then Inserts (and Deletes) data in the destination table.

Using Lookup transform in the way described in article is quite efficient (compared to other methods like merge join) - and can be useful in a variety of situations|||TheViewMaster,
Yes, very valid points. Each developer will have to decide based on the pros and cons of each method.

Personally, I wouldn't go to the extreme that you did, but would rather dump just the updates to a staging table and then outside of the data flow use an Execute SQL task to perform the batch update. I don't want to spend the cost to insert the data twice (once into staging table, and then again into the destination table using your idea).|||Can somebody point me to a simple example of this.

I have a flat file and I want to check if the record exist in the file exist in the table, if it does I want to update and if it does not I want to insert.|||

This is very helpful. Thanks!

I am completely new to the SQL Server Integration Services. I went through your link and understand how you filter out diffrent records. Could you pointting out how you actually update you existing records ?

Thanks!

|||I have cloned the example above and it does not seem to be picking up the new or changed records. Is there any way to troubleshoot this.|||I use OLE DB Command update the records. Thanks!|||

Hi phill,

Great article, I loved it.

This is what i had been looking for since long time.

One doubt is there in my mind, my current assignment i have to transform about 40K rows from a database.

Will this method work for me or there is any other better way.. ?

regards

Sudhir Kesharwani

|||

Sudhir Kesharwani wrote:

Hi phill,

Great article, I loved it.

This is what i had been looking for since long time.

One doubt is there in my mind, my current assignment i have to transform about 40K rows from a database.

Will this method work for me or there is any other better way.. ?

regards

Sudhir Kesharwani

You need to likely send your changed records to a temporary table and then perform a batch update using an Execute SQL Task on the control flow. Using the OLE DB Command transformation isn't very efficient for high numbers of rows.|||

Phil,
As we discussed about updating/inserting techniques a while ago(see http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1211321&SiteID=1) you told me SSIS can perform this operation much faster than using plain SQL... I put the emphasize on the "much faster".

I have created a performance test package comparing the 2 methods:
1) stored proc with UPDATE and INSERT statements
2) Data flow using look up/conditional split

I have a data flow task to create a data test set with a variable number of rows (I used Jamie's script component source example) with IDs (used as PK) always ranging from 1 to nbrRows. Each row is constituted of 5 int, 4 string (10, 20, 40 and 80 characters) and 1 datetime with values chosen randomly.

So what I do is calling the package with nbrRows=1, 10, 100... The first time 1 new row is added. The second time, id=1 already exists so 1 row is updated, the other 9 are added. And so on... I then perform a second round of tests with the same numbers, the only difference being that the destination table is now full (therefore performing only updates).

What I've found out so far is that SSIS is faster only when inserting a certain number of rows (approx . 1000 to 100000). SSIS performed on avg. 30% better with 50000 rows while at 1 million rows, SQL performed better by 5%.

When the data was updated (second round), SQL was always faster.

Of course, these figures are based on my tests which may not reflect actual usage. I run SQL/SSIS in parallel, the tests are done in batch, .... I thought about using DBCC DROPCLEANBUFFER but it doesn't seem to change much.

Maybe my way of comparing is totally flawed... Have you ever tried comparing the two methods in terms of performance? How did you conclude the SSIS way is much faster?

Leonce

|||

Leonce,

First, when using fast load option on the destination, there are two parameters which need to be adjusted based on the rows you are working with. Max Insert Commit Size and Rows Per Batch. What were the values for these set at?

Also note, that when possible, using a SQL Server Destination over the OLE DB Destination would be preferred.

SSIS should be no different than using BULK INSERT (it seems to be just a glorified wrapper to that). Performing updates via an OLE DB Command will always be slower than a batch update using SQL. (Only because the OLE DB Command is executed for every row in the data flow)

Phil

|||

I left the default values for Commit Size and Rows Per Batch ('' and 0). You suggest I should adjust these values depending on the number of rows transferred? What would typical values be for a table with 100000 rows? Does it dependent on the row size?

I use OLEDB Destination, I'll try with SQL Server Destination.

I forgot to mention I use your method for updates (dump into table then update). Do you think SSIS can be faster than plain SQL for updates? I would tend to think it's impossible.

Leonce

|||

fleo wrote:

I left the default values for Commit Size and Rows Per Batch ('' and 0). You suggest I should adjust these values depending on the number of rows transferred? What would typical values be for a table with 100000 rows? Does it dependent on the row size?

MICS should be set to whatever you wish the batches to be committed. So if you want to commit in 10,000 row batches, set MICS to 10,000. Rows Per Batch should be set to 100,000 in this case.

fleo wrote:

I forgot to mention I use your method for updates (dump into table then update). Do you think SSIS can be faster than plain SQL for updates? I would tend to think it's impossible.

True. Using an Execute SQL task will likely have a slight slowness to it compared to straight SQL, but in the end they do the same thing, so they should be REAL close.

|||

Let us say that a package uses merge join option to insert new rows and it uses OLE DB Destination with fast load, table lock, and batch size (1000) options. The table lock option on the destination forces us to put a NOLOCK hint in the OLE DB Source query (on the destination table). Otherwise, they run into locking contentions (one is requesting exclusive table lock when the other is requesting shared table lock). This works most of the times but occationally I see that SQL Server does not honour the hint (one scenario: when the target table does not have a clustered index on the join key column) and runs into locking contentions. Have you seen this? Do you rely on NOLOCK hint too? Thanks for your help in advance.

No comments:

Post a Comment