Friday, February 24, 2012

Checking to see if a records exists before inserting - 3 million + rows

I have 1+ CSV files (using a foreach loop) which I'm doing a lot of transform work on and then inserting into a SQL database table.
Each CSV file usually contains about 2 days worth of data (contains date stamps) - somewhere in the region of 60k records per day.
The destination table currently contains 3 million+ rows and will get bigger.
I need to make sure that before inserting into the destination table, the data doesn't already exist.

I've read the following article: http://www.sqlis.com/311.aspx
While the lookup method works, it takes ages and eats up memory as it caches the 3m+ records before running for each CSV. Obviously this will only get worse as the table grows in size.

To make things a little more efficient what I'd like to do, is first derive the dates I'm dealing with in the current file - essentially storing the max(date) and min(date) in variables. Then in the lookup SQL use those vars, to reduce the amount of data that needs to be brought into the transformation to check against before inserting into the destination table.
Lookup SQL eg. SELECT * FROM MyTable WHERE Date BETWEEN varMinDate AND varMaxDate.

Ideally I'd use an aggregate transformation and then use the subsequent output from that either in the lookup query or store the output in vars, but I don't think you can do that and I get the feeling I'm approaching this with the wrong mindset.

Any thoughts would be great!

David Wynne wrote:


Lookup SQL eg. SELECT * FROM MyTable WHERE Date BETWEEN varMinDate AND varMaxDate.

You aren't doing a select * against the lookup table, are you?|||Of course not - just pseudo code to get across what I think I'd like to achieve.
|||

Do you have the ability to push your data into a second table for comparison? You'd then be able to do an outer join based on whatever criteria and then you could pipe those results into your destination component and be far kinder on memory requirements.

I know there are some best practices with regard to using the lookup component to minimize impact but I don't recall those off the top of my head.

|||

Charles Talleyrand wrote:

Do you have the ability to push your data into a second table for comparison? You'd then be able to do an outer join based on whatever criteria and then you could pipe those results into your destination component and be far kinder on memory requirements.

I know there are some best practices with regard to using the lookup component to minimize impact but I don't recall those off the top of my head.

Yep, the OP could use an Execute SQL task to load up a temporary table with the keys of the lookup table. Then, in the data flow, you can use the mentioned outer join to capture new records. That is, unless you need to repopulate the lookup table with the incoming "new" records before the next lookup.|||

David,

How many columns do you need to put in the lookup transformation to determine if a record exists? Unless we are taking about too many/wide columns; I don't see several millions to be a problem. Another option is to use partial cache with decent amount of memory; so the chances a row is not in cache is low.

Remember, always provide a query with only the columns that are strictly required for the join

No comments:

Post a Comment