When you’re loading and transforming data in your Redshift data warehouse, you’ll often want to stage data in a temporary table rather than insert it right into the final table. I often use this approach when I’m processing new data as to not disrupt tables that analysts, dashboards or applications are making use of. Here’s why I like such an approach, and how it works.
Say I have a table named “CustomerStats” which is a large table containing aggregate statistics (number of orders, days active, lifetime value, etc.) for each of my customers. The table is used to power some dashboards and I want to update it several times a day. The stats in the table are calculated from several source tables residing in Redshift that are being fed new data throughout the day.
When you want to update CustomerStats you have a few options, including:
- Run an UPDATE on CustomerStats and join together all source tables needed to calculate the new values for each column.
- TRUNCATE CustomerStats and INSERT the results of a SELECT statement which joins together all source tables and calculates values for each column.
- Create a “staging” temporary table with the same structure as CustomerStats, INSERT the results of a SELECT statement which joins together all source tables and calculates values for each column into the staging table, TRUNCATE CustomerStats and then INSERT all rows from the staging table into CustomerStats.
While option 3 is a mouthful, it’s often the right choice. Why? Let’s look at what happens in each case.
Option 1 – UPDATE CustomerStats
The first option has two flaws.
- First, UPDATE operations in Redshift are actually a combination of a DELETE and INSERT operation. In other words, when you update the record for Customer 1, you’re deleting the old record for that customer and inserting a new one. DELETE operations have hidden overhead in that they create the need to VACUUM and ANALYZE the table afterwards.
- Second, this option puts direct strain on CustomerStats, potentially slowing down the dashboards and other resources that are accessing it.
Option 2 – TRUNCATE CustomerStats and INSERT
The second option fixes the inefficiently of the DELETE operation. TRUNCATE operations are far faster than deletes, and don’t require a VACCUM and ANALYZE afterwards. Here’s what it looks like.
TRUNCATE TABLE CustomerStats; INSERT INTO CustomerStats (CustomerId, TotalOrders, .....) SELECT CustomerId, Count(Distinct o.OrderId), ..... FROM Customer c INNER JOIN Orders o on o.CustomerId = c.CustomerId ...... WHERE .... GROUP BY CustomerID;
However, there are also two flaws in this approach.
- First, between the time of the TRUNCATE and when the INSERT operation completes to repopulate CustomerStats anyone querying the table is out of luck. It’s not out of the ordinary for the SELECT statement, which produces the results to be inserted, to take several minutes to run.
- Second, a tradeoff of a TRUNCATE vs. a DELETE is that a TRUNCATE immediately commits the transaction in which it is run in. That means if something goes wrong after the TRUNCATE you’ll be left with an empty CustomerStats table until you can fix the error and re-run the population. Not good.
Option 3 – Stage in a Temp Table
The third option solves several problems. Before I discuss those, here’s what it looks like. The first step makes use of a clever Redshift operation to create a temp table with the same structure as our permanent CustomerStats table. It also carries over the DISTKEY and SORTKEY attributes!
CREATE TEMP TABLE CustomerStats_temp (like CustomerStats); INSERT INTO CustomerStats_temp (CustomerId, TotalOrders, .....) SELECT CustomerId, Count(Distinct o.OrderId), ..... FROM Customer c INNER JOIN Orders o on o.CustomerId = c.CustomerId ...... WHERE .... GROUP BY CustomerID; TRUNCATE TABLE CustomerStats; INSERT INTO CustomerStats SELECT * FROM CustomerStats_temp; DROP TABLE CustomerStats_temp;
As you can see, we’re not running any UPDATE or DELETE statements. We’ve also solved the issue of handing an error in generating the new result set. If the INSERT into CustomerStats_temp fails, we can rollback and still have the data in CustomerStats available to our dashboards.
In addition, the only time that CustomerStats is empty is between when we truncate it and when we insert the data from CustomerStats_temp into it. That’s far less time than the INSERT statement above it takes, though there is a small window when no data is present.
If CustomerStats is absolutely massive or you just can’t live with any time where it’s empty, you’ll want to consider updating the data incrementally. Such an approach is more complex, so I tend to only use when necessary. Keep it simple whenever possible!
Don’t forget to sign up for the Data Liftoff mailing list to get more content and to stay up to date on the latest in data science and data engineering.