In a previous post, I wrote about using the COPY command to load data from an S3 bucket into a Redshift table. In this post, I’ll talk about the reverse – moving data from Redshift into S3 with the UNLOAD command.
Why Move Data from Redshift to S3?
It’s fairly obvious to most why you’d bring data from S3 into your Redshift cluster, but why do the reverse? In a modern data warehouse, you’re likely (hopefully!) taking an ELT rather than ETL approach in your processing.
In ELT, the data being loaded into your Redshift cluster is quite raw. However, after the final step (Transform), you’re left with datasets that are not only valuable in your warehouse, but to other systems that you don’t want querying Redshift directly.
For example, you might have a REST API that serves pre-crunched data to your website for some kind of customer facing dashboard. While Redshift is great for large operations like crunching that data, concurrency isn’t its strong-suit. The API is probably sitting on top of a database that’s better suited for such operations. The good news is that Redshift is already doing the hard work of transforming the data you need for the dashboard, so all you need to do it move it from Redshift to the application database serving the API. The UNLOAD command is quite efficient at getting data out of Redshift and dropping it into S3 so it can be loaded into your application database.
Another common use case is pulling data out of Redshift that will be used by your data science team or in a machine learning model that’s in production. Again, Redshift might have the data they want, but it’s probably not the right system for them to query directly.
If you’ve used the COPY command, you’ll feel right at home with UNLOAD. In my post about the COPY command, I provided the following example of moving the contents into a csv file named “file.csv” and inserting it into a table called “my_table”.
COPY my_schema.my_table from 's3://bucket-name/file.csv’ iam_role 'arn:aws:iam::482569874589:role/RedshiftLoaderRole’;
Also in that post, and in the example, I used an IAM role for authentication. I suggest doing the same for the UNLOAD command we’re about to use. There are instructions for getting it set up in that post.
Now, let’s do the reverse of the COPY example above. Let’s take all of the data from my_table and put it into our S3 bucket. Here we go!
UNLOAD ('select * from my_schema.my_table') to 's3://bucket-name/file_’ iam_role 'arn:aws:iam::482569874589:role/RedshiftLoaderRole’;
Of course you can unload only a subset of the data in the table. Just modify the SELECT statement on the first line.
One thing that might jump out is our file name is “file_”. That’s because by default UNLOAD writes data out in parallel and creates multiple files. In our example you might end up with something like this.
file_0000_part_00 file_0001_part_00 file_0002_part_00
If that’s not ideal for you, you can use the “parallel off” option and get a single file as long as the total file size is less than or equal to 6.2 GB. If it’s larger, you’ll get one file for every 6.2 GB of your data when “parallel off” is set. Here’s what our UNLOAD statement looks in that case.
UNLOAD ('select * from my_schema.my_table') to 's3://bucket-name/file_’ iam_role 'arn:aws:iam::482569874589:role/RedshiftLoaderRole’ parallel off;
There are a good number of advanced options for handing different delimiters, compressing your output files with GZIP and so on. If you’ like to learn more you can view the detailed documentation on the AWS doc site.
Don’t forget to sign up for the Data Liftoff mailing list to get more content and to stay up to date on the latest in data science and data engineering.