If you’re using Amazon Redshift, you’re likely loading in high volumes of data on a regular basis. The most efficient, and common, way to get data into Redshift is by putting it into an S3 bucket and using the COPY command to load it into a Redshift table.
Here’s an example COPY statement to load a CSV file named file.csv from the bucket-name S3 bucket into a table named my_table.
COPY my_schema.my_table from 's3://bucket-name/file.csv’ <authorization>;
Simple enough, right? What about that pesky <authorization> line? We need to provide Redshift with the necessary credentials to access the S3 bucket assuming the bucket isn’t public.
There are a few ways to mange such an operation securely in AWS, but for the case of using using COPY to ingest data from S3 into Redshift, I suggest using an IAM Role.
IAM (or Identity and Access Management) is an AWS service you can use to securely control access to AWS resources, including S3 and Redshift. You might already be using an IAM user to log into the AWS console or use the command line tools. Most organizations issue IAM user credentials to individual employees and applications. Each IAM user only has the permissions it needs and can be disabled or deleted without impacting the root account.
IAM roles are a bit different than IAM users but are used for similar reasons. Here’s how Amazon describes IAM roles:
An IAM role is an IAM identity that you can create in your account that has specific permissions. An IAM role is similar to an IAM user, in that it is an AWS identity with permission policies that determine what the identity can and cannot do in AWS. However, instead of being uniquely associated with one person, a role is intended to be assumable by anyone who needs it. Also, a role does not have standard long-term credentials such as a password or access keys associated with it. Instead, when you assume a role, it provides you with temporary security credentials for your role session.https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html
Creating an IAM Role
The first step is to create an IAM role and give it the permissions it needs to copy data from your S3 bucket and load it into a table in your Redshift cluster.
- Under the Services menu in the AWS console (or top nav bar) navigate to IAM.
- On the left hand nav menu, select Roles, and then click the Create role button.
- You’ll be presented with a list of AWS services to select from. Find and select Redshift.
- Under Select your use case, choose Redshift – Customizable.
- On the next page (Attach permission policies), search for and select AmazonS3ReadOnlyAccess, and click on Next.
- Give your role a name and click on Create role.
- Copy the Role ARN so we can use it in just a moment. You can find this later in the IAM console under the role properties as well. The ARN looks like this: arn:aws:iam::<aws-account-id>:role/<role-name>
Associating the Role with Your Cluster
Now that you have a role with read-only access to your S3 bucket, it’s time to associate it with your cluster. Doing so gives it access to load that data into a table.
- Go back to the AWS Services menu and go to Amazon Redshift.
- Choose Clusters in the left-hand nav menu and select the cluster you want to load data into.
- Under the Cluster Properties section, click on the See IAM roles link. It’s a little hard to find, but look towards the bottom of the Cluster Properties section or search using your browser.
- When the popup loads, you will be able to select your role in the Available roles drop down.
- After you select your role, press Apply changes
Note that your cluster will take a minute or two to apply the changes, but it will still be accessible during this time.
Let’s Copy Some Data!
Now that you have an IAM role with access to read from S3 and write to tables in our Redshift cluster, you can fill in that <authorization> line in the COPY Statement with your ARN info (step 7 in Creating an IAM Role above).
COPY my_schema.my_table from 's3://bucket-name/file.csv’ iam_role ‘<my-arn>’;
In the end it will look something like this assuming you named your role “RedshiftLoaderRole”.
COPY my_schema.my_table from 's3://bucket-name/file.csv’ iam_role 'arn:aws:iam::482569874589:role/RedshiftLoaderRole’;
Managing security in your data warehouse can be a challenge, so setting a good foundation is critical.
Don’t forget to sign up for the Data Liftoff mailing list to get more content and to stay up to date on the latest in data science and data engineering.