Loading Data from S3 to into Snowflake

snowflake logo

Loading data that’s been stored in an S3 bucket into a Snowflake data warehouse is an incredibly common task for a data engineer. In an ELT pattern, once data has been Extracted from a source, it’s typically stored in a cloud file store such as Amazon S3. In the Load step, the data is loaded from S3 into the data warehouse, which in this case is Snowflake.

This post walks through the two most common ways to load data into Snowflake.

  1. Using the COPY command
  2. Using Snowpipe

Which Method to Choose?

The method you choose should be based on a single question. Do you want to load data in bulk at specific intervals, or do you want to load smaller chucks of data continuously as soon as it is loaded into S3?

For bulk loads, the COPY command is the proper solution. You can execute a COPY command to load a file, or set of files, from your S3 bucket any time you’d like. However, it must be executed by some external operation such as a script run on a schedule. If you need to load large volumes of data at specific intervals, COPY is the best choice. It’s great for smaller data volumes as well, but again only at specific intervals you choose.

If you wish to load in data soon after it’s loaded into the S3 bucket, Snowpipe is the best option. With Snowpipe, you can use compute resources to load files into your Snowflake warehouse within minutes of landing in the S3 bucket. This is often referred to as continuous, or micro-batching, and works well for frequently arriving, but smaller, files that you want to load. If you’re working with a large volume of data you’ll need to break it down into many smaller files. Snowpipe is a bit more complex to get set up, but is also quite self-sufficient once you do so.

Bulk Loading with COPY

If you’ve chosen to load data in bulk from your S3 bucket, there are a few things you’ll need to configure before you can run the COPY command.

First, you’ll need to give Snowflake access to your S3 bucket. There are a few ways to do this, but I suggest configuring a Snowflake Storage Integration. Snowflake has excellent documentation on how to do this, and it’s best to ensure you follow the latest instructions. Also, if you encrypt the files in your S3 bucket, make sure you configure Snowflake to decrypt them so they can be loaded.

Next, create an external stage. An external stage is an object that points to an external storage location so Snowflake can access it. Your S3 bucket will serve as that location. The instructions for creating an external stage are fairly straightforward, but I suggest first creating a FILE FORMAT in Snowflake which for consistency you can use both for this stage as well as future ones you create. A FILE FORMAT in Snowflake is a way to reference the format of a file that is to be loaded. There are a number of options, which you can read about in depth, but as an example here is the command to create one for CSV files that are pipe delimited.

CREATE OR REPLACE FILE FORMAT csv_pipe_format type = 'csv' field_delimiter = '|';

When you create your stage per the official Snowflake docs, you can use the csv_pipe_format object as follows.

USE SCHEMA my_db.my_schema;

CREATE STAGE my_s3_stage
storage_integration = s3_int
url = 's3://my-bucket/'
file_format = csv_pipe_format;

Now, it’s time to load some data! Thanks to the work you put in to create the Snowflake external stage, the COPY command itself is very simple. All the credentials and settings for your S3 bucket, along with the expected format of the files it contains, are configured in the stage itself. All you need to do is specify a file to load, and a table to load it into.

COPY INTO destination_table
  FROM @my_s3_stage/my_file.csv;

In this example, I’m assuming that the columns in the source file (my_file.csv) match up with the columns in the table they are to be loaded in (destination_table). If not, you can specific column order in the COPY command. For more advanced options, see the Snowflake documentation.

You may also wish to load a set of files in the bucket. This is quite common as large datasets can be broken down into a number of files for performance reasons. By making use of the pattern parameter, the following example loads all CSV files in the bucket that begin with “my_file_”. For example, you might have files named “my_file_1.csv”, “my_file_2.csv” and so on in the bucket.

COPY INTO destination_table
  FROM @my_s3_stage
  pattern='my_file_.*.csv';

Though you can run the COPY command right from any SQL editor connected to your Snowflake instance (including the Snowflake web UI), you’ll likely want to automate bulk loads. One solution is to wrap the COPY command in a Python script and schedule it to run with a cron job or an Airflow Task.

The best way to connect to a Snowflake instance from Python is using the Snowflake Connector for Python, which can be installed via pip as follows.

pip install snowflake-connector-python

Next, you’ll need to make sure you have a Snowflake user account that has ‘USAGE’ permission on the stage you created earlier. You’ll need fill in the username and password for that account when you create and instantiate the snowflake_conn object in the sample code below. For simplicity of the sample, I hardcode the username, password, and account name. PLEASE do not hardwood them in your code though. Consider storing them either an external config file that is in a secure location or using something like Python Vault.

The value for the account parameter of snowflake_conn should be formatted based on your cloud provider and the region where the account is located. For example, if your account is named acct_name and hosted in the US East (Ohio) region of AWS, the account_name value will be acct_name.us-east-2.aws (as I use for the code sample). The Snowflake Connector docs has a page to help you determine the correct value based on your account has hosting provider.

With that, here is the code to run the COPY command from earlier in a Python script. In addition to properly storing and retrieving your Snowflake account credentials, I suggest adding some error handling before you schedule this script to run via a cron job or other scheduler/workflow management system such as Airflow.

import snowflake.connector

snowflake_conn = snowflake.connector.connect(
user = "my_user_name",
password = "my_password",
account = "acct_name.us-east-2.aws"
)

sql = """COPY INTO destination_table
FROM @my_s3_stage
pattern='my_file.*.csv';"""

cur = snowflake_conn.cursor()
cur.execute(sql)
cur.close()

Continuous Loading with Snowpipe

Setting up data loads via Snowpipe is a bit more involved than running bulk loads via COPY. Thankfully, Snowflake has a great walkthrough to get you going and since I won’t be able to do a better job describing the steps so I highly suggest you take a look.

However, I will offer a few suggestions and tips before you get started.

Snowpipe is great for continuous (aka “streaming”) data ingestions, but it’s not worth the complexity if you can get away with scheduling infrequent bulk loads via COPY. It’s not just configuring Snowpipe that make continuous ingestions difficult, it’s the Extract step of getting the data to your S3 bucket in the first place. Unless the data source you’re extracting and loading from is already well designed for streaming data, you’ll need to build out a well designed extraction from it and into S3. In summary, bulk loads can sound a bit boring but in some cases they really are the best option.

Snowpipe is best suited for loading a series of relatively small files from S3 rather than a few large ones. By “relatively small” I mean 100MB-200MB at the most. Depending on the volume of the data in source you’re extracting data from, you’ll need to consider how you’ll break down the extracted data into files in the S3 bucket.

Because Snowpipe uses server-less compute resources, you’ll want to understand how you’ll be billed for its use. There are a lot of factors that will go into how much you’ll be charged including the number of files you’re loading, the size of the files, the frequency of loading, and even the complexity of any transformations and resourcing of the data you’re performing during load. You’ll probably need to do some experimentation to truly estimate your cost, so I suggest starting with a reasonably sized but representative dataset and forecasting based on a few hours or even a day’s worth of Snowpipe loading.