jenny rakoczy

Copying Data from Postgres to Redshift

Jun 13, 2016

I used to work for a small startup that wanted to know a lot about their users without breaking the budget. Redshift, a column-oriented fork of Postgres, integrates with lots of existing Business Intelligence tools. Best of all, pricing starts 0.25 USD per hour.

We streamed user events from our application logs to Redshift. Then we wanted to know which users were doing what events.

I wrote postgres_to_redshift after failing to find any other tool for streaming SQL databases to Redshift.

Enter postgres_to_redshift

Key Features:

Schema changes are propagated automatically
Tables remain queryable during updates without any downtime
Data types are translated meaningfully
Runs easily in Cron or Heroku Scheduler

Usage

export POSTGRES_TO_REDSHIFT_SOURCE_URI='postgres://username:password@host:port/database-name'
export POSTGRES_TO_REDSHIFT_TARGET_URI='postgres://username:password@host:port/database-name'
export S3_DATABASE_EXPORT_ID='yourid'
export S3_DATABASE_EXPORT_KEY='yourkey'
export S3_DATABASE_EXPORT_BUCKET='some-bucket-to-use'

postgres_to_redshift

The environment variables will change to command line flags in the future.

How it Works

postgres_to_redshift takes a full dump of a Postgres database and loads it into Redshift. For small databases, we had no issue running this task hourly.

If you are curious, I’ve outlined the exact steps postgres_to_redshift takes for each table.

Step 1: Create target table in Redshift

postgres_to_redshift uses Postgres’ table definition stored in information_schema.tables and column definition stored in information_schema.columns.

Redshift’s column types differ slightly from Postgres. postgres_to_redshift performs the following mapping:

{
  "bytea" => "CHARACTER VARYING(65535)",
  "json"  => "CHARACTER VARYING(65535)",
  "oid"   => "CHARACTER VARYING(65535)",
  "text"  => "CHARACTER VARYING(65535)",
  "money" => "DECIMAL(19,2)",
}

Using CHARACTER VARYING(65535) has a side effect of truncating very long strings to Redshift’s 65k limit. I am unaware of a workaround at this time.

Step 2: Copy data from Postgres to S3

Postgres provides a COPY statement for streaming data out of a table.

COPY (SELECT (id, title) FROM films) TO STDOUT WITH DELIMITER '|'

postgres_to_redshift the output of COPY into a sequence of gzip files. It then uploads the gzipped data to an S3 bucket.

Step 3: Ingest data into Redshift and Swap Tables

Loading the data is the most complicated step. It is done in a single transaction to ensure that running queries get access to consistent data.

DROP TABLE IF EXISTS public.films_updating;
BEGIN;
ALTER TABLE public.films RENAME TO films_updating;
CREATE TABLE public.films (id, title);
COPY public.films 
  FROM 's3://#{ENV['S3_DATABASE_EXPORT_BUCKET']}/export/films.psv.gz' 
  CREDENTIALS 'aws_access_key_id=#{ENV['S3_DATABASE_EXPORT_ID']};aws_secret_access_key=#{ENV['S3_DATABASE_EXPORT_KEY']}' 
  GZIP TRUNCATECOLUMNS ESCAPE DELIMITER as '|';
COMMIT;

Because the table rename happens in a transaction, running queries proceed without error. Queries that begin after this transaction commits can access the newly imported data.

Future Goals and Current Limitations

Better Large Database Support

While Redshift is very fast at bulk loading data, is extremely slow at doing individual inserts. For this reason, postgres_to_redshift loads an entire table instead of streaming updates to Redshift. For huge databases, this may be prohibitively slow.

There also isn’t a progress indicator for large databases. That would be nice.

Streaming Replication

Postgres 9.4 added a Data Change Streaming C API for data replication and extraction of SQL statements. It could be possible to take advantage of this to write a library for creating chunked updates. The BDR (bidirectional replication) project also includes some mechanisms for logical (row-level) replication which may be useful as well. Currently, nobody is asking for this, so I am not working on it.

Port to a language with better concurrency support

Ruby was a great language for prototyping, but other languages like C, Go, or Rust may be better in the long run. As is, it is fast enough for our database size, but it may not be generally useful in Ruby.