Copying Data from Postgres to Redshift
I used to work for a small startup that wanted to know a lot about their users without breaking the budget. Redshift, a column-oriented fork of Postgres, integrates with lots of existing Business Intelligence tools. Best of all, pricing starts 0.25 USD per hour.
We streamed user events from our application logs to Redshift. Then we wanted to know which users were doing what events.
I wrote postgres_to_redshift after failing to find any other tool for streaming SQL databases to Redshift.
Enter postgres_to_redshift
Key Features:
- Schema changes are propagated automatically
- Tables remain queryable during updates without any downtime
- Data types are translated meaningfully
- Runs easily in Cron or Heroku Scheduler
Usage
export POSTGRES_TO_REDSHIFT_SOURCE_URI='postgres://username:password@host:port/database-name'
export POSTGRES_TO_REDSHIFT_TARGET_URI='postgres://username:password@host:port/database-name'
export S3_DATABASE_EXPORT_ID='yourid'
export S3_DATABASE_EXPORT_KEY='yourkey'
export S3_DATABASE_EXPORT_BUCKET='some-bucket-to-use'
postgres_to_redshift
The environment variables will change to command line flags in the future.
How it Works
postgres_to_redshift takes a full dump of a Postgres database and loads it into Redshift. For small databases, we had no issue running this task hourly.
If you are curious, I’ve outlined the exact steps postgres_to_redshift takes for each table.
Step 1: Create target table in Redshift
postgres_to_redshift uses Postgres’ table definition stored in information_schema.tables
and column definition stored in information_schema.columns
.
Redshift’s column types differ slightly from Postgres. postgres_to_redshift performs the following mapping:
{
"bytea" => "CHARACTER VARYING(65535)",
"json" => "CHARACTER VARYING(65535)",
"oid" => "CHARACTER VARYING(65535)",
"text" => "CHARACTER VARYING(65535)",
"money" => "DECIMAL(19,2)",
}
Using CHARACTER VARYING(65535)
has a side effect of truncating very long strings to Redshift’s 65k limit. I am unaware of a workaround at this time.
Step 2: Copy data from Postgres to S3
Postgres provides a COPY
statement for streaming data out of a table.
COPY (SELECT (id, title) FROM films) TO STDOUT WITH DELIMITER '|'
postgres_to_redshift the output of COPY into a sequence of gzip files. It then uploads the gzipped data to an S3 bucket.
Step 3: Ingest data into Redshift and Swap Tables
Loading the data is the most complicated step. It is done in a single transaction to ensure that running queries get access to consistent data.
DROP TABLE IF EXISTS public.films_updating;
BEGIN;
ALTER TABLE public.films RENAME TO films_updating;
CREATE TABLE public.films (id, title);
COPY public.films
FROM 's3://#{ENV['S3_DATABASE_EXPORT_BUCKET']}/export/films.psv.gz'
CREDENTIALS 'aws_access_key_id=#{ENV['S3_DATABASE_EXPORT_ID']};aws_secret_access_key=#{ENV['S3_DATABASE_EXPORT_KEY']}'
GZIP TRUNCATECOLUMNS ESCAPE DELIMITER as '|';
COMMIT;
Because the table rename happens in a transaction, running queries proceed without error. Queries that begin after this transaction commits can access the newly imported data.
Future Goals and Current Limitations
Better Large Database Support
While Redshift is very fast at bulk loading data, is extremely slow at doing individual inserts. For this reason, postgres_to_redshift loads an entire table instead of streaming updates to Redshift. For huge databases, this may be prohibitively slow.
There also isn’t a progress indicator for large databases. That would be nice.
Streaming Replication
Postgres 9.4 added a Data Change Streaming C API for data replication and extraction of SQL statements. It could be possible to take advantage of this to write a library for creating chunked updates. The BDR (bidirectional replication) project also includes some mechanisms for logical (row-level) replication which may be useful as well. Currently, nobody is asking for this, so I am not working on it.
Port to a language with better concurrency support
Ruby was a great language for prototyping, but other languages like C, Go, or Rust may be better in the long run. As is, it is fast enough for our database size, but it may not be generally useful in Ruby.
Links
Thanks to the postgres_to_redshift contributors!