Friday, July 31, 2020

How To Reduce the Disk Space Need for Amazon Redshift - Part 2

This post will conclude the, process of reducing the disk space need for Amazon Redshift. If you haven't already read Part 1 in this 2 part series, I strongly recommend that you go read it now, I will wait!

Right, so now that we know why we are doing what we are doing, let's get straight to the point. How to actually do it?

How To?

We will use an open source library called Spectrify. It basically helps us with the following
  • Export a Redshift table to S3 (CSV)
  • Convert exported CSVs to Parquet files in parallel
  • Create the Spectrum table on your Redshift cluster
It basically performs all the steps needed to get our table offloaded from Redshift and then setup as a spectrum table on the redshift cluster. 

Since the entire process is network and CPU intensive, its advisable that we do it from an Amazon EC2 instance - t2.xlarge is highly recommended. Assuming that you have an Amazon EC2 instance, heres the script gets the job done.

The script is designed in such a way that it installs all the necessary dependencies required for its execution.

It basically performs the following steps:
  • Ensure that we have psql installed. Thats needed so that we can execute the commands on Redshift database. Install it if required.
  • Create the schema if required. It needs to be done only once. We will host our spectrum tables in spectrum schema.
  • Check if spectrify is installed. Install it with all the necessary dependencies, if required.
  • Export the data in the Redshift tables to CSV files on S3.
  • Convert the CSV files to Parquet format.
  • Moving the files to appropriate move path, so that we can support incremental exports.
  • Create the spectrum table in Redshift. Don't need to create the table again and again. Only do it once.
  • Truncate the Redshift table if required.

Once the script finishes running, you should have your spectrum table ready to be queried and used like any other Redshift table. It will have the same structure and data as the original Redshift table!

What about Incremental Offload?

Totally possible! We can run this script again and again on any table in Redshift. It will keep appending data to the spectrum table without overwriting the earlier data. If a table was offloaded earlier, running it again on the same table, will offload only new rows that got added after the last run.

I hope more and more people are able to use of this awesome feature and reduces the disk space need for Amazon Redshift cluster!
Have some Fun!