At Makkajai, we use Amazon Redshift as our data warehouse. Redshift has got good features, but when it comes to providing disk space its a bit expensive. Especially, if you choose the DC kind of a cluster. For e.g. the dc2.large comes with only 160 GB storage.
With any data warehouse, the data constantly keeps increasing and we always run into disk space related issues.
One of our cluster recently ran out of disk space. I started looking around for possible solutions to get around this problem.
Spend More Money
One very obvious alternative was to Scale the Amazon Redshift cluster horizontally.
This means adding a new Amazon Redshift nodes. While this solution is pretty easy, it comes with cost. An additional $140-$190 (approximately) per month + taxes.
Also the data will keep piling up and we will have to keep adding new nodes to increase the cluster capacity. Hence, this approach becomes more and more expensive over time.
Offload the data somewhere?
Theoretically, we could offload a lot of data to a cheaper storage (for example Amazon S3) and use that for querying data from Redshift. But we need to make sure that the data is still queryable from Amazon Redshift, just like we query any other table - that would be perfect.
Offload the data somewhere?
Theoretically, we could offload a lot of data to a cheaper storage (for example Amazon S3) and use that for querying data from Redshift. But we need to make sure that the data is still queryable from Amazon Redshift, just like we query any other table - that would be perfect.
I looked around to see, if there is was a way to get this done with least amount of pain. Thankfully, we found Amazon Redshift Spectrum
It is build for specifically this use-case.
... Being able to query data stored in S3 means that you can scale your compute and your storage independently, with the full power of the Redshift query model and all of the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Redshift tables and in S3.When you issue a query, Redshift rips it apart and generates a query plan that minimises the amount of S3 data that will be read, taking advantage of both column-oriented formats and data that is partitioned by date or another key.
But how much do we pay extra?
Spectrum pricing is based on the amount of data pulled from S3 during query processing and is charged at the rate of $5 per terabyte (you can save money by compressing your data and/or storing it in column-oriented form). You pay the usual charges to run your Redshift cluster and to store your data in S3, but there are no Spectrum charges when you are not running queries.
I think this paragraph summarises it pretty well.
What do we eventually get?
Using this approach we were able reduce the Disk Space needed for Amazon Redshift significantly. We were nearing 90-100% earlier, after offloading few bigger tables we are at 65-67% disk space usage.
In the next post, I will document the exact steps needed to get this setup in place.