Sunday, December 29, 2019

How To Stream data to Redshift via Firehose

We had a backend system that would write data to PostgreSQL RDS. I wanted to copy this data to our Redshift cluster too in near realtime manner. There are many ways to copy data from PostgreSQL RDS instance to Redshift. However, most of them are not realtime solutions. Thankfully there exists Amazon Firehose which could be used to stream data to Redshift.

In this post we will see the steps needed to stream data to Redshift via Firehose using a Kotlin application.

The Solution

This post assumes that you already have an Amazon AWS account.
  • Lets start by creating a new Kinesis Data Firehose delivery streams by clicking here. We will be creating this delivery stream in Oregon region. 
  • Enter the name of stream and select source as Direct PUT and other sources and click Next

  • Next, this screen lets you transform the data into a different format. For simplicity we are not going to do any of this and select Disabled for both options.

  • Next step is by far one of the most important step in the process, so pay attention :D
  • There are three important things to configure in this step. 
    • Choose Amazon Redshift and fill in the connection details for your Redshift cluster

    • Next, you need to provide S3 bucket details where the data will be held temporarily. Create an S3 bucket called test-delivery-streams in the same Oregon region which we will use as our temporary S3 location.

    • And finally, the Amazon Redshift COPY command, here you can specify various options for the Redshift COPY command. Since we will be streaming data in JSON format, you need to put format as json 'auto' in the COPY options - optional section. 

    • Right below it will also show the actual COPY command that will be used, please review this and make sure that the COPY options have been added successfully.
    • Click Next and move to the final step
  • In this step only thing we need to change is the IAM role in the Permissions section.

    • Click that button and a new window will popup which will basically create a new IAM role that will have access to various AWS services needed to stream data to Redshift.
    • No need to change anything here, click Allow
    • After this the popup will close and we can click Next.

  • The Final step will show you all the information you have entered. After you are happy with everything shown there, click on Create Delivery Stream.
  • This creates the Delivery Stream however we are missing one final step, i.e. to create the table in Redshift that will hold the streamed data.
  • This table needs to mimic the JSON data that we are going to ingest into the Kinesis stream. For simplicity we will create a table with 2 columns only. Here is the create table script that we will use

  • Finally our delivery stream is ready to stream data to Redshift. 
  • Here is the sample Kotlin code to PUT data into the Kinesis stream which should eventually end up into Redshift table.
  • After invoking this code, the Kinesis stream will push the data to S3 our temporary location.
  • You can view the files in the location and validate that the data has reached S3

  • After about 5 minutes you should be able to see the same data in Redshift cluster too! 

  • If you click on the Kinesis Delivery Stream, it has a nifty Monitoring tab which shows information like how much data is written to S3 and how much data has been written to Redshift and stuff like that.

Have some Fun!