As you’ve probably experienced, cloud computing has become incredibly important. Today, we’ll show you how to easily feed data to the data warehouse service Amazon Redshift using CloverETL. We’ll create transformations that will read data from the source and upload them to Redshift. Luckily, this will work with any data source. With the help of CloverETL, you’ll able to deliver data to Redshift with very little effort and in matter of minutes. This is what we call rapid data integration.
Amazon Redshift - Theoretical Fundamentals
Let’s start with a little bit of theory – enough to explain the basics. Amazon Redshift allows you to upload data in many ways. However, its documentation recommends that you use the COPY command from S3. During our testing we confirmed that COPY from S3 is by far the fastest solution. So, we’ll use the COPY command with Amazon S3 bucket. CloverETL will get the data, prepare it, and upload it to Amazon S3. Afterwards, it’ll invoke the COPY command and move data to Redshift.
In order to move data to Redshift, you’ll need a two-step data integration process.
In the first step, connect to the source of the data – in this case, it’s MongoDB. To connect to MongoDB, the standard CloverETL connector will be used. Just fill in all necessary fields and you’re good to go.
The second step, uploading, will be a little trickier. Rather than putting the whole transformation into a single graph, it’s actually much better to create a subgraph that will take care of the upload to Redshift. Among its many advantages, subgraphs are reusable, making any data integration that involves uploads to Redshift fast and easy.
To make sure that the COPY command is executed only after all S3 uploads are finished, you’ll need to set two different phases in subgraph.
Split incoming data into several csv files. It's not necessary, but we suggest you do it to utilize parallel processing in the cloud. Then upload data to S3. CloverETL has native support of S3, so writing to the cloud is like writing to any other destination.
When the data is stored in the S3 bucket, the subgraph extracts the field names from metadata, prepares the COPY command, and executes it using the DBExecute component.
Example of copy command:
copy DBTableName (id, field1, field2, field3)
Now, just place the subgraph into the graph like any other component. Set your parameters (Amazon credentials – related to S3 and Redshift, table name), connect an edge to it, and you are good to go.
The beauty of this is that the Redshift subgraph is completely universal. It takes the credentials and the table name as the input parameter. Data comes on the input edge and field names are extracted automatically from metadata. You can use this subgraph to store any data to Redshift. Additionally, you can perform any other data operation you want before loading data into Redshift – like combing data from multiple sources, validating it, or filtering it. Whatever you need.
To better grasp the versatility of this subgraph, let’s take a look at the example below. It performs three Amazon Redshift uploads. The first one combines data from the file and SOAP web service. The second one filters data from MongoDB, and the third one just moves the data from the database.
As you can see, it’s pretty easy to use CloverETL to upload data to Amazon Redshift. If you have any data source, using this subgraph will help you upload it in just a few minutes.