New to CloverDX? This is the best place to start! This fly by walk-through will guide you through fundamentals of CloverDX and teach you the basics along the way.
Python has long been one of the premier choices for data engineers looking to organize, clean, and analyze data from various sources. It is relatively easy to learn with powerful libraries that enable anyone from data scientists to business analysts to work with datasets to derive value.
However, utilizing python for data parsing, organizing, and cleansing at an enterprise scale often leads us into the trap of re-inventing the wheel when we try to script handling the intricacies of real-life data processes. Parsing files, handling exceptions, or scheduling scripts in cron are just a few tedious jobs you shouldn't be doing yourself in 21st century. By combing your Python skills with a complete data integration platform like CloverDX, you can focus on deriving value from your data rather than wrangling with mundane tasks that surround your business logic.
CloverDX gives you the flexibility to connect to any data source, on-prem or in the cloud, in any format. We can use CloverDX to push our data through data pipelines that will parse, clean, and transform your data into something we can use in analytic processes. CloverDX is designed to be the robust data backbone of an organization. Supporting Python-based analysis and data manipulation is just one of the many ways we can transform organizations’ data infrastructure.
In this example, we’re going to grab sales data from a CSV file, clean/validate our data, push it through a python script utilizing NumPy libraries for analysis, and finally write out to a target database for our Marketing team to use.
Setup
I've already written a python script to take incoming sales data, analyze it, and turn it into a readable Excel report for our Marketing and Sales teams. I can leverage CloverDX to act as my ingestion engine to easily grab data from any source on the fly without having to manually re-write how I pull and parse data in my Python script.
First, I'll use a JSON Extract component to pick apart a piece of JSON and restructure it so that I can effectively use it as an input into the rest of our data process. In this case, I’ll be restructuring into a flat CSV format.
My JSON Extract Mapping looks like this:
My Data Validation component will allow me to create the following rules:
Our next step is restructuring our data into standard-in (stdin) format so we can run it through our python script.
We're going to use a Flat File Writer component to transform our data from a columnar format to a single discrete record for processing in Python.
From there, we're going to utilize my Python Script using a Subgraph that I've built.
Subgraphs in CloverDX allow you to wrap business logic (like a piece of python data analysis in our case) in a reusable format that you can share within your team or organization.
In this example, we're using an Execute Script component to call out to my Python script.
Within this component, we can run any script (including Python). We can either paste our script directly into the “Script” input box or we can provide the URL of our .py file in the “ScriptURL” box. We can also parameterize all of these inputs to dynamically call on different data sources and analysis processes from a single graph.
Our script is going to be called on the incoming stdin input. The Python script looks like this:
The resulting analysis from our script is going to go the the output port of our subgraph and out back into our original graph.
Now that we've set up our Subgraph, let's go back to the graph level to add some finishing touches.
Our subgraph output is going to be a single continuous string in Python StdIn format. We’re going to run it through a Flat File Reader to normalize it into a columnar form that’s easier to digest in the form of a business report.
My Flat File Reader configuration looks like this:
Finally, we can convert this output into a nice Excel report using our Excel Spreadsheet Writer Component.
We're going to set the "FileURL" property to our first output port (0) in a discrete format in order to incorporate it into our next component.
Our spreadsheet report output is going to look like this:
Now we can focus on the final part of our process: distributing our report via email:
When you start working on a CloverDX project, you'll notice that it comes with a predefined structure see here for more details. This structure works... CloverDX How-To
In CloverDX we sometimes get a question if and how we can work with DBT. These questions typically come up when IT/data engineering wants to empower data... Analytics and BI
HTTP APIs currently drive data integration space. No matter which system enterprises use, most of them these days do support some way to extract or modify... API
We frequently get a question what a CloverDX Cluster is, how it works and advise around configuration. So let me shed some light on it as I’ll try to... Deployment
Starting CloverDX 5.16.0, server installer is available via an RPM package making it easy to install and maintain going forward using YUM or DNF package... Deployment
In previous article, we covered how to establish a Kafka connection and how to publish messages into a Kafka topic. This article will cover the other side... CloverDX How-To
Kafka is a distributed event streaming platform capable of handle massive volumes of events. It is designed and operates similar to a messaging queue. Kafka was originally developed by a development group at LinkedIn and was open-sourced to the public in...
In previous article, we covered how to establish a Kafka connection and how to publish messages into a Kafka topic. This article will cover the other side of operation - how to consume Kafka messages and some important properties which may either process...