Easy learning curve and a rapidly growing ecosystem of libraries makes Python, (along with R) a favored choice for data prep and analytics. Python is no doubt one of the drivers behind modern self-service approach by data scientists and business analysts who can do much more data massaging on their own, without needing any outside help.
However, I've learned the hard way that doing all your data massaging in Python will eventually lead you into a trap of reinventing the wheel. Parsing files, handling exceptions, or scheduling scripts in cron are just a few tedious jobs you shouldn't be doing yourself in 21st century. Combine your Python skills with a data integration platform like CloverETL and you can focus on writing Python business logic and analytic pieces while leaving out the boring (yet necessary) stuff. CloverETL can take care of data parsing and formatting, connecting to on-premise and cloud data sources, jobflow orchestration, automation, monitoring, scaling out, etc. CloverETL is designed to be the data backbone of an organization and Python-based analysis and data manipulation can surely be part of it.
Let’s say I have some logic that I wrote in Python (let’s call it calculate_age.py - yes, amazingly it calculates person’s age from the date of birth!) and I want to use this logic inside CloverETL.
Normally I would have to use Reformat component and write it in CTL or Java but with the help of Jython – a 3rd party library for integrating Python in Java – combined with the provided PythonBridge class (see below) I can use Python directly within CloverETL!
You need to have Jython library .JAR linked to your CloverETL project. Right-click a project in Navigator and select “Properties”. Go to Java Build Path > Libraries and then click “Add JARs” or “Add External JARs” (depending whether you have the JARs in your project or elsewhere).
While on the Properties screen, check that you have Java SE Development Kit (JDK) installed. JDK is required for the PythonBridge class (see below). If it says “JRE System Library [jdk1.7.0_xx]”, it’s ok.
Python Development in Designer
If you want to create your Python Scripts in CloverETL, we recommend to install PyDev, an open-source plugin for Eclipse. It is a Python IDE (IntegratedDevelopmentEnvironments) and it allows Python editing with features like code-completion, refactoring, quick navigation, templates, code analysis and many more.
Writing Python Script
We’re writing a simple Reformat transformation in Python instead of the default CTL or Java.
How does it work?
In Reformat component we use PythonBridge class – a custom piece of Java code that delegates the Reformat’s transform() function to the Python script.
When writing your Python script, keep the following in mind:
- You must define transform() function – the main function for Reformat
- For each incoming record, transform() method will be executed once, producing one output record
- Input and output records are available via “input_record” and “output_record” variables
- For logging purposes you can use “logger” object
This is my python/calculate_age.py script adapted to work as a Reformat transformation:
from datetime import date
#read fields using using methods from clover_utils.py
name = get_string_field("Name")
surname = get_string_field("Surname")
birth_date = get_date_field("BirthDate")
country = get_string_field("Country")
#start legacy process
res = legacy_process_person(name, surname, birth_date, country)
#set fields by legacy process output
set_field("Age", res) set_field("FromUSA", res)
def legacy_process_person(name, surname, birth_date, country):
diff = date.today()-birth_date
country=="United States of America" ]
The code goes through three stages:
- Reading record values into variables (name, surname, etc.)
- Running legacy_process_person() function that returns a new set of values
- And finally writing it back as output (set_field() calls).
Of course, this is a truly basic example, but that’s all the magic!
Running a sample project
Download this Python/CloverETL integration project.
- Make sure you have Jython installed (both 2.x and 3.x will work with this example).
- Make sure you have JDK as your default Java environment (Window > Preferences > Java > Installed JREs).
- Your CloverETL project should have Jython JAR on it’s build path
(right-click on project in Navigator > Java Build Path > Libraries: jython-standalone-x.y.z.jar should be there)
- Open graph/PythonIntegration.grf and Run it.
Failed to install ”: java.nio.charset.UnsupportedCharsetException: cp0
It is a known bug in Jython for Python 2.7 and 3.4 (http://bugs.jython.org/issue2222).
- Ignore the message (recommended)
- You can set Default VM arguments to “-Dpython.console.encoding=UTF-8” in Window/Preferences/Java/Installed JREs, select current JDK and click Edit
- Or use earlier version of Python
Reformat(Python) Subgraph Explained
As you can see, I’ve wrapped the Reformat with PythonBridge into a reusable subgraph called Reformat(Python). This way I get not only a neat icon for the component, but also a user-friendly interface for setting PythonScriptURL and PythonScript parameters making reusing the “Python enabled Reformat” much more transparent - you just provide the Python script via the parameter!
More on PythonBridge
If you wonder what actually PythonBridge is and you’re famililar with Java, you can adapt it to your needs. It’s a Java class that we created specifically for the use with the reformat component. Keep in mind it’s not a standard part of CloverETL.
How does it work?
- First, it looks for graph/subgraph parameters PythonScriptURL or PythonScript
- If PythonScriptURL is set, it gets priority over PythonScript (inline script)
- For each record:
- It creates variable bindings for input_record, output_record and logger
- It executes the script’s transform() function
- If there’s an error in your Python script, it will fail the transformation and report the error
To use Python in other CloverETL components you will need to adapt the PythonBridge to match the interface of the particular component. We’ll cover this in some future blog post.
Python is a great tool for quickly implementing complex business logic or advanced analytics procedures. When you combine it with a platform like CloverETL that takes care of automating the pipeline and has built-in functionality for standard data manipulation, you can focus only on solving things that matter and streamline the rest.