Embedding Python Scripts into CloverETL Data Pipeline

Share this article:

Easy learning curve and a rapidly growing ecosystem of libraries makes Python, (along with R) a favored choice for data prep and analytics. Python is no doubt one of the drivers behind modern self-service approach by data scientists and business analysts who can do much more data massaging on their own, without needing any outside help.

However, I've learned the hard way that doing all your data massaging in Python will eventually lead you into a trap of reinventing the wheel. Parsing files, handling exceptions, or scheduling scripts in cron are just a few tedious jobs you shouldn't be doing yourself in 21st century. Combine your Python skills with a data integration platform like CloverETL and you can focus on writing Python business logic and analytic pieces while leaving out the boring (yet necessary) stuff. CloverETL can take care of data parsing and formatting, connecting to on-premise and cloud data sources, jobflow orchestration, automation, monitoring, scaling out, etc. CloverETL is designed to be the data backbone of an organization and Python-based analysis and data manipulation can surely be part of it.

Set-up

Let’s say I have some logic that I wrote in Python (let’s call it calculate_age.py - yes, amazingly it calculates person’s age from the date of birth!) and I want to use this logic inside CloverETL.

Normally I would have to use Reformat component and write it in CTL or Java but with the help of Jython – a 3rd party library for integrating Python in Java – combined with the provided PythonBridge class (see below) I can use Python directly within CloverETL!

Simple CloverETL graph to process data with Python scripts.-

Configuring Jython
You need to have Jython library .JAR linked to your CloverETL project. Right-click a project in Navigator and select “Properties”. Go to Java Build Path > Libraries and then click “Add JARs” or “Add External JARs” (depending whether you have the JARs in your project or elsewhere).

Python scripts integration with CloverETL. Jython set-up.

While on the Properties screen, check that you have Java SE Development Kit (JDK) installed. JDK is required for the PythonBridge class (see below). If it says “JRE System Library [jdk1.7.0_xx]”, it’s ok.

Adding Java SE into CloverETL for Python scripts integration.

Python Development in Designer
If you want to create your Python Scripts in CloverETL, we recommend to install PyDev, an open-source plugin for Eclipse. It is a Python IDE (IntegratedDevelopmentEnvironments) and it allows Python editing with features like code-completion, refactoring, quick navigation, templates, code analysis and many more.

Python scripts integration with CloverETL

Writing Python Script

We’re writing a simple Reformat transformation in Python instead of the default CTL or Java.

Reformat takes one input record, processes it using transform() function (this is what we’ll write in Python) and creates one output record (with potentially different structure).

How does it work?
In Reformat component we use PythonBridge class – a custom piece of Java code that delegates the Reformat’s transform() function to the Python script.

When writing your Python script, keep the following in mind:

  • You must define transform() function – the main function for Reformat
  • For each incoming record, transform() method will be executed once, producing one output record
  • Input and output records are available via “input_record” and “output_record” variables
  • For logging purposes you can use “logger” object

This is my python/calculate_age.py script adapted to work as a Reformat transformation:


from datetime import date

def transform():
logger.info("Python begin")

#read fields using using methods from clover_utils.py
name = get_string_field("Name")
surname = get_string_field("Surname")
birth_date = get_date_field("BirthDate")
country = get_string_field("Country")

#start legacy process
res = legacy_process_person(name, surname, birth_date, country)

#set fields by legacy process output
set_field("Name", res[0])
set_field("Surname", res[1])
set_field("Age", res[2]) set_field("FromUSA", res[3])
logger.info("Python end")

def legacy_process_person(name, surname, birth_date, country):
diff = date.today()-birth_date

return [
name,
surname,
diff.days/365,
country=="United States of America" ]
Notice the strange functions such as get_string_field(), get_date_field(), set_field(), etc. These are defined in python/clover_utils.py and are just simple data access functions working with the “input_record” and “output_record” objects.

The code goes through three stages:

  1. Reading record values into variables (name, surname, etc.)
  2. Running legacy_process_person() function that returns a new set of values
  3. And finally writing it back as output (set_field() calls).

Of course, this is a truly basic example, but that’s all the magic!

Running a sample project

Download this Python/CloverETL integration project.

  • Make sure you have Jython installed (both 2.x and 3.x will work with this example).
  • Make sure you have JDK as your default Java environment (Window > Preferences > Java > Installed JREs).
  • Your CloverETL project should have Jython JAR on it’s build path
    (right-click on project in Navigator > Java Build Path > Libraries: jython-standalone-x.y.z.jar should be there)
  • Open graph/PythonIntegration.grf and Run it.
No need to worry if you encounter this error: Failed to install ”: java.nio.charset.UnsupportedCharsetException: cp0
It is a known bug in Jython for Python 2.7 and 3.4 (http://bugs.jython.org/issue2222).

Possible solutions:
  • Ignore the message (recommended)
  • You can set Default VM arguments to “-Dpython.console.encoding=UTF-8” in Window/Preferences/Java/Installed JREs, select current JDK and click Edit
  • Or use earlier version of Python

Reformat(Python) Subgraph Explained

As you can see, I’ve wrapped the Reformat with PythonBridge into a reusable subgraph called Reformat(Python). This way I get not only a neat icon for the component, but also a user-friendly interface for setting PythonScriptURL and PythonScript parameters making reusing the “Python enabled Reformat” much more transparent - you just provide the Python script via the parameter!

Notice that PythonScriptURL and PythonScript parameters in Reformat(Python) subgraph are marked as unused in Outline. That’s not a bug. It’s the PythonBridge Java code that actually uses those parameters and unfortunately Designer can’t see into the Java code so it thinks it’s not used anywhere.

More on PythonBridge

If you wonder what actually PythonBridge is and you’re famililar with Java, you can adapt it to your needs. It’s a Java class that we created specifically for the use with the reformat component. Keep in mind it’s not a standard part of CloverETL.

How does it work?

  • First, it looks for graph/subgraph parameters PythonScriptURL or PythonScript
  • If PythonScriptURL is set, it gets priority over PythonScript (inline script)
  • For each record:
    • It creates variable bindings for input_record, output_record and logger
    • It executes the script’s transform() function
  • If there’s an error in your Python script, it will fail the transformation and report the error

To use Python in other CloverETL components you will need to adapt the PythonBridge to match the interface of the particular component. We’ll cover this in some future blog post.

Conclusion

Python is a great tool for quickly implementing complex business logic or advanced analytics procedures. When you combine it with a platform like CloverETL that takes care of automating the pipeline and has built-in functionality for standard data manipulation, you can focus only on solving things that matter and streamline the rest.

Share this article:
Contact Us

Further questions? Contact us.

Forum

Talk to peers on our forum.

Want to keep in touch?

Follow our social media.

Topics

see all