Tag Archives: connectivity

Versatility of the File URL Attribute Used in Readers of CloverETL

CloverETL allows users to read several different kinds of files. These files may have various formats, they can be located on a local or remote computer, they can be accessed through a proxy, and they can also be compressed into zip, gzip, or tar archives. Users can also read data from the Console, from an input Port, or from a selected Dictionary entry.

A File URL must be specified using the URL dialog.

File URL Dialog

  • Workspace view tab serves to specify files within the workspace independently of whether the workspace belongs to CloverETL Project or CloverETL Server Project.
  • Local files tab displays the file structure of a local computer.
  • Remote files tab displays the file structure of a remote computer. It allows the user to specify the protocol, username, password, port, server, proxy, username for proxy, password for proxy, identification of proxy server, and the port for the proxy. After specifying these properties, the file structure of the remote computer is displayed (except for http and https protocols).
  • Port tab displays the input fields that are string, byte, or cbyte data types and allows the user to select one of them and also select a processing type from a combo box
  • Dictionary tab displays declared dictionary entries and allows to select one of them and choose the processing type from a combo

Now I will present the list of supported values of the File URL attribute.

Local Files (without compression)

Examples:
/path/file1.txt – reads one file
/path/file1.txt;/path/file2.txt – reads two files in one directory (semicolon separates files that will be read one after another)
/path1/fileA.txt;/path2/fileB.txt – reads two files in two directories (semicolon separates files that will be read one after another)
/path?/file*.txt – reads files in directories, when both the directories and the files must match the specified pattern.
/path/* – reads all files in the specified directory

Local Files (with compression)

Examples:
zip:(/path/file.zip) – reads the first file added to the zip archive
zip:(/path/file.zip)#innerfolder/innerfile.txt – reads the innerfile.txt contained in the innerfolder which has been compressed into the specified zip archive.
zip:(/path/file??.zip)#innerfolder?/innerfile*.txt – reads files contained in the innerfolders which have been compressed into the specified zip archives (each of these files, innerfolders, archive files must match their respective pattern)
gzip:(/path/file.gz) – reads the file compressed in the gzip archive
gzip:(/path/file??.gz) – reads the files compressed into specified gzip archives (each of these archives must match specified pattern)
tar:(/path/file.tar) – reads the first file added to the tar archive
tar:(/path/file.tar)#innerfolder/innerfile.txt – reads the innerfile.txt contained in the innerfolder which is compressed in the specified tar archive.
tar:(/path/file??.tar)#innerfolder?/innerfile*.txt – reads files contained in the innerfolders which have been compressed in the specified tar archives (each of these files, innerfolders, archive files must match their respective pattern)
zip:((zip:/path/file*.zip)#innerfolder/innerfile.zip)#innermostfolder??/innermostfile*.txt –reads innermost files contained in the innermostfolders which have been compressed into the specified innerfile zip archive contained in the innerfolder which has been compressed into the specified external zip archives (each of these innermostfiles, innermostfolders, external zip archives must match their respective pattern) Remember that innerfile.zip and innerfolder may not contain wildcards.

Remote Files  (without compression)

Unlike locally stored files, files on remote computers are accessible using a set of supported protocols. Sometimes it is also necessary to use a proxy server.

The following protocols are supported for accessing a remote server: sftp, ftp, ftps, http, https.

Access without proxy:

The structure of all remote files that are accessible directly, without a proxy, is as follows:
protocol://username:password@serverpassword@server :port/(whole|relative)path/file

Here, the whole path should be used for the sftp protocol, the other four protocols use relative paths.

Examples:
sftp://johnsmith:mypassword@myserver/home/johnsmith/relativepath/filename.txt
ftp://johnsmith:mypassword@myserver/relativepath/filename.txt
ftps://johnsmith:mypassword@myserver/relativepath/filename.txt
http://johnsmith:mypassword@myserver/relativepath/filename.txt
https://johnsmith:mypassword@myserver/relativepath/filename.txt

In the patterns shown above, username, password, and port may be ommitted if possible, whereas the other parts of such File URL are required.

Example (with username, password, and port ommitted):

http://myserver/relativepath/filename.txt

Access through proxy:

The structure of all remote files that are accessible through a proxy is as follows:
protocol:(proxy:proxyuser:proxypassword@proxyserver:proxyport)//username:password@server:port/(whole|relative)path/file

or with SOCKS V4 or V5 proxy:
protocol:(proxysocks:proxyuser:proxypassword@proxyserver:proxyport)//username:password@server:port/(whole|relative)path/file

Example:
ftp:(proxy:proxyuser:proxypassword@proxyserver:proxyport)//johnsmith:mypassword@myserver/relativepath/filename.txt

Also in this case, proxyuser, proxypassword, and proxyport can be ommitted if possible; the other parts of this pattern are required.

With SOCKS V4 or V5 proxy an example follows:
ftp:(proxysocks:proxyuser:proxypassword@proxyserver:proxyport)//johnsmith:mypassword@myserver/relativepath/filename.txt

Remote Files (with compression)

Remote File URLs may also be combined with archiving protocols in a similar manner to local File URLs.

Example:
zip:(ftp://johnsmith:mypassword@myserver/relativepath/myarchive.zip)#innerfolder/filename.txt

Wildcards may also be used in a similar way:

Example:
zip:(ftp://johnsmith:mypassword@myserver/relativepath/myarchive*.zip)#innerfolder??/filename?.txt

Note:

Remember that http and https protocols do not support wildcards in top level files or archives.

Console Input

File URL for Console input will be: – (hyphen character)

User types the input into Console after the graph starts, types data separated by field delimiters, presses Enter to specify end of records, and finishes the input after the last record by pressing Ctrl+Z.

Input Port Reading

CloverETL also supports reading incoming data through the input port of some Readers. Metadata connected to the input port must contain at least one field of string, byte, or cbyte data type. The user selects the field from which data should be read and parsed according to the output metadata. Three processing types can be selected in CloverETL:

  • discrete (the default value)
  • stream
  • source

File URL pattern is the following:

port:$0.fieldname:discrete|stream|source

Discrete processing type:

When the processing type is discrete, each record is parsed separately, according to the output metadata.

Example:
port:$0.customer:discrete

Note:
The colon and the word discrete can be ommitted.

Example:
port:$0.customer

Stream processing type:

When processing type is stream, all records are concatenated and parsed according to the output metadata. If input metadata contains a null value, this null means eof and separates groups of records. All records before such a null are concatenated, but separately from all records after such a null, which are also concatenated into another data source.

Example:
port:$0.customer:stream

Source processing type:

Example:
port:$0.file:source

When processing type is source, values of the selected field ($0.file) are valid URLs. The Reader to which such input is connected takes the file accessible with this URL and reads the contents. Metadata on the output must match the structure of the files specified with the help of these URLs.

Dictionary Entry Reading

Dictionary tab allows the selection of one of the graph dictionary entries. The processing type in the combo box should also be specified.

File URL pattern is:
dict:myentry:discrete|source

Discrete processing type:

Example:
dict:customer:discrete

Reads contents of dictionary entry whose name is customer.

Source processing type:

Example:
dict:file:source

When processing type is source, the value of the selected dictionary entry (file) is a valid URL. The reader with this File URL takes the contents of the file accessible with the help of this dictionary entry and reads the file contents. Metadata on the component’s output must match the structure of the file specified with the help of this URL.

Connecting to JMS Queue on Glassfish v2 with CloverETL

JMS (Java Message Service) is a Java Enterprise Edition (EE) technology used to allow messages to be sent in a decentralized manner between web clients or end-users. In order to check for new messages or to send outgoing ones, a Java program must successfully connect to the JMS queue in order to proceed with the application. CloverETL graph can be driven or can send such messages. You can connect to any JMS broker from CloverETL using JMSReader or JMSWriter component. In this article you can find instruction, how to create JMS Connection and how to configure the graph to connect to JMS queue on Glassfish application server from CloverETL.

Let’s suppose we’ve defined JMS connection factory test on the Glassfish application server as follows:Jms Connection Factory settings:

and JMS queue jms/testQueue:

JMS Queue settings

Now we need to define JMS connection in CloverETL graph:

JMS Connection in CloverDesigner

“Connection factory JNDI name” and “Destination JNDI” refers to the objects we have just created on our application server.

Library appserv_rt.jar contains com.sun.enterprise.naming.SerialInitContextFactory class. When you add the library to the connection settings, you can choose the factory class from the combo box. But when you know the class name, you don’t need to do it, as the Glassfish server has this library on the class-path or you will add it to the project class-path (see below). Remove the library after connection edition; having it in connection definition and in the class-path can lead to unpredictable errors.

If the graph is running on CloverETL Server within the same Glassfish server instance, you don’t need any further settings as all needed jar files are contained in application server.

When you run the graph as a “client” (from CloverETL Designer, command line or other CloverETL Server) you need to set following jars on the jvm class path:

  • /lib/appserv-rt.jar
  • /lib/appserv-admin.jar
  • /lib/javaee.jar
  • /lib/j2ee.jar
  • /imq/lib/imqjmsra.jar

(You find them in your Glassfish home directory).

In CloverETL Designer you can do it by selecting the project. After right click go to Properties –> Java Build Path:

Project properties

You can add the libraries with “Add JARs…” button if you have copied the files to the Workspace or with “Add External JARs…” if the files are outside the Workspace.

You also need to set the -Dorg.omg.CORBA.ORBInitialHost property if you run the graph on the different host and the naming service port property, if necessary. The default naming service port in the app server is 3700. If the naming service is running on a different port, you’ll need to set it via the -Dorg.omg.CORBA.ORBInitialPort property. You can double-check the actual naming service port for a given server instance by looking in the server instace’s domain.xml for “orb-listener-1″. Alternatively, check the Configuration –> ORB –> IIOP Listeners –> orb-listener-1 in Glassfish Administration Console.

To set these properties, when running the graph from CloverETL Designer, you need to create new graph launch configuration:

  1. Go to Run –> Run Configurations
  2. Create new CloverETL graph configuration
  3. Switch to Arguments tab and set the properties:CloverETL graph run configuration

Writing to Infobright Database

In response to customer requests we created a new component that writes data into Infobright software, a very popular column-oriented relational database. The CloverETL’s just released version 2.9 offers this new InfobrightDataWriter component.

Infobright is a highly compressed column-oriented database, based on MySql engine. In this database data are stored column-by-column instead of more typically row-by-row. There are many advantages of column-orientation, including the ability to do more efficient data compression and allowing compression to be optimized for each particular data type. The higher efficiency can be achieved because each column stores a single data type as opposed to rows that typically contain several data type. However the main purpose of Infobright software is to deliver a scalable data warehouse database optimized for analytic queries.

Our new InfobrightDataWriter component allows to easily load data into Infobright database directly from CloverETL transformations. It can be used both with Infobright Community Edition (ICE) as well as with Infobright Enterprise Edition (IEE). With IEE it is possible to use fast data loading in binary format. In binary format particular rows aren’t separated by any special character and there are no values’ delimiters or qualifiers either.

The connector uses named pipe for data transfer. In Linux systems it is system pipe, on Windows you need native library placed on the Java library path. The library is part of  infobright-core-vX_X package that can be downloaded from Infobright site (Download Contributed Software).

For remote load it is necessary to start Infobright remote load agent on the server where Infobright is running (part of infobright-core-v3).

Example graph:

 

InfobrightDataWriter configuration:

The above graph loads data from selected input fields (“Clover fields” attribute) to database table test (“Database table” attribute) on remote server. It uses fast binary method of data loading (“Data format” attribute). Data loaded into database can be sent to output port. Input metadata has free format but the output one has to meet certain conditions:

  • comma as a field delimiter
  • system new line ('n' for Linux, 'rn' for Windows) as a record delimiter
  • date/time formatting: yyyy-MM-dd HH:mm:ss (can be date only, time only or both)

As we load data to the remote server, the Infobright agent must run on the server. To start the agent we need to call:

java -jar infobright-core-3.0-remote.jar -p 6666 -l all

The port the agent is listening to must be the same as the agent port set on the component (“Remote agent port” attribute).

Full description of the component can be found on CloverETL’s wiki pages (InfobrightDataWriter component).

DataDirect’s OracleDB JDBC Driver Speed Test

Purpose

Compare the speed of data loading into Oracle database (Oracle Database 11g Release 11.1.0.6.0 – Enterprise Edition) with Oracle corp. JDBC driver, DataDirect JDBC Oracle driver and direct data loading (OracleDataWriter component – sqlldr utility) in CloverETL.

 

Test description

 

Graph used for testing:

DDdbLoad.grf

The above graph loads data into database table that contains 3 number columns and 127 varchar columns.

Database table for storing data is truncated before each data loading – DBExecute components, each with query: TRUNCATE TABLE dd_test1 REUSE STORAGE

Phase 1: loading data with DDBulkLoad (DataDirect) object from csv file (loader.load(file))

Phase 3: loading data with DDBulkLoad (DataDirect) object from ResultSet (loader.load(resultSet)) – created ResultSet implementation, that reads data from DataRecord (read from the edge).

Phase 4: loading data with DBOutputTable with Oracle corp’s  JDBC driver:

Manifest-Version: 1.0
Specification-Title:    Oracle JDBC driver classes for use with JDK14
Sealed: true
Created-By: 1.4.2_08 (Sun Microsystems Inc.)
Implementation-Title:   ojdbc14.jar
Specification-Vendor:   Oracle Corporation
Specification-Version:  Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Version: Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Vendor:  Oracle Corporation
Implementation-Time:    Wed Jan 25 01:28:31 2006

Phase 7: loading data with DBOutputTable with DataDirect Oracle JDBC driver – enabled bulk load feature

Phase 9: loading data with OracleDataWriter component from csv file (sqlldr utility)

Phase 11: loading data with OracleDataWriter component from edge (sqlldr utility)

Test processing

Graph run 3 times for 10,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with following settings:

 

Test results

Results in seconds.

1,000,000 records with default DataDirect settings:

Phase 1: 178, 167, 132 – min: 132, max: 178, average: 159

Phase 3: 128, 166, 152 – min: 128, max: 166, average: 149

Phase 5: 228, 246, 290 – min: 228, max: 290, average: 255

Phase 7: 176, 170, 239 – min: 170, max: 239, average: 195

Phase 9: 44, 45, 56 – min: 44, max: 56, average: 48

Phase 11: 104, 95, 106 – min: 95, max: 104, average: 102

1,000,000 records with custom settings:

Phase 1: 163, 152, 142 – min: 142, max: 163, average: 152

Phase 3: 166, 133, 134 – min: 133, max: 166, average: 144

Phase 5: 278, 263, 260 – min: 260, max: 278, average: 267

Phase 7: 239, 172, 209 – min: 172, max: 239, average: 207

10,000,000 records with default DataDirect settings:

Phase 1: 1553, 1818, 1352 – min: 1352, max: 1818, average: 1574

Phase 3: 1475, 1299, 1298 – min: 1298, max: 1475, average: 1357

Phase 5: 3041, 2592, 2550 – min: 2550, max: 3041, average: 2728

Phase 7: 1824, 1623, 1722 – min: 1722, max: 1824, average: 1723

Phase 9: 404, 432, 472 – min: 404, max: 472, average: 436

Phase 11: 1096, 975, 1012 – min: 975, max: 1096, average: 1028

Summary

Loading data was slowest when DBOutputTable with original Oracle corp. driver was used. All loadings with DataDirect driver were faster than with Oracle corp. driver but the usage of DDBulkLoad object (DataDirect) plainly increases the speed of loading data in comparison with setting EnableBulkLoad=true and using DBOutputTable. The results for loading data from csv file and edge (result set) are very similar with slight predomination of ResultSet method. All three methods with DataDirect driver usage, get to more steady execution times with number of records to load.

The fastest way of loading data is unquestionably direct data loading with sqlldr utility. Even when inter-storing data in pipe, the sqlldr utility is about 50% faster than any other method, but is less convenient.
1,000,000 records
10,000,000 records

Processing Data from QuickBase

We have great news for users of on-line database QuickBase from Intuit . CloverETL became a next tool which can be used to manipulate the data in this database. Now you can work with data in QuickBase without restraints and with full power of CloverETL. We introduced several specialized components dedicated for this purpose.

QuickBaseQueryReader serves as a reader of records queried from a table. You only need to specify a table name which you want to read from and a query in QuickBase query language. It is that simple. The second reader, QuickBaseRecordReader, is even simpler. Again, you specify a table name and instead of the little bit confusing query, you just specify the particular record identifiers you require.

Data writing is provided also by a set of two components – QuickBaseRecordWriter and QuickBaseImportCSV. First one is a complex writer of incoming data records to a given table and second one is a bit simpler, however much faster bulk data loader to QuickBase.

So we have prepared the whole family of components fully supporting the data manipulation in QuickBase – do not hesitate and try it.

Upcoming 2.7 Release of CloverETL – Faster Sorting of Data and Improved Reading Data

As of today (Mar 31st), Clover Engine 2.7 branch has been created and the testing/QA process has started. Within approx 2 weeks, brand new version of CloverETL will be ready. It brings many small new features and bug fixes, but also several significant improvements – mostly in speed.

The aging ExtSort component is being replaced by new FastSort, which can bring up to 2.5 times the performance of old ExtSort. I am sure, there will be special post on this blog by FastSort’s developer Pavel Najvar, who will explan in detail where he found those hidden 250% of speed.

There are also speed improvements in our Universal Data Reader (reader of text data, delimited or fixed). We thoroughly profiled its code and were able to find 20-25% of additional speed. This puts us even farher in front of competition !