LRpt - Perl extension for comparing csv files and reporting their contents.


NAME

LRpt - Perl extension for comparing csv files and reporting their contents.


SYNOPSIS

  ###################################################################
  #
  # Simple report
  #
  ###################################################################
  
  lks.pl --keys=keys.txt selects.txt > sel_subs.txt
  
  lcsvdmp.pl --conn_file=conn_file.txt --path=data sel_subs.txt 
  
  lrptxml.pl --rldf=rldf.xml --selects=sel_subs.txt --keys_file=keys.txt 
             data > report.xml
  
  
  ###################################################################
  #
  # Diff report
  #
  ###################################################################
  
  lks.pl --keys=keys.txt selects.txt > sel_subs.txt
  
  lcsvdmp.pl --conn_file=conn_file.txt --path=data_state1 sel_subs.txt
  
  ... some actions on the database...
  
  lcsvdmp.pl --conn_file=conn_file.txt --path=data_state2 sel_subs.txt
  
  lcsvdiff.pl --all --keys_file=keys.txt data_state1 data_state2 
             > diffs.txt
  
  lrptxml.pl --diffs --rldf=rldf.xml --selects=sel_subs.txt 
             --keys_file=keys.txt diffs.txt > report_aft.xml
  
  
  ###################################################################
  #
  # Expected-actual diff report
  #
  ###################################################################
  
  lks.pl --keys=keys.txt selects.txt > sel_subs.txt
  
  lcsvdmp.pl --conn_file=conn_file.txt --path=data sel_subs.txt
  
  lcsveadiff.pl --keys_file=keys.txt --expectations=exp.xml
            --cmp_rules=cmp_rules.xml data > eadiffs.xml


INSTALLATION

You can get the most recent code from lreport project on source forge http://lreport.sourceforge.net. You are given 2 options - you can either download a ppm active state package or a CPAN package.


DESCRIPTION

LRpt (LReport) is a set of tools for database row sets comparison and reporting. The core logic operates on csv files, however, LReport also provides tools for converting select results to csv files.


HOW IS THIS MANUAL ORGANIZED

First part of this document is a reference manual.

If it's your first time with LReport, I recommend you to start with a a stroll through LReport section of this document. Then have a look at 2 additional examples.


REFERENCE MANUAL

In this and following chapters you will find a more systematic treatment of LReport. If you haven't read examples above, I recommend you at least to skim over them. They will help you to understand concepts explained below. When appropriate, you are given links to other documentation pages.

LREPORT SUITE

LReport consits of the following tools:

lks.pl
Tool for replacing where key place holders in select files. You can read more about it in man page for LRpt::KeySubst.

lcsvdmp.pl
Tool for dumping results of a group of selects to csv files. You can read more about it in man page for LRpt::CSVDumper.

lcsvdiff.pl
Tool for comparing 2 sets of csv files. You can read more about in in man page for LRpt::CSVDiff.

lcsveadiff.pl
Tool for comparing a group of csv files with expectations. You can read more about it on a man page for LRpt::CSVEADiff.

lrptxml.pl
Tool for creating and xml report from results generated by lcsvdmp.pl and lcsvdiff.pl. You can read more about it on a man page for LRpt::XMLReport.

COMPARISON TYPES

There are 2 types of possible comparisons:

Before-after comparison
Expected-actual comparison

BEFORE-AFTER COMPARISON

A set of selects is executed twice - before and after some transaction, which could potentialy change the data. For each select, rows returned before change are compared with rows returned by the same select after the change. When differences are found, they are reported.

The following differences can be reported:

missing row
The row, which existed before a transaction does not exist any more.

additional row
The row did not exist before a transaction and appeared after the change.

not equal
The row existed before a transaction, still exists after it but some fields of the row are modified.

In order to determine a direct relationship between before and after rows (i.e. to determine which after row should be compared with which before row) a row key is used. A row key is an equivalent of primary key in a database table. It has to be unique and may consist of many columns.

During comparison for particular select, LReport sorts after and before rows by a row key (after and before rows are sorted separately). Then it iterates through all keys found in both collections of rows (before and after rows). If for example before collections have 3 rows with keys (01,AB), (02,GT), (02,JT) and the after collection has rows with keys (01,AB), (03,KP), then LReport will iterate through a following list: (01,AB), (02,GT), (02,JT), (03,KP). So it's basically a sum of sets.

For each value of a row key, LReport finds an after and before row with this key value. If there is no such row in before collection, that means that a new row was created. So it is reported as additional row. If there is no such row in after collection, that means that a row was deleted. It is then reported as missing row. If both after and before row exist, they are compared field by field. If any difference is found, the row is reported as not equal. Otherwise rows are considered equal.

EXPECTED-ACTUAL COMPARISON

NOTE: This functionality is experimental.

This type of comparison may be used for test automation. Sometimes you can create a static set of expected data, which you can use for comparison with actual data. There are many cases, however, when you cannot create static files with expectations, because some column values depend on a date, environment and some circumstances, which are difficult to recreate. So instead, you can implement a script/program, which will predict what could be the output of a test and then generate expectations.

Those expecations may be then compared with results of selects done on actual data. Although it may seem that there is nothing special in expected-actual comparison and it is a slightly modified version of before-after comparison, this is not the case. There are some nuances, which forces an introduction of some new notions and concepts.

In general, expected-actual comparison uses the same core logic as before-after comparison. It parses the expectations and builds collections of expected rows and then compares them with collections of actual rows.

The first big difference is that not all columns are defined in expectations. It makes sense. We often deal with rows, which consist of 40 columns, but we are only interested in 10 of them. It would be infeasible to force a user to define expected values for all those 40 columns. So the user can decide not to specify them.

From the other hand, it would be good to have a protection mechanism, which would warn a user when a really important column has no expectation defined. This is where comparing rules come to the rescue.

Comparing rules are defined for a particular select. They define what to do if an expected value for a given column in not specified. You can make the application terminate, do nothing, warn, use default value. You can find details in COMPARING RULES chapter.

Another big difference is a direct consequence of the previous one. If you are allowed to not specify all columns values, it may happen that a value for column belonging to a row key is not defined. As a result a value of a row key for this row cannot be calculated. So the mechanism of before-after comparison cannot be used in this case. The solution is provided by unkeyed rows and column matching. It is explained below.

The following differences are reported:

missing row
The row was expected to exist but doesn't.

additional row
The row exists even though it is not expected.

not equal
The row exists as expected but some fields contain values different from expected.

unmatched
This type of difference is used for rows, for which row keys are not defined in expectations (due to not giving a value for one more columns belonging to a row key). Such rows are called unkeyed rows

If case of unkeyed row (missing values in columns which are parts of a row key), rows can not be compared on a row by row basis since row key is the only information allowing to determine a direct relationship between rows. Hence, the following solution was developed for such rows:

Each unkeyed row defined in expectations is tried to be matched with all its fields with any of actual rows. If there is no such match, the row is reported as unmatched.

The detailed rules are as follows:

  1. Expectations rows with row keys values defined (keyed rows) , have priority.
  2. All actual rows, which are matched with keyed expectation rows are excluded from further comparison.

  3. If all keyed expectation rows are processed, unkeyed expectation rows are processed
  4. From all unkeyed rows, rows having the biggest number of column values defined are taken first for processing
  5. For each unkeyed row, LReport tries to find a remaining actual row (not matched earlier by any other exectation row), whose columns' values match with all column values in expectation row.
  6. If such a row is found, the row is considered matched. If there is no such actual row, the expectation row is reported as unmatched. In addition, the result of comparison with the closest match is reported (see below).

Reporting of matching and unmatching has some additional specifics. Since row key is not defined, it is possible to define a set of identical expectation rows. LReport catches those identical rows and count them. When it reports unmatching rows, it groups identical rows and shows how many of them were not matched. If for example, expectations rows for a given select are

  1, ABC, ACT
  1, ABC, ACT
  1, ABC, ACT

and only 2 of them have been matched with actual rows, row (1, ABC, ACT) is reported to be unmatched once.

You can find details on configuring and using expected-actual comparison (with a step by step example) in manual for LRpt:CSVEADiff.

SELECTS

LReport does not operate on database tables but on select results. Single csv file contain results of one select statement. If the statement is select * from table then it is actually the whole table. But this is a specific example. In general, csv file contain results from one select.

SELECT NAMES

In order to manage select results, choose a proper row key, report format etc. each select should be given a name. In some simple cases it is not required to give select a name, but it is recommended to use a meaningful name whenever possible.

Select name is used for following purposes:

Creating csv file name
Results for each select are stored in a file select_name.ext, where ext is an csv file extension.

Choosing a row key
When a row key is defined, it contains information about names of selects to which it applies

Choosing a report layout
A report layout contains a name of a select, to which it applies

SELECT FILES

Selects are defined in files called select files. They use jar record format. Definition of each select ends with a line beginning with %% with white spaces after it.

Definition of each select consists of 2 fields: name and select. Their meaning is as follows:

name
Name of a select used for csv file creation, row key choosing and so on.

select
Full text of a select. The text can be divided on several lines, no continuation characters are needed on end of lines. The only restrictions you have to obey are:

* No line of a select text may start with string: or %%. Adding single space in front of them is enough to get rid of this limitation.

REPORTING AND FORMATING

LReport provides some facilities for generating nicely formatted reports of selects results and found differences. It does that by generating xml reports, which can be then transformed to some document format using XSLT.

LReport package in the source forge also contain XSLT stylesheet for converting xml report to an RTF document.

Reporting is a wide topic by itself so it is not described in details here. You can read more on reporting in LRpt::XMLReport.

ROW KEYS

The row key is more or less an equivalent of primary key in database theory. It defines a set of columns, which values uniquely identify each row.

Row key is crucial in before-after comparison, since it determines the direct link between before and after row.

In its code LReport treats row key value as a string. If a row key consists of one column, the key value for a row equals the value of the specified column from this row.

If a row key consists of several columns, then <LReport> creates key value by joining values from those columns, using # as a separator. So if values from columns are 3, Kaluski and <20050601>, then the key value will be 3#Kaluski#20050601. Using a column, which values may contain # as a row key part, may lead to unpredicted affects. It would be good to be able to define, what should be the row key value separator. But it is currently not implemented. The hash character # was chosen for the following reasons:

* It is low in characters order. So it not should disturb in string comparisons. The strings abc#1234 and abcd#1234 are compared correctly. abc#1234 is less then abcd#1234. If # would be high in characters order (having ascii code 123 for example), then the comparison above would consider abc#1234 to be greater then abcd#1234

* I found it very unlikely to use such a character in columns which are parts of unique keys.

Row keys values are treated as strings, so they are compared as strings. Therefore keys value 2 is greater then 1000. In order to achieve correct numerical comparison, column value has to be left padded with zeros. 0002 is lower then 1000.

There are several ways of defining a row key. We can define it in a file, give it at command line and use default key. Currently it is recommended to use only a file method. I am still working on the most convinient and expressive way of defining row keys in command line, so things may change. The default row key is a first column of a table/select results.

ROW KEY FILE

Row key file is in jar record format. Definition of each key ends with a line beginning with %% with white spaces after it.

Definition of each key consists of 2 fields: select_name and key_columns. Their meaning is as follows:

select_name
Name or names (comma separarated) of selects, which should use this key. This why giving names to all used selects is important.

key_columns
Comma separated list of columns, which constitute a row key. Column names should be given in the order, in which they appear in the key. In order to left pad with zeros a given column, containing numeric value, the field length has to be specified after the column's name. The field length has to be given after a colon after a column name (no spaces in between). See example below.

Example of a row key file:

  select_name: customer
  key_columns: customer_id:4
  %%
  select_name: service, service_history
  key_columns: customer_id:4, service_type
  %%

As you can see one row key may be defined for several selects. Note that values of customer_id column will be left padded with zeros to create 4 character string.

FORMAT OF CSV FILES

It's an ordinary csv file. LReport expects that a first line of each such a file contains columns' names. Default field separator is tab. You can read more on defaults and changing them in DEFAULTS AND ENVIRONMENT VARIABLES section.

NOTE: csv files used as input to lcsvdiff.pl have to be sorted by a key used in comparison.

COMPARING RULES

Comparing rules are used in expected-actual comparison. They are used to determine what to do when expectations for a given column are not specified. Currently, the following possibilities are available:

skip_and_warn
Column is important, but not so important to stop the processing. A warning is printed and the comparison goes on

die
Column is so important, that further comparison does not make any sense. LReport (actually lcsveadiff.pl) terminates.

use_default
If column value is not given a default value is used.

use_default_and_warn
If column value is not given a default value is used and a warning is logged

skip
[default] Column is not important. It is not a problem if expectations are not defined. The comparison goes on

You can read more about comparing rules in man page for LRpt::CSVEADiff.

SELECT TEMPLATES

A select template is an additional feature provided by LReport. It's usefull in cases when a user executes the same select or set of selects many times but for different values in a where clause.

EXAMPLE

We would like to run the selects from previous example. We want to run it not only for customer_id = 1234 but also for customer_id = 1334 and customer_id = 3476. And maybe some others in future.

We can use a select template here. We define the following file (let's call it sel_tmpl.txt):

  name: customer
  select: select * from customer where --customer_id--
  %%
  name: service
  select: select * from service where --customer_id--

Then we have create a where key file (wkey.txt):

  name: customer_id
  key: customer_id = 1234
  %%

When we now run lks.pl:

  lks.pl --keys=wkey.txt sel_tmpl.txt

Meaning of parameters is as follows:

--keys
A file containing a text, with which a where key placeholder should be replaced.

sel_tmpl.txt
A file containing selects templates

it will print on standard output following lines:

  name: customer
  select: select * from customer where customer_id = 1234
  %%
  name: service
  select: select * from service where customer_id = 1234
  %%

lks.pl will basically replace all entries --entry-- from sel_tmp.txt with a proper definition from wkey.txt. Should we run this to selects for customer_id = 1334, we have to modify wkey.txt only. Output generated by lks.pl can be fed as input to lcsvdmp.pl (in --select option).

You can read more on how lks.pl work in man page for LRpt::KeySubst.

Also have a look at SYNOPSIS.

WHERE KEY FILES

Where keys are defined in files called where key files. They use jar record format. Definition of each key ends with a line beginning with %% with white spaces after it.

Definition of each select consists of 2 fields: name and select. Their meaning is as follows:

name
Name of a key. Has to match with a placeholder in a select template. Where key named abc will replace a placeholder --abc-- in a select template

key
A text with, which a key placeholder in a select template should be replaced. The text can be divided on several lines, no continuation characters are needed on end of lines. The only restrictions you have to obey are:

* No line of a select text may start with string: or %%. Adding single space in front of them is enough to get rid of this limitation.

WORKING WITH DATABASES

Core logic of LReport operates on csv files. However, before we can operate on them, we have to create them first. A tool for creating csv files from select results is lcsvdmp.pl and its usage is explained in example 2.

DATABASE CONNECTION FILE

A database connection file contains instructions on how to connect to the database. It is used by lcsvdmp.pl. As you know each database driver (DBD) has a slightly different flavour. So the syntax of the connection string for Sybase is slightly different from that for Oracle, and is slightly different from Mysql and so on. In order to make lcsvdmp.pl flexible and be able to deal with any source for which DBD exists, I passed the responsibility of opening a connection on a user. Connection file should contain a perl code, which opens a connection and assigns a reference to datbase handle to a variable named $dbh. lcsvdmp.pl will load contents of connection file and eval it.

This an example of opening a connection to ODBC source. As you can see you can even include additional configuration code there:

  use Win32::ODBC;
  my $DBName     = "INVOICES";
  my $DBServer   = "MYSERVER";
  my $DBUser     = "pkaluski";
  my $DBPassword = "password";
  
  no strict;
  Win32::ODBC::ConfigDSN( ODBC_CONFIG_DSN, 
                          "Sybase ASE ODBC Driver",
                          "DSN=DBSybaseSource",
                          "Database=$DBName",
                          "InterfacesFileServerName=$DBServer");
  use strict;
  my $error = Win32::ODBC::Error();
  if( $error ){
      die $error;
  } 
  $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword,
                       {RaiseError => 1, AutoCommit => 1}
                     );

Do not forget about assigning a reference to opened connection to $dbh variable.

SUPPORTED DATABASES

Since LReport core logic operates on csv files, you can use LReport for any database for which you are able to create csv files for select results. If you use lcsvdmp.pl to do it, you need to have a DBI driver for a database you are working with. LReport uses the following DBI methods:

connect
prepare
execute
fetchall_arrayref

If the DBD for you database supports those methods, lcsvdmp.pl should be able to do its job.

LIMITATIONS AND WARNINGS

LReport's core logic works on csv files. This gives the tool significant flexibility, since it can work with virtually any database. However, such an approach does have some consequences, which you should be aware of.

No information on columns data types.
csv files carry no information on data type of each column. LReport treats all columns as texts. This may give surprising results for numeric data. You would expect that 1 and 1.0 are equal, but for LReport they are different texts

LReport does not support database schema comparison
LReport will help you to detect that a field's value has changes from 1 to 2. It also detect that a particular columns is new or does not exist any more. But it will not detect that column datatype has changed from int to char.

No support for null values
LReport does not see null values, since it is problematic how to put null values in csv file. No value (tabtab) in csv file is treated as an empty string. So basically LReport will not be able to find a difference between field containing a null value and a field containing an empty string.

DEFAULTS AND ENVIRONMENT VARIABLES

LReport has some hardcoded defaults. Each of those defaults can be overriden by environment variable. In some cases, those defaults can be further overriden by command line switches. So the general rule is this:

Check if the command switch specifies a parameter's value. If is does not then
Check if there is an environmental variable, which defines it. If there is no such variable, use hardcoded default.

There are following parameters, which are used by LReport tools. The list below is given in the format: meaning: hardcoded default -> Environment variable -> command line option

csv filename extension:
  txt -> LRPT_CSV_FILE_EXT -> --ext=ext
Field separator in csv files
  tab -> LRPT_CSV_FIELD_SEPARATOR -> --sep=separator
Location in which csv files should be created
  . (current directory) -> LRPT_CSV_FILE_PATH -> --path=path
Database connection file
  conn_file.txt -> LRPT_CONNECTION_FILE -> --conn_file=file.txt
Chunk size
  1000 -> LRPT_CHUNK_SIZE -> --chunk_size=size
Location of global keys file
  keys.txt -> LRPT_GLOBAL_KEYS_FILE --> No command line switch
Separator between different values in diff file
  --#> -> LRPT_DIFF_VALUE_SEPARATOR --> No command line switch

SUPPORTED PLATFORMS

The tool was developed and tested on Windows. I can't think of any reason why it will not work on Unix but it was not tested. Currently it assumes that a name separator in file paths is slash (/) so it probably won't work on VMS and other systems, which do not use slash as a name separator.


DEVELOPMENT RELATED INFORMATION

NOTE: This sections is under construction

If you are brave enough to try to modify the LReport code and the ugliness of the code did not scare you to death, read this chapter. It will provide you with some guidance on how all things are organized. I hope it will make your coding advanture less painfull.

OBJECT/PACKAGE MODEL

All LReport logic is encapsulated in packages. All *.pl tools are only wrappers calling one exported function. For example, lcsvdmp.pl looks as follows:

  use strict;
  use LRpt::CSVDumper;
  
  dump_selects( @ARGV );
  
Some packages are object oriented, some are not.

LRpt provides 4 main objects to perform expected job (see SYNOPSIS):

LRpt::CSVDumper
Dumps results of selects to a csv file. Wrapped by lcsvdmp.pl.

LRpt::CSVDiff
Compares two sets of csv files. Wrapped by lcsvdiff.pl

LRpt::CSVEADiff
Compares set of csv files with expectations. Wrapped by lcsveadiff.pl

LRpt::XMLReport
Converts csv and diff files to xml format. Wrapped by lrptxml.pl

LRpt::KeySubst
Substitutes where keys place holders in select templates.

Apart from those packages there is a bunch of other classes defined. Some of them may appear pretty usefull. Be careful though. I do not consider them public, so I may decide to change them in future. If you find some of them extremely useful, let me know so I will be more careful in my future desing decisions.

LRpt library consists of the following classes:

LRpt::CollDiff
Object for comparing two collections of rows

LRpt::CollEADiff
Object for comparing collection of rows with a set of expectations

LRpt::Collection
Container for data loaded from csv files.

LRpt::CollUnkeyed
Container for unkeyed rows.

LRpt::Config
Object for managing defaults and runtime parameters.

LRpt::JarReader
Object for reading jar records files.

LRpt::RKeysRdr
Object for reading row keys definition files.

Each of this modules have its POD, so if you would like to know more about them, you can go ahead and read it.

REGRESSION TESTS

LReport consists of a bunch of tools. Currently regression tests do not test particular modules separately. Instead there are some end-to-end test scenarios. For each testcase there is set of expectations and the results are compared with those expectations.

As of 10-09-2006 tests are divided into 2 groups: tests of simple reporting and tests of difference reporting. The 't' directory contains 2 directories, 1 for each group - DiffRpt and Report. Let's have a look at DiffRpt directory. It constains the following subdirectories/files:

  dbschema
  dbschema1
  dbschema2
  dbschema3
  t1
  t2
  t3
  t4
  t5
  driver.pl

driver.pl is test driver for diff reporting tests. It contains perl code running scenarious of four tests.

dbschema and dbschema1 are databases. They are based on csv files and are handled by DBD::CVS. They are used to simulate changes in the database. Instead of changing the contents of the database, selects for before state are done on dbschema database and selects for after state are done on dbschema1.

t1 to t4 are directories used by test cases. Let's have a look at t1. The test case code looks like this (it is in t\DiffRpt\driver.pl file):

CODE

On the beginning, t1 contains configuration and input files. It also contains expected directory. It contains a directory structure with files expected to be generated by tested tools. Once you run the test, it will create actual directory in t1. All output files generated by the test case will be placed in the actual directory and its subdirectories. Once the scenario is done, contents of actual and expected directories are compared. If they are the same, that means that the test case generated exactly what we expected. Otherwise, it behaves differently which means that we either introduced a new bug or modified the functionality. We should either remove the bug, or if changes were intentional - modify the test case.


BUGS

As of 15-01-2006 the tool have only one user, which is me. I used it in one environment. You can imagine that there are plenty of undetected bugs which will jump out, once you will start using it for your specific needs. I expect that regression tests may fail on Unix. This is due to end of line incompatibilities.

In case you find a bug, let me know.


TO DO

PERFORMANCE

Thanks to chunking, LReport can dump and compare huge files. Some performance tuning has been already done but there is still some space for improvement.

MANAGING COMPARING RULES

Currently comparing rules (for expected-actual comparison) have to be specified in command lines options. I expect that for intensive use of LReport, it would be convenient to have a common place where all this is defined. LReport tools should look there if nothing is given in the command line.

DOCUMENTATION

This is the first version of a documentation. Some parts of it are still under construction. Some links does not work. I am aware of that. My goal is to provide you with something you can play with. I will keep working on documentation refinements.


A STROLL THROUGH LREPORT

This section is a tutorial, which will guide you through functionalities of LReport. It will start with a fairly simple example, in which basic concepts are presented. And then by incresing examples' difficulty, more and more advanced topics will be presented.

After reading this you should have a good understanding of LReport. You should be able to use it and should feel comfortable when reading documentation.

This tutorial uses several example files. You can find a tarball with them on LReport web site.

VERY SIMPLE CASE

Imagine that you have to compare 2 simple csv files. They are as follows:

before1.txt
 last_name      salary  occupation
 Bush   20000   Clerk
 Kowalski       30000   Programmer
 Nowak  40000   Manager
 Wilson 10000   Clerk
after1.txt:
 last_name      salary  occupation
 Bush   20000   Assistant
 Kos    40000   Manager
 Nowak  40000   Manager
 Wilson 10000   Clerk

NOTE: Just on the very beginning let me tell you one important thing - input csv files MUST BE SORTED BY THE KEY (later you will find out what a key is). If they are not sorted, you may get weird results.

We compare csv files using lcsvdiff.pl

 lcsvdiff.pl before1.txt after1.txt

We will get the following output:

 SCHEMA: last_name       salary  occupation
 UPD( Bush ): occupation: Clerk ==#> Assistant
 INS( Kos )
 DEL( Kowalski )

Let's have a closer look at it, line by line. It says:

 SCHEMA: last_name       salary  occupation

csv file contains rows with 3 columns: last_name, salary and occupation.

 UPD( Bush ): occupation: Clerk ==#> Assistant

The value of column occupation in a row identified by Bush (we will talk about identification soon) has changed from Clerk to Assistant.

 INS( Kos )

Row identified by Kos (last_name column) is new (additional) in after1.txt file.

 DEL( Kowalski )

Row identified by Kowalski (last_name column) from before1.txt does not exist any more (is missing) in after1.txt file.

MEANING OF KEYS

Now we need to compare files, which contain also employees first names:

before2.txt
 first_name     last_name       salary  occupation
 James  Bush    20000   Clerk
 Adam   Kowalski        30000   Programmer
 Adam   Nowak   40000   Manager
 John   Wilson  10000   Clerk
after2.txt
 first_name     last_name       salary  occupation
 James  Bush    20000   Assistant
 James  Kos     40000   Manager
 Adam   Nowak   40000   Manager
 John   Wilson  10000   Clerk

Lets try lcsvdiff.pl again

 lcsvdiff.pl before2.txt after2.txt

This time we get the following output:

 Key is not unique. Some rows from file before2.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN0> line 5.
 Key is not unique. Some rows from file after2.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN1> line 5.
 SCHEMA: first_name      last_name       salary  occupation
 UPD( James ): last_name: Bush ==#> Kos
 UPD( James ): salary: 20000 ==#> 40000
 UPD( James ): occupation: Clerk ==#> Manager

We got 2 warnings and then different and weird result. What has happened?

It's time to talk about keys. When you compare rows (in csv files or database) then you need to define, which column/columns are used to uniquely identify each row. This is so called primary key. It allows you to determine, which rows should be compared with each other, which should be considered missing or additional. Otherwise you could only compare rows basing on their order in both files.

For after1.txt and before1.txt it worked fine, because lcsvdiff.pl assumed a default key - the first column. Values in these column were unique, so it worked well. In the second example, the first column contains first names, which are very likely to be repeated. We have Adam in 2 rows. For this case it would be much better to use columns last_name and first_name as a key. We tell it lcsvdiff.pl about a key a in a following manner:

 lcsvdiff.pl --key_cols=last_name,first_name before2.txt after2.txt

Now it worked fine giving the following output:

 SCHEMA: first_name      last_name       salary  occupation
 UPD( Bush#James ): occupation: Clerk ==#> Assistant
 INS( Kos#James )
 DEL( Kowalski#Adam )

Note the difference between comparison of before1.txt/after1.txt and before2.txt/after2.txt. Previously rows were identified by Bush, Kos and Kowalski. Now, they are identified by Bush#James, Kos#James and Kowalski#Adam. This is a concatenation of values from key columns - last_name and first_name. # is a concatenation character.

IMPORTANT NOTE - remember that input files for lcsvdiff.pl have to be sorted by the key. If they are not sorted, strange results may be returned.

NUMERICAL COLUMNS

Let's try with following files

before3.txt
 empl_id        first_name      last_name       salary  occupation
 1      John    Wilson  10000   Clerk
 9      James   Bush    20000   Clerk
 11     Adam    Kowalski        30000   Programmer
 201    Adam    Nowak   40000   Manager
after3.txt
 empl_id        first_name      last_name       salary  occupation
 1      John    Wilson  10000   Clerk
 9      James   Bush    20000   Assistant
 201    Adam    Nowak   40000   Manager
 5043   James   Kos     40000   Manager

Since empl_id is a unique key, we can again allow lcsvdiff.pl to use default key i.e. first column.

The output is as follows:

 SCHEMA: empl_id first_name      last_name       salary  occupation
 DEL( 11 )
 INS( 5043 )
 DEL( 9 )

What's wrong now? Rows are now identified by empl_id. However, their order in the output seem random. It would be good to have them ordered. The current order is because lcsvdiff.pl treats all keys as strings. Numerically 5043 is bigger then 9. But alphabeticaly it is lower. You have to tell lcsvdiff.pl that empl_id is numeric. lcsvdiff.pl will zero-pad it, so it compares correctly.

 lcsvdiff.pl --key_cols=empl_id:5 before3.txt after3.txt
 SCHEMA: empl_id first_name      last_name       salary  occupation
 UPD( 00009 ): occupation: Clerk ==#> Assistant
 DEL( 00011 )
 INS( 05043 )

--key_cols=empl_id:5 means that column empl_id is numeric so it should be zero-padded to a length of 5 characters.

SKIPPING IRRELEVANT INFORMATION

Sometimes you don't want to compare all values from rows. For example in real life rows often contain date and time of their last modification. Look at the file below (before4.txt):

 empl_id        first_name      last_name       salary  occupation      last_update
 1      John    Wilson  10000   Clerk   2006-01-12 10:15
 9      James   Bush    20000   Clerk   2006-05-22 15:35
 11     Adam    Kowalski        30000   Programmer      2006-03-07 12:10
 201    Adam    Nowak   40000   Manager 2006-11-12 14:23

Imagine that a record for John Wilson has changed twice - his salary changed from 10000 to 8000 but after a month it has changed back to 10000. So the employment data remaind unchanged. after4.txt will look something like this:

 empl_id        first_name      last_name       salary  occupation      last_update
 1      John    Wilson  10000   Clerk   2006-04-12 10:15
 9      James   Bush    20000   Assistant       2006-07-27 17:45
 201    Adam    Nowak   40000   Manager 2006-11-12 14:23
 5043   James   Kos     40000   Manager 2006-11-12 10:15

But lcsvdiff.pl shows:

 lcsvdiff.pl --key_cols=empl_id:5 before4.txt after4.txt
 SCHEMA: empl_id first_name      last_name       salary  occupation      last_update
 UPD( 00001 ): last_update: 2006-01-12 10:15 ==#> 2006-04-12 10:15
 UPD( 00009 ): occupation: Clerk ==#> Assistant
 UPD( 00009 ): last_update: 2006-05-22 15:35 ==#> 2006-07-27 17:45
 DEL( 00011 )
 INS( 05043 )

It reports that there is a difference for row 00001 (John Wilson). And from technical point of view - that's right. But from the point of view of business logic, nothing changed. We want to skip last_update in our comparisons:

 lcsvdiff.pl --key_cols=empl_id:5 --skip_cols=last_update before4.txt after4.txt
 SCHEMA: empl_id first_name      last_name       salary  occupation      last_update
 UPD( 00009 ): occupation: Clerk ==#> Assistant
 DEL( 00011 )
 INS( 05043 )

This is exactly what we expected.

INTENSIVE USE

Previous example was used to illustrate basic notions related to comparing csv files in general and lcsvdiff.pl in particular.

Now, we are going to focus on some aspects which may become relevant in real life usage, especially if you often compare csv files of different formats. Especialy testing is a job, when you are really likely to do this.

Your data may be distributed between several files, so in order to get a full picture of what has changed, you may have to compare several types of csv files.

Imagine that you have 2 directories - one with data before the change and the second with data after the change.

directory before
employees.txt:
 empl_id        first_name      last_name       salary  occupation      last_update
 1      John    Wilson  10000   Clerk   2006-01-12 10:15
 9      James   Bush    20000   Clerk   2006-05-22 15:35
 11     Adam    Kowalski        30000   Programmer      2006-03-07 12:10
 201    Adam    Nowak   40000   Manager 2006-11-12 14:23

payments.txt:

 empl_id        payment_number  amount  status payment_date     last_update
 1      1       100.00  PD      2006-05-23      2006-05-24 15:35
 1      2       5100.00 PD      2006-06-23      2006-06-27 15:35
 1      3       100.05  PD      2006-07-23      2006-10-22 15:35
 9      1       8930.00 PD      2006-05-21      2006-05-22 15:35
 9      2       100.00  PD      2006-05-22      2006-05-23 15:35
 11     1       9580.00 PD      2006-05-03      2006-05-22 15:35
 11     2       39840.00        PD      2006-06-23      2006-07-22 15:35
 11     3       4530.00 PD      2006-07-10      2006-08-22 15:35
 11     4       100.00  PD      2006-08-29      2006-12-22 15:35
 201    1       300.00  PD      2006-05-23      2006-05-25 15:35
 201    2       400.00  PD      2006-11-23      2006-11-25 15:35
 201    3       500.00  PD      2006-12-23      2006-12-24 15:35

directory after
employees.txt:
 empl_id        first_name      last_name       salary  occupation      last_update
 1      John    Wilson  10000   Manager 2006-11-12 10:15
 9      James   Bush    20000   Clerk   2006-05-22 15:35
 11     Adam    Kowalski        30000   Programmer      2006-03-07 12:10
 201    Adam    Nowak   40000   Manager 2006-11-12 14:23

payments.txt:

 empl_id        payment_number  amount  status payment_date     last_update
 1      1       100.00  PD      2006-05-23      2006-05-24 15:35
 1      2       5100.00 PD      2006-06-23      2006-06-27 15:35
 1      3       100.05  PD      2006-07-23      2006-10-22 15:35
 1      4       10000.05        PD      2006-11-23      2006-12-22 15:35
 9      1       8930.00 PD      2006-05-21      2006-05-22 15:35
 9      2       100.00  PD      2006-05-22      2006-05-23 15:35
 11     1       9580.00 PD      2006-05-03      2006-05-22 15:35
 11     2       39840.00        PD      2006-06-23      2006-07-22 15:35
 11     3       4530.00 PD      2006-07-10      2006-08-22 15:35
 11     4       100.00  PD      2006-08-29      2006-12-22 15:35
 201    1       300.00  PD      2006-05-23      2006-05-25 15:35
 201    2       400.00  CN      2006-11-23      2006-11-25 15:35
 201    3       500.00  PD      2006-12-23      2006-12-24 15:35

You have to compare employees.txt and payments.txt from before and after directories accordingly.

 lcsvdiff.pl before after
 Key is not unique. Some rows from file before/payments.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN0> line 13.
 Key is not unique. Some rows from file after/payments.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN1> line 14.
 lcsvdiff before/payments.txt after/payments.txt
 SCHEMA: empl_id payment_number  amount  status payment_date     last_update
 UPD( 1 ): payment_number: 3 ==#> 4
 UPD( 1 ): amount: 100.05 ==#> 10000.05
 UPD( 1 ): last_update: 2006-07-23 ==#> 2006-11-23
 DEL( 9 )
 lcsvdiff before/employees.txt after/employees.txt
 SCHEMA: empl_id first_name      last_name       salary  occupation      last_update
 UPD( 1 ): occupation: Clerk ==#> Manager
 UPD( 1 ): last_update: 2006-01-12 10:15 ==#> 2006-11-12 10:15
 DEL( 9 )

The outcome is weird. No wonder - we did not define keys. But how to define a key, which works for both? You can't. You have to define different key for each file. And you can do it by using special file, called key file. Below is an example (keys_file.txt):

 select_name: employees
 key_columns: empl_id:5
 %%
 select_name: payments
 key_columns: empl_id:5, payment_number:5
 %%

It says that for a file employees.txt a key should be empl_id (numeric, 5 digits) and for payments.txt empl_id and payment_date.

So if you call now lcsvdiff.pl:

 lcsvdiff.pl --keys_file=keys_file.txt before after

it will do the following:

Let's see:

 lcsvdiff.pl --keys_file=keys_file.txt before after
 lcsvdiff before/payments.txt after/payments.txt
 SCHEMA: empl_id payment_number  amount  status payment_date     last_update
 INS( 00001#00004 )
 UPD( 00201#00002 ): status payment_date: PD ==#> CN
 lcsvdiff before/employees.txt after/employees.txt
 SCHEMA: empl_id first_name      last_name       salary  occupation      last_update
 UPD( 00001 ): occupation: Clerk ==#> Manager
 UPD( 00001 ): last_update: 2006-01-12 10:15 ==#> 2006-11-12 10:15

You can go even futher. If there is a set of csv files you often compare, you can put a file with a definition of most frequently used keys in one location. You can then order lcsvdiff.pl to look there for the definition of the key, instead of specifying a key or a key file every time. Please refer to documentation for a description of LRPT_GLOBAL_KEYS_FILE environment variable.

If you have such a variable defined, calling lcsvdiff.pl will look like this:

 lcsvdiff.pl --global_keys before after

--global_keys switch tells lcsvdiff.pl to look for keys in a file specified in LRPT_GLOBAL_KEYS_FILE variable

WORKING WITH A DATABASE

In many cases data are not I csv files but in a database. How to compare them using lcsvdiff.pl?

lvcsdiff.pl works on csv files, in which a first row contains column names. Most of well known database servers provide tools for generating such files. However some of them require some time to understand them and some additional tricks if you want to dump results of a select rather then a whole table. lcsvdmp.pl is a simple tool, which does only one think - runs a select (or a group of selects) and dumps results to csv files.

So if employees and payments are actually tables, before we can use lcsvdiff.pl we have to dump to csv files result of 2 selects:

 select * from employees
 select * from payments

Input file for lcsvdmp.pl will look like this:

 name: employees
 select: select * from employees
 %%
 name: payments
 select: select * from payments
 %%

If we run it

 lcsvdmp.pl selects.txt

We will get this:

 Cannot open conn_file.txt : No such file or directory at C:/Perl/site/lib/LRpt/CSVDumper.pm line 201.

Apparently it failed. Of course it did. How would it know to which database to connect? As you can see in the message, it is looking for conn_file.txt, a so called connection file. Connection file should contain a perl code, which will open a connection and assign a database handler to a variable $dbh.

Confused? No problem. Read on. Perl uses DBI to access the database. DBI is a perl module - it is a common interface to database servers. In order to do anything via DBI, you have to open a connection with a server. The details of how you specify a connection parameters differ from vendor to vendor. Moreover it is possible that new drivers will be created. So instead of trying to be smart about currently available drivers, lcsvdmp.pl gives user a chance to write a code, which opens a connection and returns it. For ODBC, connection file will look like this:

  my $DBUser     = "pkaluski";
  my $DBPassword = "password";
  
  $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword,
                       {RaiseError => 1, AutoCommit => 1}
                     );

Since connection file is actually a perl code which will be evaled, you can add some more sophisticated logic there. Below is an example of code which connects to database via ODBC by specifying server, database and user:

  use Win32::ODBC;
  my $DBName     = "INVOICES";
  my $DBServer   = "MYSERVER";
  my $DBUser     = "pkaluski";
  my $DBPassword = "password";
  
  no strict;
  Win32::ODBC::ConfigDSN( ODBC_CONFIG_DSN, 
                          "Sybase ASE ODBC Driver",
                          "DSN=BolekSybase",
                          "Database=$DBName",
                          "InterfacesFileServerName=$DBServer");
  use strict;
  my $error = Win32::ODBC::Error();
  if( $error ){
      die $error;
  } 
  $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword,
                       {RaiseError => 1, AutoCommit => 1}
                     );

Remember that in this code an opened connection has to be assigned to variable named $dbh.

If you are not familiar with perl, just go to LReport website and get the proper connection file for well known servers.

You can tell lcsvdmp.pl which connection file to take by --conn_file option. If you don't specify it, lcsvdmp.pl will look in the current directory for conn_file.txt. If it won't find it, it will look for a file which name is given in LRPT_CONNECTION_FILE. If it won't work, lcsvdmp.pl will die, as it did in our case

In this case we will use a driver allowing to treat csv files as a database. We do it only for explanation purposes so you can repeat an example yourself without installing any particular database (you need to install DBD::CSV in order ro use it). The connection file in our case will look like this:

 $dbh = DBI->connect( "DBI:CSV:csv_sep_char=\t;f_dir=database",
                         { RaiseError => 1 }
                        );

NOTE: Database files are in the tutorial package in a directory database.

So let's create a connection file in the current directory and run the program again:

  lcsvdmp.pl selects.txt

Have a look on what it prints:

 select: select * from employees
 results: ./employees.txt
 %%
 name: payments
 select: select * from payments
 results: ./payments.txt
 %%

Note that it describes to what files it has dumped results of selects. As you can see it creates a file in the current directory, name is a select name plus extension txt.

If you run the same set of selects again and again, each run will overwrite results from previous run. It would be much better to have lcsvdmp.pl to create files in a specified directory. You can do this by using --path option.

If you do

  lcsvdmp.pl --path=before selects.txt

and after some actions

  lcsvdmp.pl --path=after selects.txt

in before directory you will have files employees.txt and payments.txt with data in a state before modification and in after - the data after the modification.

RUNNING THE SAME STATEMENTS FOR DIFFERENT WHERE CLAUSES

Now imagine that you want to check changes not only for one customer, but for several. You are given a group of 20 employees and you have to check how their data changed after a given operation. It would be difficult to check data for employees retrieved in one big select. It would be easier to select data for each employee separately.

So you have to run a select for each employee. You can do it by simply opening selects.txt file from previous stage, changing empl_id in all where clauses, run it, change empl_id, run and so on. This may be tedious for 20 employees. It would be much more convenient to give a list of empl_ids and proper selects will run automatically.

There are several ways to do it. LReport toolkit offers one of them. There is a tool lks.pl which replaces a specific pattern in a file with values given in other special file. Let's see how it works.

We have to create a selects file template like this:

 name: employees
 select: select * from employees where --empl_id--
 %%
 name: payments
 select: select * from payments where --empl_id--
 %%

The file with the detailed value for placeholder --empl_id-- will look like this (keys.txt):

 name: empl_id
 key: empl_id = 1
 %%

When you run

 lks.pl --keys=keys.txt selects_tmpl.txt

it will produce on a standard output the following text.

 name: employees
 select: select * from employees where empl_id = 1
 %%
 name: payments
 select: select * from payments where empl_id = 1
 %%

which can then be fed to lcsvdmp.pl as a select file.

It replaced all placeholders with a value empl_id = 1.

If now, you would like to run the same set of selects for employee with id 201, you don't have to modify selects file. You modify keys.txt so it looks like this:

  name: empl_id
  key: empl_id = 201
  %%

and run lcsvdmp.pl for the generated set of selects.

WHAT'S NEXT

You now know pretty much about LReport toolkit. You should be able to use it in comparison of various csv files. You can now look at additional examples. They will tell something about creating reports. For more details on defaults, tool's customization you can go directly to REFERENCE MANUAL section.

Good luck and have fun.


ADDITIONAL EXAMPLES

I created LReport to help me in my tests. When I worked as a tester of business applications, I have noticed that I am wasting much of my time checking what has changed in the database and documenting my observations. Although you cannot fully replace human beings in testing, you can at least support them in the most repetitive, boring and error prone tasks.

EXAMPLE 1

You want to check how a given transaction affects a database.

You are not going to observe the whole database. Just results of 2 selects:

select1: select * from customer where customer id = 1234
select2: select * from service where customer id = 1234

Say, results of those selects before the transaction are as follows:

  select1 (customer) :
     ---------------------------------------------
     | customer_id | name  | last_name | address  |
     ---------------------------------------------
     |        1234 | Jan   | Nowak     | Warszawa |
     --------------------------------------------- 
  
  select2 (service) :
     ---------------------------------------------------
     | customer_id | service_type | price | status      |
     ---------------------------------------------------
     |        1234 |         MAIL | 1.30  | ACTIVE      |
     ---------------------------------------------------
     |        1234 |        VOICE | 0.34  | ACTIVE      |
     ---------------------------------------------------

After the transaction, the same selects return the following results:

  select1:
     ---------------------------------------------
     | customer_id | name  | last_name | address  |
     ---------------------------------------------
     |        1234 | Jan   | Nowak     | Warszawa |
     --------------------------------------------- 
  
  select2:
     ---------------------------------------------------
     | customer_id | service_type | price | status      |
     ---------------------------------------------------
     |        1234 |         GPRS | 2.05  | ACTIVE      |
     ---------------------------------------------------
     |        1234 |        VOICE | 0.34  | DEACTIVATED |
     ---------------------------------------------------

You would like to see what are differences between select results before and after the transaction. After processing the data through the tool chain, you get the following diff report


  lcsvdiff.pl before/service.txt after/service.txt 
  INS( 1234#GPRS )
  DEL( 1234#MAIL )
  UPD( 1234#VOICE ): status: ACTIVE =#=> DEACTIVATED

Those entries mean the following:

INS( 1234#GPRS )
New row has been inserted with a key 1234#GPRS

DEL( 1234#MAIL )
The row with a key 1234#MAIL has been removed

UPD( 1234#VOICE ): status
In a row with a key (which is customer_id and service_type ) 1234#VOICE, value in a column status has changed from ACTIVE to DEACTIVATED

As you see, there is nothing about select1. This is fine because nothing has changed in customer table.

You can also generate an xml report about the current results of selects. The report will look like this:

  <report>
      <customer>
          <statement><![CDATA[select * from customer where customer_id = 1234]]>
          </statement>
          <header>
              <customer_id/>
              <name/>
              <last_name/>
              <address/>
          </header>
          <data>
              <equal>
                  <customer_id>1234</customer_id>
                  <name>Jan</name>
                  <last_name>Nowak</last_name>
                  <address>Warszawa</address>
              </equal>
          </data>
      </customer>
      <service>
          <statement><![CDATA[select * from service where customer_id = 1234]]>
          </statement>
          <header>
              <customer_id/>
              <service_type/>
              <price/>
              <status/>
          </header>
          <data>
              <additional>
                  <customer_id>1234</customer_id>
                  <service_type>GPRS</service_type>
                  <price>2.05</price>
                  <status>ACTIVE</status>
              </additional>
              <missing>
                  <customer_id>1234</customer_id>
                  <service_type>MAIL</service_type>
                  <price>1.30</price>
                  <status>ACTIVE</status>
              </missing>
              <different>
                  <customer_id>1234</customer_id>
                  <service_type>VOICE</service_type>
                  <price>0.34</price>
                  <status>
                      <old_value>ACTIVE</old_value>
                      <new_value>DEACTIVATED</new_value>
                  </status>
              <different>
          </data>
      </service>
  </report>

You can transform it to some other formats using XSLT. You can transform it to RTF. You can find an RTF document for this example at http:/xxxx.

CONCLUSION

I hope, you now know what can you expect from LReport. There is more LReport can offer you, for example expected-actual comparison. But for starters this example should be enough. If this is what you are looking for, read on. In the next example, I am going to be more specific on how does the LReport tool chain work.


HOW DOES IT WORK

Now I am going to show you another example. It will work on the same data as in the first example. However, now I will be more specific on how do you use LReport tools to achieve results you want.

EXAMPLE 2

Let's see what steps do we need to perform in order to get results presented in previous example.

You can split the whole processing on the following steps:

  1. Run selects and dump their results to csv files
  2. Run the transaction, which is about to modify a database
  3. Run selects again and dump their results to another set of csv files
  4. Compare csv files from before- and after-transaction state
  5. Generate XML report

Let's observe those steps one by one.

Step 1: Run selects and dump their results to csv files

It is important to understand that LReport operates on csv files. All comparisons and reporting is done on csv files as a source of data. I will elaborate in LIMITATIONS AND WARNINGS, what are potential benefits and flaws of such an approach. LReport makes the following assumptions about the format of csv files:

You can modify some of those defaults. You can read about it in section DEFAULTS AND ENVIRONMENT VARIABLES.

The goal of step 1 is to put the data into csv files so other tools from LReport tool chain can work with them. If your database server provides you with tools allowing to create such files, you don't have to use the tool provided by LReport. But let's assume we will use LReport to do it for us. The tool we have to use is lcsvdmp.pl which stands for ``LReport csv dump''. You call the lcsvdmp.pl in a following way:

  lcsvdmp.pl --conn_file=conn_file.txt --path=data_dir selects.txt

Meaning of parameters:

--path
This is a directory in which csv files will be placed. If this parameter is not given, files will be created in the current directory.

--conn_file
This may be difficult to understand on the beginning. This is the name of the database connection file, which contains instructions on how to connect to the database. lcsvdmp.pl uses DBI to connect to database. As you know each database driver (DBD) has a slightly different flavour. So the syntax of the connection string for Sybase is slightly different from that for Oracle, and is slightly different from Mysql and so on. In order to make lcsvdmp.pl flexible and be able to deal with any source for which DBD exists, I passed the responsibility of opening a connection on a user. Connection file should contain a perl code, which opens a connection and assigns a reference to datbase handle to a variable named $dbh. lcsvdmp.pl will load contents of connection file and eval it.

This an example of opening a connection to ODBC source. As you can see you can even include additional configuration code there:

  use Win32::ODBC;
  my $DBName     = "INVOICES";
  my $DBServer   = "MYSERVER";
  my $DBUser     = "pkaluski";
  my $DBPassword = "password";
  
  no strict;
  Win32::ODBC::ConfigDSN( ODBC_CONFIG_DSN, 
                          "Sybase ASE ODBC Driver",
                          "DSN=BolekSybase",
                          "Database=$DBName",
                          "InterfacesFileServerName=$DBServer");
  use strict;
  my $error = Win32::ODBC::Error();
  if( $error ){
      die $error;
  } 
  $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword,
                       {RaiseError => 1, AutoCommit => 1}
                     );

selects
Points to a file containing selects to be executed and dumped to csv files. For our example it has the following format:
  name: customer
  select: selects * from customer where customer_id = 1234
  %%
  name: service
  select: selects * from customer where customer_id = 1234
  %%

Each select should have a name. LReport for each of selects will create a file with a file name equal to select's name, plus extension. So results of select named 'customer' will go to a file customer.txt.

In our example we run the tool in the following way

  lcsvdmp.pl --conn_file=conn_file.txt --path=before selects.txt

Note that we have given a --path option with value before. It will make lcsvdmp.pl to put all csv files in before directory, which should exist before we run the tool.

Once we run it, in the before directory you should have 2 files - customer.txt and service.txt with results of according selects.

The generated files will look like this:

customer.txt
  customer_id   name    last_name       address
  1234  Jan     Nowak   Warszawa
service.txt
  customer_id   service_type    price   status
  1234  VOICE   0.34    ACTIVE
  1234  MAIL    1.30    ACTIVE

Fields are separated by tab characters.

It is time for the step 2.

Step 2: Run the transaction, which will modify the database

Just do it. Once you are done, you have to see what has changed.

Step 3: Run selects again and dump their results to csv files

We use lcsvdmp.pl again. We will run it as follows:

  lcsvdmp.pl --conn_file=conn_file.txt --path=after selects.txt

As you can see, the only difference is that we specify different destination directory for csv files. If we use before, new csvs will simply overwrite the previous ones and we will loose the information about the state before the transaction.

At this stage, we have information about database states before and after the transaction, stored in separate directories. It's time to compare them.

Step 4: Compare csv files from before- and after-transaction state

It's time for the next tool - lcsvdiff.pl.

The command line syntax is as follows

  lcsvdiff.pl --keys_file=keys.txt $before_dir $after_dir

Meaning of parameters:

--keys_file
The name of the file, which contains definitions of row keys for each csv file. Some explanation is needed here. In order to be able to compare 2 sets of rows, you have to define a kind of primary key, which will be a direct link between after and before rows to be compared. You can read more about rows keys in BEFORE-AFTER COMPARISON and ROW KEYS. The file for our example will look like this

  select_name: customer
  key_columns: customer_id:4
  %%
  select_name: service
  key_columns: customer_id:4, service_type
  %%

For customer select, the key is a field customer_id. Note the digit in a definition. It describes a length of the field and you should always specify it for numeric fields. Why? As you know lcsvdiff.pl operates on csv files. Those files, by their nature, carry no information about the field type, so they will treat all columns as text. It does not have severe consequences but there is one, which is good to be aware of. 2 is less then 1000 for numeric types. But is greater then 1000 for text comparison. When you specify a field lenght, lcsvdiff.pl will leftpad the value with zeros. So the comparison will be 0002 with 1000 which will give the correct result.

$before_dir
Directory containing csv files with data before the transaction

$after_dir
Directory containing csv files with data after the transaction

lcsvdiff.pl will work as follows:

  1. It will load customer.txt files from before and after directories.
  2. A row key for each of rows will be calculated
  3. In our case there is one row only

  4. It will find out that in both files there is a row with a key 1234 and values in all columns are the same. So there are no differences.
  5. It will load service.txt files from both locations.
  6. It will calculate a row key for each row.
  7. In case of service, there is a composite key, consisting of columns customer_id and service_type.

  8. It will find that a row with a key 1234#GPRS exists in after file but does not exist in before file.
  9. So it concludes that this row was inserted by the transaction. So it will generate an INS entry:
      INS( 1234#GPRS )

  10. Then it will find that a row with a key 1234#MAIL exists in before file but does exist in after file.
  11. So it concludes that this row was removed by the transaction. So it generates the DEL entry:
      DEL( 1234#MAIL )

  12. It will notice that in both files there is a row with a key (1234,VOICE) (presented as 1234#VOICE).
  13. It will find out that the value of column status is different. So it will generate an UPD entry like below
      UPD( 1234#VOICE ): status: ACTIVE =#=> DEACTIVATED

  14. Since there would be no more rows, the tool will finish.

The tool will generate all the output on the standard output.

If you were interested in finding the diffences, you are done at this point. If you would like to generate more readable, nicely formated report, you go to step 5:

Step 5: Generate XML report

We have to use lrptxml.pl tool. But I am not going to describe it here, to avoid bogging you down with too much details. You can read about reporting in manual page for lrptxml.pl.


SEE ALSO

Project is maintained on Source Forge (http://lreport.sourceforge.net). You can find links to documentation there.


AUTHOR

Piotr Kaluski, <pkaluski@piotrkaluski.com>


COPYRIGHT AND LICENSE

Copyright 2004-2006 by Piotr Kaluski

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 LRpt - Perl extension for comparing csv files and reporting their contents.