LRpt - Perl extension for comparing csv files and reporting their contents. |
LRpt - Perl extension for comparing csv files and reporting their contents.
################################################################### # # Simple report # ################################################################### lks.pl --keys=keys.txt selects.txt > sel_subs.txt lcsvdmp.pl --conn_file=conn_file.txt --path=data sel_subs.txt lrptxml.pl --rldf=rldf.xml --selects=sel_subs.txt --keys_file=keys.txt data > report.xml ################################################################### # # Diff report # ################################################################### lks.pl --keys=keys.txt selects.txt > sel_subs.txt lcsvdmp.pl --conn_file=conn_file.txt --path=data_state1 sel_subs.txt ... some actions on the database... lcsvdmp.pl --conn_file=conn_file.txt --path=data_state2 sel_subs.txt lcsvdiff.pl --all --keys_file=keys.txt data_state1 data_state2 > diffs.txt lrptxml.pl --diffs --rldf=rldf.xml --selects=sel_subs.txt --keys_file=keys.txt diffs.txt > report_aft.xml ################################################################### # # Expected-actual diff report # ################################################################### lks.pl --keys=keys.txt selects.txt > sel_subs.txt lcsvdmp.pl --conn_file=conn_file.txt --path=data sel_subs.txt lcsveadiff.pl --keys_file=keys.txt --expectations=exp.xml --cmp_rules=cmp_rules.xml data > eadiffs.xml
You can get the most recent code from lreport project on source forge http://lreport.sourceforge.net. You are given 2 options - you can either download a ppm active state package or a CPAN package.
LRpt (LReport) is a set of tools for database row sets comparison and reporting. The core logic operates on csv files, however, LReport also provides tools for converting select results to csv files.
First part of this document is a reference manual.
If it's your first time with LReport, I recommend you to start with a a stroll through LReport section of this document. Then have a look at 2 additional examples.
In this and following chapters you will find a more systematic treatment of LReport. If you haven't read examples above, I recommend you at least to skim over them. They will help you to understand concepts explained below. When appropriate, you are given links to other documentation pages.
LReport consits of the following tools:
lks.pl
LRpt::KeySubst
.
lcsvdmp.pl
LRpt::CSVDumper
.
lcsvdiff.pl
LRpt::CSVDiff
.
lcsveadiff.pl
LRpt::CSVEADiff
.
lrptxml.pl
LRpt::XMLReport
.
There are 2 types of possible comparisons:
A set of selects is executed twice - before and after some transaction, which could potentialy change the data. For each select, rows returned before change are compared with rows returned by the same select after the change. When differences are found, they are reported.
The following differences can be reported:
In order to determine a direct relationship between before and after rows (i.e. to determine which after row should be compared with which before row) a row key is used. A row key is an equivalent of primary key in a database table. It has to be unique and may consist of many columns.
During comparison for particular select, LReport sorts after and before rows by a row key (after and before rows are sorted separately). Then it iterates through all keys found in both collections of rows (before and after rows). If for example before collections have 3 rows with keys (01,AB), (02,GT), (02,JT) and the after collection has rows with keys (01,AB), (03,KP), then LReport will iterate through a following list: (01,AB), (02,GT), (02,JT), (03,KP). So it's basically a sum of sets.
For each value of a row key, LReport finds an after and before row with this key value. If there is no such row in before collection, that means that a new row was created. So it is reported as additional row. If there is no such row in after collection, that means that a row was deleted. It is then reported as missing row. If both after and before row exist, they are compared field by field. If any difference is found, the row is reported as not equal. Otherwise rows are considered equal.
NOTE: This functionality is experimental.
This type of comparison may be used for test automation. Sometimes you can create a static set of expected data, which you can use for comparison with actual data. There are many cases, however, when you cannot create static files with expectations, because some column values depend on a date, environment and some circumstances, which are difficult to recreate. So instead, you can implement a script/program, which will predict what could be the output of a test and then generate expectations.
Those expecations may be then compared with results of selects done on actual data. Although it may seem that there is nothing special in expected-actual comparison and it is a slightly modified version of before-after comparison, this is not the case. There are some nuances, which forces an introduction of some new notions and concepts.
In general, expected-actual comparison uses the same core logic as before-after comparison. It parses the expectations and builds collections of expected rows and then compares them with collections of actual rows.
The first big difference is that not all columns are defined in expectations. It makes sense. We often deal with rows, which consist of 40 columns, but we are only interested in 10 of them. It would be infeasible to force a user to define expected values for all those 40 columns. So the user can decide not to specify them.
From the other hand, it would be good to have a protection mechanism, which would warn a user when a really important column has no expectation defined. This is where comparing rules come to the rescue.
Comparing rules are defined for a particular select. They define what to do if an expected value for a given column in not specified. You can make the application terminate, do nothing, warn, use default value. You can find details in COMPARING RULES chapter.
Another big difference is a direct consequence of the previous one. If you are allowed to not specify all columns values, it may happen that a value for column belonging to a row key is not defined. As a result a value of a row key for this row cannot be calculated. So the mechanism of before-after comparison cannot be used in this case. The solution is provided by unkeyed rows and column matching. It is explained below.
The following differences are reported:
If case of unkeyed row (missing values in columns which are parts of a row key), rows can not be compared on a row by row basis since row key is the only information allowing to determine a direct relationship between rows. Hence, the following solution was developed for such rows:
Each unkeyed row defined in expectations is tried to be matched with all its fields with any of actual rows. If there is no such match, the row is reported as unmatched.
The detailed rules are as follows:
Reporting of matching and unmatching has some additional specifics. Since row key is not defined, it is possible to define a set of identical expectation rows. LReport catches those identical rows and count them. When it reports unmatching rows, it groups identical rows and shows how many of them were not matched. If for example, expectations rows for a given select are
1, ABC, ACT 1, ABC, ACT 1, ABC, ACT
and only 2 of them have been matched with actual rows, row (1, ABC, ACT) is reported to be unmatched once.
You can find details on configuring and using expected-actual comparison
(with a step by step example)
in manual for LRpt:CSVEADiff
.
LReport does not operate on database tables but on select results. Single
csv file contain results of one select statement. If the statement
is select * from table
then it is actually the whole table. But this
is a specific example. In general, csv file contain results from one select.
In order to manage select results, choose a proper row key, report format etc. each select should be given a name. In some simple cases it is not required to give select a name, but it is recommended to use a meaningful name whenever possible.
Select name is used for following purposes:
Selects are defined in files called select files. They use jar record format. Definition of each select ends with a line beginning with %% with white spaces after it.
Definition of each select consists of 2 fields: name and select. Their meaning is as follows:
* No line of a select text may start with string: or %%. Adding single space in front of them is enough to get rid of this limitation.
LReport provides some facilities for generating nicely formatted reports of selects results and found differences. It does that by generating xml reports, which can be then transformed to some document format using XSLT.
LReport package in the source forge also contain XSLT stylesheet for converting xml report to an RTF document.
Reporting is a wide topic by itself so it is not described in details here.
You can read more on reporting in LRpt::XMLReport
.
The row key is more or less an equivalent of primary key in database theory. It defines a set of columns, which values uniquely identify each row.
Row key is crucial in before-after comparison, since it determines the direct link between before and after row.
In its code LReport treats row key value as a string. If a row key consists of one column, the key value for a row equals the value of the specified column from this row.
If a row key consists of several columns, then <LReport> creates key value by joining values from those columns, using # as a separator. So if values from columns are 3, Kaluski and <20050601>, then the key value will be 3#Kaluski#20050601. Using a column, which values may contain # as a row key part, may lead to unpredicted affects. It would be good to be able to define, what should be the row key value separator. But it is currently not implemented. The hash character # was chosen for the following reasons:
* It is low in characters order. So it not should disturb in string comparisons. The strings abc#1234 and abcd#1234 are compared correctly. abc#1234 is less then abcd#1234. If # would be high in characters order (having ascii code 123 for example), then the comparison above would consider abc#1234 to be greater then abcd#1234
* I found it very unlikely to use such a character in columns which are parts of unique keys.
Row keys values are treated as strings, so they are compared as strings. Therefore keys value 2 is greater then 1000. In order to achieve correct numerical comparison, column value has to be left padded with zeros. 0002 is lower then 1000.
There are several ways of defining a row key. We can define it in a file, give it at command line and use default key. Currently it is recommended to use only a file method. I am still working on the most convinient and expressive way of defining row keys in command line, so things may change. The default row key is a first column of a table/select results.
Row key file is in jar record format. Definition of each key ends with a line beginning with %% with white spaces after it.
Definition of each key consists of 2 fields: select_name and key_columns. Their meaning is as follows:
Example of a row key file:
select_name: customer key_columns: customer_id:4 %% select_name: service, service_history key_columns: customer_id:4, service_type %%
As you can see one row key may be defined for several selects. Note that values of customer_id column will be left padded with zeros to create 4 character string.
It's an ordinary csv file. LReport expects that a first line of each such a file contains columns' names. Default field separator is tab. You can read more on defaults and changing them in DEFAULTS AND ENVIRONMENT VARIABLES section.
NOTE: csv files used as input to lcsvdiff.pl have to be sorted by a key used in comparison.
Comparing rules are used in expected-actual comparison. They are used to determine what to do when expectations for a given column are not specified. Currently, the following possibilities are available:
You can read more about comparing rules in man page for
LRpt::CSVEADiff
.
A select template is an additional feature provided by LReport. It's usefull in cases when a user executes the same select or set of selects many times but for different values in a where clause.
We would like to run the selects from previous example. We want to run it not only for customer_id = 1234 but also for customer_id = 1334 and customer_id = 3476. And maybe some others in future.
We can use a select template here. We define the following file (let's call it sel_tmpl.txt):
name: customer select: select * from customer where --customer_id-- %% name: service select: select * from service where --customer_id--
Then we have create a where key file (wkey.txt):
name: customer_id key: customer_id = 1234 %%
When we now run lks.pl:
lks.pl --keys=wkey.txt sel_tmpl.txt
Meaning of parameters is as follows:
it will print on standard output following lines:
name: customer select: select * from customer where customer_id = 1234 %% name: service select: select * from service where customer_id = 1234 %%
lks.pl will basically replace all entries --entry--
from sel_tmp.txt
with a proper definition from wkey.txt. Should we run this to selects for
customer_id = 1334, we have to modify wkey.txt only. Output generated by
lks.pl can be fed as input to lcsvdmp.pl (in --select option).
You can read more on how lks.pl work in man page for
LRpt::KeySubst
.
Also have a look at SYNOPSIS.
Where keys are defined in files called where key files. They use jar record format. Definition of each key ends with a line beginning with %% with white spaces after it.
Definition of each select consists of 2 fields: name and select. Their meaning is as follows:
* No line of a select text may start with string: or %%. Adding single space in front of them is enough to get rid of this limitation.
Core logic of LReport operates on csv files. However, before we can operate on them, we have to create them first. A tool for creating csv files from select results is lcsvdmp.pl and its usage is explained in example 2.
A database connection file contains instructions on how to connect to the database. It is used by lcsvdmp.pl. As you know each database driver (DBD) has a slightly different flavour. So the syntax of the connection string for Sybase is slightly different from that for Oracle, and is slightly different from Mysql and so on. In order to make lcsvdmp.pl flexible and be able to deal with any source for which DBD exists, I passed the responsibility of opening a connection on a user. Connection file should contain a perl code, which opens a connection and assigns a reference to datbase handle to a variable named $dbh. lcsvdmp.pl will load contents of connection file and eval it.
This an example of opening a connection to ODBC source. As you can see you can even include additional configuration code there:
use Win32::ODBC; my $DBName = "INVOICES"; my $DBServer = "MYSERVER"; my $DBUser = "pkaluski"; my $DBPassword = "password"; no strict; Win32::ODBC::ConfigDSN( ODBC_CONFIG_DSN, "Sybase ASE ODBC Driver", "DSN=DBSybaseSource", "Database=$DBName", "InterfacesFileServerName=$DBServer"); use strict; my $error = Win32::ODBC::Error(); if( $error ){ die $error; } $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword, {RaiseError => 1, AutoCommit => 1} );
Do not forget about assigning a reference to opened connection to $dbh variable.
Since LReport core logic operates on csv files, you can use LReport for any database for which you are able to create csv files for select results. If you use lcsvdmp.pl to do it, you need to have a DBI driver for a database you are working with. LReport uses the following DBI methods:
connect
prepare
execute
fetchall_arrayref
If the DBD for you database supports those methods, lcsvdmp.pl should be able to do its job.
LReport's core logic works on csv files. This gives the tool significant flexibility, since it can work with virtually any database. However, such an approach does have some consequences, which you should be aware of.
LReport has some hardcoded defaults. Each of those defaults can be overriden by environment variable. In some cases, those defaults can be further overriden by command line switches. So the general rule is this:
There are following parameters, which are used by LReport tools. The list below is given in the format: meaning: hardcoded default -> Environment variable -> command line option
txt -> LRPT_CSV_FILE_EXT -> --ext=ext
tab -> LRPT_CSV_FIELD_SEPARATOR -> --sep=separator
. (current directory) -> LRPT_CSV_FILE_PATH -> --path=path
conn_file.txt -> LRPT_CONNECTION_FILE -> --conn_file=file.txt
1000 -> LRPT_CHUNK_SIZE -> --chunk_size=size
keys.txt -> LRPT_GLOBAL_KEYS_FILE --> No command line switch
--#> -> LRPT_DIFF_VALUE_SEPARATOR --> No command line switch
The tool was developed and tested on Windows. I can't think of any reason why it will not work on Unix but it was not tested. Currently it assumes that a name separator in file paths is slash (/) so it probably won't work on VMS and other systems, which do not use slash as a name separator.
NOTE: This sections is under construction
If you are brave enough to try to modify the LReport code and the ugliness of the code did not scare you to death, read this chapter. It will provide you with some guidance on how all things are organized. I hope it will make your coding advanture less painfull.
All LReport logic is encapsulated in packages. All *.pl tools are only wrappers calling one exported function. For example, lcsvdmp.pl looks as follows:
use strict; use LRpt::CSVDumper; dump_selects( @ARGV ); Some packages are object oriented, some are not.
LRpt provides 4 main objects to perform expected job (see SYNOPSIS):
LRpt::CSVDumper
LRpt::CSVDiff
LRpt::CSVEADiff
LRpt::XMLReport
LRpt::KeySubst
Apart from those packages there is a bunch of other classes defined. Some of them may appear pretty usefull. Be careful though. I do not consider them public, so I may decide to change them in future. If you find some of them extremely useful, let me know so I will be more careful in my future desing decisions.
LRpt library consists of the following classes:
LRpt::CollDiff
LRpt::CollEADiff
LRpt::Collection
LRpt::CollUnkeyed
LRpt::Config
LRpt::JarReader
LRpt::RKeysRdr
Each of this modules have its POD, so if you would like to know more about them, you can go ahead and read it.
LReport consists of a bunch of tools. Currently regression tests do not test particular modules separately. Instead there are some end-to-end test scenarios. For each testcase there is set of expectations and the results are compared with those expectations.
As of 10-09-2006 tests are divided into 2 groups: tests of simple reporting and tests of difference reporting. The 't' directory contains 2 directories, 1 for each group - DiffRpt and Report. Let's have a look at DiffRpt directory. It constains the following subdirectories/files:
dbschema dbschema1 dbschema2 dbschema3 t1 t2 t3 t4 t5 driver.pl
driver.pl is test driver for diff reporting tests. It contains perl code running scenarious of four tests.
dbschema and dbschema1 are databases. They are based on csv files and are handled by DBD::CVS. They are used to simulate changes in the database. Instead of changing the contents of the database, selects for before state are done on dbschema database and selects for after state are done on dbschema1.
t1 to t4 are directories used by test cases. Let's have a look at t1. The test case code looks like this (it is in t\DiffRpt\driver.pl file):
CODE
On the beginning, t1 contains configuration and input files. It also contains expected directory. It contains a directory structure with files expected to be generated by tested tools. Once you run the test, it will create actual directory in t1. All output files generated by the test case will be placed in the actual directory and its subdirectories. Once the scenario is done, contents of actual and expected directories are compared. If they are the same, that means that the test case generated exactly what we expected. Otherwise, it behaves differently which means that we either introduced a new bug or modified the functionality. We should either remove the bug, or if changes were intentional - modify the test case.
As of 15-01-2006 the tool have only one user, which is me. I used it in one environment. You can imagine that there are plenty of undetected bugs which will jump out, once you will start using it for your specific needs. I expect that regression tests may fail on Unix. This is due to end of line incompatibilities.
In case you find a bug, let me know.
Thanks to chunking, LReport can dump and compare huge files. Some performance tuning has been already done but there is still some space for improvement.
Currently comparing rules (for expected-actual comparison) have to be specified in command lines options. I expect that for intensive use of LReport, it would be convenient to have a common place where all this is defined. LReport tools should look there if nothing is given in the command line.
This is the first version of a documentation. Some parts of it are still under construction. Some links does not work. I am aware of that. My goal is to provide you with something you can play with. I will keep working on documentation refinements.
This section is a tutorial, which will guide you through functionalities of LReport. It will start with a fairly simple example, in which basic concepts are presented. And then by incresing examples' difficulty, more and more advanced topics will be presented.
After reading this you should have a good understanding of LReport. You should be able to use it and should feel comfortable when reading documentation.
This tutorial uses several example files. You can find a tarball with them on LReport web site.
Imagine that you have to compare 2 simple csv files. They are as follows:
last_name salary occupation Bush 20000 Clerk Kowalski 30000 Programmer Nowak 40000 Manager Wilson 10000 Clerk
last_name salary occupation Bush 20000 Assistant Kos 40000 Manager Nowak 40000 Manager Wilson 10000 Clerk
NOTE: Just on the very beginning let me tell you one important thing - input csv files MUST BE SORTED BY THE KEY (later you will find out what a key is). If they are not sorted, you may get weird results.
We compare csv files using lcsvdiff.pl
lcsvdiff.pl before1.txt after1.txt
We will get the following output:
SCHEMA: last_name salary occupation UPD( Bush ): occupation: Clerk ==#> Assistant INS( Kos ) DEL( Kowalski )
Let's have a closer look at it, line by line. It says:
SCHEMA: last_name salary occupation
csv file contains rows with 3 columns: last_name, salary and occupation.
UPD( Bush ): occupation: Clerk ==#> Assistant
The value of column occupation in a row identified by Bush (we will talk about identification soon) has changed from Clerk to Assistant.
INS( Kos )
Row identified by Kos (last_name column) is new (additional) in after1.txt file.
DEL( Kowalski )
Row identified by Kowalski (last_name column) from before1.txt does not exist any more (is missing) in after1.txt file.
Now we need to compare files, which contain also employees first names:
first_name last_name salary occupation James Bush 20000 Clerk Adam Kowalski 30000 Programmer Adam Nowak 40000 Manager John Wilson 10000 Clerk
first_name last_name salary occupation James Bush 20000 Assistant James Kos 40000 Manager Adam Nowak 40000 Manager John Wilson 10000 Clerk
Lets try lcsvdiff.pl again
lcsvdiff.pl before2.txt after2.txt
This time we get the following output:
Key is not unique. Some rows from file before2.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN0> line 5. Key is not unique. Some rows from file after2.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN1> line 5. SCHEMA: first_name last_name salary occupation UPD( James ): last_name: Bush ==#> Kos UPD( James ): salary: 20000 ==#> 40000 UPD( James ): occupation: Clerk ==#> Manager
We got 2 warnings and then different and weird result. What has happened?
It's time to talk about keys. When you compare rows (in csv files or database) then you need to define, which column/columns are used to uniquely identify each row. This is so called primary key. It allows you to determine, which rows should be compared with each other, which should be considered missing or additional. Otherwise you could only compare rows basing on their order in both files.
For after1.txt and before1.txt it worked fine, because lcsvdiff.pl assumed a default key - the first column. Values in these column were unique, so it worked well. In the second example, the first column contains first names, which are very likely to be repeated. We have Adam in 2 rows. For this case it would be much better to use columns last_name and first_name as a key. We tell it lcsvdiff.pl about a key a in a following manner:
lcsvdiff.pl --key_cols=last_name,first_name before2.txt after2.txt
Now it worked fine giving the following output:
SCHEMA: first_name last_name salary occupation UPD( Bush#James ): occupation: Clerk ==#> Assistant INS( Kos#James ) DEL( Kowalski#Adam )
Note the difference between comparison of before1.txt/after1.txt and before2.txt/after2.txt. Previously rows were identified by Bush, Kos and Kowalski. Now, they are identified by Bush#James, Kos#James and Kowalski#Adam. This is a concatenation of values from key columns - last_name and first_name. # is a concatenation character.
IMPORTANT NOTE - remember that input files for lcsvdiff.pl have to be sorted by the key. If they are not sorted, strange results may be returned.
Let's try with following files
empl_id first_name last_name salary occupation 1 John Wilson 10000 Clerk 9 James Bush 20000 Clerk 11 Adam Kowalski 30000 Programmer 201 Adam Nowak 40000 Manager
empl_id first_name last_name salary occupation 1 John Wilson 10000 Clerk 9 James Bush 20000 Assistant 201 Adam Nowak 40000 Manager 5043 James Kos 40000 Manager
Since empl_id is a unique key, we can again allow lcsvdiff.pl to use default key i.e. first column.
The output is as follows:
SCHEMA: empl_id first_name last_name salary occupation DEL( 11 ) INS( 5043 ) DEL( 9 )
What's wrong now? Rows are now identified by empl_id. However, their order in the output seem random. It would be good to have them ordered. The current order is because lcsvdiff.pl treats all keys as strings. Numerically 5043 is bigger then 9. But alphabeticaly it is lower. You have to tell lcsvdiff.pl that empl_id is numeric. lcsvdiff.pl will zero-pad it, so it compares correctly.
lcsvdiff.pl --key_cols=empl_id:5 before3.txt after3.txt SCHEMA: empl_id first_name last_name salary occupation UPD( 00009 ): occupation: Clerk ==#> Assistant DEL( 00011 ) INS( 05043 )
--key_cols=empl_id:5
means that column empl_id is numeric so it should be zero-padded to a length of 5 characters.
Sometimes you don't want to compare all values from rows. For example in real life rows often contain date and time of their last modification. Look at the file below (before4.txt):
empl_id first_name last_name salary occupation last_update 1 John Wilson 10000 Clerk 2006-01-12 10:15 9 James Bush 20000 Clerk 2006-05-22 15:35 11 Adam Kowalski 30000 Programmer 2006-03-07 12:10 201 Adam Nowak 40000 Manager 2006-11-12 14:23
Imagine that a record for John Wilson has changed twice - his salary changed from 10000 to 8000 but after a month it has changed back to 10000. So the employment data remaind unchanged. after4.txt will look something like this:
empl_id first_name last_name salary occupation last_update 1 John Wilson 10000 Clerk 2006-04-12 10:15 9 James Bush 20000 Assistant 2006-07-27 17:45 201 Adam Nowak 40000 Manager 2006-11-12 14:23 5043 James Kos 40000 Manager 2006-11-12 10:15
But lcsvdiff.pl shows:
lcsvdiff.pl --key_cols=empl_id:5 before4.txt after4.txt SCHEMA: empl_id first_name last_name salary occupation last_update UPD( 00001 ): last_update: 2006-01-12 10:15 ==#> 2006-04-12 10:15 UPD( 00009 ): occupation: Clerk ==#> Assistant UPD( 00009 ): last_update: 2006-05-22 15:35 ==#> 2006-07-27 17:45 DEL( 00011 ) INS( 05043 )
It reports that there is a difference for row 00001 (John Wilson). And from technical point of view - that's right. But from the point of view of business logic, nothing changed. We want to skip last_update in our comparisons:
lcsvdiff.pl --key_cols=empl_id:5 --skip_cols=last_update before4.txt after4.txt SCHEMA: empl_id first_name last_name salary occupation last_update UPD( 00009 ): occupation: Clerk ==#> Assistant DEL( 00011 ) INS( 05043 )
This is exactly what we expected.
Previous example was used to illustrate basic notions related to comparing csv files in general and lcsvdiff.pl in particular.
Now, we are going to focus on some aspects which may become relevant in real life usage, especially if you often compare csv files of different formats. Especialy testing is a job, when you are really likely to do this.
Your data may be distributed between several files, so in order to get a full picture of what has changed, you may have to compare several types of csv files.
Imagine that you have 2 directories - one with data before the change and the second with data after the change.
empl_id first_name last_name salary occupation last_update 1 John Wilson 10000 Clerk 2006-01-12 10:15 9 James Bush 20000 Clerk 2006-05-22 15:35 11 Adam Kowalski 30000 Programmer 2006-03-07 12:10 201 Adam Nowak 40000 Manager 2006-11-12 14:23
payments.txt:
empl_id payment_number amount status payment_date last_update 1 1 100.00 PD 2006-05-23 2006-05-24 15:35 1 2 5100.00 PD 2006-06-23 2006-06-27 15:35 1 3 100.05 PD 2006-07-23 2006-10-22 15:35 9 1 8930.00 PD 2006-05-21 2006-05-22 15:35 9 2 100.00 PD 2006-05-22 2006-05-23 15:35 11 1 9580.00 PD 2006-05-03 2006-05-22 15:35 11 2 39840.00 PD 2006-06-23 2006-07-22 15:35 11 3 4530.00 PD 2006-07-10 2006-08-22 15:35 11 4 100.00 PD 2006-08-29 2006-12-22 15:35 201 1 300.00 PD 2006-05-23 2006-05-25 15:35 201 2 400.00 PD 2006-11-23 2006-11-25 15:35 201 3 500.00 PD 2006-12-23 2006-12-24 15:35
empl_id first_name last_name salary occupation last_update 1 John Wilson 10000 Manager 2006-11-12 10:15 9 James Bush 20000 Clerk 2006-05-22 15:35 11 Adam Kowalski 30000 Programmer 2006-03-07 12:10 201 Adam Nowak 40000 Manager 2006-11-12 14:23
payments.txt:
empl_id payment_number amount status payment_date last_update 1 1 100.00 PD 2006-05-23 2006-05-24 15:35 1 2 5100.00 PD 2006-06-23 2006-06-27 15:35 1 3 100.05 PD 2006-07-23 2006-10-22 15:35 1 4 10000.05 PD 2006-11-23 2006-12-22 15:35 9 1 8930.00 PD 2006-05-21 2006-05-22 15:35 9 2 100.00 PD 2006-05-22 2006-05-23 15:35 11 1 9580.00 PD 2006-05-03 2006-05-22 15:35 11 2 39840.00 PD 2006-06-23 2006-07-22 15:35 11 3 4530.00 PD 2006-07-10 2006-08-22 15:35 11 4 100.00 PD 2006-08-29 2006-12-22 15:35 201 1 300.00 PD 2006-05-23 2006-05-25 15:35 201 2 400.00 CN 2006-11-23 2006-11-25 15:35 201 3 500.00 PD 2006-12-23 2006-12-24 15:35
You have to compare employees.txt and payments.txt from before and after directories accordingly.
lcsvdiff.pl before after Key is not unique. Some rows from file before/payments.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN0> line 13. Key is not unique. Some rows from file after/payments.txt will be lost!!! at C:/Perl/site/lib/LRpt/Collection.pm line 351, <GEN1> line 14. lcsvdiff before/payments.txt after/payments.txt SCHEMA: empl_id payment_number amount status payment_date last_update UPD( 1 ): payment_number: 3 ==#> 4 UPD( 1 ): amount: 100.05 ==#> 10000.05 UPD( 1 ): last_update: 2006-07-23 ==#> 2006-11-23 DEL( 9 ) lcsvdiff before/employees.txt after/employees.txt SCHEMA: empl_id first_name last_name salary occupation last_update UPD( 1 ): occupation: Clerk ==#> Manager UPD( 1 ): last_update: 2006-01-12 10:15 ==#> 2006-11-12 10:15 DEL( 9 )
The outcome is weird. No wonder - we did not define keys. But how to define a key, which works for both? You can't. You have to define different key for each file. And you can do it by using special file, called key file. Below is an example (keys_file.txt):
select_name: employees key_columns: empl_id:5 %% select_name: payments key_columns: empl_id:5, payment_number:5 %%
It says that for a file employees.txt a key should be empl_id (numeric, 5 digits) and for payments.txt empl_id and payment_date.
So if you call now lcsvdiff.pl:
lcsvdiff.pl --keys_file=keys_file.txt before after
it will do the following:
Let's see:
lcsvdiff.pl --keys_file=keys_file.txt before after lcsvdiff before/payments.txt after/payments.txt SCHEMA: empl_id payment_number amount status payment_date last_update INS( 00001#00004 ) UPD( 00201#00002 ): status payment_date: PD ==#> CN lcsvdiff before/employees.txt after/employees.txt SCHEMA: empl_id first_name last_name salary occupation last_update UPD( 00001 ): occupation: Clerk ==#> Manager UPD( 00001 ): last_update: 2006-01-12 10:15 ==#> 2006-11-12 10:15
You can go even futher. If there is a set of csv files you often compare, you can put a file with a definition of most frequently used keys in one location. You can then order lcsvdiff.pl to look there for the definition of the key, instead of specifying a key or a key file every time. Please refer to documentation for a description of LRPT_GLOBAL_KEYS_FILE environment variable.
If you have such a variable defined, calling lcsvdiff.pl will look like this:
lcsvdiff.pl --global_keys before after
--global_keys
switch tells lcsvdiff.pl to look for keys in a file
specified in LRPT_GLOBAL_KEYS_FILE variable
In many cases data are not I csv files but in a database. How to compare them using lcsvdiff.pl?
lvcsdiff.pl works on csv files, in which a first row contains column names. Most of well known database servers provide tools for generating such files. However some of them require some time to understand them and some additional tricks if you want to dump results of a select rather then a whole table. lcsvdmp.pl is a simple tool, which does only one think - runs a select (or a group of selects) and dumps results to csv files.
So if employees and payments are actually tables, before we can use lcsvdiff.pl we have to dump to csv files result of 2 selects:
select * from employees select * from payments
Input file for lcsvdmp.pl will look like this:
name: employees select: select * from employees %% name: payments select: select * from payments %%
If we run it
lcsvdmp.pl selects.txt
We will get this:
Cannot open conn_file.txt : No such file or directory at C:/Perl/site/lib/LRpt/CSVDumper.pm line 201.
Apparently it failed. Of course it did. How would it know to which database to
connect? As you can see in the message, it is looking for conn_file.txt, a so
called connection file. Connection file should contain a perl code, which
will open a connection and assign a database handler to a variable $dbh
.
Confused? No problem. Read on. Perl uses DBI to access the database. DBI is a perl module - it is a common interface to database servers. In order to do anything via DBI, you have to open a connection with a server. The details of how you specify a connection parameters differ from vendor to vendor. Moreover it is possible that new drivers will be created. So instead of trying to be smart about currently available drivers, lcsvdmp.pl gives user a chance to write a code, which opens a connection and returns it. For ODBC, connection file will look like this:
my $DBUser = "pkaluski"; my $DBPassword = "password"; $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword, {RaiseError => 1, AutoCommit => 1} );
Since connection file is actually a perl code which will be evaled, you can add some more sophisticated logic there. Below is an example of code which connects to database via ODBC by specifying server, database and user:
use Win32::ODBC; my $DBName = "INVOICES"; my $DBServer = "MYSERVER"; my $DBUser = "pkaluski"; my $DBPassword = "password"; no strict; Win32::ODBC::ConfigDSN( ODBC_CONFIG_DSN, "Sybase ASE ODBC Driver", "DSN=BolekSybase", "Database=$DBName", "InterfacesFileServerName=$DBServer"); use strict; my $error = Win32::ODBC::Error(); if( $error ){ die $error; } $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword, {RaiseError => 1, AutoCommit => 1} );
Remember that in this code an opened connection has to be assigned to variable
named $dbh
.
If you are not familiar with perl, just go to LReport website and get the proper connection file for well known servers.
You can tell lcsvdmp.pl which connection file to take by --conn_file
option.
If you don't specify it, lcsvdmp.pl will look in the current directory
for conn_file.txt. If it won't find it, it will look for a file which name
is given in LRPT_CONNECTION_FILE. If it won't work, lcsvdmp.pl will die, as it did in our case
In this case we will use a driver allowing to treat csv files as a database. We do it only for explanation purposes so you can repeat an example yourself without installing any particular database (you need to install DBD::CSV in order ro use it). The connection file in our case will look like this:
$dbh = DBI->connect( "DBI:CSV:csv_sep_char=\t;f_dir=database", { RaiseError => 1 } );
NOTE: Database files are in the tutorial package in a directory database.
So let's create a connection file in the current directory and run the program again:
lcsvdmp.pl selects.txt
Have a look on what it prints:
select: select * from employees results: ./employees.txt %% name: payments select: select * from payments results: ./payments.txt %%
Note that it describes to what files it has dumped results of selects. As you can see it creates a file in the current directory, name is a select name plus extension txt.
If you run the same set of selects again and again, each run will overwrite
results from previous run. It would be much better to have lcsvdmp.pl
to create files in a specified directory.
You can do this by using --path
option.
If you do
lcsvdmp.pl --path=before selects.txt
and after some actions
lcsvdmp.pl --path=after selects.txt
in before directory you will have files employees.txt and payments.txt with data in a state before modification and in after - the data after the modification.
Now imagine that you want to check changes not only for one customer, but for several. You are given a group of 20 employees and you have to check how their data changed after a given operation. It would be difficult to check data for employees retrieved in one big select. It would be easier to select data for each employee separately.
So you have to run a select for each employee. You can do it by simply opening selects.txt file from previous stage, changing empl_id in all where clauses, run it, change empl_id, run and so on. This may be tedious for 20 employees. It would be much more convenient to give a list of empl_ids and proper selects will run automatically.
There are several ways to do it. LReport toolkit offers one of them. There is a tool lks.pl which replaces a specific pattern in a file with values given in other special file. Let's see how it works.
We have to create a selects file template like this:
name: employees select: select * from employees where --empl_id-- %% name: payments select: select * from payments where --empl_id-- %%
The file with the detailed value for placeholder --empl_id--
will look like this
(keys.txt):
name: empl_id key: empl_id = 1 %%
When you run
lks.pl --keys=keys.txt selects_tmpl.txt
it will produce on a standard output the following text.
name: employees select: select * from employees where empl_id = 1 %% name: payments select: select * from payments where empl_id = 1 %%
which can then be fed to lcsvdmp.pl as a select file.
It replaced all placeholders with a value empl_id = 1.
If now, you would like to run the same set of selects for employee with id 201, you don't have to modify selects file. You modify keys.txt so it looks like this:
name: empl_id key: empl_id = 201 %%
and run lcsvdmp.pl for the generated set of selects.
You now know pretty much about LReport toolkit. You should be able to use it in comparison of various csv files. You can now look at additional examples. They will tell something about creating reports. For more details on defaults, tool's customization you can go directly to REFERENCE MANUAL section.
Good luck and have fun.
I created LReport to help me in my tests. When I worked as a tester of business applications, I have noticed that I am wasting much of my time checking what has changed in the database and documenting my observations. Although you cannot fully replace human beings in testing, you can at least support them in the most repetitive, boring and error prone tasks.
You want to check how a given transaction affects a database.
You are not going to observe the whole database. Just results of 2 selects:
select1: select * from customer where customer id = 1234
select2: select * from service where customer id = 1234
Say, results of those selects before the transaction are as follows:
select1 (customer) : --------------------------------------------- | customer_id | name | last_name | address | --------------------------------------------- | 1234 | Jan | Nowak | Warszawa | --------------------------------------------- select2 (service) : --------------------------------------------------- | customer_id | service_type | price | status | --------------------------------------------------- | 1234 | MAIL | 1.30 | ACTIVE | --------------------------------------------------- | 1234 | VOICE | 0.34 | ACTIVE | ---------------------------------------------------
After the transaction, the same selects return the following results:
select1: --------------------------------------------- | customer_id | name | last_name | address | --------------------------------------------- | 1234 | Jan | Nowak | Warszawa | --------------------------------------------- select2: --------------------------------------------------- | customer_id | service_type | price | status | --------------------------------------------------- | 1234 | GPRS | 2.05 | ACTIVE | --------------------------------------------------- | 1234 | VOICE | 0.34 | DEACTIVATED | ---------------------------------------------------
You would like to see what are differences between select results before and after the transaction. After processing the data through the tool chain, you get the following diff report
lcsvdiff.pl before/service.txt after/service.txt INS( 1234#GPRS ) DEL( 1234#MAIL ) UPD( 1234#VOICE ): status: ACTIVE =#=> DEACTIVATED
Those entries mean the following:
INS( 1234#GPRS )
DEL( 1234#MAIL )
UPD( 1234#VOICE ): status
As you see, there is nothing about select1. This is fine because nothing has changed in customer table.
You can also generate an xml report about the current results of selects. The report will look like this:
<report> <customer> <statement><![CDATA[select * from customer where customer_id = 1234]]> </statement> <header> <customer_id/> <name/> <last_name/> <address/> </header> <data> <equal> <customer_id>1234</customer_id> <name>Jan</name> <last_name>Nowak</last_name> <address>Warszawa</address> </equal> </data> </customer> <service> <statement><![CDATA[select * from service where customer_id = 1234]]> </statement> <header> <customer_id/> <service_type/> <price/> <status/> </header> <data> <additional> <customer_id>1234</customer_id> <service_type>GPRS</service_type> <price>2.05</price> <status>ACTIVE</status> </additional> <missing> <customer_id>1234</customer_id> <service_type>MAIL</service_type> <price>1.30</price> <status>ACTIVE</status> </missing> <different> <customer_id>1234</customer_id> <service_type>VOICE</service_type> <price>0.34</price> <status> <old_value>ACTIVE</old_value> <new_value>DEACTIVATED</new_value> </status> <different> </data> </service> </report>
You can transform it to some other formats using XSLT. You can transform it to RTF. You can find an RTF document for this example at http:/xxxx.
I hope, you now know what can you expect from LReport. There is more LReport can offer you, for example expected-actual comparison. But for starters this example should be enough. If this is what you are looking for, read on. In the next example, I am going to be more specific on how does the LReport tool chain work.
Now I am going to show you another example. It will work on the same data as in the first example. However, now I will be more specific on how do you use LReport tools to achieve results you want.
Let's see what steps do we need to perform in order to get results presented in previous example.
You can split the whole processing on the following steps:
Let's observe those steps one by one.
It is important to understand that LReport operates on csv files. All comparisons and reporting is done on csv files as a source of data. I will elaborate in LIMITATIONS AND WARNINGS, what are potential benefits and flaws of such an approach. LReport makes the following assumptions about the format of csv files:
You can modify some of those defaults. You can read about it in section DEFAULTS AND ENVIRONMENT VARIABLES.
The goal of step 1 is to put the data into csv files so other tools from LReport tool chain can work with them. If your database server provides you with tools allowing to create such files, you don't have to use the tool provided by LReport. But let's assume we will use LReport to do it for us. The tool we have to use is lcsvdmp.pl which stands for ``LReport csv dump''. You call the lcsvdmp.pl in a following way:
lcsvdmp.pl --conn_file=conn_file.txt --path=data_dir selects.txt
Meaning of parameters:
This an example of opening a connection to ODBC source. As you can see you can even include additional configuration code there:
use Win32::ODBC; my $DBName = "INVOICES"; my $DBServer = "MYSERVER"; my $DBUser = "pkaluski"; my $DBPassword = "password"; no strict; Win32::ODBC::ConfigDSN( ODBC_CONFIG_DSN, "Sybase ASE ODBC Driver", "DSN=BolekSybase", "Database=$DBName", "InterfacesFileServerName=$DBServer"); use strict; my $error = Win32::ODBC::Error(); if( $error ){ die $error; } $dbh = DBI->connect( "DBI:ODBC:DBSybaseSource", $DBUser, $DBPassword, {RaiseError => 1, AutoCommit => 1} );
name: customer select: selects * from customer where customer_id = 1234 %% name: service select: selects * from customer where customer_id = 1234 %%
Each select should have a name. LReport for each of selects will create a file with a file name equal to select's name, plus extension. So results of select named 'customer' will go to a file customer.txt.
In our example we run the tool in the following way
lcsvdmp.pl --conn_file=conn_file.txt --path=before selects.txt
Note that we have given a --path option with value before. It will make lcsvdmp.pl to put all csv files in before directory, which should exist before we run the tool.
Once we run it, in the before directory you should have 2 files - customer.txt and service.txt with results of according selects.
The generated files will look like this:
customer_id name last_name address 1234 Jan Nowak Warszawa
customer_id service_type price status 1234 VOICE 0.34 ACTIVE 1234 MAIL 1.30 ACTIVE
Fields are separated by tab characters.
It is time for the step 2.
Just do it. Once you are done, you have to see what has changed.
We use lcsvdmp.pl again. We will run it as follows:
lcsvdmp.pl --conn_file=conn_file.txt --path=after selects.txt
As you can see, the only difference is that we specify different destination directory for csv files. If we use before, new csvs will simply overwrite the previous ones and we will loose the information about the state before the transaction.
At this stage, we have information about database states before and after the transaction, stored in separate directories. It's time to compare them.
It's time for the next tool - lcsvdiff.pl.
The command line syntax is as follows
lcsvdiff.pl --keys_file=keys.txt $before_dir $after_dir
Meaning of parameters:
select_name: customer key_columns: customer_id:4 %% select_name: service key_columns: customer_id:4, service_type %%
For customer select, the key is a field customer_id. Note the digit in a definition. It describes a length of the field and you should always specify it for numeric fields. Why? As you know lcsvdiff.pl operates on csv files. Those files, by their nature, carry no information about the field type, so they will treat all columns as text. It does not have severe consequences but there is one, which is good to be aware of. 2 is less then 1000 for numeric types. But is greater then 1000 for text comparison. When you specify a field lenght, lcsvdiff.pl will leftpad the value with zeros. So the comparison will be 0002 with 1000 which will give the correct result.
lcsvdiff.pl will work as follows:
INS( 1234#GPRS )
DEL( 1234#MAIL )
UPD( 1234#VOICE ): status: ACTIVE =#=> DEACTIVATED
The tool will generate all the output on the standard output.
If you were interested in finding the diffences, you are done at this point. If you would like to generate more readable, nicely formated report, you go to step 5:
We have to use lrptxml.pl tool. But I am not going to describe it here, to avoid bogging you down with too much details. You can read about reporting in manual page for lrptxml.pl.
Project is maintained on Source Forge (http://lreport.sourceforge.net). You can find links to documentation there.
Piotr Kaluski, <pkaluski@piotrkaluski.com>
Copyright 2004-2006 by Piotr Kaluski
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
LRpt - Perl extension for comparing csv files and reporting their contents. |