Globus Usage Statistics How To
The Globus Alliance collects a variety of usage statistics on the various components of the Globus Toolkit. Details on how and why the statistics are collected can be found at http://www.globus.org/toolkit/docs/4.0/Usage_Stats.html.
These packets are sent and stored on the Globus Usage Statistics database.
This is a PostgreSQL database, and can be accessed with the command psql -h <host> -p <port> -d <database> -U <username>.
Once connected to the database, the names of the database packets can be seen with the command \dt, and the details of a specific packet schema can be accessed with \d [table name].
Once the packets have been analyzed, the usage reports can be made.
If there are any questions as to the codes used in making the packets (for example, Gram fault class names are stored as integers 0-11 rather than full strings, with each integer corresponding to a distinct name), the java classes in the org.globus.usage.packets package contain the codes used to make the packets sent to the database.
The code for generating existing reports is found in the org.globus.usage packages report, gramreport, gftpreport, and rftreport.
Running Usage Statistics Reports
Before reports are generated, the database java drivers must be in place. In particular, the reports require the org.postgresql.Driver java class already be in place. This class is part of the Java Database Connectivity (JDBC) API. It also requires Apache Ant and GNUplot to be installed.
Each component has a script for generating its usage reports named <componentName>-reports.sh. They are used as follows:
bin/[componentName]-reports.sh [reportType] -n [number of time steps] -step [size of time step {month/day}] [date in yyyy-mm-dd format]
For example, bin/gftp-reports.sh buffer -n 2 -step month 2006-01-15 will generate a TCP buffer report for GFTP for the time period of January 15, 2006 to March 15, 2006. The data is computed in two month sized pieces in this case, one for 2006-01-15 to 2006-02-15 and one for 2006-02-15 to 2006-03-15.
The output of these scripts are a set of .png graphics files and html files, all of which can be viewed from a normal web browser.
How the Reports are Structured
Each component has several java command line programs associated with it. When these programs are run, they access the necessary fields of the database, compute the statistics and output it in an xml format. A set of Java programs already exist to help access the database, manage the time periods, and format the output. See the "Writing Usage Statistics Reports" section for more details.
These java programs are run as follows:
java [reportname] -n [number of time steps to compute] -step [size of time steps {month/day}] [start-date in yyyy-mm-dd]
An example of actual use would be java ByteReport -n 5 -step day 2006-01-15 > GFTPByteReport.xml. This runs the GFTP ByteReport for 5 days beginning on January 15, 2006. The reports for January 15-20 2006 are then piped into an xml file. This will form the "raw" data of the usage statistics reports.
All Globus Usage Stats xml files begin with a <report> element. As most usage reports are made as histograms (either time histograms or slotted histograms, see below for details and examples), the <report> element is typically composed of <histogram> elements. The report can contain other elements, but a separate xslt file will be needed to interpret other elements.
Details on the Globus Usage Statistics XML Schema can be found here
Once these xml files are created, a series of xslt stylesheets are run using ant. The ant files are called reports.xml and reside in the same folder as the shell script for the component. Each ant target is called on one xml file. The output of the ant target is either a pair of .data and .gnuplot files or an .html file. An example call might look like:
ant gnuplot-histograms -f reports.xml -Din.xml=GFTPByteReport.xml
This calls the target gnuplot-histograms with the input of GFTPByteReport.xml. The xslt file will parse the input file and create GNUplot data and gnuplot files or html files as needed.
Once these files are created, gnuplot is run on the outputted gnuplot files using the command gnuplot [filename].gnuplot. This generates the .png graphics files.
Then ant clean -f reports.xml is run, which runs the clean target, erasing all temporary files used. This leaves the user with the required html, png, and xml files. The user can then view these files in a web browser or post them online for others to view as well.
The graphs generally take one of two forms for existing GRAM, GFTP, and RFT reports. Either the graph is a time histogram or a slotted histogram.
In a time histogram, the graph is always the requested statistic on the y-axis and time on the x-axis. Each time step becomes one entry on the graph. These can be used to display trends over time. These are displayed as stacks, with different colors representing part of the total. An example can be found here: sample time histogram
In a slotted histogram, the graph is always count or number on the y-axis and '
bins' on the x-axis. For example, a slotted histogram of file size could show the number of files with sizes in the range of 0-100kb, 100kb-1mb, 1mb-2mb. Here is an example slotted histogram: sample
slotted histogram
Writing New Globus Usage Statistics Reports
Java Command Line Programs
If reports are being made for a new component or the data necessary for a new report isn't available in one of the xml data reports already generated by the usage statistics code, then a new java command line program must be written for it. If the necessary data is already generated as an xml file, then changes only need to be made to the .xsl files used to transform the xml into gnuplot files, so please refer to "Writing New XSL Files."
A number of utility classes exist and details on the Java API can be found here.
The Globus Usage Statistics java programs are all run in the following generic format: java [report name] -n [number of steps to be taken] -step [size of step {month/day}] [date in yyyy-mm-dd format]
A generic report.java file can be found here. It provides all the necessary code to parse the command line arguments. It also contains the generation of a sample histogram report.
Once these arguments are parsed, a TimeStep object should be created. This Timestep advances the date by one step at a time beginning at the starting date.
After the TimeStep is created, a DatabaseRetriever is necessary. This manages the connection to the database, structures the sql
queries, queries the database, and returns the results in the form a ResultSet. ResultSet is part of the java.sql package. The default constructor uses the db.properties file included in the org.globus.usage package. It points to the globus usage statistics database and uses the general login. If a different login or a different database is needed, editing the db.properties file or providing a new properties file for the DatabaseRetriever solves the problem.
The DatabaseRetriever function Retrieve structures select queries to the database in SQL. It automatically retrieves the data specified from the given packets during the current time step. The DatabaseRetriever can simply be used to return all the entries in the given fields or it can be use the aggregate functions inherent in SQL. A quick overview of SQL syntax and available aggregate functions can be found at http://www.1keydata.com/sql/sql.html
The retrieve method can be called in several ways, but these are the two most common:
retrieve( String packetName, String [] itemsToSelect, String [] Condtitions, Date startDate, Date endDate);
retrieve( String packetName, String [] itemsToSelect, Date startDate, Date endDate);
The first call selects each value from the fields named in itemsToSelect from packetName. It only does this for dates between startDate and endDate where the test conditions in String [] Conditions are true. The second does the same call, but without any extra conditions.
Here are some examples of calls to the Retrieve method:
TimeStep ts = new TimeStep(stepsize, numberofsteps, inputdate);
ResultSet rs;
DatabaseRetriever dbr = new DatabaseRetriever();
rs = dbr.retrieve("gram_packets", new String [] {"ip_address","fault_class"}, ts.getTime(), ts.stepTime());
The above returns the ip address and the fault class for all packets falling in this time period.
rs = dbr.retrieve("gram_packets",new String [] {"ip_address"}, new String [] {"fault_class=0"}, ts.getTime(), ts.stepTime());
The previous statement returns the ip address from gram packets between the two specified dates where the fault_class field equals 0 (this means there was no fault in the job- i.e. it was a success).
rs = dbr.retrieve("gftp_packets",new String [] {"sum(num_bytes)","sum(block_size)"}, ts.getTime(), ts.stepTime());
The above will return only two numbers to the result set- the sum of the number of bytes transferred by all gftp packets between these two dates and the sum of all the block sizes of these packets.
rs = dbr.retrieve("gram_packets",new String [] {"count(*)"},new String [] {"job_credential_endpoint_used"}, ts.getTime(), ts.stepTime()};
This last call will count all packets where the boolean value job_credential_endpoint_used is true. The query will only return one number- the count of all packets where the condition is true.
Once the data is retrieved, the result set information can be accessed using the methods documented at http://java.sun.com/j2se/1.4.2/docs/api/java/sql/ResultSet.html
If the output is going to be formatted into histogram elements, the java class HistogramParser is very useful. It is initialized as follows:
HistogramParser histogram = new HistogramParser(String graphTitle, String outputFileName, String axisName, int numberofSteps);
Every HistogramParser class will produce one histogram element for the output. It will contain the specified title, output file name, and axis name.
Every time a new entry element should be added to the histogram, call the method nextStep(String startDate, String endDate). This will track the item elements in that entry for the given dates.
To add item elements, call the method addData(String itemName, double value).
To display the histogram, call the method output(PrintStream ps). Typically this PrintStream will be System.out.
Here are several sample reports utilizing all of these classes:
GRAM ErrorReport.java
GFTP ByteReport.java
GFTP HostReport.java
Writing New XSL Files
In order to turn the generic xml format into something useful ant is used along with xsl stylesheets to parse this information. A good tutorial on xslt can be found at http://www.w3schools.com/xsl/
Files already exist to create graphs from generic histogram elements. For stacked histograms, where items are displayed on top of each other in a colorful stack, these two files can be used:
gnuplot-histogram-data.xsl
gnuplot-histogram-instructions.xsl
These generate a .data and a .gnuplot file that will be used by GNUplot.
To generate graphs where items are displayed side by side rather than one on top of the others, these files can be used:
gnuplot-bar-data.xsl
gnuplot-bar-instructions.xsl
These files create one graph for every histogram element in the xml file. Here is an example of an ant file that will utilize these two stylesheets:
reports.xml
Also here is an example of a .xsl file that will turn ErrorReport.xml into html tables:
html-instructions.xsl
GNUplot Instruction Files
The program GNUplot is used to format all of the Globus Usage Statistics graphs. It provides a wide variety of options for formatting and is relatively easy to use.
A good tutorial on using GNUplot can be found here http://www.duke.edu/~hpgavin/gnuplot.html
And here is a set of demonstration graphs for both GNUplot versions 4.0 and 4.1 http://www.gnuplot.info/screenshots/index.html#demos
These can be very helpful for finding new ways to display data.
Here is a link to the full GNUplot 4.0 Documentation:
http://www.gnuplot.info/docs/gnuplot.html
Basically GNUplot works as follows. The set command initializes all of the various formatting options for the graph, such as the output file, the range of the x and y axis, whether or not to autoscale, and a variety of other options discussed in full in the GNUplot documentation.
Once the options are set, a plot command should be made. This will make one graph and put it into the specified output file. The plot command works something like this:
plot 'file.data' using 1:2 title "column 2" with boxes fs pattern 0, 'file.data' using 1:3 title "column 3" with boxes fs pattern 1
This will graph column 1 of file.data on the x-axis and column 2 on the y-axis using boxes filled with pattern 0, and also column 1 on the x-axis against column 3 on the y-axis on the same graph. The key will contain two titles- column 2 and column 3. However, since these two items are both graphed against column 1 on the x-axis, they will overlap each other. This is why the value element of the item type exists in the Globus Usage Statistics XML schema.
If bars are desired side by side (rather than stacked on top of each other) the following trick can be used:
my_width=.1
set boxwidth my_width
plot 'file.data' using 1:2 title "column 2" with boxes fs pattern 0, 'file.data' using ($1+1*my_width):3 title "column 3" with boxes fs pattern 1, 'file.data' using ($1+2*my_width):4 title "column 4" with boxes fs pattern 2
The variable my_width is used to set how wide the boxes will be. Then each successive column is plotted one value of my_width further down the axis than the previous columns.
The axis of the graph can also be specified. Here is an example:
plot 'file.data' using 1:2 axis x1y1 title "column 2" with boxes fs pattern 0, 'file.data' using ($1+my_width):2 axis x1y2 title "column 3" with boxes fs pattern 1
This will plot column 2's values against the left y axis, and then plot column 3's values against the right y-axis. This can be a good way to resolve issues where one column's values are significantly larger than the others. This could force the scale on one axis to be so large that some of the columns aren't viewable. Using one axis for large values and one for smaller values can fix this issue.
After a plot command is made, a graph is sent to the output file. If a new output file is then specified, a new plot command can be made from the same instruction file. Otherwise the second plot command will simply overwrite the previous output file.
GNUplot Data Files
The gnuplot data files for Globus usage statistics are structured like this sample file:
1 0 4
2 3 0
3 0 4.5
1 2 1
2 0 3
3 4 0
The leftmost column is used for the x-axis positions. Since most globus usage stats are plotted with time on the x-axis, each time step is represented by an integer on the x-axis. The second and third columns are the actual values of the statistics reports, which will be plotted on the y-axis versus time on the x-axis. There can be as many columns as necessary. The two blank spaces indicate an index. The first three rows will be index 0, and the last three rows index 1. This allows for only part of a data file to be plotted at one time.
The command plot 'file.data' index 0 using 1:2 title "column2 index 0" with boxes fs pattern 0 will only plot the values in the first index.
Finalizing Reports
Once the reports are finished, they can be added to the appropriate Globus Usage Statistics Script for their component. When results are generated, they can then be posted online at the Globus Usage Statistics webpage, which at the moment is www.mcs.anl.gov/~gawor/stats/other