PROV-Template: A Quick Start

The aim of this blog post is to provide simple guidelines to generate provenance using the PROV-Template approach.

A quick reminder:  a provenance template is a PROV document, describing the provenance that it is intended to be generated. A provenance template includes  some variables that are placeholders for values. So,  a provenance  template can be seen as a declarative specification of the provenance intended to be generated by an application.   A set of bindings contains associations between variables and values. The PROV-template  expansion algorithm, when provided with a template and a set of bindings, generates a provenance document, in which variables have been replaced by values.

Therefore, three steps are involved in this methodology.

  1. Design a “provenance template” describing the structure of the provenance intended to be generated.
  2. Instrument the application, log values, and create “binding files” from these values.
  3. Produce provenance by expanding the template using binding files.

We consider a simple computation, which we would like to describe with provenance.   The computation consisted of 3 calls of binary functions: the functions were composed in such a way that the results of two calls were used by the third one. To simplify, we assume that the operations were arithmetic +, -, and *, and the values flowing and out of these operations were integers. Note my use of past tense: the aim of provenance is to describe past computation, as opposed to a future, hypothetical computation (or workflow).

(10+11)-(7*5)

As we have 3 binary functions, we design a template describing the invocation of a binary function.  It consists of an activity (denoted by variable operation), two used entities (denoted by variables consumed1 and consumed2), a generated entity (denoted by variable produced), and an agent (denoted by variable agent) responsible for the activity.  Graphically, the template can be represented as follows.

Template for the invocation of a binary function

Using the PROV-N notation, the template is expressed as follows.  We see that variables are declared in the namespace with prefix var. Each entity and activity is associated with a type, expressed by a variable, which can also be instantiated.

document
 prefix tmpl <http://openprovenance.org/tmpl#>
 prefix var <http://openprovenance.org/var#>
 prefix vargen <http://openprovenance.org/vargen#>

 bundle vargen:b
  activity(var:operation, [ prov:type='var:operation_type' ] )
  agent(var:agent)
  wasAssociatedWith(var:operation,var:agent,-)
  entity(var:consumed1,[prov:value='var:consumed_value1'])
  entity(var:consumed2,[prov:value='var:consumed_value2'])
  used(var:operation, var:consumed1, - )
  used(var:operation, var:consumed2, - )  
  entity(var:produced,[prov:type='var:produced_type',prov:value='var:produced_value'])
  wasGeneratedBy(var:produced, var:operation, - )
  wasDerivedFrom(var:produced, var:consumed1)
  wasDerivedFrom(var:produced, var:consumed2)
 endBundle
endDocument

To be able to generate provenance, one needs to define so-called “bindings files”, associating variables with values. The structure of bindings file is fairly straightforward: with the most recent version of the ProvToolbox, a bindings file can be expressed as a simple JSON structure. Such JSON structures are very easy to generate programmatically from multiple programming languages. However, in this blog post, we do not want to actually program anything in order to generate provenance.

Therefore, we are going to assume that the application already logs values of interest. We are further going to assume that the data can be easily converted to a tabular format, and specifically, that a CSV (comma separated values) representation can be constructed from those logs. The structure that we expect is illustrated in the following figure. In the first line of the file, we find variable names (exactly those found in the template) acting as column headers. In the second line, we find the type of the values found in the table.

Application log as a CSV file. First line contains variable names whereas second line contains the type of their values. Subsequent lines are the actual values.

Concretely, the CSV file uses commas as separator. The third, fourth, and five lines contain deftail of the invocations of the plus, times, and subtraction functions.

operation, operation_type, consumed1, consumed_value1, consumed2, consumed_value2, produced, produced_value, agent
prov:QUALIFIED_NAME, prov:QUALIFIED_NAME, prov:QUALIFIED_NAME, xsd:int, prov:QUALIFIED_NAME, xsd:int, prov:QUALIFIED_NAME, xsd:int, prov:QUALIFIED_NAME
ex:op1, ex:plus, ex:e1, 10, ex:e2, 11, ex:e3, 21, ex:Luc
ex:op2, ex:times, ex:e4,  5, ex:e5,  7, ex:e6, 35, ex:Luc
ex:op3, ex:subtraction, ex:e3, 21, ex:e6, 35, ex:e7, -14, ex:Luc

Each line can automatically be converted to a JSON file. For instance, the third line containing the details of the addition operation can be converted to the following JSON structure, which is essentially a dictionary associating each variable with its corresponding value, with an explicit representation of the typing information where appropriate.

{
 "var":
   {"operation": [{"@id": "ex:op1"}],
    "operation_type": [{"@id": "ex:plus"}],
    "consumed1": [{"@id": "ex:e1"}],
    "consumed_value1": [ {"@value": "10", "@type": "xsd:int"}],
    "consumed2": [{"@id": "ex:e2"}],
    "consumed_value2": [{"@value": "11", "@type": "xsd:int"}],
    "produced": [{"@id": "ex:e3"}],
    "produced_value": [ {"@value": "21", "@type": "xsd:int"} ],
    "agent": [{"@id": "ex:Luc"}]},
 "context": {"ex": "http://example.org/"}
}

We do not need to create this JSON structure ourselves. Instead, we provide an awk script that converts a given line into a bindings file.


function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
function trim(s)  { return rtrim(ltrim(s)); }

BEGIN {
      printf("{\"var\":\n{")
      OFS=FS=","
}
NR==1 {                                # Process header
    for (i=1;i<=NF;i++)                
        head[i] = trim($i)                  
    next                               
}
NR==2 {                                # Process types
    for (i=1;i<=NF;i++)                
        type[i] = trim($i)             
    next                               
}
NR==line{
    first=1
    for (i=1;i<=NF;i++) {              # For each field
	if (first) {
	    first=0
	} else {
	    printf ","
	}
	if (type[i]=="prov:QUALIFIED_NAME") {
	    printf "\"%s\": [{\"@id\": \"%s\"}]",  trim(head[i]), trim($i)
	} else if (type[i]=="xsd:string") {
	    printf "\"%s\": [ \"%s\" ]",  trim(head[i]), trim($i)
	} else  {
	    printf "\"%s\": [ {\"@value\": \"%s\", \"@type\": \"%s\"} ]",trim(head[i]), trim($i), trim(type[i])
	}
    }
    printf "\n"                        
}
END {
    printf("},\n")
    printf("\"context\": {\"ex\": \"http://example.org/\"}\n")
    printf("}\n")    
}

To facilitate the processing, we even provide a Makefile with a target do.csv that processes a line (variable LINE) of the csv file to generate a bindings file. It is then used by the utility provconvert to expand the template file. The target workflow hard-codes the presence of three lines in the CSV, the generation of a bindings file for each line, and the expansion of the template with these bindings. All files are then merged in a single provenance file using the -merge option of provconvert.

LINE=4

do.csv:
	cat bindings.csv | awk -v line=$(LINE) -f src/main/resources/awk/tobindings.awk  > target/bindings$(LINE).json
	provconvert -bindver 3 -infile template_block.provn -bindings target/bindings$(LINE).json -outfile target/block$(LINE).provn


workflow:
	$(MAKE) LINE=3 do.csv
	$(MAKE) LINE=4 do.csv
	$(MAKE) LINE=5 do.csv
	printf "file, target/block3.provn, provn\nfile, target/block4.provn, provn\nfile, target/block5.provn, provn\n" | provconvert -merge - -flatten -outfile target/wfl.svg

The resulting provenance is displayed in the following figure.

Expanded provenance showing three activities, consumed and generated entities, and an agent.

 

Concluding  Remarks

Given a log file in CSV format, we have shown it is becoming easy to generate PROV-compliant provenance without having to write a single line of code: an awk script converts CSV data to JSON, used to expand a template expressed in a PROV-compliant format.

For the provenance to be meaningful, the application must be instrumented to log the relevant values. For instance, each entity/agent/activity is expected to have been given a unique identifier.

The template design phase is also critical. In our design, we decided that one template would describe the invocation of a single function. The same template was reused for all function calls. Alternatives are possible: multiple activities could be described in a single template, alternatively different types of activities could be described in different templates. I will come back to this issue in another blog post in a few weeks.

What is in ProvToolbox 0.7.2?

1. Introduction

Yesterday, I released ProvToolbox 0.7.2, which includes the following novel features.

2. Novel Features

2.1. MacOS X Installer

Continuing our efforts of providing binary installers to facilitate installation of ProvToolbox, this release includes an installer for MacOS X.

Simply follow the link http://openprovenance.org/java/installer/provconvert-0.7.2.dmg, you will then be given access to the installation image.

Installation Disk

Installation Disk

Click on the Installer. Note that you need to allow installation of programs from any sources in your security preferences. Then simply follow the instructions. The installer will install all libraries and executable in /Applications/provconvert (default location, which can be overriden), as well as a symbolic link making the provconvert executable available in your execution path. An Uninstaller is also available as an executable jar file /Applications/provconvert/Uninstaller/uninstaller.jar.

provconvert Installer

provconvert Installer

Et voila! The executable can be invoked directly from the command line.

provconvert -version

which should return provconvert version 0.7.2 (2015-09-15 20:16).

2.2. Templates

As we continue to use templates in our applications, two further requirements have been implemented. It is now possible to expand a template, and strip the result from any variable that has not been instantiated. For this, simply pass the option -allexpand to provconvert, to be used in conjunction with the -bindings option (see Tutorial 4 (part 1) and Tutorial 4 (part 2) on template processing in ProvToolbox). Furthermore, an error code is returned when not all variables have been expanded.

2.3. Interoperability

As we are integrating Provtoolbox, ProvStore and ProvStore in the inter-operability harness developed by the Software Sustainability Institute, we have fixed some minor issues to ensure interoperability between our software stacks.

2.4. provconvert artifact

The artifact toolbox has been renamed into provconvert, since we have plans for other artifacts out of ProvToolbox.

3. Conclusion

For all details about ProvToolbox, see the github.io page http://lucmoreau.github.io/ProvToolbox/.

What is in ProvToolbox 0.7.1?

1. Introduction

Yesterday, I released ProvToolbox 0.7.1. It is a minor release, fixing minor bugs of 0.7.0, and including a useful new feature.

2. Novel Features

2.1. Debian Package

To facilitate installation, a new binary release format is now supported: Debian packaging to support binary release on Ubuntu and other Debian-based Linux distributions. You just need to run the following commands.

wget https://repo1.maven.org/maven2/org/openprovenance/prov/toolbox/0.7.1/toolbox-0.7.1.deb
dpkg --install toolbox-0.7.1.deb

This is in addition to RPM support introduced in 0.6.2:

rpm -U https://repo1.maven.org/maven2/org/openprovenance/prov/toolbox/0.7.1/toolbox-0.7.1-rpm.rpm

2.3 Visualization

Modification of the visualisation component prov-dot allow dge thickness, node size, and tooltips (on SVG) to be controlled. For this, the provenance graph nodes and edges need to be annotated with reserved attributes dot:size and dot:tooltip. The following figure illustrates the kind of graphs that can now be generated.

A summarisation of the provenance challenge workflow. Nodes are to be understood as provenance types. Thickness of edges and size of nodes reflect their frequency in the summarised document.

A summarisation of the provenance challenge workflow. Nodes are to be understood as provenance types. Thickness of edges and size of nodes reflect their frequency in the summarised document.

2.3 Bug fixes

I also fixed some minor bugs in qualified namespaces in the prov-sql package, and updated reserved namespace for provtoolbox.

3. Conclusion

Tell me how you use ProvToolbox and/or provconvert and for for which purpose. Share details of your projects with me, I will add them to https://github.com/lucmoreau/ProvToolbox/wiki/Projects-and-Applications-Using-ProvToolbox.

For all details about ProvToolbox, see the github.io page http://lucmoreau.github.io/ProvToolbox/.

ProvToolbox Tutorial 3: Merging PROV Documents

1. Introduction

It has become a requirement in several of our applications to merge PROV documents. The purpose of this tutorial is to explain how ProvToolbox allows documents to be merged, ensuring that descriptions are uniquely represented with all their attributes, merging bundles they may contain, and optionally “flattening” them.

This functionality is directly available from the command line using provconvert.

The tutorial is standalone and a zip archive can be downloaded from the following URL: http://search.maven.org/remotecontent?filepath=org/openprovenance/prov/ProvToolbox-Tutorial3/0.7.0/ProvToolbox-Tutorial3-0.7.0-src.zip. The tutorial can also be found on the ProvToolbox project on GitHub.

The tutorial assumes that provconvert has been installed and is available in the execution path.
The tutorial relies on a Makefile and can simply be run by calling:

make do.all

2. Examples of Merges

2.1 Merging two documents without bundles

Our first example consists of two documents. The first document “doc1” consists of the attribution of an entity e1 to an agent ag1. The entity has an attribute attr1.

doc1

doc1

The second document “doc2” describes the derivation of the same entity e1 from another entity e0. The description of the entity e1 contains an attribute attr2.

doc2

doc2

By merging the two documents, we obtain a new document, in which the entity e1 is both attributed to ag1 and derived from e0. The description of e1 contains the attributes attr1 and attr2.

Merged documents doc1 and doc2

Merged documents doc1 and doc2

Merging the two documents is simply performed by calling provconvert with argument -merge as follows.

provconvert -merge doc1-2-listing.txt -outfile target/doc1-2.provn

The -merge option expects a path to a file (or – to indicate standard input) that lists the files that have to be merged. In our case, we have a file doc1-2-listing.txt with the following contents:

file, src/main/resources/doc1.provn, provn
file, src/main/resources/doc2.provn, provn

Each line consists of three elements separated by a comma:

  1. A tag indicating if we are dealing with a file on the file system or a URL
  2. The path to the file or a full http URL
  3. The PROV format expected to be read

For completeness, we show the details of the documents in PROV-N notation. First, “doc1”:

document

 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val1"])
 agent(ex:ag1)
 wasAttributedTo(ex:e1, ex:ag1)

endDocument

Then, “doc2”:

document

 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val2"])
 entity(ex:e0) 
 wasDerivedFrom(ex:e1, ex:e0)

endDocument

Finally, the merged document:

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)
endDocument

The merge operation follows key constraints of the PROV-CONSTRAINTS specification, such as key-object (constraint 22) and key-properties (constraint 23).
The reader who is familiar with the RDF representation of PROV will note that the merge operation is simply obtained by “concatening” all the RDF files together.

The merge operation becomes interesting in the presence of bundles.

2.2 Merging two documents with distinct bundles

First, we consider two documents with distinct bundles.

We now examine a variant of “doc1”, which contains a bundle bun1. In the illustration, the bundle is represented by a rectangle, which contains a description of e2 generated by a2.

doc1 with bundle bun1

doc1 with bundle bun1

The second document is a variant of “doc2” with another bundle named bun2. It contains a description of e3 generated by a3.

doc2 with bundle bun2

doc2 with bundle bun2

After merging the two documents, we obtain a new document containing both bun1 and bun2.

doc1 with bundle bun1 merged with doc2 with bundle bun2

doc1 with bundle bun1 merged with doc2 with bundle bun2

As we can see, as the two bundles have different names, they are kept distinct in the merged document.

Concretely, the first document with bundle bun1.

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val1"])
 agent(ex:ag1)
 wasAttributedTo(ex:e1, ex:ag1)

 bundle ex:bun1
   entity(ex:e2)
   activity(ex:a2,-,-)
   wasGeneratedBy(ex:e2,ex:a2,-)
 endBundle

endDocument

The first document with bundle bun1.

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val2"])
 entity(ex:e0) 
 wasDerivedFrom(ex:e1, ex:e0)

 bundle ex:bun2
   entity(ex:e3)
   activity(ex:a3,-,-)
   wasGeneratedBy(ex:e3,ex:a3,-)
 endBundle

endDocument

The merged documents with two bundles is as follows:

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)

 bundle ex:bun2
  entity(ex:e3)
  activity(ex:a3,-,-)
  wasGeneratedBy(ex:e3,ex:a3,-)
 endBundle

 bundle ex:bun1
  entity(ex:e2)
  activity(ex:a2,-,-)
  wasGeneratedBy(ex:e2,ex:a2,-)
 endBundle
endDocument

2.3 Merging and flattening two documents with distinct bundles

We can optionally use the -flatten option to “remove” bundles, and “pour” their content in the surrounding document.

provconvert -merge doc1b1-2b2-listing.txt -flatten -outfile target/doc1b1-2b2-flatten.provn

The resulting document no longer contains bundles.

Merge and flatten of doc1 with bundle bun1 and doc2 with bundle bun2

Merge and flatten of doc1 with bundle bun1 and doc2 with bundle bun2

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)
 entity(ex:e3)
 activity(ex:a3,-,-)
 wasGeneratedBy(ex:e3,ex:a3,-)
 entity(ex:e2)
 activity(ex:a2,-,-)
 wasGeneratedBy(ex:e2,ex:a2,-)
endDocument

2.4 Merging two documents with the same bundle

Now, let us consider a variant of “doc2” with a bundle bun1, the same identifier as the bundle we had in the “doc1” variant. In the figure, we see that bundle bun1 contains a description of a2 generating e3.

A variant of doc2 with bundle named bun1

A variant of doc2 with bundle named bun1

If we now merge doc1 with bundle bun1 and doc2 with bundle bun1, the merge procedure merges the descriptions contained in the two instances of bundle bun1. We obtain:

doc1 with bundle bun1 merged with doc2 with bundle bun2

doc1 with bundle bun1 merged with doc2 with bundle bun2

2.5 Merging and flattening two documents with the same bundle

If in addition, we specify the -flatten option, merging and flattening operations result in the following document.

doc1 with bundle bun1 merged with doc2 with bundle bun1, after flattening

doc1 with bundle bun1 merged with doc2 with bundle bun1, after flattening

3. Conclusion

As our applications generate provenance incrementally, bundles by bundles, the ability to merge documents and collapse bundles has become critical. This functionality is implemented by ProvToolbox in the method IndexedDocument.merge(). This tutorial has shown that it is also directly available from the command line, using the provconvert utility.

What form of processing do you regularly perform on your provenance graphs? Which functionality would you like to see added to ProvToolbox? Tell us, and for any other issue related to ProvToolbox, on the Github issue tracker.

4. Appendix. Log Change

  • Original version submitted on 2015/07/27

ProvToolbox Tutorial 2: Reading, Converting and Saving PROV Documents

1. Introduction

Building on the first ProvToolbox tutorial, the aim of this second tutorial is to show how to read a PROV document using ProvToolbox and export it to some format.

We assume that installation instructions as described in the first Tutorial have been followed. Details about the Maven configuration can also be found there.

2. Download and Execution

The tutorial is standalone and a zip archive can be downloaded from the following URL: http://search.maven.org/remotecontent?filepath=org/openprovenance/prov/ProvToolbox-Tutorial2/0.7.0/ProvToolbox-Tutorial2-0.7.0-src.zip. The tutorial can also be found on the ProvToolbox project on GitHub.

After unziping the archive, we can execute the tutorial, by calling:

mvn clean install

Beside the verbose logging by the Maven build process, the tutorial itself displays the following text, including some PROV expressed according to PROV-XML.

*************************
* Converting document  
*************************

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<prov:document xmlns:prov="http://www.w3.org/ns/prov#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:provbook="http://www.provbook.org" xmlns:jim="http://www.cs.rpi.edu/~hendler/">
    <prov:entity prov:id="provbook:a-little-provenance-goes-a-long-way">
        <prov:value xsi:type="xsd:string">A little provenance goes a long way</prov:value>
    </prov:entity>
    <prov:agent prov:id="provbook:Paul">
        <foaf:name xsi:type="xsd:string">Paul Groth</foaf:name>
    </prov:agent>
    <prov:agent prov:id="provbook:Luc">
        <foaf:name xsi:type="xsd:string">Luc Moreau</foaf:name>
    </prov:agent>
    <prov:wasAttributedTo>
        <prov:entity prov:ref="provbook:a-little-provenance-goes-a-long-way"/>
        <prov:agent prov:ref="provbook:Paul"/>
    </prov:wasAttributedTo>
    <prov:wasAttributedTo>
        <prov:entity prov:ref="provbook:a-little-provenance-goes-a-long-way"/>
        <prov:agent prov:ref="provbook:Luc"/>
    </prov:wasAttributedTo>
    <prov:entity prov:id="jim:LittleSemanticsWeb.html"/>
    <prov:wasDerivedFrom>
        <prov:generatedEntity prov:ref="provbook:a-little-provenance-goes-a-long-way"/>
        <prov:usedEntity prov:ref="jim:LittleSemanticsWeb.html"/>
    </prov:wasDerivedFrom>
</prov:document>

*************************

3. Reading and writing PROV documents in Java

The following Java snippet is extracted from the file src/main/java/org/openprovenance/prov/tutorial/tutorial2/ReadWrite.java. In line 3, it shows how a document can be read, given its path filein on the file system. In line 4, we see how a PROV Document can be saved into a file fileout. The writeDocument procedure determines the PROV format that is required by looking at the extension. If a non-standard extension is used, then the format can be specified explicitly, as in line 5, by one of the values of the enumerated type ProvFormat.

    public void doConversions(String filein, String fileout) {
        InteropFramework intF=new InteropFramework();
        Document document=intF.readDocumentFromFile(filein);
        intF.writeDocument(fileout, document);     
        intF.writeDocument(System.out, ProvFormat.XML, document);
    }

   public static void main(String [] args) {
        if (args.length!=2) throw new UnsupportedOperationException("main to be called with two filenames");
        String filein=args[0];
        String fileout=args[1];
        
        ReadWrite tutorial=new ReadWrite(InteropFramework.newXMLProvFactory());
        tutorial.openingBanner();
        tutorial.doConversions(filein, fileout);
        tutorial.closingBanner();
    }

For completion, line 13 shows how the tutorial class is initialized and line 15 takes care of invoking the conversion functionality.

The tutorial is called from the command line, passing src/main/resources/a-little.provn as the input file, and target/a-little.svg as the output file.  Therefore, the a-little.provn file is converted to SVG (by line 4) and to XML on standard output (by line 5).

The following table lists the formats that are supported by ProvToolbox.

gv text/vnd.graphviz output
dot text/vnd.graphviz output
prov-asn text/provenance-notation input
prov-asn text/provenance-notation output
pn text/provenance-notation input
pn text/provenance-notation output
asn text/provenance-notation input
asn text/provenance-notation output
provn text/provenance-notation input
provn text/provenance-notation output
rdf application/rdf+xml input
rdf application/rdf+xml output
json application/json input
json application/json output
ttl text/turtle input
ttl text/turtle output
trig application/trig input
trig application/trig output
jpeg image/jpeg output
jpg image/jpeg output
provx application/provenance+xml input
provx application/provenance+xml output
xml application/provenance+xml input
xml application/provenance+xml output
png image/png output
pdf application/pdf output
svg image/svg+xml output

4. Conclusion

For further documentation on the classes and methods used, Javadoc for ProvToolbox can be found from http://openprovenance.org/java/site/latest/apidocs/.  The Javadoc documentation also refers to PROV specifications where appropriate.

Suggestions for tutorials and also for ways of improving the programming experience offered by ProvToolbox are always welcome. Please raise issues on GitHub issue tracker.

5. Appendix. Log Change

  1. Original version submitted on 2015/06/30
  2. Updated to 0.7.0 on 2015/07/27

What is in ProvToolbox 0.6.1?

I have just released ProvToolbox version 0.6.1, shortly after 0.6.0. There is one notable novelty: ProvToolbox is now promoted to Maven central. This means that no special configuration for artefact repositories is required; instead, the default maven repository is now used. Furthermore, the project complies with Sonatype rules for publication: signed artefacts, web site, documentation, etc. The tutorial was adapted to reflect this new configuration.

ProvToolbox Tutorial 1: Creating and Saving a PROV Document

1. Introduction

The aim of this tutorial is to show how to create a simple PROV Document in Java using ProvToolbox, and export it to some format.  The post also explains how to configure Maven to use the required ProvToolbox artifacts.

The tutorial is based on “A Little Provenance Goes a Long Way“, a quote that Paul and I asserted in our book  “An introduction to PROV” (see www.provbook.org).

2. Download and Execution

The tutorial is standalone and a zip archive can be downloaded from the following URL: http://search.maven.org/remotecontent?filepath=org/openprovenance/prov/ProvToolbox-Tutorial1/0.7.0/ProvToolbox-Tutorial1-0.7.0-src.zip.The tutorial can also be found on the ProvToolbox project on GitHub.

After unziping the archive, we obtain the following directory structure:

tutorial1

The directory contains a README file, a license file, and a  Maven pom.xml configuration file, and a source directory containing a single Java file.

To execute the tutorial, you need to have two pieces of software installed:

  1. the software project management tool Apache Maven  (see http://maven.apache.org/download.html),  and
  2. the graph visualization software GraphViz (see http://www.graphviz.org/).

You are then ready to execute the tutorial by calling

mvn clean install

Beside the verbose logging by the Maven build process, the tutorial itself displays the following text, including some PROV expressed according to the Provenance Notation.

*************************
* Converting document  
*************************
document
prefix xsd <http://www.w3.org/2001/XMLSchema>
prefix provbook <http://www.provbook.org>
prefix jim <http://www.cs.rpi.edu/~hendler/>
entity(provbook:a-little-provenance-goes-a-long-way,[prov:value = "A little provenance goes a long way" %% xsd:string])
agent(provbook:Paul,[prov:label = "Paul Groth"])
agent(provbook:Luc,[prov:label = "Luc Moreau"])
wasAttributedTo(provbook:a-little-provenance-goes-a-long-way, provbook:Paul)
wasAttributedTo(provbook:a-little-provenance-goes-a-long-way, provbook:Luc)
entity(jim:LittleSemanticsWeb.html)
wasDerivedFrom(provbook:a-little-provenance-goes-a-long-way, jim:LittleSemanticsWeb.html)
endDocument
*************************

Furthermore, a file target/little.svg appears in the target/ directory. It displays as follows.
little

3. Java Source Code

We now examine the Java class Little. The main method creates an instance of the Little class, constructs a PROV document, and writes it to a file, and also displays it on the System.out stream, in the Provenance notation (here "target/little.svg" was passed as argument when invoking the main method).

    public void doConversions(Document document, String file) {
        InteropFramework intF=new InteropFramework();
        intF.writeDocument(file, document);     
        intF.writeDocument(System.out, ProvFormat.PROVN, document);
    }

    public static void main(String [] args) {
        if (args.length!=1) throw new UnsupportedOperationException("main to be called with filename");
        String file=args[0];
        
        Little little=new Little(InteropFramework.newXMLProvFactory());
        little.openingBanner();
        Document document = little.makeDocument();
        little.doConversions(document,file);
        little.closingBanner();
    }

The InteropFramework class deals with conversion of Java representations to PROV serializations and vice-versa. It uses the writeDocument method to save a Document to a file or to an Outputstream.

The code snippet below displays the method  makeDocument which constructs a Document. It proceeds by creating two entities quote and original, and two agents paul and luc. It then creates three associations (two attributions from quote to the agents) and one derivation from the quote to the original.  A new Document is created, and all the assertions are added. The Document namespace is also set appropriately.

    public Document makeDocument() {     
        Entity quote = pFactory.newEntity(qn("a-little-provenance-goes-a-long-way"));
        quote.setValue(pFactory.newValue("A little provenance goes a long way",
                                         pFactory.getName().XSD_STRING));

        Entity original = pFactory.newEntity(ns.qualifiedName(JIM_PREFIX,"LittleSemanticsWeb.html",pFactory));

        Agent paul = pFactory.newAgent(qn("Paul"), "Paul Groth");
        Agent luc = pFactory.newAgent(qn("Luc"), "Luc Moreau");

        WasAttributedTo attr1 = pFactory.newWasAttributedTo(null,
                                                            quote.getId(),
                                                            paul.getId());
        WasAttributedTo attr2 = pFactory.newWasAttributedTo(null,
                                                            quote.getId(),
                                                            luc.getId());
        WasDerivedFrom wdf = pFactory.newWasDerivedFrom(quote.getId(),
                                                        original.getId());

        Document document = pFactory.newDocument();
        document.getStatementOrBundle()
                .addAll(Arrays.asList(new StatementOrBundle[] { quote, 
                                                                paul,
                                                                luc, 
                                                                attr1,
                                                                attr2, 
                                                                original,
                                                                wdf }));
        document.setNamespace(ns);
        return document;
    }
    public QualifiedName qn(String n) {
        return ns.qualifiedName(PROVBOOK_PREFIX, n, pFactory);
    }

As one can see from the code, the ProvFactory is used to create all statements. The constructor receives a ProvFactory from the main method. The constructor also creates a namespace object ns to manage all namespace/prefix declarations.

    public Little(ProvFactory pFactory) {
        this.pFactory = pFactory;
        ns=new Namespace();
        ns.addKnownNamespaces();
        ns.register(PROVBOOK_PREFIX, PROVBOOK_NS);
        ns.register(JIM_PREFIX, JIM_NS);
    }

4. Maven Configuration

The Maven configuration file is straightforward to define. First, one needs to specify the artifacts this code depends on. There are two of them. The prov-model artifact provides the classes necessary to manipulate the PROV Data Model in Java, irrespectively of the serialization chosen for it. The prov-interop artifact provides the classes necessary to convert the PROV Data Model in Java to any PROV-compatible serialization and back.

  <dependencies>
    <dependency>
      <groupId>org.openprovenance.prov</groupId>
      <artifactId>prov-model</artifactId>
      <version>0.7.0</version>
    </dependency>
    <dependency>
      <groupId>org.openprovenance.prov</groupId>
      <artifactId>prov-interop</artifactId>
      <version>0.7.0</version>
    </dependency>
  </dependencies>

Since version 0.6.1, ProvToolbox is deployed on Maven central. Hence, it is not required to specify any repository for artifacts.

For completeness, we show how the tutorial is started by Maven during the test phase. The main method of class Little is invoked by the exec-maven-plugin passing an explicit argument target/little.svg.

<plugin>
	<groupId>org.codehaus.mojo</groupId>
	<artifactId>exec-maven-plugin</artifactId>
	<version>1.3.2</version>
	<executions>
	  <execution>
	    <phase>test</phase>
	    <goals>
	      <goal>java</goal>
	    </goals>
	    <configuration>
	      <mainClass>org.openprovenance.prov.tutorial.tutorial1.Little</mainClass>
	      <arguments>
		<argument>target/little.svg</argument>
	      </arguments>
	    </configuration>
	  </execution>
	</executions>
</plugin>

5. Conclusion

For further documentation on the classes and methods used, Javadoc for ProvToolbox can be found from http://openprovenance.org/java/site/latest/apidocs/.  The Javadoc documentation also refers to PROV specifications where appropriate.

This is the first tutorial for ProvToolbox. Others will follow. We also welcome suggestions for tutorial topics, and also for ways of improving the programming experience offered by ProvToolbox. Post comments on this site or use GitHub issues.

 

6. Appendix. Log Change

  1. Original version submitted on 2014/08/01
  2. Updated with maven central configuration on 2014/08/08
  3. Updated to 0.6.2 on 2015/07/01
  4. Updated to 0.7.0 on 2015/07/27

 

What’s in ProvToolbox 0.6.0?

Back in December, I was releasing ProvToolbox 0.5.0. On August 1st 2014, I released version 0.6.0. This post outlines the key changes of this version. These were driven by requirements from ProvTranslator, ProvValidator, Picaso, and a new provenance template management system (I will blog about Picaso and the template management system in the near future.

1. Novel features

1.1 Random Generator

ProvToolbox now includes Jamal Hussein’s PROV graph generator. For instance, the provconvert command

provconvert -generator 30:3:entity:1234 -outfile foo.jp

generates the following provenance graph, where 30 is the number of nodes generated, 3 is the maximum connectivity, entity is the type of the first node, 1234 is an optional seed.

random

1.2 Templating System

prov-template is a templating system developed by Danius Michaelides, Trung Dong Huynh, and myself. Its specification is available at https://provenance.ecs.soton.ac.uk/prov-template/. ProvToolbox now contains a reference implementation of this specification. See the section Implementation for a description of how to invoke the templating system from the command line.

1.3 Tutorial

A tutorial for ProvToolbox is long overdue. ProvToolbox 0.6.0 now includes a small tutorial, explaining how to set up a maven environment, write some Java code to create a provenance document and serialize it to your favourite format. More similar short tutorials are also in the pipeline. Watch this space!

2. Improvements

2.1 Better inter-operability

The key motto of ProvToolbox is to construct a Java representation of the Provenance Data Model, manipulate it, and save it.  Two key “generic” methods, readDocument and witeDocument, were introduced to perform read and write operations.

The method readDocument(InputStream, ProvFromat) reads a Document from an inputstream, using the parser specified by the format argument. Likewise, writeDocument(OutputStream, ProvFormat, Document) writes a document to an output stream according to the specified format.

Furthermore, ProvToolbox provides support for content negotiation over PROV-related media types. This feature is heavily exploited by the services ProvTranslator and ProvValidator.

2.2 prov-sql

prov-sql is now being used in a template management system we are developing. It works in the sense that it has been tested in the context of that system, but it is in no way optimized. Indeed, there is plenty of room for improvement! It is now time for others to have a look at the mapping, experiment with it, and improve it. prov-sql uses a JPA ORM to map Java Beans to a SQL database. The automatically-generated documentation of the mapping between Java classes and SQL tables is available from prov-sql orm mapping page.

2.3 Bug fixes and documentation

A series of bugs have been fixed (see GitHub Issues). Thanks to those who submitted bug reports.

Dependencies of ProvToolbox have been revised: there is a general upgrading to more recent artifacts, and superflous dependencies were removed.

2.4 GitHub IO page

And last, but not, least, ProvToolbox now has its own GitHub IO page at http://lucmoreau.github.io/ProvToolbox/

3. Conclusion

Overall, it is a release that consolidates ProvToolbox, supporting better inter-operability across PROV representations, and supporting functionality in our various services.

What is in ProvToolbox 0.5.0?

Release 0.5.0 is the second Christmas release of ProvToolbox. A year ago, I was releasing ProvToolbox 0.1.1. At the time, the Provenance working group had just released its candidate recommendations, and was in the implementation phase of PROV. Since then, PROV has become a recommendation. ProvToolbox has also changed dramatically, being released no less than 9 times since last Christmas.

This blog post highlights the key new features found in ProvToolbox 0.5.0.

1. Artefact architecture

Benefiting from the stable nature of PROV, ProvToolbox underwent significant refactoring. PROV-DM is essentially specified by a set of interfaces. They are implemented by POJOs, offering a Java representation of the PROV model in memory.  This Java representation can be marshalled to various formats, using two different kinds of marshallers: POJO-based and external marshallers, which I now define.

POJO-based marshallers include PROV-XML and a very(!) preliminary mapping to SQL. The design is extensible and other serializations could be defined. For instance, Spring Data could be used for serializing to NoSQL databases (any taker?). POJO-based marshallers typically use Java annotations to specify marshalling: JAXB annotations are used for marshalling to XML and JPA annotations for mapping to SQL.

External marshallers take care of the conversion to rdf, json, and graphviz representations.  These marshallers only rely on accessors to access the properties of the objects represented in memory.

A significant contribution of this release is the refactoring of the Maven artefacts to minimise cross-dependencies. For instance, the converter to rdf, prov-rdf, only depends on the prov-model and is independent of all other artefacts.

The following figure summarizes the component architecture of ProvToolbox.

Key Components of ProvToolbox

Key Components of ProvToolbox

2. Qualified Names

PROV uses qualified names to denote resources. Qualified names can be converted into URIs ensuring compatibility with the web architecture. PROV qualified names have a syntax that is more permissive than XML QNames.

PROV POJOS are now specified in terms of QualifiedNames. QualifiedNames replace java.xml.namespace.QName, which have essentially been phased out from the 0.5.0 code since they are expected to be compatible with  XML QNames.

For instance, in PROV, one is concerned by the generation of entities by activities. This is modelled by the following interface, with getters and setters for entity and activity, identified by a QualifiedName.

public interface WasGeneratedBy extends  .... {
  void setEntity(QualifiedName entity);
  void setActivity(QualifiedName activity);
  QualifiedName getEntity();
  QualifiedName getActivity();
}

Full details about this interface can found here.

3. Documentation

It was now time to provide some documentation for ProvToolbox. The focus has been on Javadoc providing good cross-reference to the PROV specifications. It can be found at http://openprovenance.org/java/site/0_5_0/apidocs/.

4.  Miscellaneous Improvements

A series of improvements have been brought to ProvToolbox 0.5.0:

  • A new “visitor” interface for the PROV statement has been defined. It makes it very easy to define functionality that is statement specific  (see StatementAction). A variant of this visitor also allows for values to be returned (see StatementActionValue).
  • In 0.4.0, I introduced the class Namespace to help manage prefix-namespace mappings. PROV-N allows for bundles to inherit prefixes from the enclosing document. To implement this mechanism, Namespace can now be chained, and prefixes can be looked up along that chain.
  • Extensive testing of the Key construct for PROV-Dictionary.

5. Use cases: ProvValidator and ProvTranslator

I don’t develop ProvToolbox just for the sake of it. It is used as a core component of ProvValidator and ProvTranslator.

ProvValidator  implements the prov-constraints specification over prov-model. The validator is made available as a service at https://provenance.ecs.soton.ac.uk/validator/view/validator.html.

ProvTranslator is also an online service offering conversion of PROV into various representations. It is essentially a service wrapping up ProvToolbox. It is available from https://provenance.ecs.soton.ac.uk/validator/view/translator.html.

Conclusion

In the new year, the focus will be on bug fixing, tackling outstanding issues in the tracker, and refactoring code, with a view to release ProvToolbox 1.0.

Useful Pointers

GitHub repository: https://github.com/lucmoreau/ProvToolbox/

Javadoc: http://openprovenance.org/java/site/0_5_0/apidocs/

Maven repository: http://openprovenance.org/java/maven-releases/

What is in ProvToolbox 0.4.0?

ProvToolbox is an open source Java package to create, manipulate, save, and read PROV representations. PROV is the W3C standard for representing provenance on the Web. Since  I have just released version 0.4.0 this WE, I thought it would be good to explain recent changes and future directions for ProvToolbox.

History

First, some context. ProvToolbox was initially conceived during the lifetime of the W3C Provenance Working Group. ProvToolbox was initially implementing the PROV data model, serialization to XML (and back) according to PROV-XML, mapping to RDF (and back) according to PROV-O (Thanks to Mike Jewell for helping with the first version of the converter RDF to Java), serialization to PROV-N and back, and serialization to JSON (and back) according to PROV-JSON (thanks to Trung Dong Huynh for helping with this converter). ProvToolbox was one of the implementations demonstrating implementability of the PROV specifications.  ProvToolbox’s design was inspired by the OPMToolbox a similar toolkit for OPM, a predecessor of PROV.  In its original design, ProvToolbox was adopting a schema driven approach, in which schemas, grammars, and ontologies were automatically compiled into marshallers and umarshallers. This was particularly convenient when PROV was being designed, and changed every other week.

Motivation

The purpose of ProvToolbox is to create PROV representations, manipulate them, save and read them using standard serializations.  By sheltering the programmers from the nitty-gritty details of serialization, it is hoped that they can focus on provenance specific functionality, and improve the quality  of their applications. ProvToolbox is known to be used in several applications and services, including the online PROV translator, the PROV validator,  Amir’s CollabMap-based trust rating, and others. If your application uses ProvToolbox, please let me know.

Recent Changes

prov-model

The key change is the introduction of the prov-model artifact (preliminary version of which was already released in 0.3.0). This artifact is the realization of the PROV conceptual model in Java. Classes and associations of the conceptual model are all formalized by Java interfaces, specifying their accessors and mutators, and other relevant methods.  Instances of these interfaces can be found in the form of Java beans, for instance, in the prov-xml artifact, which takes care of marshalling to XML and unmarshalling from XML.  Another implementation of these interfaces can be found in prov-sql (see this topic, being discussed below). One can imagine further implementations using Spring Data, for instance.

static and refactored beans

While a schema driven approach was suitable when the PROV standard was being developed, now that PROV is frozen, it is better to define beans statically, and curate them manually.   For instance, attributes are now handled systematically, and expressed as org.openprovenance.model.Attribute. The outcome is beans that are more natural to the programmer.

Namespace Handling

In the toolbox, qualified names (known as QName in XML Schema) are used to represent URIs in a short form. Managing namespaces and associated prefixes is sometimes a pain. To facilitate the programmer’s task, a class Namespace, embedding all namespace-related processing was introduced.  Please let me know if it covers your need.

Unmarshalling from XML

Beans used to be generated by JAXB. However, for attributes such as prov:location, which were expected to be xsd:anySimpleType, the corresponding Java method expected each location attribute to be an Object. The accessor getLocation() used to have the following signature.

List<Object> getLocation()

However, round trip conversion from Java to XML and back was not successful with QNames whose namespace had not been declared globally. To ensure compatibility with the PROV standard definition, and to shelter the programmer from these tedious serialization details, manual (un)marshallers were written (adaptors in JAXB speak). An extensive series of tests was developed: more than 200 tests are now run for each serialization to check round-trip conversion. In particular, we ensure support for rdf 1.1 primitive data types.

prov-sql

Another novelty of this release is a very very preliminary ORM mapping for PROV.  It allows PROV representations to be saved to and retrieved from SQL databases. Currently, there is no support for PROV-Dictionary and other extensions. Lots of schema optimizations are possible (and required too!). Feedback welcome on the SQL Schema and mapping.  But before spending any more time on the sql schema, there is some further refactoring of beans that I would like to implement (see next section).

Where next?

I am reasonably satisfied with the definitions in prov-model. There is still one significant change that I would like to introduce, which is likely to break again applications using ProvToolbox. The JAXB automatic bean generation introduced IDRef, a Java class, whose sole purpose was to serialize atttribute prov:ref=”e1″ in the example below.

<prov:wasGeneratedBy prov:id="gen1">
 <prov:entity prov:ref="e1"/>
 <prov:activity prov:ref="a1"/>
</prov:wasGeneratedBy>

The following Java methods were defined accordingly.

 IDRef getEntity()
 IDRef getActivity()

Instead, I propose to specify beans with the following methods, avoiding programmers to have to manipulate IDRefs, since they are not part of the PROV data model, but are only introduced for the purpose of serialization.

 QName getEntity()
 QName getActivity()

So far, ProvToolbox has used javax.xml.namespace.QName, but this class is supposed to represent XML QNames. XML QNames come with strong syntactic restrictions (though the Java class does not enforce them), but these have been  relaxed in PROV Qualified Names. Therefore, I will introduce a Qualified Name class for ProvToolbox, supporting the syntactic definitions set by PROV, but also allowing easy conversion to Turtle and XML QNames; further, it will also offer functions to convert to and from corresponding URIs.

With this in place, I hope that bean interfaces will be frozen till release 1.0.0. The focus will then be on finalizing and refactoring the various  serializations.

Finally, I have already started documenting ProvToolbox, but much more is required!

Useful Pointers

GitHub repository: https://github.com/lucmoreau/ProvToolbox/

Javadoc: http://openprovenance.org/java/site/0_4_0/apidocs/

Maven repository: http://openprovenance.org/java/maven-releases/