ProvToolbox Tutorial 4: Templates for Provenance (part 1)

1. Introduction

In several of our applications, we felt the need of separating the logging of information from the constructing and storing of provenance. For this, we introduced PROV-Template a templating system for provenance, describing the shape of provenance graphs to be generated, and we specified an algorithm capable of instantiating templates, with specific values.

The purpose of this tutorial is to introduce PROV-Template and how templates can be instantiated using ProvToolbox. This functionality is directly available from the command line using provconvert.

The tutorial is standalone and a zip archive can be downloaded from the following URL: http://search.maven.org/remotecontent?filepath=org/openprovenance/prov/ProvToolbox-Tutorial4/0.7.0/ProvToolbox-Tutorial4-0.7.0-src.zip. The tutorial can also be found on the ProvToolbox project on GitHub.

The tutorial assumes that provconvert has been installed and is available in the execution path. (See http://lucmoreau.github.io/ProvToolbox/ for installation instructions.) The tutorial relies on a Makefile and can simply be run by calling:

make do.all

2. Example of Templates

2.1 A Template for Attribution of a Quote

Building on blog post “A little provenance goes a long way”, imagine that we need to systematically provide attribution to quotes. As this is a repetitive tasks, we should consider the PROV-Templates approach to generate provenance.

A provenance template is itself a PROV document in which some variables act as placeholders for values to be filled at expansion time. More precisely, a template is a bundle of PROV assertions: a bundle is the PROV mechanism by which provenance of provenance can be expressed.

The figure below contains a graphical illustration of a template for Quote Attribution. It contains the following variables:

  • var:author the identifier of the author (stated to be a prov:Person)
  • var:name the author’s name
  • var:quote the identifier of the quote
  • var:value the quote itself
  • vargen:bundleId the identifier of the bundle to be generated

The quote is attributed to the author agent. The variables var:author, var:namer, var:quote, var:value are qualified names in a namespace reserved for PROV-Template variables, and are conventionally prefixed with the prefix var. There is an expectation that values need to be provided for these variables when instantiating a template. On the other hand, the variable vargen:bundleId, with prefix vargen, can have a value generated automatically at instantiation time.

Quote Attribution Template

Quote Attribution Template

Concretely, in the PROV-N notation, the template is expressed as follows.

document

  prefix var <http://openprovenance.org/var#>
  prefix vargen <http://openprovenance.org/vargen#>
  prefix tmpl <http://openprovenance.org/tmpl#>
  prefix foaf <http://xmlns.com/foaf/0.1/>
  
  bundle vargen:bundleId
    entity(var:quote, [prov:value='var:value'])
    entity(var:author, [prov:type='prov:Person', foaf:name='var:name'])
    wasAttributedTo(var:quote,var:author)
  endBundle

endDocument

2.2 Template Instantiation: A Little Provenance Goes a Long Way

Let’s now look into how we can instantiate the templates. Let us consider the following bindings for the 4 variables author, name, quote and value. An association between a variable and a value is referred to as a binding.

var:author http://orcid.org/0000-0002-3494-120X
var:name “Luc Moreau”
var:quote ex:quote1
var:value “A Little Provenance Goes a Long Way”

If we instantiate the template with these bindings, we obtain the following instantiated document. We note that vargen:bundleId was instantiated with UUID value.

Template Instantiation for "A Little Provenance Goes a Long Way"

Template Instantiation for “A Little Provenance Goes a Long Way”

Expansion of a template with provconvert is straightforward. The parameter -infile must be used to provide the template. The binding file is specified with the -binding parameter. The resulting instantiated template is specified with -outfile.

	
provconvert -infile template1.provn -bindings binding1.ttl -outfile doc1.provn

The input template and its instantiation can be expressed in any of the formats supported by ProvToolbox. We still have to express the set of bindings. We did not want to introduce a new specific format (though we may do it in the future), so, we just decided to use PROV. In particular, the Turtle notation is fairly elegant in this case. Two family of properties are introduced in the tmpl namespace, namely value_i and 2dvalue_i_j, for binding variables in identifier and value positions, respectively.

@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix tmpl: <http://openprovenance.org/tmpl#> .
@prefix var: <http://openprovenance.org/var#> .
@prefix ex: <http://example.com/#> .

var:author a prov:Entity;
           tmpl:value_0 <http://orcid.org/0000-0002-3494-120X>.
var:name   a prov:Entity;
           tmpl:2dvalue_0_0 "Luc Moreau".
var:quote  a prov:Entity;
           tmpl:value_0 ex:quote1.
var:value  a prov:Entity;
           tmpl:2dvalue_0_0 "A Little Provenance Goes a Long Way".

Details about the syntax of bindings can be found in https://provenance.ecs.soton.ac.uk/prov-template/.

2.3 Template Instantiation: A Second Author

In some cases, we would like to express that there is a second author to a document. The attribution template does not need to be redefined. We simply need to provide relevant bindings for the second author.

For instance, Paul and Luc are the two authors of that quote. Conceptually, we want to provide the following bindings.

var:author http://orcid.org/0000-0002-3494-120X
http://orcid.org/0000-0003-0183-6910
var:name “Luc Moreau”
“Paul Groth”

We see that each of var:author and var:name is given two values. This results in the following expanded provenance graph.

Instantiation with Two Authors

Template Instantiation with Two Authors

The contents of the bindings file is explicit below. Lines 7-9, var:author is given two values, using the properties tmpl:value_0 and tmpl:value_1. Lines 10-12, var:name is given two values to occur in attribute position, with properties tmpl:2dvalue_0_0 and tmpl:2dvalue_1_0.

@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix tmpl: <http://openprovenance.org/tmpl#> .
@prefix var: <http://openprovenance.org/var#> .
@prefix ex: <http://example.com/#> .

var:author a prov:Entity;
           tmpl:value_0 <http://orcid.org/0000-0002-3494-120X>;
           tmpl:value_1 <http://orcid.org/0000-0003-0183-6910>.
var:name   a prov:Entity;
           tmpl:2dvalue_0_0 "Luc Moreau";
           tmpl:2dvalue_1_0 "Paul Groth".
var:quote  a prov:Entity; 
           tmpl:value_0 ex:quote1.
var:value  a prov:Entity; 
           tmpl:2dvalue_0_0 "A Little Provenance Goes a Long Way".

Again, we refer the reader to the PROV-Template specification for details of the bindings syntax.

2.4 Template Instantiation: More Attributes

In general, PROV also allows for variable number of attribute values to be provided for a given attribute. For instance, we may want the name and nick name to be provided as two possible values for the var:name variable. This would result in the following expanded graph.

Template Instantiation: Variable Number of Attributes

Template Instantiation with Variable Number of Attributes

Again, the template remains unchanged, but the bindings are as follows. In lines 12-13, we see two possible names for Paul, respectively expressed with tmpl:2dvalue_1_0 and tmpl:2dvalue_1_1. This shows that template expansion can support a variable number of attributes for different statements instantiated from the same template statement.

@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix tmpl: <http://openprovenance.org/tmpl#> .
@prefix var: <http://openprovenance.org/var#> .
@prefix ex: <http://example.com/#> .

var:author a prov:Entity;
           tmpl:value_0 <http://orcid.org/0000-0002-3494-120X>;
           tmpl:value_1 <http://orcid.org/0000-0003-0183-6910>.
var:name   a prov:Entity;
           tmpl:2dvalue_0_0 "Luc Moreau";
           tmpl:2dvalue_1_0 "Paul Groth";
           tmpl:2dvalue_1_1 "pgroth".
var:quote  a prov:Entity;
           tmpl:value_0 ex:quote1.
var:value  a prov:Entity;
           tmpl:2dvalue_0_0 "A Little Provenance Goes a Long Way".

3. Conclusions

PROV-Template is easy to work with, it just requires provconvert to be installed. By decoupling the generation of provenance from the logging of values, we observed a number of benefits:

  • It allowed us to fine tune the provenance, independently of the application.
  • It permitted us to keep the code to generate the provenance separate from the application itself.
  • It allowed us to adopt a more conceptual approach to provenance, thinking of “provenance schemas” rather than instances.

This is the first part of the tutorial on PROV-Template. In the second part of the tutorial, we will see how PROV-Template can support more sophisticated use cases.

Thanks to co-authors Dong and Danius. Heather has been using it in Smart Society’s SmartShare application.

ProvToolbox Tutorial 3: Merging PROV Documents

1. Introduction

It has become a requirement in several of our applications to merge PROV documents. The purpose of this tutorial is to explain how ProvToolbox allows documents to be merged, ensuring that descriptions are uniquely represented with all their attributes, merging bundles they may contain, and optionally “flattening” them.

This functionality is directly available from the command line using provconvert.

The tutorial is standalone and a zip archive can be downloaded from the following URL: http://search.maven.org/remotecontent?filepath=org/openprovenance/prov/ProvToolbox-Tutorial3/0.7.0/ProvToolbox-Tutorial3-0.7.0-src.zip. The tutorial can also be found on the ProvToolbox project on GitHub.

The tutorial assumes that provconvert has been installed and is available in the execution path.
The tutorial relies on a Makefile and can simply be run by calling:

make do.all

2. Examples of Merges

2.1 Merging two documents without bundles

Our first example consists of two documents. The first document “doc1” consists of the attribution of an entity e1 to an agent ag1. The entity has an attribute attr1.

doc1

doc1

The second document “doc2” describes the derivation of the same entity e1 from another entity e0. The description of the entity e1 contains an attribute attr2.

doc2

doc2

By merging the two documents, we obtain a new document, in which the entity e1 is both attributed to ag1 and derived from e0. The description of e1 contains the attributes attr1 and attr2.

Merged documents doc1 and doc2

Merged documents doc1 and doc2

Merging the two documents is simply performed by calling provconvert with argument -merge as follows.

provconvert -merge doc1-2-listing.txt -outfile target/doc1-2.provn

The -merge option expects a path to a file (or – to indicate standard input) that lists the files that have to be merged. In our case, we have a file doc1-2-listing.txt with the following contents:

file, src/main/resources/doc1.provn, provn
file, src/main/resources/doc2.provn, provn

Each line consists of three elements separated by a comma:

  1. A tag indicating if we are dealing with a file on the file system or a URL
  2. The path to the file or a full http URL
  3. The PROV format expected to be read

For completeness, we show the details of the documents in PROV-N notation. First, “doc1”:

document

 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val1"])
 agent(ex:ag1)
 wasAttributedTo(ex:e1, ex:ag1)

endDocument

Then, “doc2”:

document

 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val2"])
 entity(ex:e0) 
 wasDerivedFrom(ex:e1, ex:e0)

endDocument

Finally, the merged document:

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)
endDocument

The merge operation follows key constraints of the PROV-CONSTRAINTS specification, such as key-object (constraint 22) and key-properties (constraint 23).
The reader who is familiar with the RDF representation of PROV will note that the merge operation is simply obtained by “concatening” all the RDF files together.

The merge operation becomes interesting in the presence of bundles.

2.2 Merging two documents with distinct bundles

First, we consider two documents with distinct bundles.

We now examine a variant of “doc1”, which contains a bundle bun1. In the illustration, the bundle is represented by a rectangle, which contains a description of e2 generated by a2.

doc1 with bundle bun1

doc1 with bundle bun1

The second document is a variant of “doc2” with another bundle named bun2. It contains a description of e3 generated by a3.

doc2 with bundle bun2

doc2 with bundle bun2

After merging the two documents, we obtain a new document containing both bun1 and bun2.

doc1 with bundle bun1 merged with doc2 with bundle bun2

doc1 with bundle bun1 merged with doc2 with bundle bun2

As we can see, as the two bundles have different names, they are kept distinct in the merged document.

Concretely, the first document with bundle bun1.

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val1"])
 agent(ex:ag1)
 wasAttributedTo(ex:e1, ex:ag1)

 bundle ex:bun1
   entity(ex:e2)
   activity(ex:a2,-,-)
   wasGeneratedBy(ex:e2,ex:a2,-)
 endBundle

endDocument

The first document with bundle bun1.

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val2"])
 entity(ex:e0) 
 wasDerivedFrom(ex:e1, ex:e0)

 bundle ex:bun2
   entity(ex:e3)
   activity(ex:a3,-,-)
   wasGeneratedBy(ex:e3,ex:a3,-)
 endBundle

endDocument

The merged documents with two bundles is as follows:

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)

 bundle ex:bun2
  entity(ex:e3)
  activity(ex:a3,-,-)
  wasGeneratedBy(ex:e3,ex:a3,-)
 endBundle

 bundle ex:bun1
  entity(ex:e2)
  activity(ex:a2,-,-)
  wasGeneratedBy(ex:e2,ex:a2,-)
 endBundle
endDocument

2.3 Merging and flattening two documents with distinct bundles

We can optionally use the -flatten option to “remove” bundles, and “pour” their content in the surrounding document.

provconvert -merge doc1b1-2b2-listing.txt -flatten -outfile target/doc1b1-2b2-flatten.provn

The resulting document no longer contains bundles.

Merge and flatten of doc1 with bundle bun1 and doc2 with bundle bun2

Merge and flatten of doc1 with bundle bun1 and doc2 with bundle bun2

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)
 entity(ex:e3)
 activity(ex:a3,-,-)
 wasGeneratedBy(ex:e3,ex:a3,-)
 entity(ex:e2)
 activity(ex:a2,-,-)
 wasGeneratedBy(ex:e2,ex:a2,-)
endDocument

2.4 Merging two documents with the same bundle

Now, let us consider a variant of “doc2” with a bundle bun1, the same identifier as the bundle we had in the “doc1” variant. In the figure, we see that bundle bun1 contains a description of a2 generating e3.

A variant of doc2 with bundle named bun1

A variant of doc2 with bundle named bun1

If we now merge doc1 with bundle bun1 and doc2 with bundle bun1, the merge procedure merges the descriptions contained in the two instances of bundle bun1. We obtain:

doc1 with bundle bun1 merged with doc2 with bundle bun2

doc1 with bundle bun1 merged with doc2 with bundle bun2

2.5 Merging and flattening two documents with the same bundle

If in addition, we specify the -flatten option, merging and flattening operations result in the following document.

doc1 with bundle bun1 merged with doc2 with bundle bun1, after flattening

doc1 with bundle bun1 merged with doc2 with bundle bun1, after flattening

3. Conclusion

As our applications generate provenance incrementally, bundles by bundles, the ability to merge documents and collapse bundles has become critical. This functionality is implemented by ProvToolbox in the method IndexedDocument.merge(). This tutorial has shown that it is also directly available from the command line, using the provconvert utility.

What form of processing do you regularly perform on your provenance graphs? Which functionality would you like to see added to ProvToolbox? Tell us, and for any other issue related to ProvToolbox, on the Github issue tracker.

4. Appendix. Log Change

  • Original version submitted on 2015/07/27

What is in ProvToolbox 0.7.0?

1. Introduction

Today, I have released ProvToolbox 0.7.0. It has again been a consolidation phase, seeking to ensure better compliance, better inter-operability, better robustness, and better internal organization.

2. Novel Features

2.1 Error Codes

As provconcert is being used more frequently as part of more complex workflows, it became critical to return error codes to indicate when there is a problem. Here is an illustration of how this can be used. The first invocation of provconvert is not providing an argument to the option -infile; so the error code is non zero. The second invocation is running without any problem; so the error code is zero.

> provconvert -infile
14:38:27,301 FATAL CommandLineArguments:297 - Parsing failed.  Reason: no argument for:infile
> echo $?
1

> provconvert -help
...
> echo $?
0

Error codes are defined in the interface org.openprovenance.prov.interop.ErrorCodes.

2.2 Document Comparison

To support the inter-operability harness, we needed a program capable of deciding whether two documents were serializations of the same PROV document. Such functionality already existed in ProvToolbox and was extensively used in our testing. It has now been exposed in provconvert.

An illustration of this functionality is as follows.

provconvert -infile file1.provn -compare file2.ttl -outcompare diff.txt

An error code (STATUS_COMPARE_DIFFERENT) is returned when the two files contain serializations of different PROV Documents.

2.3 Better Logging of Various Warnings

Log4j is the logging infrastructure used by ProvToolbox. We have refactored the code to ensure that some warnings, such as those generated by the PROV-N parser, are logged with Log4j. The intent is that all messages get logged through this infrastructure.

This naturally raises a question as to what we should do when the PROV-N parser finds PROV-N statements that are not compatible with the grammar, recovers and continues parsing; or when, a PROV Qualified Name is constructed with an incorrect syntax.

On the one hand, a permissive approach is good because it allows ProvToolbox to deal with a wide variety of inputs; on the other hand, there may be cases when we want to be strict. Is a strict mode a desirable feature? Your input would be desirable: let me know what your use cases are, and we will try to support them.

2.4 Syntax of Qualified Names in PROV-N

PROV-N has a production typedLiteral to encode all typed literals, consisting of a STRING_LITERAL for the external representation of the literal, and a datatype, expressed as a Qualified Name, for its type.

typedLiteral ::= STRING_LITERAL "%%" datatype

For instance, "1" %% xsd:integer represents the integer value 1. (In this case, PROV-N also supports the more simple convenience notation 1.)

A Qualified Name is expressed as follows (cf. Example 38).

  "ex:value" %% prov:QUALIFIED_NAME

A convenience notation is also permitted, in the form of 'ex:value'.

ProvToolbox, before 0.7.0, was only supporting the convenience notation for Qualified Names. It now supports both forms in compliance with the specification.

2.5 Default xsd Namespace

PROV-DM, PROV-N, PROV-O all use http://www.w3.org/2000/10/XMLSchema# as the XML Schema Namespace URI. Since 0.7.0, this namespace URI has also become the default for XML Schema in ProvToolbox. (See NamespacePrefixMapper.html#XSD_NS.)

In previous versions, as ProvToolbox had a strong JAXB heritage, the default namespace URI for XML Schema was the “XML version” http://www.w3.org/2000/10/XMLSchema (note the lack of hash at the end). We moved away from this namespace URI since we want the programmer to manipulate the namespace URI used in the Recommendation. The “XML Version” is only required when marshalling to/unmarshalling from XML.

2.6 Syntax of QualifiedName

PROV-DM defines a PROV Identifier as a Qualified Name, which is a name subject to namespace interpretation. It consists of a namespace, denoted by an optional prefix, and a local name. PROV-N provides a concrete syntax for prov:QUALIFIED_NAME, further explaining how a PROV-N qualified name can be mapped to a valid IRI.

However, PROV-N provides a concrete syntax for prov:QUALIFIED_NAME, further noting that a PROV-N qualified name QUALIFIED_NAME can be mapped to a valid IRI [RFC3987] by concatenating the namespace denoted its local name to the local name, whose -escaped characters have been unescaped by dropping the character ‘\’ (backslash).

Before 0.7.0, ProvToolbox was not implementing fully the syntax of prov:QUALIFIED_NAME, since it ignored escape characters, and how they should be handled when forming a URI.

A consequence of this was that some URIs read from a ttl representation were not represented properly as Qualified Names in the toolbox, and were not converted back to their original from when exporting back to ttl.

All this is now addressed in ProvToolbox 0.7.0 with the possibility of forcing syntactic checks when creating Qualified Names. Full details are available from https://github.com/lucmoreau/ProvToolbox/wiki/Syntax-of-prov:QUALIFIED_NAME. A consequence of releasing ProvToolbox 0.7.0 with support for this syntax is that PROV-N documents previously generated may not be readable if they don’t already support this encoding.

2.7 Syntax of QName

PROV-XML mandates xsd:QName as the XSD datatype to be used for qualified names. However, the xsd:QName datatype is more restrictive than the QualifiedName defined in PROV-N, e.g. PROV-N allows local names to start with numbers, whereas xsd:QName does not. PROV-XML does not specify how to convert an arbitrary PROV Qualified Name into xsd:QName.

ProvToolbox now offers such a conversion function and also a method to check whether a xsd:QName is syntactically correct. Details of the encoding have been documented at https://github.com/lucmoreau/ProvToolbox/wiki/Mapping-PROV-Qualified-Names-to-xsd:QName.

Before ProvToolbox 0.7.0, the module to convert to prov-xml was simply ignoring the required syntax of xsd:QName and was generating xsd:QNames that were not syntactically valid. It was not acceptable.

A consequence of releasing ProvToolbox 0.7.0 with support for this encoding is that PROV-XML documents previously generated may not be readable if they don’t already support this encoding.

2.9 Internationalization Testing

Some testing of Unicode characters was introduced to ensure that multiple languages were supported in string representations, but also in qualified names.

2.10 Various Bug fixes

  • prov:InternationalizedString issue #133
  • incorrect prefix declaration in export issue #132
  • parsing relative uris with input stream in ttl issue #122
  • prov-n qualified names written as “ex:foo” %% prov:QUALIFIED_NAME issue #109
  • warning for prov:label non-string value issue #104
  • escaping of characters in Qualifed Names and QNames issue #120
  • visualisation of prov:value issue #71
  • conversion to dot issue #67

3. Conclusion

I am keen to know who is using ProvToolbox and/or provconvert and for for which purpose. Share details of your projects with me, I will add them to https://github.com/lucmoreau/ProvToolbox/wiki/Projects-and-Applications-Using-ProvToolbox.

ProvToolbox will now be integrated in the inter-operability harness developed in collaboration with Software Sustainability Institute.This test harness will allow us to check inter-operability of various software packages developed in Southampton, including ProvToolbox, ProvStore, ProvPy, ProvTranslator, ProvJS. If we identify inter-operability issues, we will seek to address them in due course.

For all details about ProvToolbox, see the github.io page http://lucmoreau.github.io/ProvToolbox/.

Thanks to Danius, Dong, and Heather for identifying issues or suggesting improvements and implementing them.

ProvToolbox Tutorial 2: Reading, Converting and Saving PROV Documents

1. Introduction

Building on the first ProvToolbox tutorial, the aim of this second tutorial is to show how to read a PROV document using ProvToolbox and export it to some format.

We assume that installation instructions as described in the first Tutorial have been followed. Details about the Maven configuration can also be found there.

2. Download and Execution

The tutorial is standalone and a zip archive can be downloaded from the following URL: http://search.maven.org/remotecontent?filepath=org/openprovenance/prov/ProvToolbox-Tutorial2/0.7.0/ProvToolbox-Tutorial2-0.7.0-src.zip. The tutorial can also be found on the ProvToolbox project on GitHub.

After unziping the archive, we can execute the tutorial, by calling:

mvn clean install

Beside the verbose logging by the Maven build process, the tutorial itself displays the following text, including some PROV expressed according to PROV-XML.

*************************
* Converting document  
*************************

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<prov:document xmlns:prov="http://www.w3.org/ns/prov#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:provbook="http://www.provbook.org" xmlns:jim="http://www.cs.rpi.edu/~hendler/">
    <prov:entity prov:id="provbook:a-little-provenance-goes-a-long-way">
        <prov:value xsi:type="xsd:string">A little provenance goes a long way</prov:value>
    </prov:entity>
    <prov:agent prov:id="provbook:Paul">
        <foaf:name xsi:type="xsd:string">Paul Groth</foaf:name>
    </prov:agent>
    <prov:agent prov:id="provbook:Luc">
        <foaf:name xsi:type="xsd:string">Luc Moreau</foaf:name>
    </prov:agent>
    <prov:wasAttributedTo>
        <prov:entity prov:ref="provbook:a-little-provenance-goes-a-long-way"/>
        <prov:agent prov:ref="provbook:Paul"/>
    </prov:wasAttributedTo>
    <prov:wasAttributedTo>
        <prov:entity prov:ref="provbook:a-little-provenance-goes-a-long-way"/>
        <prov:agent prov:ref="provbook:Luc"/>
    </prov:wasAttributedTo>
    <prov:entity prov:id="jim:LittleSemanticsWeb.html"/>
    <prov:wasDerivedFrom>
        <prov:generatedEntity prov:ref="provbook:a-little-provenance-goes-a-long-way"/>
        <prov:usedEntity prov:ref="jim:LittleSemanticsWeb.html"/>
    </prov:wasDerivedFrom>
</prov:document>

*************************

3. Reading and writing PROV documents in Java

The following Java snippet is extracted from the file src/main/java/org/openprovenance/prov/tutorial/tutorial2/ReadWrite.java. In line 3, it shows how a document can be read, given its path filein on the file system. In line 4, we see how a PROV Document can be saved into a file fileout. The writeDocument procedure determines the PROV format that is required by looking at the extension. If a non-standard extension is used, then the format can be specified explicitly, as in line 5, by one of the values of the enumerated type ProvFormat.

    public void doConversions(String filein, String fileout) {
        InteropFramework intF=new InteropFramework();
        Document document=intF.readDocumentFromFile(filein);
        intF.writeDocument(fileout, document);     
        intF.writeDocument(System.out, ProvFormat.XML, document);
    }

   public static void main(String [] args) {
        if (args.length!=2) throw new UnsupportedOperationException("main to be called with two filenames");
        String filein=args[0];
        String fileout=args[1];
        
        ReadWrite tutorial=new ReadWrite(InteropFramework.newXMLProvFactory());
        tutorial.openingBanner();
        tutorial.doConversions(filein, fileout);
        tutorial.closingBanner();
    }

For completion, line 13 shows how the tutorial class is initialized and line 15 takes care of invoking the conversion functionality.

The tutorial is called from the command line, passing src/main/resources/a-little.provn as the input file, and target/a-little.svg as the output file.  Therefore, the a-little.provn file is converted to SVG (by line 4) and to XML on standard output (by line 5).

The following table lists the formats that are supported by ProvToolbox.

gv text/vnd.graphviz output
dot text/vnd.graphviz output
prov-asn text/provenance-notation input
prov-asn text/provenance-notation output
pn text/provenance-notation input
pn text/provenance-notation output
asn text/provenance-notation input
asn text/provenance-notation output
provn text/provenance-notation input
provn text/provenance-notation output
rdf application/rdf+xml input
rdf application/rdf+xml output
json application/json input
json application/json output
ttl text/turtle input
ttl text/turtle output
trig application/trig input
trig application/trig output
jpeg image/jpeg output
jpg image/jpeg output
provx application/provenance+xml input
provx application/provenance+xml output
xml application/provenance+xml input
xml application/provenance+xml output
png image/png output
pdf application/pdf output
svg image/svg+xml output

4. Conclusion

For further documentation on the classes and methods used, Javadoc for ProvToolbox can be found from http://openprovenance.org/java/site/latest/apidocs/.  The Javadoc documentation also refers to PROV specifications where appropriate.

Suggestions for tutorials and also for ways of improving the programming experience offered by ProvToolbox are always welcome. Please raise issues on GitHub issue tracker.

5. Appendix. Log Change

  1. Original version submitted on 2015/06/30
  2. Updated to 0.7.0 on 2015/07/27

What is in ProvToolbox 0.6.2?

1. Introduction

Today, I have released ProvToolbox 0.6.2 some 11 months after the previous release. This has been a consolidation phase. ProvToolbox is used in various projects and applications, which have exercised its functionality, identified bugs, and raised requirements for new functionality to make it more useful. Concretely, ProvToolbox with its templating system is used in Picaso (contributor: Dong Huynh, Danius Michaelides), SmartSociety‘s SmartShare application (contributor: Heather Packer), eBook‘s blockly-based workflow systems (contributor: Danius Michaelides). ProvToolbox is also used in ProvStore to support the conversion of PROV to various formats.

2. Novel Features

2.1 Document Merge and Flattening

It has become a critical requirement of several of our applications to merge PROV documents. If you think of the RDF representation of PROV, a kind of concatenation of all tuples. For other representations such as PROV-N, PROV-XML, and PROV-JSON, which are more statement oriented, merging documents fuses statements about the same resource — for instance, for an entity, regrouping all attributes in a single statement. When documents contain bundles, these are also merged if they have the same identifier.

Furthermore, we have the option of stripping bundles from documents, as if we were pouring their contents in the document they occur in.

2.2 Standard inputs and outputs for provconvert

With provconvert, we can now use ‘-‘ as a filename to indicate that the input/output will come on standard input or output. This allows provconvert to act much more like a unix tool. However, because provconvert needs to know the format of its input or output (it would previously derive this from the filename extensions) we’ve introduced three extra options  -informat, -outformat and -bindformat. These take filename extensions or mime-types as their arguments.

Here we grab a provn document using the curl command line tool, convert it to xml, and show the output:

% curl -s http://www.provbook.org/provapi/documents/bk.provn | provconvert -infile - -informat provn -outfile - -outformat xml

The output as xml is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<prov:document xmlns:prov="http://www.w3.org/ns/prov#"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:hendler="http://www.cs.rpi.edu/~hendler/"
xmlns:bk="http://www.provbook.org/is/#"
xmlns:dct="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:images="http://www.provbook.org/imgs/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:provapi="http://www.provbook.org/provapi/documents/"
xmlns:provbook="http://www.provbook.org/">
    <prov:bundleContent prov:id="provbook:provenance">
        <prov:agent prov:id="provbook:Luc">
            <foaf:name xsi:type="xsd:string">Luc Moreau</foaf:name>
        </prov:agent>
        <prov:agent prov:id="provbook:Paul">
            <foaf:name xsi:type="xsd:string">Paul Groth</foaf:name>
        </prov:agent>
        <prov:entity prov:id="provbook:provenance">
... 

The format options also override provconvert using filename extensions to  derive formats, so we are now less restricted when we name files.

2.3 Supported formats

The option -formats of provconvert provides a list of supported formats. The output is a list of formats one per line, with each line listing filename extension, its associated mime-type and whether the entry is for input or for output.

% provconvert -formats 
gv      text/vnd.graphviz       output 
dot     text/vnd.graphviz       output 
trig    application/trig        input 
trig    application/trig        output 
provn   text/provenance-notation        input 
provn   text/provenance-notation        output 
...

2.4 RPM for provconvert

We now offer an RPM (Red-Hat Package Manager) for binary release. Using the rpm command, one can now install provconvert with:

rpm -U https://repo1.maven.org/maven2/org/openprovenance/prov/toolbox/0.6.2/toolbox-0.6.2-rpm.rpm

2.5 Implementation of prov-template

prov-template is a specification introduced by Dong, Danius and myself that specifies a templating system for PROV. It allows for templates to be defined as PROV documents containing variables. Bindings consist of associations between variables and values. Templates can be expanded by replacing variables by their values specified in binding.

As prov-template is being used in several applications, we realised that parts of the specification has not been fully implemented, and there were some bugs as well. The key changes include proper support for time in activities and in instantaneous events, and a correct implementation of “linked template variables”, allowing the template designer to control the cartesian products, when variables are bound to multiple values.

I am hoping to publish a tutorial on prov-template in the near future.

2.6 Bug Fixes

A few notable bug fixes are listed below.

  • prov-dot: conversion to dot (and subsequently svg, pdf, etc) escaping characters (issue 103)
  • prov-json: correct handling of bundle names and namespaces (issue 96)
  • prov-n: added newline at end of document (issue 112)
  • prov-template: transitive closure for linked variables (issue 113)

2.7 Tutorial

Two more tutorials have been produced. They are included in the release and I will publish blog posts about them shortly.

  • reading and converting PROV documents
  • merging documents

3. Where next?

ProvToolbox was designed to support interoperable conversion of PROV representations. In collaboration with the Software Sustainability Institute, we are developing a test harness that allows us to check inter-operability of various software packages developed in Southampton, including ProvToolbox, ProvStore, ProvPy, ProvTranslator, ProvJS. If we identify inter-operability issues, we will seek to address them in due course.

We have also identified a series of new requirements for prov-template, and possible ways of improving this templating system. We hope to produce the second iteration of this specification and deliver its reference implementation in ProvToolbox.

For all details about ProvToolbox, see the github.io page http://lucmoreau.github.io/ProvToolbox/.

Thanks to Danius, Dong, and Heather for identifying issues or suggesting improvements and implementing them.

Provenance of Publications: A PROV style for Latex

1. Provenance is still Challenging

Isn’t it frustrating that it is still hard to generate the provenance of our documents?

Isn’t it still challenging for our  research community to demonstrate best practice?

While the provenance community has made substantial progress in terms of understanding and standardising provenance, it is an unfortunate reality that, due to the lack of easy tools, provenance still remains beyond the reach of the general public.

It is for me a great frustration that provenance of my papers cannot be generated automatically. Thus, I can’t demonstrate best practice to the community.

2. A Style for LaTeX

If we look at publications, we can see that they already contain a lot of provenance information in textual form, but this information is not made accessible in machine-processable format. Given that I use LaTeX for many of my publications, I have developed prov.sty – a LaTeX style that generates provenance information, on the basis of annotations that are inserted in the source of the document.  In this blog post, I show the LaTeX annotations supported by prov.sty, and the type of provenance they generate.

3. LaTeX Annotations

I will use a running example taken from a recent paper “The Rationale of PROV”, which I co-authroed with Paul, James, Tim and Simon.

Luc Moreau, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. The Rationale of PROV. Web Semantics: Science, Services and Agents on the World Wide Web, 2015, doi: 10.1016/j.websem.2015.04.001, available under CC BY license (http://creativecommons.org/licenses/by/4.0/).

3.1 Authors, Organizations, Title, …

Of course, a publication has got a title, authors, and their affiliation.

Title, Authors and Organizations

Document Title, Authors and Organizations

The following LaTeX macros allow us to annotate

  • an author’s name with a URI using \provAuthor
  • an organization’s name with a URI using \provOrganization, and
  • a title using \provTitle

Lines 1-2 show how my name is marked up with \provAuthor and my ORCID URI. Likewise, lines 4-5 show how my institution’s name is marked up with \provOrganization and its web site’s URI. Finally, the title is marked up with \provTitle.

\provAuthor {Luc Moreau}
            {http://orcid.org/0000-0002-3494-120X}

\provOrganization {University of Southampton}
                  {http://www.soton.ac.uk/}

\provTitle {The Rationale of PROV}

As far as LaTeX is concerned, these annotations are macros which expand into their first argument, discarding the others, if any.

The resulting provenance is illustrated below. At the bottom, we see a yellow ellipse, with uniquely generated identifier 20892220-a071-4ef3-a799-3056447ec8a2; it has an attribute — the title. This entity is the publication entitled “The Rationale of PROV”. It is attributed to two agents, myself and the University of Southampton.

Authorship of Document

Authorship of Document

The following Turtle excerpt shows that the provided URIs are used in the description of the Person “Luc Moreau” and the Organization “University of Southampton”, both agents. Attribution of the document to the agent is by means of the property prov:wasAttributedTo.

<http://orcid.org/0000-0002-3494-120X> 
  a prov:Agent, prov:Person;
  foaf:name "Luc Moreau" . 

<http://www.soton.ac.uk/>
  a prov:Agent, prov:Organization;
  foaf:name "University of Southampton" . 

doc:20892220-a071-4ef3-a799-3056447ec8a2
  a prov:Entity ;
  schema:headline "The Rationale of PROV" ;
  prov:wasAttributedTo <http://orcid.org/0000-0002-3494-120X> ;
  prov:wasAttributedTo <http://www.soton.ac.uk/> .

Every time LaTeX typesets the document, a new identifier is generated in place of doc:20892220-a071-4ef3-a799-3056447ec8a2.

3.2 Projects, Funding agencies, …

Many publications include an acknowledgement section listing the projects and funding agencies that sponsored the work.

ack

Acknowledgement to Projects and Funding Agencies

The following LaTeX macro allows us to annotate

  • a project’s name with two URIs, for the project and the funding agency, using \provProject, and
\provProject
  {SOCIAM (EP/J017728/1)}
  {http://www.sociam.org/}
  {http://www.epsrc.ac.uk/}

The resulting provenance is illustrated below. At the bottom, we see the same entity 20892220-a071-4ef3-a799-3056447ec8a2 for the publication entitled “The Rationale of PROV”. It is attributed to the project, itself funded by the funding agency.

Project and Funding  Agency

Project and Funding Agency

The following Turtle excerpt shows an attribution of the document to the project by means of the property prov:wasAttributedTo, and that the project was sponsored by the funding agency, encoded with the property prov:actedOnBehalfOf.

<http://www.epsrc.ac.uk/> 
  a prov:Agent.

<http://www.sociam.org/> 
  a prov:Agent;
  foaf:name "SOCIAM (EP/J017728/1)" ; 
  prov:actedOnBehalfOf <http://www.epsrc.ac.uk/> .

doc:20892220-a071-4ef3-a799-3056447ec8a2
  a prov:Entity ;
  prov:wasAttributedTo <http://www.sociam.org/> .

3.3 Bibliography

As far as the bibliography is concerned, very little work is required.

bibliography

Bibliography

The usual LaTeX commands \bibliography and \bibliographystyle need to be preceded by \provBibliography, declaring that provenance need to be generated for bibliographical entries.

\provBibliography
\bibliographystyle{elsarticle}
\bibliography{rationale}

For this to work, each bibliography entry needs to have a URI or DOI associated with it. We do this by creating an attribute url or doi for each bibtex entry.

@TechReport{prov-dm:20130430,
  author = {Luc Moreau and Paolo {Missier (eds.)} …},
  title = {PROV-DM: The PROV Data Model},
  institution = {World Wide Web Consortium},
  year = {2013},
  type = {W3C Recommendation},
  number = {REC-prov-dm-20130430},
  month = oct,
  url = {http://www.w3.org/TR/2013/REC-prov-dm-20130430/}}

The resulting provenance is illustrated below. At the bottom, we see the same entity 20892220-a071-4ef3-a799-3056447ec8a2 for the publication entitled “The Rationale of PROV”. It was derived from the cited document.

The paper "The Rationale of PROV" cites the "PROV-DM Recommendation"

The paper “The Rationale of PROV” cites the “PROV-DM Recommendation”

The following Turtle code shows a derivation from the document to the cited publication using prov:wasDerivedFrom.

doc:20892220-a071-4ef3-a799-3056447ec8a2 
  prov:wasDerivedFrom
  <http://www.w3.org/TR/2013/REC-prov-dm-20130430/> . 

3.4 Included Figures

The provenance of included figures can also be expressed.

A Figure Included from the PROV-O specification

A Figure Included from the PROV-O specification

The LaTeX macro \includegraphics can include a file (e.g. pdf, jpeg, etc). It now can generate the provenance of this inclusion: the current document is said to be derived from the included resource.

The included resource is a file on the file system, so a third party would typically not be able to access it directly. For this reason, the macro \provResource allows for an online resource, copy of the included file, to be declared.

\provResource{http://www.w3.org/TR/2013/REC-prov-o-20130430/diagrams/starting-points.svg}
\includegraphics{starting-points.png}

Thus, the provenance of this inclusion is modelled as follows: the current document was derived
from the included resource, itself an alternate of the online resource. For a third party to be able to check that the online
resource is a copy of the included one, prov.sty computes the md5 hash of the included file.

doc:20892220-a071-4ef3-a799-3056447ec8a2
  prov:wasDerivedFrom 
    inc:20892220-a071-4ef3-a799-3056447ec8a2-1. 

inc:20892220-a071-4ef3-a799-3056447ec8a2-1 
  a prov:Entity ;
  schema:contentLocation <starting-points.png> . 
  prov:alternateOf <http://www.w3.org/TR/2013/REC-prov-o-20130430/diagrams/starting-points.svg> . 
  crypto:md5 "1727ca12ed150ec814e3475859d7b362" . 

In this specific example, the online resource is an SVG file, whereas the included file in a PNG. Thus, the md5 hash does not allow to check that they are identical.

4. Embedding Provenance

The macro \provEmbed allows for metadata about the provenance to be inserted in the PDF document, using the XMP metadata format. This command is expected to be called as the last macro before the end of the document.

\provLocation{http://eprints.soton.ac.uk/375233/7/provenance.ttl}
\provEmbed

XMP supports a subset of RDF/XML that does not appear to be expressive enough to embed PROV provenance directly. Instead, using the approach recommended by PROV-AQ, a pointer to the provenance is expressed, using the XMP format. The location itself is specified by LaTeX command \provLocation: http://eprints.soton.ac.uk/375233/7/provenance.ttl.

<rdf:RDF>
  <rdf:Description rdf:about=""
                   xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
    <xmpMM:DocumentID>uuid:3c59bdaa-dbf1-a740-963a-7c266e65f7b2</xmpMM:DocumentID>
    <xmpMM:InstanceID>uuid:085e7f83-2095-4342-8a5b-57b0d87f5715</xmpMM:InstanceID>
  </rdf:Description>
  <rdf:Description rdf:about=""
                   xmlns:prov="http://www.w3.org/ns/prov#">
    <prov:alternateOf rdf:resource="http://openprovenance.org/documents#20892220-a071-4ef3-a799-3056447ec8a2"/>
    <prov:has_anchor rdf:resource="http://openprovenance.org/documents#20892220-a071-4ef3-a799-3056447ec8a2"/>
    <prov:has_provenance rdf:resource="http://eprints.soton.ac.uk/375233/7/provenance.ttl"/>
  </rdf:Description>
</rdf:RDF>

Using the LaTeX command \provBanner, it is also possible to generate a textual description of where the provenance is accessible.

A textual description of where the provenance is located

A textual description of where the provenance is located

5. prov.sty: a github project

With this Blog post, I have showed that it is possible to lower PROV’s barrier of adoption, by adapting tools to generate provenance automatically. For those tools to be useful, they need to generate provenance systematically, for every created artifact. Over time, as similar tools get developed, their provenance should be linked up. For instance, the git2prov converter is capable of exporting PROV from GIT. It should be possible for users to seamleassly navigate the provenance generated by both tools.

The LaTeX style prov.sty is still a proof of concept, but I feel that it is time to release it, and have others to use it. Improving usability, enhancing the quality of provenance, and strengthening of LaTeX integration are all desirable.

prov.sty is available at https://github.com/prov-suite/prov-sty under the MIT Open Source license.

Pull requests are welcome and let’s make it a community effort to develop prov.sty

github project for prov.sty

github project for prov.sty

For a more complete description of prov.sty, please see:

Moreau, Luc and Groth, Paul (2015) Provenance of Publications: A PROV style for LaTeX. In the Seventh USENIX Workshop on the Theory and Practice of Provenance (TAPP’15), USENIX. URI: http://eprints.soton.ac.uk/378019/

Provenance Reading List

I am regularly asked by students and researchers about a reading list on provenance. The following papers give them a good baseline about the kind of work we undertake in my group. This is not meant to be an extensive literature survey, but this should give them enough background to have discussions about projects related to provenance.

Introduction to PROV

Recommendations

Provenance Analytics

Semantics