What is in ProvToolbox 0.4.0?

ProvToolbox is an open source Java package to create, manipulate, save, and read PROV representations. PROV is the W3C standard for representing provenance on the Web. Since  I have just released version 0.4.0 this WE, I thought it would be good to explain recent changes and future directions for ProvToolbox.

History

First, some context. ProvToolbox was initially conceived during the lifetime of the W3C Provenance Working Group. ProvToolbox was initially implementing the PROV data model, serialization to XML (and back) according to PROV-XML, mapping to RDF (and back) according to PROV-O (Thanks to Mike Jewell for helping with the first version of the converter RDF to Java), serialization to PROV-N and back, and serialization to JSON (and back) according to PROV-JSON (thanks to Trung Dong Huynh for helping with this converter). ProvToolbox was one of the implementations demonstrating implementability of the PROV specifications.  ProvToolbox’s design was inspired by the OPMToolbox a similar toolkit for OPM, a predecessor of PROV.  In its original design, ProvToolbox was adopting a schema driven approach, in which schemas, grammars, and ontologies were automatically compiled into marshallers and umarshallers. This was particularly convenient when PROV was being designed, and changed every other week.

Motivation

The purpose of ProvToolbox is to create PROV representations, manipulate them, save and read them using standard serializations.  By sheltering the programmers from the nitty-gritty details of serialization, it is hoped that they can focus on provenance specific functionality, and improve the quality  of their applications. ProvToolbox is known to be used in several applications and services, including the online PROV translator, the PROV validator,  Amir’s CollabMap-based trust rating, and others. If your application uses ProvToolbox, please let me know.

Recent Changes

prov-model

The key change is the introduction of the prov-model artifact (preliminary version of which was already released in 0.3.0). This artifact is the realization of the PROV conceptual model in Java. Classes and associations of the conceptual model are all formalized by Java interfaces, specifying their accessors and mutators, and other relevant methods.  Instances of these interfaces can be found in the form of Java beans, for instance, in the prov-xml artifact, which takes care of marshalling to XML and unmarshalling from XML.  Another implementation of these interfaces can be found in prov-sql (see this topic, being discussed below). One can imagine further implementations using Spring Data, for instance.

static and refactored beans

While a schema driven approach was suitable when the PROV standard was being developed, now that PROV is frozen, it is better to define beans statically, and curate them manually.   For instance, attributes are now handled systematically, and expressed as org.openprovenance.model.Attribute. The outcome is beans that are more natural to the programmer.

Namespace Handling

In the toolbox, qualified names (known as QName in XML Schema) are used to represent URIs in a short form. Managing namespaces and associated prefixes is sometimes a pain. To facilitate the programmer’s task, a class Namespace, embedding all namespace-related processing was introduced.  Please let me know if it covers your need.

Unmarshalling from XML

Beans used to be generated by JAXB. However, for attributes such as prov:location, which were expected to be xsd:anySimpleType, the corresponding Java method expected each location attribute to be an Object. The accessor getLocation() used to have the following signature.

List<Object> getLocation()

However, round trip conversion from Java to XML and back was not successful with QNames whose namespace had not been declared globally. To ensure compatibility with the PROV standard definition, and to shelter the programmer from these tedious serialization details, manual (un)marshallers were written (adaptors in JAXB speak). An extensive series of tests was developed: more than 200 tests are now run for each serialization to check round-trip conversion. In particular, we ensure support for rdf 1.1 primitive data types.

prov-sql

Another novelty of this release is a very very preliminary ORM mapping for PROV.  It allows PROV representations to be saved to and retrieved from SQL databases. Currently, there is no support for PROV-Dictionary and other extensions. Lots of schema optimizations are possible (and required too!). Feedback welcome on the SQL Schema and mapping.  But before spending any more time on the sql schema, there is some further refactoring of beans that I would like to implement (see next section).

Where next?

I am reasonably satisfied with the definitions in prov-model. There is still one significant change that I would like to introduce, which is likely to break again applications using ProvToolbox. The JAXB automatic bean generation introduced IDRef, a Java class, whose sole purpose was to serialize atttribute prov:ref=”e1″ in the example below.

<prov:wasGeneratedBy prov:id="gen1">
 <prov:entity prov:ref="e1"/>
 <prov:activity prov:ref="a1"/>
</prov:wasGeneratedBy>

The following Java methods were defined accordingly.

 IDRef getEntity()
 IDRef getActivity()

Instead, I propose to specify beans with the following methods, avoiding programmers to have to manipulate IDRefs, since they are not part of the PROV data model, but are only introduced for the purpose of serialization.

 QName getEntity()
 QName getActivity()

So far, ProvToolbox has used javax.xml.namespace.QName, but this class is supposed to represent XML QNames. XML QNames come with strong syntactic restrictions (though the Java class does not enforce them), but these have been  relaxed in PROV Qualified Names. Therefore, I will introduce a Qualified Name class for ProvToolbox, supporting the syntactic definitions set by PROV, but also allowing easy conversion to Turtle and XML QNames; further, it will also offer functions to convert to and from corresponding URIs.

With this in place, I hope that bean interfaces will be frozen till release 1.0.0. The focus will then be on finalizing and refactoring the various  serializations.

Finally, I have already started documenting ProvToolbox, but much more is required!

Useful Pointers

GitHub repository: https://github.com/lucmoreau/ProvToolbox/

Javadoc: http://openprovenance.org/java/site/0_4_0/apidocs/

Maven repository: http://openprovenance.org/java/maven-releases/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s