UML, Food Contamination Risk, and Black Holes: Provenance without Frontier!

With some students and collaborators, we have published three papers at IPAW’18 (International Provenance and Annotation Workshop), which are being presented this week, during Provenance Week’18, hosted at King’s College London.

1. Provenance capture with UML2PROV

It still remains challenging to develop provenance-enabled applications.  A key difficulty is to ensure that the provenance generated by an application describes what the application actually does!  Prov-Templates already offers a declarative way of specifying the shape of provenance to be generated, while bindings created at runtime by the application instantiate these templates into concrete provenance. UML2PROV, designed with Carlos Sáenz Adán and Beatriz Pérez Valle, goes a step further.

The ambitious vision of UML2PROV is to control the generation of provenance from UML specifications of a program, minimising the amount of manual intervention by the programmer. Three steps are involved:

  1. From UML specifications, UML2PROV creates Prov-Templates and a runtime library to generate bindings.
  2. A programmer’s intervention deploys the runtime library with the application.
  3. As the application runs, the runtime library creates bindings that can be used to derive provenance from the templates.

Carlos previously supported UML sequence and activity diagrams. The IPAW’18 paper extends this work by supporting class diagrams. This apparent simple extension required a significant re-design as it allows flow of data to be better tracked in applications.  The system was also fully redesigned to minimise the programmer’s intervention. The original approach used reflection and proxies to intercept method calls, whereas now it relies on aspects.  Carlos is not the first to suggest that provenance aspects can be woven into code in order to generate provenance, but he is the first to do it, systematically, in the context of a software engineering methodology such as UML.

Full details of the paper can be found at:


Carlos Saenz-Adan, Luc Moreau, Beatriz Pérez, Simon Miles, and Francisco J. García-Izquierdo. Automating provenance capture in software engineering with uml2prov. In IPAW’2018: 7th International Provenance and Annotation Workshop, London, UK, July 2018.


2. Deriving Contamination Risk from Food Provenance

Food has always been a powerful illustration of why provenance matters. How many times haven’t I been delivering talks and mentioning “Provenance Wine” and “Provenance Whisky” over the years?  However, the  adage “provenance helps users place their trust in data or things” has proven challenging to demonstrate scientifically and rigorously.  I believe that the work by Belfrit Batlajery, Mark Weal, Adriane Chapman and myself is a significant step in that direction.

Belfrit’s approach is to use provenance-based descriptions of food supply chains, ideally from Farm to Fork, and partial knowledge about contamination levels of bacteria there in. A belief propagation approach over factor graphs derived from provenance graph allows him to infer levels of contamination across the network.  The point of this analysis is not so much to infer super-precise contamination levels , but instead, to have an estimation that is accurate enough to be able to determine where risk is higher.

This approach would allow an organisation to meet its due diligence requirements by logging food provenance, inferring contamination levels, and use these to identify the places where food sampling is likely to reduce the level of risk across the supply chain, ultimately allowing consumer to have confidence in food products.

Full details of the paper can be found at:

Belfrit Victor Batlajery, Mark Weal, Adriane Chapman, and Luc Moreau. Belief propagation through provenance graphs. In IPAW’2018: 7th International Provenance and Annotation Workshop, London, UK, July 2018.


3. The search for Low Mass X-Ray Binaries

Our universe is amazing, full of complex systems, hard to comprehend and visualise.  One of them, a low-mass X-Ray binary (LMXB) is a binary star system where one of the components is compact, either a black hole or neutron star; the other component amazingly transfers mass to the compact component. X-Rays are produced by matter being transferred between components. If you want to see a visual representation of this, look at some amazing images:

Down to earth, Michael Johnson, in collaboration with Poshak Gandhi, Adriane Chapman and myself, has been developing data science techniques to help with the discovery of such LMXBs. The opportunity being exploited is that arising from the Large Synoptic Survey Telescope (LSST) still being in design. Michael has developed a provenance-enabled image processing pipeline.  From a provenance perspective, the challenge was to demonstrate benefits of using provenance to the astronomer. Michael was able to show that while provenance capture introduced a runtime overhead, the overhead was offset by significant compute time savings by querying provenance rather than recomputing results.  This is an important outcome since it demonstrates that provenance should be routinely captured by data science toolkits because it brings benefits to its users.

Interestingly, Michael’s workflow uses the UML2PROV technique to generate its provenance.

Full details of the paper can be found at:

Michael A. C. Johnson, Luc Moreau, Adriane Chapman, Poshak Gandhi, and Carlos Saenz-Adan. Using the provenance from astronomical workflows to increase processing efficiency. In IPAW’2018: 7th International Provenance and Annotation Workshop, London, UK, July 2018.


A Community Repository for Provenance Templates

On several occasions, I have written about PROV-TEMPLATE, a declarative approach that enables designers and programmers to design and generate provenance compatible with the PROV standard of the World Wide Web Consortium. Designers specify the topology of the provenance to be generated by composing templates, which are provenance graphs containing variables, acting as placeholders for values. Programmers write programs that log values and package them up in sets of bindings, a data structure associating variables and values. An expansion algorithm generates instantiated provenance from templates and sets of bindings in any of the serialisation formats supported by PROV.

To promote best practice, and to facilitate the implementation of provenance-enabled applications, I feel that the time has come to create a catalogue of provenance templates and examples of bindings to instantiate these templates. This catalogue should be open and available to use by anybody for any purpose, academic, commercial or other.

I am pleased to announce the launch of a community repository for provenance templates at:

I am drafting a document specifying the governance of this repository. I am also writing up some guidelines on how to structure the repository and the process by which templates can be shared (it will obviously be relying on github for this!).

These are starting points, as I really want this repository to become a community resource; therefore, I welcome suggestions for the management of the repository.

Shortly, I will be talking about the governance of the repository. Watch this space …


Service for template-based provenance generation

Previously I blogged about prov-template, an approach for generating provenance. It consists of a declarative definition of the shape of provenance to be generated, referred to as template, and sets of values to instantiate templates into concrete provenance. Today, I am pleased to write about a new online service allowing such templates to be expanded into provenance.

The service is available from I will now illustrate its use through a few examples.

First, let’s consider an example of template (see for its provn version). Visually it look like this.


A template for a binary operation

It shows an activity, two entities used as input and an entity generated as output. There is an agent associated with the activity. The output is derived from the two inputs. This provenance description is contained in a provenance document, but what makes it a template is that the identities of nodes are variables (URIs in a reserved namespace with prefix varvar:produced, var:operation, …). These variables are meant to be replaced by concrete values. Variables are also allowed in value position of property-value pairs (cf. var:operation_type).

This template for instance could be used to describe the addition of two numbers resulting in their sum.

The template service looks as follows. Two input boxes respectively expect the URL to a template (you need to ensure that the template is accessible to the service) and the bindings between variables and values to instantiate the template.  For convenience, the template pull down menu already provides the link to the template described above. Likewise, the example pull down menu contains several examples of bindings. Let’s select the first one, and click on the SVG button to generate an SVG representation of the expanded provenance.


Template expansion service

The result is as follows, variable names have been instantiated with identities of activity, entities and agent, but also with values of properties. Property-value pairs whose value was variable not assigned by the bindings are simply removed from the expanded provenance (for instance, as variable var:operation_type is unassigned, the property type was removed from the expansion).


Below, we find the expanded provenance for the fourth binding. There, we see that two different outputs were provided output1 and output2, and they have been given different numbers of attributes.


The language to express the bindings is a simple json structure. The first set of bindings is expressed as follows.

  "var": {
    "operation": [
        "@id": "ex:operation1"
    "agent": [
        "@id": "ex:ag1"
    "consumed1": [
        "@id": "ex:input_1"
    "consumed_value1": [
        "@value": "4",
        "@type": "xsd:int"
    "consumed2": [
        "@id": "ex:input_2"
    "consumed_value2": [
        "@value": "5",
        "@type": "xsd:int"
    "produced": [
        "@id": "ex:output"
    "produced_type": [
        "@id": "ex:Result"
    "produced_value": [
        "@value": "20",
        "@type": "xsd:int"
  "context": {
    "ex": ""
  "template": ""


Go ahead and experiment with templates and bindings using the service. For more details, please see previous posts.

Happy prov-templating …

Legacy sites at

I recently wrote a blog post about the relaunch of Today, I am pleased to announce the availability of two websites providing a historical perspective on the work that took place in the provenance community.

The Provenance Challenge website is hosted at It is kept in its original wiki look-and-feel as it constituted a significant community effort that led to the PROV standardisation. At the time, the community decided that it needed to understand the different representations of provenance, their common aspects, and the reasons for their difference. The Provenance Challenge was born as a community activity aiming to understand and compare the various provenance solutions. Three consecutive provenance challenges took place. A significant artifact that resulted from the Provenance Challenge series is the Open Provenance Model.

The Open Provenance Model (OPM) website is hosted at OPM is the first community data model for provenance. OPM was designed as a conceptual data model for exchanging provenance information. It contained key concepts such as Artifact (called Entity in PROV), Process (called Activity in PROV), and Agent. It also introduced notions of usage, generation and derivation of artifacts.

Of course, all this is now superseded by PROV, the W3C set of Recommendations and Notes for provenance. These legacy sites are made available to the community for reference. We aim to persist those pages and URLs in the future. Feel free to link to them! Relaunched

It is my pleasure to announce the relaunch of, the site for standard-based provenance solutions.

With our move to King’s College London, Dong and I have migrated the provenance services from Southampton to King’s. I am pleased to announce the launch of the following services at

  • ProvStore, the provenance repository that enables users to store, share, browse and manage provenance documents. ProvStore is available from
  • A translator capable of converting between different representations of PROV, including visual representations of PROV in SVG. The translator service can be found at
  • A validator service that checks provenance documents against the constraints defined in prov-constraints. Such a service can detect logically inconsistent provenance. An example of such inconsistency is when an activity is said to have started after it ended, or when something is being used before it was even created. The validator is hosted at
  • A template expansion service facilitates a declarative approach to provenance generation, in which the shape of provenance can be defined by a provenance document containing variables, acting as placeholders for values. When provided with a set of bindings associating variables to values, the template expansion service generates a provenance document. The template expansion service lives at

The Southampton services will be decommissioned shortly. If you have data in the old provenance store, we provide a procedure for you to download your provenance documents from the old store, and to upload them at In the age of GDPR, you will have to sign up for the new provenance store and accept its terms and conditions.

While the look and feel of the services may look quite similar, under the bonnet, there have been significant changes.

  • We have adopted a micro-service architecture for our services, allowing them to be composed in interesting ways. Services are deployed in Docker containers, facilitating their redeployment and enabling their configurability. We are also investigating other forms of licensing that would allow the services to be deployed elsewhere, allowing the host to have full control over access, storage and management. (Contact us if this is of interest to you.)
  • We have adopted Keycloak for identity management and access control for our existing and future micro-services. This offers an off-the-shelf solution for managing identities and obtaining consent. A single registration for all our services will now be possible.

As before, the above services are powered by some open source libraries, which can also be found from ProvToolbox is a Java toolkit for processing provenance; it is available from Likewise, Prov Python is a toolkit for processing provenance in Python and can be found at

The Photo of the Week #dontmesswithmydata

As a technologist, I have observed with a strong interest the fallout of Carole Cadwalladr’s investigative journalism published by the Observer, the Guardian, Channel 4 and New York Times.  Presumption of innocence is important, and I do hope that the official investigation will make responsibilities and failures explicit.

However, out of this tumultuous week for the Web and Social Media, I find the following photo extremely powerful.



Taken from The Guardian. Enforcement officers working for the Information Commissioner’s Office entering the premises of Cambridge Analytica.


Stealing data is no different than stealing money.  This year, we will see the launch of the GDPR (General Data Protection Regulation), but we should not forget that there already exist strong principles of data protection.  For convenience, I copy below the eight data protection principles:

  1. Personal data shall be processed fairly and lawfully and, in particular, shall not be processed unless  at least one of the conditions in Schedule 2 is met, and in the case of sensitive personal data, at least one of the conditions in Schedule 3 is also met.
  2. Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes.
  3. Personal data shall be adequate, relevant and not excessive in relation to the purpose or purposes for which they are processed.
  4. Personal data shall be accurate and, where necessary, kept up to date.
  5. Personal data processed for any purpose or purposes shall not be kept for longer than is necessary for that purpose or those purposes.
  6. Personal data shall be processed in accordance with the rights of data subjects under this Act.
  7. Appropriate technical and organisational measures shall be taken against unauthorised or unlawful processing of personal data and against accidental loss or destruction of, or damage to, personal data.
  8. Personal data shall not be transferred to a country or territory outside the European Economic Area unless that country or territory ensures an adequate level of protection for the rights and freedoms of data subjects in relation to the processing of personal data.


If anybody had a doubt, the law has power to enforce regulations. As an individual, I welcome this power #dontmesswithmydata!





Provenance Reading List (v2)

I am regularly asked by students and researchers about a reading list on provenance. The following papers give them a good baseline about the kind of work we undertake in my group. This is not meant to be an extensive literature survey, but this should give them enough background to have discussions about projects related to provenance.

This page updates a previous version of the reading list available at

Introduction to Provenance and PROV


Provenance Analytics

Software engineering and provenance


Provenance and Accountability