Provenance in the Wild: Provenance at The Gazette

Today, following my post on Provenance in the 2014 National Climate Assessment, I continue to blog about applications making use of PROV. The Gazette is  the UK’s official public record since 1665, The Gazette has a long and established history and has been at the heart of British public life for almost 350 years (see an overview of its history).  Today’s Gazette continues to publish business-critical information – and, thanks to its digital transformation, this information is now more accessible than ever before.

A quick reminder of what I mean by provenance:

Provenance is a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing.

The W3C PROV recommendations offer a conceptual model for provenance, its mapping to various technologies such as RDF, XML, or simple textual form, but also the means to expose, find and share provenance over the Web.

In a true open government approach, the tender for the Gazette Web site is a available online, and requested the use of PROV (which was then a Candidate Recommendation).  The purpose of provenance on The Gazette is to describe the capture, transformation and enrichment and publishing process applied to all Notices published by the Gazette.  Let us examine how this was actually deployed.

For instance, the notice available at https://www.thegazette.co.uk/notice/2152652 records a change in a partnership. In the right hand column, we see a link to Provenance Trail.

notice2152652-annotated

Following this link, we obtain the provenance information for this notice:

notice2152652-provenance

On this page, we find a graphical representation of the publication pipeline for this notice, and various links to machine-processable representation of provenance (in RDF/XML and JSON).  When uploading this provenance into the Southampton PROV Translator service, we obtain the following graphical representation, which shows a much more complex and detailed pipeline.

 

notice2152652-provenance-tool

Provenance Information is not just exposed in browsable format, but is also exposed in machine processable format. Going back to the original https://www.thegazette.co.uk/notice/2152652 page, and looking at the html source, we can find the following link element, stating the existence of a relation between the current document and the provenance page. The relation http://www.w3.org/ns/prov#has_provenance is defined in the W3C Provenance Working Group Provenance Access and Query specification.

<link rel="http://www.w3.org/ns/prov#has_provenance"
      href="https://www.thegazette.co.uk/id/notice/2152652/provenance"
      title="PROVENANCE" />

Tools like Danius Michaelides’ ProvExtract can pickup this link and feed it into the Southampton Provenance Tool suite. ProvExtract also extracts some metadata, expressed as RDFa embedded in the document.

 

provextract

Unfortunately, a slight inter-operability issue showed up here. The resource https://www.thegazette.co.uk/id/notice/2152652/provenance has only an html representation. In our ProvBook, Paul and I explain how content-negotiation can be used to serve multiple representations, including the standardized ones such as Turtle, PROV-X, and PROV-N. It is an improvement that may be considered by The Gazette in future releases. Overall, I think it is remarkable how The Gazette exposes provenance in both visual and machine-processable formats.  Congratulations to The Gazette’s team for this achievement.

I will finish this post with a few concluding remarks.

  1. While the National Climate Assessment 2014 report  exposes provenance to end-users as text, The Gazette opted for a high-level visualization of the pipeline. It is interesting to observe how simplified The Gazette graphical representation is, compared to the graphical rendering of the raw data displayed in this post. It shows that abstraction of provenance is an important processing step to apply to provenance to make it understandable. It was a recurrent topic of discussion at  Provenance Week 2014.
  2. The Gazette also provides signed version of its provenance (and other metadata). It is a powerful way of asserting the authorship of such provenance: in other words, it is a cryptographic form of provenance of provenance, which is non-repudiable (The Gazette cannot deny publishing such information) and unforgeable (nobody else can claim to have published this information). Her,e practitioners are ahead of standardisation and theory: there is no standard way of signing provenance and there is no formal definition of a suitable normal form of provenance ready for signature.
  3. As we supply such prov data to the tools we have developed in Southampton, we can obtain interesting visualization. It really shows the benefits of standardisation, since PROV data produced by The Gazette team can be consumed by independently developed applications.
Advertisements

What is in ProvToolbox 0.4.0?

ProvToolbox is an open source Java package to create, manipulate, save, and read PROV representations. PROV is the W3C standard for representing provenance on the Web. Since  I have just released version 0.4.0 this WE, I thought it would be good to explain recent changes and future directions for ProvToolbox.

History

First, some context. ProvToolbox was initially conceived during the lifetime of the W3C Provenance Working Group. ProvToolbox was initially implementing the PROV data model, serialization to XML (and back) according to PROV-XML, mapping to RDF (and back) according to PROV-O (Thanks to Mike Jewell for helping with the first version of the converter RDF to Java), serialization to PROV-N and back, and serialization to JSON (and back) according to PROV-JSON (thanks to Trung Dong Huynh for helping with this converter). ProvToolbox was one of the implementations demonstrating implementability of the PROV specifications.  ProvToolbox’s design was inspired by the OPMToolbox a similar toolkit for OPM, a predecessor of PROV.  In its original design, ProvToolbox was adopting a schema driven approach, in which schemas, grammars, and ontologies were automatically compiled into marshallers and umarshallers. This was particularly convenient when PROV was being designed, and changed every other week.

Motivation

The purpose of ProvToolbox is to create PROV representations, manipulate them, save and read them using standard serializations.  By sheltering the programmers from the nitty-gritty details of serialization, it is hoped that they can focus on provenance specific functionality, and improve the quality  of their applications. ProvToolbox is known to be used in several applications and services, including the online PROV translator, the PROV validator,  Amir’s CollabMap-based trust rating, and others. If your application uses ProvToolbox, please let me know.

Recent Changes

prov-model

The key change is the introduction of the prov-model artifact (preliminary version of which was already released in 0.3.0). This artifact is the realization of the PROV conceptual model in Java. Classes and associations of the conceptual model are all formalized by Java interfaces, specifying their accessors and mutators, and other relevant methods.  Instances of these interfaces can be found in the form of Java beans, for instance, in the prov-xml artifact, which takes care of marshalling to XML and unmarshalling from XML.  Another implementation of these interfaces can be found in prov-sql (see this topic, being discussed below). One can imagine further implementations using Spring Data, for instance.

static and refactored beans

While a schema driven approach was suitable when the PROV standard was being developed, now that PROV is frozen, it is better to define beans statically, and curate them manually.   For instance, attributes are now handled systematically, and expressed as org.openprovenance.model.Attribute. The outcome is beans that are more natural to the programmer.

Namespace Handling

In the toolbox, qualified names (known as QName in XML Schema) are used to represent URIs in a short form. Managing namespaces and associated prefixes is sometimes a pain. To facilitate the programmer’s task, a class Namespace, embedding all namespace-related processing was introduced.  Please let me know if it covers your need.

Unmarshalling from XML

Beans used to be generated by JAXB. However, for attributes such as prov:location, which were expected to be xsd:anySimpleType, the corresponding Java method expected each location attribute to be an Object. The accessor getLocation() used to have the following signature.

List<Object> getLocation()

However, round trip conversion from Java to XML and back was not successful with QNames whose namespace had not been declared globally. To ensure compatibility with the PROV standard definition, and to shelter the programmer from these tedious serialization details, manual (un)marshallers were written (adaptors in JAXB speak). An extensive series of tests was developed: more than 200 tests are now run for each serialization to check round-trip conversion. In particular, we ensure support for rdf 1.1 primitive data types.

prov-sql

Another novelty of this release is a very very preliminary ORM mapping for PROV.  It allows PROV representations to be saved to and retrieved from SQL databases. Currently, there is no support for PROV-Dictionary and other extensions. Lots of schema optimizations are possible (and required too!). Feedback welcome on the SQL Schema and mapping.  But before spending any more time on the sql schema, there is some further refactoring of beans that I would like to implement (see next section).

Where next?

I am reasonably satisfied with the definitions in prov-model. There is still one significant change that I would like to introduce, which is likely to break again applications using ProvToolbox. The JAXB automatic bean generation introduced IDRef, a Java class, whose sole purpose was to serialize atttribute prov:ref=”e1″ in the example below.

<prov:wasGeneratedBy prov:id="gen1">
 <prov:entity prov:ref="e1"/>
 <prov:activity prov:ref="a1"/>
</prov:wasGeneratedBy>

The following Java methods were defined accordingly.

 IDRef getEntity()
 IDRef getActivity()

Instead, I propose to specify beans with the following methods, avoiding programmers to have to manipulate IDRefs, since they are not part of the PROV data model, but are only introduced for the purpose of serialization.

 QName getEntity()
 QName getActivity()

So far, ProvToolbox has used javax.xml.namespace.QName, but this class is supposed to represent XML QNames. XML QNames come with strong syntactic restrictions (though the Java class does not enforce them), but these have been  relaxed in PROV Qualified Names. Therefore, I will introduce a Qualified Name class for ProvToolbox, supporting the syntactic definitions set by PROV, but also allowing easy conversion to Turtle and XML QNames; further, it will also offer functions to convert to and from corresponding URIs.

With this in place, I hope that bean interfaces will be frozen till release 1.0.0. The focus will then be on finalizing and refactoring the various  serializations.

Finally, I have already started documenting ProvToolbox, but much more is required!

Useful Pointers

GitHub repository: https://github.com/lucmoreau/ProvToolbox/

Javadoc: http://openprovenance.org/java/site/0_4_0/apidocs/

Maven repository: http://openprovenance.org/java/maven-releases/

A Little Provenance Goes a Long Way

PROV is a rich vocabulary that was designed to tackle a variety of use cases.   The Provenance Working Group worked really hard to design PROV to facilitate its adoption.  In our book, Paul and I provide many recipes to design, deploy, and use provenance in the context of a complex data journalism scenario.

However, we argue that identifying a resource, exposing its authors with attribution, and expressing what it  is derived from, already goes a long way towards a provenance-enabled Web.

Echoing Jim Hendler‘s quote  A Little Semantics Goes a Long Way, Paul and I conclude the ProvBook with a quote of our own:

A little provenance goes a long way

How could “we eat our own dog food” and express the provenance of this quote?

Simple, with the following Turtle snippet:

@prefix prov: <http://www.w3.org/ns/prov#>.
@prefix provbook: <http://www.provbook.org/>.

provbook:a-little-provenance-goes-a-long-way a prov:Entity;
  prov:value "A little provenance goes a long way";
  prov:wasAttributedTo provbook:Paul ;
  prov:wasAttributedTo provbook:Luc ;
  prov:wasDerivedFrom <http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html>.

We have identified the quote, with url http://www.provbook.org/a-little-provenance-goes-a-long-way. For convenience, we provided a copy of the quote itself (using property prov:value). We identified Paul and myself as the authors. And finally, we gave credit to Jim, by indicating that our quote was inspired by his: this notion is called Derivation, and is expressed with the property wasDerivedFrom.

All these statements can be represented graphically. Yellow ellipses represent entities whereas orange pentagons represent agents in PROV; agents here are the authors of the quote.

little

Apply this motto in your own context, and publish simple provenance statements about your resources. Really, a little provenance goes a long way …

Cross posted from http://blog.provbook.org/2013/10/11/a-little-provenance-goes-a-long-way/

Food Supply Chain and Provenance

On 4 June the Secretary of State announced that Professor Chris Elliott, Director of the Global Institute for Food Security at Queen’s University Belfast, was to lead an independent review into the integrity and assurance of food supply networks.

The aim of the review will be to “advise the Secretary of State for the Environment, Food and Rural Affairs and the Secretary of State for Health and also industry on issues which impact upon consumer confidence in the authenticity of food products, including any systemic failures in food supply networks and systems of oversight with implications for food safety and public health; and to make recommendations”.

The public, and all those involved in food supply (and the way that food supply is regulated) are invited to give us their views. Through this call for evidence, we are keen to hear about issues including those which affect consumer confidence.

https://www.gov.uk/government/consultations/food-supply-chain-review

Thank you for this opportunity to provide input into this review on the food supply chain. As an individual and parent, I am concerned by the quality of food, the quality of ingredients, their origin, and the production processes. For allergies, and in general for health reasons, the correct, timely, and accurate labelling of food products is critically important. Furthermore,  locations where produces are grown, CO2 footprint, and ethical considerations are also matters of increasing interest. Price is also on our mind during weekly shopping, and the last thing we would want is a regulatory burden that would either reduce the variety of foods available to us, or make food unaffordable.

I am writing this contribution in my capacity of Professor of Computer Science, at the University of Southampton, and co-chair of the recent Provenance Working Group at the World Wide Web consortium. As a computer scientist, I believe that the solution to this challenging problem has to rely on technology, and such a technology is now readily available in standardized form. On the Web, we use the term provenance to refer to a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing. Such a concept is equally applicable to food.

The problem of food traceability is analogous to the problem of provenance of data on the Web: multiple organisations are typically involved in the creation of food or Web data, regulatory authorities/governments have to audit the supply chain of food and information on the Web, third parties (such as consumer groups or review sites) can comment on and analyse food and information items, consumers wish to access information about them, rate it, and share it through social media.

The solution to the food supply chain problem includes making provenance of food explicitly available, in a computer processable format, over the Web infrastructure, including all details of its ingredients, production, and delivery, so that it can be contributed to, and inspected, by all stakeholders: farmers, industry, regulators, consumer groups, and end consumers. Ubiquitous availability of provenance will allow novel, useful services to be developed, providing detailed analysis, reviews, or recommendations. All this processable evidence will also help identify suspicious steps in the food production chain, or gaps in the recorded information, which can then trigger further inspection. Ultimately such a provenance-based online environment will promote transparency and accountability and will allow consumer confidence to be restored.

I am expanding on these thoughts below, by answering some of the questions, and would be delighted to discuss them, in person, with the team undertaking the review.

1. What measures need to be taken by the UK food industry and government to increase consumers trust in the integrity of the food supply systems?

Full provenance of food, including the organization, people, food products, locations, processes, and transportation needs to be systematically made available online. Provenance must be in a computer processable, standardized, and open  format so that it can be browsed, mined, analyzed, and audited by all stakeholders of the food industry, hereby offering transparency for the whole industry.   It is also crucial to recognize that no single authoritative source of provenance may exist for a given food product, but independent parties (e.g. regulatory authorities, organizations in the supply chain, consumers, consumer groups) should be able to contribute to such provenance evidence.

With this information infrastructure in place, scanning the barcode of a product on a supermarket shelf, or clicking a button, should give the consumer real-time, up-to-date information about its origin, quality and constituents.

Consumers may not be interested in all the tiny details of a product’s provenance. We anticipate that novel services will emerge that describe products according to user’s preferences, e.g. allergens may be far more important for some than locality of products. Consumers will decide which “food recommender service” they will trust.

The regulatory framework should identify (and to some extent already does) the minimum provenance expected to be found.  Openness, transparency, data journalists, and the many eyes of the crowd will help identify missing or inconsistent provenance.

3. How can government, food businesses and regulators better identify new and emerging forms of food fraud?

Online provenance can become an incredible source of information for detecting inconsistent food labels, suspicious patterns of processing and transporting, lack of inspection, etc. Furthermore, given that every piece of provenance evidence is a claim that should be attributable to some organization or individual,  it allows responsibility to be assigned.  This is particular useful to identify who is responsible for a fraudulent situation.

In this context, provenance evidence becomes the foundation for establishing trust in food products. Provenance evidence itself should be non forgeable. Mechanisms such as digital signatures combined with provenance of provenance allows for provenance evidence to be attributed unambiguously.

5. Do consumers fully understand the way industry describes the composition and quality of the products on sale?

The provenance evidence advocated in this document, can and should be as technical as required for the purpose of auditing, compliance checking, etc.  Since it should be extensive and detailed, it could not possibly be printed on a product packaging, but it should be available online.

The information made available to consumers (including that on packaging) can be seen as a summary of the full provenance of a product, presented to consumers in a friendly and practical manner. Such consumer targeted information should also be regarded as provenance evidence, also available in machine processable format, which can be validated against detailed provenance evidence.

11. How can large corporations relying on complex supply chains improve both information and evidence as to the traceability of food?

By leveraging a standard vocabulary for provenance (and a food-industry specific terminology for all entities, agents and activities), large corporations, their suppliers and consumers, and their auditors, can create a Web of provenance evidence for food, which can be navigated by all stakeholders of the food industry.

12. Should there be legislative requirements for tamper proof labelling, and/or to advise competent authorities of mislabelling if it is discovered in the supply chain?

Cryptographic signature is a mechanism by which a label (or more generally any digital document) can be “signed” by a  party, demonstrating the authenticity of the label (or document).  A valid signature identifies the signer (authentication) in such a way that the signer cannot deny having signed the message (non-repudiation) and that the message has not been tampered with in transit (integrity). Cryptographic signatures are readily available, and with other cryptographic techniques, are extensively used in e-commerce.

It is the expectation that all provenance evidence in this context, including labelling, would be cryptographically signed.

Some reporting facility should be made available, by which inconsistencies in the provenance of food can be flagged. Any such logged report also constitutes evidence, and its veracity should be established by the relevant organisation.

13. What additional information does the public need to be offered about food content and processing techniques? How can this information be conveyed in an easy to understand manner ?   

It is difficult to enumerate today all possible information that could be useful to the public. Beyond the traditional allergens and ingredients, CO2 footprint, ethical considerations, organic certifications may be of interest.

The regulatory framework should not restrict the information made available to public. Instead, it should enable transparency, and facilitate any relevant information to be made available. This will empower organisations to develop services that leverage that information. As indicated in our response to question 5, presentation services can take care of conveying information in an easy to understand manner.  By opening up provenance of food products, an ecosystem of original solutions will emerge, as it has emerged in many aspects of the Web.

Further points

We are not aiming to build another massive, centralised IT system, likely to be bound to failure, given the complexity of the supply chain. Lightweight, agile, Web oriented techniques have proven very successful for this kind of applications.

In this document, we are agnostic as to how data should be made available, whether open or restricted, free or payable for. It is likely that some aspects have to be open (inspection reports, reviews, origins of produces), whereas others may be confidential (which recipe was used for a given product).  We see provenance of food as an enabling platform for a variety of services to be developed, and potentially competitive advantage to be built upon.