Reflection on the proof of concept

Part 8/8: A blog post series by Dong Huynh, Sophie Stalla-Bourdillon and Luc Moreau


For this EPSRC impact acceleration project conducted over a period of three months, we have implemented the Loan Decision scenario, instrumented the pipeline so that it produces provenance, categorised explanations according to their audience and their purpose, built an explanation-generation prototype, and wrapped the whole system in an online demonstrator.

This work aimed to demonstrate that provenance, defined as a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a decision, is a solid foundation for generating explanations.

We are delighted to release a report summarising all this work.

Some lessons can be drawn from this piece of work.


1. General considerations


Given the short duration of the project, there are inevitably some limitations. First, we discuss some general considerations.

  • First, we designed this prototype for one application scenario, for one machine learning pipeline, for one specific regulatory framework (GDPR), and for a subset of requirements from this framework. It is our intent to generalise the approach to other scenarios, regulations and requirements.
  • Second, the approach is predicated on finding specific mark-ups in the provenance, to be able to construct the relevant explanations. Besides the above generalisation, there is also a clear need to document such mark-ups, so that data controllers can adapt their system to produce suitably annotated provenance. It has to be understood by data controllers that a failure to generate provenance with the right mark-ups will result in the system’s inability  of constructing some explanations.
  • Third, adequate tools need to be provided to assist data controllers in producing the right provenance information, and in checking that it addresses data protection (or others) requirements they are under the obligation to meet.
  • Fourth, explanations can and should be refined to fully meet their purposes. Extensive requirement capturing and user studies will help validate these.
  • Fifth, it is our belief that explanations could be viewed as more than just one paragraph communicated to the data subject in a single request-response interaction. We envisage explanations potentially as part of a dialogue between the system and its targeted recipients. A mechanism to design such an explanation service would, therefore, be required.
  • Finally, some aspects of the decision-making pipeline are currently not explained. It is particularly the case of the machine learning algorithm itself, which remains a black-box: the algorithm was used to create a model, and the model was used to classify some input data. Both the model creation and classification are modelled by activities in the provenance. If some libraries are able to generate further provenance, this, in turn, can be turned into explanations.


2.    Refining Explanations


  • We generate different explanations for automated and human decisions. Something to investigate is how meaningful the human involvement is. How much is added by the human on top of the automated recommendation they proceed? Can the meaningfulness be determined automatically? Which semantic mark-up in the provenance would help with this task?
  • We were able to demonstrate that some loan application characteristics (or elements of third-party data such as credit reference) were not used by the decision-making pipeline. This information, while certainly useful, is looking at “syntactic usage”: some data may have been passed to the pipeline, but may or may not have been effectively used to reach the decision. In other words, the data may or may not have had an influence on the final decision. However, such information can only be surfaced if we gain a better understanding of the black box.
  • Counter-factual explanations. We have demonstrated that it is possible to construct simple counter-factual explanations out of provenance. By simply considering alternate loan applications in a counter-factual world (e.g. loan for a different purpose, for a different amount, for a data subject with different profile), and applying the pipeline, we obtain counter-factual decisions. By marking the original loan application and associated decision, as well as alternate applications, we were able to construct an example of counter-factual explanation. This approach needs to be generalised and the nature of explanations that can be supported needs to be further studied.


3.  Where next?


This limited proof-of-concept exercise is only the start of a journey. With the new EPSRC-funded PLEAD: Provenance-driven and Legally-grounded Explanations for Automated Decisions, we are about to embark in novel research to address some of above concerns.  More posts on this topic will follow.

PLEAD Project:

Full report for this project:




Explanations for Automated Decisions from Provenance – A Demonstrator

Part 7/8: A blog post series by Dong Huynh, Sophie Stalla-Bourdillon and Luc Moreau


In the previous post, we presented our approach to generating explanations from provenance and provided an illustration for one type of explanations. Follow the same approach, we have implemented an explanation for most of the categories we identified in the loan decision scenario. The implementation is deployed online at to demonstrate the explanations we generate from the provenance of a loan decision.




1. Simulate a Loan Application


The simulator supports the loan decision pipeline scenario, in which you can play the role of a borrower applying for a loan.



First, you can simulate a new loan application – filling in a loan application, whose data will be randomly picked from our dataset.




At this point, you can submit the application to get a loan decision.




The provenance of a decision is recorded, detailing the processes taking place to arrive at the decision. It can be accessed via the Provenance button.


2.  Checking out the explanations


Below the loan decision, you will find a list of questions that an applicant may inquire about its various aspects.




For instance, they may ask whether the decision was solely automated. The Automation tab will provide the answer, which is generated from the provenance data of the decision. Below each explanation, we provide its legal and business contexts that call for the explanation.




Do try out the demonstrator and let us know what you think. At the bottom of each page, there is a yellow question mark; clicking on it will allow you to send us quick feedback, or any suggestion, on the page that you are seeing.


3.  Coming next


This is the seventh in our 8-part blog series on explanations for automated decisions from provenance. In the last post, we will summarise our short journey exploring this novel and exciting application of provenance and discuss where we want to go next.



Constructing explanations from provenance

Part 6/8: A blog post series by Dong Huynh, Sophie Stalla-Bourdillon and Luc Moreau

In this sixth blog post on provenance-based explanations, we look at the mechanism we use to construct explanations from provenance. First, we summarise the overall approach, and then, we look at the details.

1. Overview


Figure 1 illustrates our approach to generating explanations. We will discuss it, point by point.


Figure 1: Provenance Based Explanation Generation


In the introductory blog post, we provided the definition of provenance as “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a pied of data or a thing in the world”. Our focus here is the provenance of an AI-based decision (see 1, Figure 1).

It is our view that such a provenance record is an excellent starting point to construct an explanation specific to a decision, helping a data subject understand it and take action in response to it, appropriate for the context (see 2). Thus, we are assuming that the AI based application has been instrumented in order to log provenance (see 3). To this end, we have been developing various techniques by which code can be instrumented, some libraries have been instrumented, and even provenance can be reconstructed from logged data.

In this project, we instrumented a loan decision pipeline to record the full provenance of a loan decision in the scenario.

The result is a provenance graph (see 4) providing full details of the inputs and processing leading to a decision, including data items, processes, agents that have influenced decisions. In long-running applications, and in applications processing a vast amount of data, potentially this provenance can become very large.

This full record in itself is not conducive to construct an explanation directly, since it may contain too many details that a data subject may find irrelevant, tedious, or overwhelming.  Instead, the provenance needs to be processed (see 5) to produce relevant information nuggets in support a specific explanation’s purpose: this may involve summarisation and analytics, to extract the essence of provenance. The result is what we refer to as a provenance summary (see 6).

The provenance summary is the input to a generation component (see 7), converting relevant summary information into an explanation, which in our case would be textual. Alternatively, or additionally, the same extracted information could be represented in a graphical way to further aid its consumption.  The outcome is an explanation which could be targeted to the data subject (and would typically refer to the data subject and their data, using “you” or “your application”) or to the data controller (and then would typically use “the borrower” or “the borrower’s application” instead).

2.  Illustration


In this section, we illustrate the approach with the explanation Time Relevance from the loan scenario.

A question a data subject may want to ask is “How timely relevant is the data used for assessing my loan?” so they can check if the latest data are being used (since they may believe that their circumstances have recently changed in their favour).

This question can be addressed by extracting all data items that have affected a decision. In the instance of a loan decision, they consist of 3 items: the loan application made by the borrower, and two credit references obtained from two distinct external credit referencing agencies.  It is these that are of interest to the data subject, as they want to ensure that they are the most recent.

A suitable query over the provenance of the decision can be designed and executed to extract the following subgraph, out of which the relevant information can be found to construct the explanation.  In the graph shown below, at the bottom, we find the decision (yellow ellipsis), and three influencing entities from which it was derived: the loan application, a fico score and a credit history.  The latter two are provided by external agencies (fico) and (credit_agency), represented as orange pentagons.


Figure 2: Query Result for “Time Relevance” explanation


Out of the extracted provenance subgraph, one can then construct the following explanation.  It explicitly lists the external credit referencing agencies, the credit references they provided, and the time at which such credit references were obtained.

The external data sources were the borrower FICO score (fico_score/29) provided by the credit referencing agency (fico) at <2019-06-15T20:46:39.921182> and the borrower credit reference (credit_history/29) provided by the credit referencing agency (credit_agency) at <2019-06-20T12:43:31.114156>.

The explanation can then further be enriched by providing contact details for these external agencies.  Taking our approach of “Explanations as Detective Controls”, the explanation allows as data subject to detect if external information is timely; the explanation is actionable, as data subjects can use the contact details to approach these agencies, query the credit reference, and potentially have it fixed.

A standard such as PROV allows multiple organisations to share provenance information in an interoperable manner.   In this simulated example, the provenance is created by the loan company, based on the processes it follows.  If the loan company and credit referencing agencies inter-operate properly, we can envisage that not only they exchange credit reference but also (some of) the provenance of these credit references. Credit reference agencies themselves rely on external organisations to compile these references.  All that information can be transformed into actionable information for the data subject

3.  Coming Next


In the next blog posts, we present the demonstrators for the loan application and associated explanations, computed on the fly, using the provenance it generates.

Tracking Provenance in a Decision Pipeline

Part 5/8: A blog post series by Dong Huynh, Sophie Stalla-Bourdillon and Luc Moreau

In the first post of this series, we argued that the provenance of automated decisions is a valuable source of data from which explanations about those decisions can be generated. In order to demonstrate that capability, we first need to record the provenance of such decisions in our hypothetical loan scenario.

As the loan decision pipeline was implemented in Python, we use the PROV Python package to make provenance assertions that are compliant to the PROV recommendations by the World Wide Web Consortium. In brief, the PROV Data Model defines three basic concepts, Entity, Activity and Agent (see Figure below) and various relations between them. For instance, an entity can be used by an activity to generate some new entity; the activity itself may be influenced in some ways by agents.

PROV concept Description Examples
prov-concept-entity A thing, either physical, digital or conceptual, whose provenance we want to describe piece of information, decision, document, plan, dataset, trained machine learning model
prov-concept-activity Occurs over a period of time and acts upon or with entities actions such as planning, monitoring, approving, training, classifying
prov-concept-agent Bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity person, machine, service, system, organisation, collective

In the following sections, we present the recorded provenance of a loan decision step-by-step, from the loan application made by a borrower to its classification by the machine learning pipeline and the final decision, either automated or made by a loan officer.

1   Provenance of Input Data

The first piece of inputs to the pipeline is the loan application (loan:applications/35) made by an applicant (see entity loan:applicants/35 in Figure below). The application entity contains the data provided by the applicant in the application form (shown in the white box linked to the entity by a dotted line). In addition, the Loan Company obtains the applicant’s credit history (loan:credit_history/35) and credit score (loan:fico_score/35) from third-party organisations (loan:credit_agency and loan:fico, respectively). Each of the input entities is attributed to the corresponding agent via an attribution relation.


2   Classification of a Loan Application

The input data are then transformed into a set of loan features, modelled as an entity (py:loan_features/35), that is suitable for processing by the machine learning pipeline. The provenance of the loan features entity is explicitly asserted by the derivation relations linking it to the input entities from which it was produced.


A computer (ex:machine/8e7425f366a0) classifies the loan features using the pipeline (loan:pipeline/1) and produces a recommendation (ex:recommendation/35) on the loan application. It does so, however, on behalf of the Loan Company (loan:institution). The process of classification is modelled as an activity (ex:classify_loans/35), which has a start time and an end time; it uses the loan features as inputs and generates the recommendation as the output.


3 Making a Loan Decision

Let us consider the case in which the pipeline predicts that the probability of charge-off is very low (3.5%) and, hence, an automated approval decision is generated directly from the recommendation from the pipeline. The decision is attributed to the computer running the pipeline; however, the chain of responsibility is made clear: the computer produces the decision on behalf of the Loan Company.


In another case, the probability of charge-off is on the borderline (25.4%), the automated recommendation is escalated to a review carried out by a loan officer (loan:staff/112). The loan decision is now attributed not to the computer but to the officer, who also acts on behalf of the Loan Company. Compared to the previous, automated case, the provenance in this case shows that the review activity takes into account the loan application, the credit history, and the credit score of the applicant in addition to the automated recommendation produced by the pipeline.


4 Discussion

For the sake of brevity, we present the provenance of a decision here in small, digestible snippets. The full provenance of a decision is recorded as a single directed graph allowing one to trace from the provenance back to the input data and to identify the responsibility for each of the activities found along the way.

Each of the entities is categorised by types with one or more prov:type attributes. Most types are application-specific such as ln:LoanApplication, ln:FICOScore, and ln:CreditOfficer. In addition, we tag certain entities with types that will be useful for identifying relevant data in support explanation generation: pl:Controlled, pl:HumanLedActivity, prov:SoftwareAgent, prov:Person and so on. In the next post, we discuss the technical approach to generate explanations from the recorded provenance.

Explanations for Automated Decisions – Examples

Part 4/8: A blog post series by Dong Huynh, Sophie Stalla-Bourdillon and Luc Moreau

In the previous blog post, we introduced the loan scenario, in which a borrower submits a loan application and receives a decision on the application from an automated decision pipeline. In a workshop with colleagues at the Information Commissioner’s Office, we brainstormed on various questions that a data subject may have on such a decision. We paid particular attention to the rights granted to data subjects by the GDPR on such circumstances and, hence, questions of which answers would help the subjects exercise those rights. In addition, we also discussed questions and answers, or explanations, on aspects of the loan decision pipeline that would help data controllers to meet their obligations or demonstrate compliance with their obligations. These explanations would, thus, be particularly useful to establish data protection by design and by default.

The types of explanations thereby generated can be classified into two categories depending upon whose concerns they address, as illustrated below.

Individual Concerns Organisational Concerns
Automation Performance
Data Inclusion Responsibility
Data Exclusion Process
Data Source Systemic Discrimination or Bias
Data Accuracy Ongoing Monitoring
Data Currency
Profile-related Fairness
Discrimination-related Fairness

In this post, we expand upon two of the foregoing explanations. In each of them, we identify the target audience of the explanation, the corresponding questions they may have, the rationale for an organisation to provide such an explanation, and example explanations in the context of the loan decision scenario. The descriptions of all the categories will be available in our final report to be made available with the last post in this series.

1       Automation

Audience Data subjects
Questions Has the loan decision been reached solely via automated means?
Description Whether a decision made solely by automated means without any meaningful human involvement.
Rationale This explanation helps determine whether GDPR Article 22 is applicable and thereby the prohibition applies: “The data subject shall have the right not to be subject to a decision based solely on automated processing…” It is therefore relevant for demonstrating compliance with Article 5(1)(a) (fairness principle) and Article 5(2) (principle of accountability).

This explanation should also help understand when best practice as unfolded in Recital 71 is met, e.g. to determine whether both child data and solely automated means have been used.

This explanation could also help determine whether the information provided to the data subject as per Article 13, 14 and 15 is adequate.

Examples The automated recommendation was reviewed by a credit officer (staff/112) whose decision was based on your application (applications/34), the automated recommendation (recommendation/34) itself, a credit reference (credit_history/34) and a fico score (fico_score/34).</code>
The loan application was automatically approved based on a combination of the borrower loan application and third-party data: the borrower credit reference and the borrower FICO score.

2       Responsibility

Audience Data controller, Regulator
Questions Who were responsible for the final decision pipeline?

Who set the threshold value for automated decisions?

Who decided how the data was selected?

Who approved the pipeline for deployment?

Description As part of their own governance and to support accountability, organisations must keep track of who did what and when in their internal processes.
Rationale This explanation would help determine whether the data is processed fairly and transparently (Article 5(1)(a)) (although this would not lead to a complete fairness assessment). Ultimately its implementation would be useful for accountability purposes.
Examples Responsibilities for the AI pipeline were that data engineer (staff/259) selected file (loans_filtered.xz), that data engineer (staff/259) split file (loans_train.xz), that manager (staff/37) approved the company pipeline (pipeline/1) and that data engineer (staff/259) fit data for the company pipeline (1558649326/5011959424).

3       Coming next

This is the fourth post in a series of blog posts on Explanations for AI-based decisions. In the next post, we describe how the provenance of a loan decision is modelled and recorded to support the generation of explanations for such a decision. Future work will then involve determining the level of usefulness of meaningfulness reached by these explanations.

A Scenario of Automated Decision-Making

Part 3/8: A blog post series by Dong Huynh, Sophie Stalla-Bourdillon and Luc Moreau

Credit applications nowadays are typically assessed by automated systems and often approved or rejected within seconds, without human intervention. In this project, we imagine a loan scenario that allows us to simulate such an automated decision pipeline with the aim to explore potential questions one may ask about its decisions.

1. The Loan Assessment Scenario


Loan Company is a credit institution that offers short-term unsecured loans to borrowers. In order to minimise loss from charge-off, i.e. when a loan is unlikely to be repaid by the borrower, the institution developed a machine-learning pipeline that predicts the probability of a charge-off from a loan application. Based on this probability, a recommendation is made on whether the application should be approved or rejected.

The pipeline was trained and tested on the company’s past loan performance data and was shown to perform reasonably well. It was approved for deployment to access all incoming loan applications and is enabled to make automatic decisions in clear-cut cases without the attention of a loan officer:

  • If the probability of charge-off is higher than 50%, the loan application is automatically rejected
  • If the probability is less than 25%, it is automatically approved.
  • A loan officer has to examine the remaining cases (i.e. where the probability is between 25% and 50%) and make the final decision.


2. Building the Automated Decision Pipeline


To add some realism to the loan scenario, we use a real-world loan performance dataset originally published by LendingClub to build the decision pipeline. The dataset went through typical machine learning analysis, filtering, and transformations:

  • Data filtering and selection:
    1. Only loans that have finished, either as fully paid or charged off, are retained.
    2. Loan features (i.e. data fields) with significant missing data (i.e. over 30%) or that are not available before a loan is approved are removed.
  • Data preparation and transformation:
    1. Remove loan features that are clearly not useful as predictors for charge-off:
      1. All values are unique or too many different values
      2. Features that are already included in another (duplication)
    2. Convert feature values to those suitable for machine learning
      1. Loan status (fully paid/charged off) to 0/1 labels
      2. Replace categorical features with dummy labels (0/1) for each of their categories
  • Split data into train and test sets according to the loan date: 90% and 10%
  • Create a machine learning pipeline with Scikit-learn to combine imputation and decision tree classification
  • Train the pipeline with the training dataset (90%)
  • Validate its accuracy with the test set (10%)



3. Some potential questions


In the loan scenario, in order to borrow money from Loan Company, borrowers would need to submit applications that contain their personal information such as their addresses, their income levels, whether they are homeowners, and so on. Loan Company does not just rely on those but, in a typical case, would also routinely obtain credit references on the borrowers to check their creditworthiness.

As a data subject whose personal data is being processed and profiled by Loan Company in a highly automated manner, a borrower has certain legal rights, e.g. the rights to be informed about the modalities of the processing and the logic involved, to access his/her data, to rectify inaccurate data and potentially (at least in instances in which the processing is purely automated) to request human intervention and to challenge a decision (we assume GDPR Recital 71 should inform the interpretation of Article 22). However, in order to be able to decide whether to exercise these rights, the borrower should be put in a situation in which he can obtain meaningful insights into the processing activities and for instance receive answers to the following questions:

  • Was the loan decision they receive made by a human or an automated process?
  • What types of data were used to assess my loan application?
  • Where did those data come from?
  • Are they all accurate?

The explanations thereby provided can be thus be conceived as detective controls or safeguards giving data subjects the opportunity to check whether a remedy is applicable.

4. Coming next


In the following blog post, we identify the different types of explanations and argue why they are useful to both data subjects and data controllers.


AI-based Automated Decisions and Explanations: A Provenance Perspective

Part 1/8: A blog post series by Dong Huynh, Sophie Stalla-Bourdillon and Luc Moreau

1. Opportunities and Challenges with Automated Decisions


AI-based automated decisions are increasingly used as part of new services being deployed to the general public. This approach to building services presents significant potential benefits, such as the reduced speed of execution, increased accuracy, lower cost, and ability to learn from a wide variety of situations. Of course, equally significant concerns have been raised and are now well documented such as concerns about privacy, fairness, bias and ethics.

Several regulatory and legal frameworks have emerged across the world to address some of these concerns. Of interest to us in this blog post is the General Data Protection Regulation (GDPR), a framework that codifies some rights for data subjects (the users who have provided data in return for those services) and obligations on data controllers (the organisations that are providing these services).

A key challenge is that regulatory frameworks remain high-level and do not specify practical ways of becoming compliant. For instance, how to determine if a decision is solely based on automated processing (article 22 of the GDPR), how should the ‘logic’ of the processing be derived and expressed (article 15 of the GDPR), or what is actually required in terms of transparency/accountability obligations, or whether transparency necessary leads to fit-for-purpose explanations (article 12 of the GDPR).


2. Technology: Part of the Solution


Whilst technology underpinning automated decision making is the source of concerns, we take the view that technology also has a place to help address these concerns. We are not trying to suggest that the solution should only be technological, but instead, that technology must be part of the solution, in particular, because compliance should also be performed speedily, with accuracy, and low cost, otherwise, the benefits of technology in the first place will be greatly diminished.

As there is increased interest in tightened governance frameworks for automated decisions, including steps for generating explanations pertaining to decisions, our focus is on what we refer to as explainable computing, including not only explainable AI, but also explainable security, explainable workflows, and any form of computing activity requiring explanations.

3. Provenance-Based Explanations


Thus, “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering” an automated decision is an incredibly valuable source of data from which to generate explanations. This is precisely the definition of W3C PROV provenance, a standardised form of knowledge graph providing an account of what a system performed. It includes references to people, data sets and organisations involved in decisions; attribution of data; and data derivation. It is suggested that provenance can assist with Information Accountability  and is presented as a fundamental principle in the US ACM statement on Algorithmic Transparency and Accountability.

To promote the suitability of provenance as a technological solution to aid in the construction of explanations, we designed and built a demonstrator that produces explanations related to a fictitious loan scenario. This demonstrator allows a user to impersonate an individual submitting a loan application, obtaining a decision, and being able to request explanations pertaining to the whole process. The demonstrator generates explanations on the fly.

This work is undertaken by a multi-disciplinary team, involving Sophie Stalla-Bourdillon providing a legal perspective, Dong Huynh and Luc Moreau for the computational and provenance aspects, and the Information Commissioner’s Office, the UK regulator for data protection.

4. A Series of Blog Posts


This blog post is the first of a series, which will progressively introduce the rationale, the computational and legal design for explanations, the underpinning technology, and the imaginary scenario of the demonstrator. Concretely, in the coming weeks, we will publish posts on:

  1. AI-based Automated Decisions and Explanations: A Provenance Perspective (This Post)
  2. Explanations as detective controls
  3. A Scenario of Automated Decision-Making
  4. Explanations for Automated Decisions – Examples
  5. Tracking Provenance in a Decision Pipeline
  6. Constructing explanations from provenance
  7. Explanations for Automated Decisions from Provenance – A Demonstrator
  8. Series conclusion


Research Associate

We are recruiting a post-doctoral researcher for the new project PLEAD: “Provenance-driven and Legally-grounded Explanations for Automated Decisions”  funded by EPSRC. We are looking for researchers who have a PhD in Computer Science, Artificial Intelligence, Machine Learning, Data Science, or a related area, an excellent publication record, and appetite for interdisciplinary collaborative and impactful work.
PLEAD brings together an interdisciplinary team of technologists, legal experts, commercial companies and public organisations to investigate how provenance can help explain the logic that underlies automated decision-making to the benefit of data subjects as well as help data controllers to demonstrate compliance with the law. Explanations that are provenance-driven and legally-grounded will allow data subjects to place their trust in automated decisions and will allow data controllers to ensure compliance with legal requirements placed on their organisations.
For more details about PLEAD, see
For the job advert, see
For any enquiry, please contact me.

Legacy sites at

I recently wrote a blog post about the relaunch of Today, I am pleased to announce the availability of two websites providing a historical perspective on the work that took place in the provenance community.

The Provenance Challenge website is hosted at It is kept in its original wiki look-and-feel as it constituted a significant community effort that led to the PROV standardisation. At the time, the community decided that it needed to understand the different representations of provenance, their common aspects, and the reasons for their difference. The Provenance Challenge was born as a community activity aiming to understand and compare the various provenance solutions. Three consecutive provenance challenges took place. A significant artifact that resulted from the Provenance Challenge series is the Open Provenance Model.

The Open Provenance Model (OPM) website is hosted at OPM is the first community data model for provenance. OPM was designed as a conceptual data model for exchanging provenance information. It contained key concepts such as Artifact (called Entity in PROV), Process (called Activity in PROV), and Agent. It also introduced notions of usage, generation and derivation of artifacts.

Of course, all this is now superseded by PROV, the W3C set of Recommendations and Notes for provenance. These legacy sites are made available to the community for reference. We aim to persist those pages and URLs in the future. Feel free to link to them! Relaunched

It is my pleasure to announce the relaunch of, the site for standard-based provenance solutions.

With our move to King’s College London, Dong and I have migrated the provenance services from Southampton to King’s. I am pleased to announce the launch of the following services at

  • ProvStore, the provenance repository that enables users to store, share, browse and manage provenance documents. ProvStore is available from
  • A translator capable of converting between different representations of PROV, including visual representations of PROV in SVG. The translator service can be found at
  • A validator service that checks provenance documents against the constraints defined in prov-constraints. Such a service can detect logically inconsistent provenance. An example of such inconsistency is when an activity is said to have started after it ended, or when something is being used before it was even created. The validator is hosted at
  • A template expansion service facilitates a declarative approach to provenance generation, in which the shape of provenance can be defined by a provenance document containing variables, acting as placeholders for values. When provided with a set of bindings associating variables to values, the template expansion service generates a provenance document. The template expansion service lives at

The Southampton services will be decommissioned shortly. If you have data in the old provenance store, we provide a procedure for you to download your provenance documents from the old store, and to upload them at In the age of GDPR, you will have to sign up for the new provenance store and accept its terms and conditions.

While the look and feel of the services may look quite similar, under the bonnet, there have been significant changes.

  • We have adopted a micro-service architecture for our services, allowing them to be composed in interesting ways. Services are deployed in Docker containers, facilitating their redeployment and enabling their configurability. We are also investigating other forms of licensing that would allow the services to be deployed elsewhere, allowing the host to have full control over access, storage and management. (Contact us if this is of interest to you.)
  • We have adopted Keycloak for identity management and access control for our existing and future micro-services. This offers an off-the-shelf solution for managing identities and obtaining consent. A single registration for all our services will now be possible.

As before, the above services are powered by some open source libraries, which can also be found from ProvToolbox is a Java toolkit for processing provenance; it is available from Likewise, Prov Python is a toolkit for processing provenance in Python and can be found at