Provenance and explainability of AI decisions: PhD opportunity

Are you interested in a PhD? I have a fully funded PhD scholarship, and I am seeking to supervise a student interested in provenance, explainability, and AI decisions.  Contact me, and we can discuss a PhD topic. Below, I suggest examples of research directions: they are not meant to be constraining and limiting the research you would undertake, but they are shared here to serve as a starting point for a conversation.

First, what is provenance? Provenance is “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering” a piece of data, a document, or an automated decision. Thus, provenance is an incredibly valuable source of data from which to generate explanations, about decisions made by algorithmic systems. This is precisely the definition of W3C PROV provenance (, a standardised form of knowledge graph providing an account of what a system performed. It includes references to people, data sets and organisations involved in decisions; attribution of data; and data derivation. It captures not only how data is used and updated but also how data flows in the system and their causal dependencies. The US ACM statement on Algorithmic Transparency and Accountability suggested that provenance can assist with Information Accountability.  We share this view, as discussed in

So, the initial research question for a research project is: how can provenance be used to generate explanations about automated decisions that affect users?  From there, there are multiple investigations, depending on your personal interest. Here are a few possible starting points:

  1. Imagine a typical decision pipeline, involving some machine learning technique being used to train a model, a training dataset that is being selected (potentially, according to corporate governance to avoid bias), some preparation of the dataset, training of the model according to some algorithm, and then the deployment of the model and its application to user data to make decisions or recommendations. How does provenance of such a decision-making pipeline need to be marked up to assist with the creation explanations? What constitutes an explanation? What is its purpose, i.e., what is it intended to explain to the user? How should it be structured? What NLG technique can be used to organise the explanation: for instance, can Rhetorical Structure Theory be applied in this context, to develop the structure of an explanation out of provenance. The work can involve algorithmic design and proof-of-concept building, but also user evaluation, in which users are presented with explanations, and provide feedback on their suitability. Finally, an explanation could have multiple forms, from texts to a multimedia presentation.
  2. When a system is instrumented to generate provenance, very often large provenance data sets may be generated. They can consist of 100 Mb of data, and possibly more. I have developed a summarisation technique (see reading list) that can extract the essence of such large provenance data, and generate a much more compact provenance graph, which we call a provenance summary. Provenance summaries could be a strong basis for generating explanations. However, some challenges need to be tackled for them to be useful. Summaries talk about categories of activities and entities, rather than individual instances. So how can this information be exploited to situate a decision made by a user, in the context of decisions made about categories of users? Provenance graphs have a temporal semantics (as defined by the PROV-CONSTRAINTS recommendation However, temporal semantics for provenance summaries needs to be defined. Subsequently, it should be determined how it can be exploited to construct an explanation.
  3. Provenance is usually exploited in a relatively coarse-grained manner, in which whole algorithms or data transformations are just described by a semantic relation (a subtype of the derivation relation “was derived from”). As a result, with the above discussion, whole pipelines may be documented with provenance, but individual algorithms remain black boxes.  However, it does not have to be the case: algorithms (for which we have the source code) can also be instrumented, thereby exposing details of their execution. We have successfully manually instrumented a simple decision tree library. Can this be done for more complex algorithms? Is there a limit to what can be instrumented? How can the information be exploited to construct meaningful explanations of the behaviour of the algorithm? Can modern GPU processors also be used to construct and process very large provenance graphs?

Scholarship details

To be eligible for this scholarship, you will have to be a UK or a EU citizen.  The scholarship includes registration fees (UK/EU fees) and a stipend for 3 years. There is also support for computing equipment and some travel funding to attend conferences.

Research Context

The successful applicant will join Prof Luc Moreau’s  team at King’s College London, as part of the Cybersecurity Group. Two departmental hubs are related to this activity, namely the Trusted Autonomous Systems hub and the Security hub (see The team is involved in three new projects at King’s (Provenance Analytics for Command and Control, funded by ONR-G,  THuMP: Trust in Human-Machine Partnership funded by EPSRC, and a third project funded by EPSRC, details to be announced).

A few pointers



The Photo of the Week #dontmesswithmydata

As a technologist, I have observed with a strong interest the fallout of Carole Cadwalladr’s investigative journalism published by the Observer, the Guardian, Channel 4 and New York Times.  Presumption of innocence is important, and I do hope that the official investigation will make responsibilities and failures explicit.

However, out of this tumultuous week for the Web and Social Media, I find the following photo extremely powerful.



Taken from The Guardian. Enforcement officers working for the Information Commissioner’s Office entering the premises of Cambridge Analytica.


Stealing data is no different than stealing money.  This year, we will see the launch of the GDPR (General Data Protection Regulation), but we should not forget that there already exist strong principles of data protection.  For convenience, I copy below the eight data protection principles:

  1. Personal data shall be processed fairly and lawfully and, in particular, shall not be processed unless  at least one of the conditions in Schedule 2 is met, and in the case of sensitive personal data, at least one of the conditions in Schedule 3 is also met.
  2. Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes.
  3. Personal data shall be adequate, relevant and not excessive in relation to the purpose or purposes for which they are processed.
  4. Personal data shall be accurate and, where necessary, kept up to date.
  5. Personal data processed for any purpose or purposes shall not be kept for longer than is necessary for that purpose or those purposes.
  6. Personal data shall be processed in accordance with the rights of data subjects under this Act.
  7. Appropriate technical and organisational measures shall be taken against unauthorised or unlawful processing of personal data and against accidental loss or destruction of, or damage to, personal data.
  8. Personal data shall not be transferred to a country or territory outside the European Economic Area unless that country or territory ensures an adequate level of protection for the rights and freedoms of data subjects in relation to the processing of personal data.


If anybody had a doubt, the law has power to enforce regulations. As an individual, I welcome this power #dontmesswithmydata!





Provenance Reading List (v2)

I am regularly asked by students and researchers about a reading list on provenance. The following papers give them a good baseline about the kind of work we undertake in my group. This is not meant to be an extensive literature survey, but this should give them enough background to have discussions about projects related to provenance.

This page updates a previous version of the reading list available at

Introduction to Provenance and PROV


Provenance Analytics

Software engineering and provenance


Provenance and Accountability

Data privacy and accountability

My statement for todays’ panel on privacy. For today’s panel, I want to talk about data privacy in the context of the notion of accountability.

Imagine you browse the web, looking for shoes. For the weeks to follow, whenever you visit a web page, adverts of shoes will be presented to you.

Have you ever asked yourself why these adverts are shown to you, who has information about you, what information they have about you, and how did they decide to serve this advert to you?

A system able to answer such why/who/what/how questions is accountable. Being accountable means being able to provide explanations or justifications for decisions and actions.

To be able to provide accountability, there is a need to be able to trace flows of data (traceability), tracing data across systems enables explanations to be provided about the transformations, operations, and decisions made about such data.  Several names are available for such notion, traceability or provenance.  Provenance of a decision helps explaining factors that affected the decision, data involved in it, etc. The word is common for food: provenance of food is a sign of its quality; likewise, provenance of a piece of art enables its authenticity to be asserted. Over the last 15 years, I have been leading research activities around provenance of data, and led a standardisation activity for provenance on the web.

The European GDPR General data protection regulation coming in 2018 has a component dubbed the “right to explanation”. There are still some uncertainty about what it entails both legally and technically.

What has it got to do with privacy? Privacy and accountability have an interesting relation that I want to discuss.

Consider expense claims, a topic well understood by this audience. Imagine that Alice and Bob have a business meeting conducted over a meal. Bob has to make his expense claim public.  This may indirectly make the presence of Alice at the restaurant’s location public. Alice’s privacy is in tension with Bob’s accountability/transparency requirement.

So, there is a tension between privacy and accountability. 100% private doesn’t give you accountability, 100% accountable doesn’t give you privacy.

Privacy is important, so is accountability! These are values that we want to promote Technically, legally and as a society, we are still learning to understand these values and how they should be protected.





Principles for Algorithmic Transparency and Accountability: A Provenance Perspective

A few days ago, the ACM U.S. Public Policy Council (USACM) released a statement and a list of seven principles aimed at addressing potential harmful bias of algorithmic solutions. This effort was initiated by the USACM’s Algorithmic Accountability Working Group.  Algorithmic solutions are now widely deployed to make decisions that affect our lives, e.g., recommendations for movies, targeted ads on the web, autonomous vehicles, suggested contacts or reading in social networks, etc.  We have all come across systems making decisions that are targeted to us individually, and I am sure that many of us have wondered  how a given recommendation was made to us, on the basis of which information and what kind of profile. Typically, no explanation is made available to us!  Nor there is any means to track the origin of such decisions!

Interestingly, emerging regulatory frameworks, such as the EU General Data Protection Regulation, are introducing the “right to explanations” (see in particular related to Article 22 on Automated individual decision-making, including profiling. So, the regulatory framework is evolving, even though there is still no consensus on how to actually achieve this in practice.

Furthermore, algorithmic bias is a phenomenon that has been observed in various contexts (see for instance two recent articles of the  New-York Times and the Guardian). Given their pervasive nature, ACM U.S. Public Policy Council acknowledges that it is imperative to address “challenges associated with the design and technical aspects of algorithms and preventing bias from the onset”.   On this basis, they propose 7 principles, compatible with their code of ethics.

As a provenance researcher, I have always regarded the need to log flows of information and activities, and ascribe responsibility for these as crucial steps to making systems accountable. This view was echoed by Danny Weitzner and team in their seminal paper on Information Accountability.  I was therefore delighted to see that “Data provenance” was listed as an explicit principle of the USACM list of seven principles. So, instead of paraphrasing them, I take the liberty of copying them below.


Figure 1: ACM US Public Policy Council list of seven principles for Algorithmic Transparency and Accountability



However, I feel that provenance, as understand it, encompasses several of these principles, something that I propose to investigate in the rest of this post.  To illustrate this, I propose Figure 2, a block diagram outlining the high-level architecture of a transparent and accountable system.  At the heart of such a system, we find its Business Logic which provides its primary functionality (e.g. Recommendations, Analytics, etc).  In provenance-aware systems, applications log their activities and data flows, out of which a semantic representation is constructed, which I refer to as provenance. PROV is a standardised representation for provenance, which was recently published by the World Wide Web consortium and seeing strong adoption in various walks of life.  In this context, provenance is defined as “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing”.

There is no point constructing such a semantic representation, if it is not being exploited. Various capabilities can be built on top of such a provenance repository, including query interfaces, audit functionality, explanation service, redress mechanism and validation, which we discuss now in light of the seven principles.


The Role of Provenance in the Architecture of an Accountable System

Figure 2: The Role of Provenance in the Architecture of an Accountable System


The first principle (Awareness) identifies a variety of stakeholders: Owners, Designers and Builders, Users, but the second principle also mentions the role of Regulators, and we believe that potential third-party Auditors are also relevant in that context.  While technology makes progress with algorithmic solutions, society is much slower to react, and there is indeed work required to increase awareness, and establish what the user rights are, and what the obligations on owners should be, whether by means of regulations or self-regulations. The SmartSociety project recently published a Social Charter for Smart Platforms, which is an illustration of what rights and obligations can be in “smart” platforms. 

The second principle (Access and Redress) recommends mechanisms by which systems can be questioned and redress enabled for individuals.This principle points to the ability to query the system and its past actions, which is a typical provenance-based functionality. For those seeking redress, there is a need to be able to refer to an event that resulted in an unsatisfactory outcome; PROV-based provenance mandates that all outcome, data and activity instances are uniquely identified.  Furthermore, we are of the view that such a redress mechanism, including reached resolutions, should be inspectable in a similar fashion; thus, provenance of redress requests and resolutions should also be inspectable.

The third principle (Accountability) is concerned with holding institutions responsible for the decisions made by their algorithmic systems.   For this, one needs a non-repudiable account of what has happened, and suitable attribution of decisions to system components, their owners, and those legally responsible for the system’s actions. Again, such an account is exactly what PROV offers: therefore we see the third principle being implemented technically with queries over provenance representation, and socially with suitable regulatory and enforcement mechanisms.

The fourth principle (Explanation) requires explanations to be produced about the unfolding of activities and decisions.  There is emerging evidence that provenance can serve as a form of computer-based narrative, out of which textual explanations can be composed and presented to users.  We recently conducted some user studies about the perceived legibility of natural language explanations by casual users.  We also used a similar technique in order to provide explanations about user ratings in a Ride Share application.

The fifth principle (Data Provenance) is explicitly focusing on training data used to train so-called “machine-learning” algorithms. We believe that it is not just training data that is relevant, but any external data, the business logic and designers may rely upon. It is expected that public scrutiny of such data offers opportunity to correct potential bias, and in general, any concern that may affect decisions. To operationalize this principle, one needs to have access to a description of the data (potentially, the data itself), but also how it is used in training algorithms, and how this potentially affects decisions. PROV-based Provenance, queries and explanations are required here to allow such scrutiny. Some of our recent work focused on analytics techniques to assess the quality of data, using provenance information; such a mechanism becomes useful to ensure some form of quality control in systems.

The sixth principle (Auditability) demands models, algorithms, data, and decision to be recorded, so that they can be audited. All these can easily be described in PROV, by means of “PROV entities“, which can be used or generated by “PROV activities“, under the supervision of responsible agents. Specific auditing functions (aimed at various stakeholders) can query the provenance to expose individual entities, but also their aggregate characteristics, over longer periods of time. Techniques that we have developed, such as provenance summarisation, become really critical in this context, since they enable us to investigate aggregate behaviour of applications, instead of individual circumstances.

The seventh principle (Validation and Testing)  recommends regular validation of models and testing for harmful outcomes.  This suggests that processing over provenance, checking whether  some expected criteria has been met or not, can be implemented by policy-based approaches over provenance, detecting whether  past executions comply with expectations, described as policies. We have applied this technique to decide whether processing was performed in compliance with usage policies. If this is good practice to undertake validation and testing, therefore, it also becomes a necessity to document such a practice, to be able demonstrate that such validation and testing takes place.

So overall, the provenance research community has been investigating issues around capturing, storing, representing, querying and exploiting provenance information, all of them having a critical role in the principles of Algorithmic Transparency and Accountability.  There is still much to research however, including critical issues around (1) agreed domain-specific extensions of PROV to support transparency and accountability; (2) better integration of the software engineering methodologies with provenance; (3) enforceable compliance with architecture; (4) non repudiation of provenance; (5) querying and auditing facility; (6) compliance checks over provenance; (7) user-friendly explanation of complex algorithmic decisions; (8) scalability of all the above issues.

In the spirit of Principle 1,  I hope this blog post contributes to raising awareness of these issues. Feedback and comments welcome!