Provenance and explainability of AI decisions: PhD opportunity

Are you interested in a PhD? I have a fully funded PhD scholarship, and I am seeking to supervise a student interested in provenance, explainability, and AI decisions.  Contact me, and we can discuss a PhD topic. Below, I suggest examples of research directions: they are not meant to be constraining and limiting the research you would undertake, but they are shared here to serve as a starting point for a conversation.

First, what is provenance? Provenance is “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering” a piece of data, a document, or an automated decision. Thus, provenance is an incredibly valuable source of data from which to generate explanations, about decisions made by algorithmic systems. This is precisely the definition of W3C PROV provenance (, a standardised form of knowledge graph providing an account of what a system performed. It includes references to people, data sets and organisations involved in decisions; attribution of data; and data derivation. It captures not only how data is used and updated but also how data flows in the system and their causal dependencies. The US ACM statement on Algorithmic Transparency and Accountability suggested that provenance can assist with Information Accountability.  We share this view, as discussed in

So, the initial research question for a research project is: how can provenance be used to generate explanations about automated decisions that affect users?  From there, there are multiple investigations, depending on your personal interest. Here are a few possible starting points:

  1. Imagine a typical decision pipeline, involving some machine learning technique being used to train a model, a training dataset that is being selected (potentially, according to corporate governance to avoid bias), some preparation of the dataset, training of the model according to some algorithm, and then the deployment of the model and its application to user data to make decisions or recommendations. How does provenance of such a decision-making pipeline need to be marked up to assist with the creation explanations? What constitutes an explanation? What is its purpose, i.e., what is it intended to explain to the user? How should it be structured? What NLG technique can be used to organise the explanation: for instance, can Rhetorical Structure Theory be applied in this context, to develop the structure of an explanation out of provenance. The work can involve algorithmic design and proof-of-concept building, but also user evaluation, in which users are presented with explanations, and provide feedback on their suitability. Finally, an explanation could have multiple forms, from texts to a multimedia presentation.
  2. When a system is instrumented to generate provenance, very often large provenance data sets may be generated. They can consist of 100 Mb of data, and possibly more. I have developed a summarisation technique (see reading list) that can extract the essence of such large provenance data, and generate a much more compact provenance graph, which we call a provenance summary. Provenance summaries could be a strong basis for generating explanations. However, some challenges need to be tackled for them to be useful. Summaries talk about categories of activities and entities, rather than individual instances. So how can this information be exploited to situate a decision made by a user, in the context of decisions made about categories of users? Provenance graphs have a temporal semantics (as defined by the PROV-CONSTRAINTS recommendation However, temporal semantics for provenance summaries needs to be defined. Subsequently, it should be determined how it can be exploited to construct an explanation.
  3. Provenance is usually exploited in a relatively coarse-grained manner, in which whole algorithms or data transformations are just described by a semantic relation (a subtype of the derivation relation “was derived from”). As a result, with the above discussion, whole pipelines may be documented with provenance, but individual algorithms remain black boxes.  However, it does not have to be the case: algorithms (for which we have the source code) can also be instrumented, thereby exposing details of their execution. We have successfully manually instrumented a simple decision tree library. Can this be done for more complex algorithms? Is there a limit to what can be instrumented? How can the information be exploited to construct meaningful explanations of the behaviour of the algorithm? Can modern GPU processors also be used to construct and process very large provenance graphs?

Scholarship details

To be eligible for this scholarship, you will have to be a UK or a EU citizen.  The scholarship includes registration fees (UK/EU fees) and a stipend for 3 years. There is also support for computing equipment and some travel funding to attend conferences.

Research Context

The successful applicant will join Prof Luc Moreau’s  team at King’s College London, as part of the Cybersecurity Group. Two departmental hubs are related to this activity, namely the Trusted Autonomous Systems hub and the Security hub (see The team is involved in three new projects at King’s (Provenance Analytics for Command and Control, funded by ONR-G,  THuMP: Trust in Human-Machine Partnership funded by EPSRC, and a third project funded by EPSRC, details to be announced).

A few pointers



The Photo of the Week #dontmesswithmydata

As a technologist, I have observed with a strong interest the fallout of Carole Cadwalladr’s investigative journalism published by the Observer, the Guardian, Channel 4 and New York Times.  Presumption of innocence is important, and I do hope that the official investigation will make responsibilities and failures explicit.

However, out of this tumultuous week for the Web and Social Media, I find the following photo extremely powerful.



Taken from The Guardian. Enforcement officers working for the Information Commissioner’s Office entering the premises of Cambridge Analytica.


Stealing data is no different than stealing money.  This year, we will see the launch of the GDPR (General Data Protection Regulation), but we should not forget that there already exist strong principles of data protection.  For convenience, I copy below the eight data protection principles:

  1. Personal data shall be processed fairly and lawfully and, in particular, shall not be processed unless  at least one of the conditions in Schedule 2 is met, and in the case of sensitive personal data, at least one of the conditions in Schedule 3 is also met.
  2. Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes.
  3. Personal data shall be adequate, relevant and not excessive in relation to the purpose or purposes for which they are processed.
  4. Personal data shall be accurate and, where necessary, kept up to date.
  5. Personal data processed for any purpose or purposes shall not be kept for longer than is necessary for that purpose or those purposes.
  6. Personal data shall be processed in accordance with the rights of data subjects under this Act.
  7. Appropriate technical and organisational measures shall be taken against unauthorised or unlawful processing of personal data and against accidental loss or destruction of, or damage to, personal data.
  8. Personal data shall not be transferred to a country or territory outside the European Economic Area unless that country or territory ensures an adequate level of protection for the rights and freedoms of data subjects in relation to the processing of personal data.


If anybody had a doubt, the law has power to enforce regulations. As an individual, I welcome this power #dontmesswithmydata!