Research Associate

We are recruiting a post-doctoral researcher for the new project PLEAD: “Provenance-driven and Legally-grounded Explanations for Automated Decisions”  funded by EPSRC. We are looking for researchers who have a PhD in Computer Science, Artificial Intelligence, Machine Learning, Data Science, or a related area, an excellent publication record, and appetite for interdisciplinary collaborative and impactful work.
PLEAD brings together an interdisciplinary team of technologists, legal experts, commercial companies and public organisations to investigate how provenance can help explain the logic that underlies automated decision-making to the benefit of data subjects as well as help data controllers to demonstrate compliance with the law. Explanations that are provenance-driven and legally-grounded will allow data subjects to place their trust in automated decisions and will allow data controllers to ensure compliance with legal requirements placed on their organisations.
For more details about PLEAD, see https://plead-project.org/.
For the job advert, see https://my.corehr.com/
For any enquiry, please contact me.
Advertisements

Provenance and explainability of AI decisions: PhD opportunity

Are you interested in a PhD? I have a fully funded PhD scholarship, and I am seeking to supervise a student interested in provenance, explainability, and AI decisions.  Contact me, and we can discuss a PhD topic. Below, I suggest examples of research directions: they are not meant to be constraining and limiting the research you would undertake, but they are shared here to serve as a starting point for a conversation.

First, what is provenance? Provenance is “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering” a piece of data, a document, or an automated decision. Thus, provenance is an incredibly valuable source of data from which to generate explanations, about decisions made by algorithmic systems. This is precisely the definition of W3C PROV provenance (https://www.w3.org/TR/prov-primer/), a standardised form of knowledge graph providing an account of what a system performed. It includes references to people, data sets and organisations involved in decisions; attribution of data; and data derivation. It captures not only how data is used and updated but also how data flows in the system and their causal dependencies. The US ACM statement on Algorithmic Transparency and Accountability suggested that provenance can assist with Information Accountability.  We share this view, as discussed in https://lucmoreau.wordpress.com/2017/01/20/principles-for-algorithmic-transparency-and-accountability-a-provenance-perspective/

So, the initial research question for a research project is: how can provenance be used to generate explanations about automated decisions that affect users?  From there, there are multiple investigations, depending on your personal interest. Here are a few possible starting points:

  1. Imagine a typical decision pipeline, involving some machine learning technique being used to train a model, a training dataset that is being selected (potentially, according to corporate governance to avoid bias), some preparation of the dataset, training of the model according to some algorithm, and then the deployment of the model and its application to user data to make decisions or recommendations. How does provenance of such a decision-making pipeline need to be marked up to assist with the creation explanations? What constitutes an explanation? What is its purpose, i.e., what is it intended to explain to the user? How should it be structured? What NLG technique can be used to organise the explanation: for instance, can Rhetorical Structure Theory be applied in this context, to develop the structure of an explanation out of provenance. The work can involve algorithmic design and proof-of-concept building, but also user evaluation, in which users are presented with explanations, and provide feedback on their suitability. Finally, an explanation could have multiple forms, from texts to a multimedia presentation.
  2. When a system is instrumented to generate provenance, very often large provenance data sets may be generated. They can consist of 100 Mb of data, and possibly more. I have developed a summarisation technique (see reading list) that can extract the essence of such large provenance data, and generate a much more compact provenance graph, which we call a provenance summary. Provenance summaries could be a strong basis for generating explanations. However, some challenges need to be tackled for them to be useful. Summaries talk about categories of activities and entities, rather than individual instances. So how can this information be exploited to situate a decision made by a user, in the context of decisions made about categories of users? Provenance graphs have a temporal semantics (as defined by the PROV-CONSTRAINTS recommendation https://www.w3.org/TR/prov-constraints/). However, temporal semantics for provenance summaries needs to be defined. Subsequently, it should be determined how it can be exploited to construct an explanation.
  3. Provenance is usually exploited in a relatively coarse-grained manner, in which whole algorithms or data transformations are just described by a semantic relation (a subtype of the derivation relation “was derived from”). As a result, with the above discussion, whole pipelines may be documented with provenance, but individual algorithms remain black boxes.  However, it does not have to be the case: algorithms (for which we have the source code) can also be instrumented, thereby exposing details of their execution. We have successfully manually instrumented a simple decision tree library. Can this be done for more complex algorithms? Is there a limit to what can be instrumented? How can the information be exploited to construct meaningful explanations of the behaviour of the algorithm? Can modern GPU processors also be used to construct and process very large provenance graphs?

Scholarship details

To be eligible for this scholarship, you will have to be a UK or a EU citizen.  The scholarship includes registration fees (UK/EU fees) and a stipend for 3 years. There is also support for computing equipment and some travel funding to attend conferences.

Research Context

The successful applicant will join Prof Luc Moreau’s  team at King’s College London, as part of the Cybersecurity Group. Two departmental hubs are related to this activity, namely the Trusted Autonomous Systems hub and the Security hub (see https://www.kcl.ac.uk/nms/depts/informatics/research/research). The team is involved in three new projects at King’s (Provenance Analytics for Command and Control, funded by ONR-G,  THuMP: Trust in Human-Machine Partnership funded by EPSRC, and a third project funded by EPSRC, details to be announced).

A few pointers

 

UML, Food Contamination Risk, and Black Holes: Provenance without Frontier!

With some students and collaborators, we have published three papers at IPAW’18 (International Provenance and Annotation Workshop), which are being presented this week, during Provenance Week’18, hosted at King’s College London.

1. Provenance capture with UML2PROV

It still remains challenging to develop provenance-enabled applications.  A key difficulty is to ensure that the provenance generated by an application describes what the application actually does!  Prov-Templates already offers a declarative way of specifying the shape of provenance to be generated, while bindings created at runtime by the application instantiate these templates into concrete provenance. UML2PROV, designed with Carlos Sáenz Adán and Beatriz Pérez Valle, goes a step further.

The ambitious vision of UML2PROV is to control the generation of provenance from UML specifications of a program, minimising the amount of manual intervention by the programmer. Three steps are involved:

  1. From UML specifications, UML2PROV creates Prov-Templates and a runtime library to generate bindings.
  2. A programmer’s intervention deploys the runtime library with the application.
  3. As the application runs, the runtime library creates bindings that can be used to derive provenance from the templates.

Carlos previously supported UML sequence and activity diagrams. The IPAW’18 paper extends this work by supporting class diagrams. This apparent simple extension required a significant re-design as it allows flow of data to be better tracked in applications.  The system was also fully redesigned to minimise the programmer’s intervention. The original approach used reflection and proxies to intercept method calls, whereas now it relies on aspects.  Carlos is not the first to suggest that provenance aspects can be woven into code in order to generate provenance, but he is the first to do it, systematically, in the context of a software engineering methodology such as UML.

Full details of the paper can be found at:

 

Carlos Saenz-Adan, Luc Moreau, Beatriz Pérez, Simon Miles, and Francisco J. García-Izquierdo. Automating provenance capture in software engineering with uml2prov. In IPAW’2018: 7th International Provenance and Annotation Workshop, London, UK, July 2018. https://kclpure.kcl.ac.uk/portal/en/publications/automating-provenance-capture-in-software-engineering-with-uml2prov(aed0b80c-7d14-40d6-b3e6-3a71ee606213).html

 

2. Deriving Contamination Risk from Food Provenance

Food has always been a powerful illustration of why provenance matters. How many times haven’t I been delivering talks and mentioning “Provenance Wine” and “Provenance Whisky” over the years?  However, the  adage “provenance helps users place their trust in data or things” has proven challenging to demonstrate scientifically and rigorously.  I believe that the work by Belfrit Batlajery, Mark Weal, Adriane Chapman and myself is a significant step in that direction.

Belfrit’s approach is to use provenance-based descriptions of food supply chains, ideally from Farm to Fork, and partial knowledge about contamination levels of bacteria there in. A belief propagation approach over factor graphs derived from provenance graph allows him to infer levels of contamination across the network.  The point of this analysis is not so much to infer super-precise contamination levels , but instead, to have an estimation that is accurate enough to be able to determine where risk is higher.

This approach would allow an organisation to meet its due diligence requirements by logging food provenance, inferring contamination levels, and use these to identify the places where food sampling is likely to reduce the level of risk across the supply chain, ultimately allowing consumer to have confidence in food products.

Full details of the paper can be found at:

Belfrit Victor Batlajery, Mark Weal, Adriane Chapman, and Luc Moreau. Belief propagation through provenance graphs. In IPAW’2018: 7th International Provenance and Annotation Workshop, London, UK, July 2018. https://kclpure.kcl.ac.uk/portal/en/publications/belief-propagation-through-provenance-graphs(c1b7a54d-4e9c-4a9f-8d7d-cce4a6b1e4ab).html

 

3. The search for Low Mass X-Ray Binaries

Our universe is amazing, full of complex systems, hard to comprehend and visualise.  One of them, a low-mass X-Ray binary (LMXB) is a binary star system where one of the components is compact, either a black hole or neutron star; the other component amazingly transfers mass to the compact component. X-Rays are produced by matter being transferred between components. If you want to see a visual representation of this, look at some amazing images: https://www.google.co.uk/search?q=low+mass+x+ray+binaries&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj7gcqmvpHcAhVEblAKHc1gBywQ_AUICigB&biw=2003&bih=1120

Down to earth, Michael Johnson, in collaboration with Poshak Gandhi, Adriane Chapman and myself, has been developing data science techniques to help with the discovery of such LMXBs. The opportunity being exploited is that arising from the Large Synoptic Survey Telescope (LSST) still being in design. Michael has developed a provenance-enabled image processing pipeline.  From a provenance perspective, the challenge was to demonstrate benefits of using provenance to the astronomer. Michael was able to show that while provenance capture introduced a runtime overhead, the overhead was offset by significant compute time savings by querying provenance rather than recomputing results.  This is an important outcome since it demonstrates that provenance should be routinely captured by data science toolkits because it brings benefits to its users.

Interestingly, Michael’s workflow uses the UML2PROV technique to generate its provenance.

Full details of the paper can be found at:

Michael A. C. Johnson, Luc Moreau, Adriane Chapman, Poshak Gandhi, and Carlos Saenz-Adan. Using the provenance from astronomical workflows to increase processing efficiency. In IPAW’2018: 7th International Provenance and Annotation Workshop, London, UK, July 2018. https://kclpure.kcl.ac.uk/portal/en/publications/using-the-provenance-from-astronomical-workflows-to-increase-processing-efficiency(26f42342-f907-48ba-be7d-41a28ae1f501).html

A Community Repository for Provenance Templates

On several occasions, I have written about PROV-TEMPLATE, a declarative approach that enables designers and programmers to design and generate provenance compatible with the PROV standard of the World Wide Web Consortium. Designers specify the topology of the provenance to be generated by composing templates, which are provenance graphs containing variables, acting as placeholders for values. Programmers write programs that log values and package them up in sets of bindings, a data structure associating variables and values. An expansion algorithm generates instantiated provenance from templates and sets of bindings in any of the serialisation formats supported by PROV.

To promote best practice, and to facilitate the implementation of provenance-enabled applications, I feel that the time has come to create a catalogue of provenance templates and examples of bindings to instantiate these templates. This catalogue should be open and available to use by anybody for any purpose, academic, commercial or other.

I am pleased to announce the launch of a community repository for provenance templates at:

https://github.com/openprov/templates

I am drafting a document specifying the governance of this repository. I am also writing up some guidelines on how to structure the repository and the process by which templates can be shared (it will obviously be relying on github for this!).

These are starting points, as I really want this repository to become a community resource; therefore, I welcome suggestions for the management of the repository.

Shortly, I will be talking about the governance of the repository. Watch this space …

 

Service for template-based provenance generation

Previously I blogged about prov-template, an approach for generating provenance. It consists of a declarative definition of the shape of provenance to be generated, referred to as template, and sets of values to instantiate templates into concrete provenance. Today, I am pleased to write about a new online service allowing such templates to be expanded into provenance.

The service is available from https://openprovenance.org/services/view/expander. I will now illustrate its use through a few examples.

First, let’s consider an example of template (see https://openprovenance.org/templates/org/openprovenance/generic/binaryop/1.provn for its provn version). Visually it look like this.

template

A template for a binary operation

It shows an activity, two entities used as input and an entity generated as output. There is an agent associated with the activity. The output is derived from the two inputs. This provenance description is contained in a provenance document, but what makes it a template is that the identities of nodes are variables (URIs in a reserved namespace with prefix varvar:produced, var:operation, …). These variables are meant to be replaced by concrete values. Variables are also allowed in value position of property-value pairs (cf. var:operation_type).

This template for instance could be used to describe the addition of two numbers resulting in their sum.

The template service looks as follows. Two input boxes respectively expect the URL to a template (you need to ensure that the template is accessible to the service) and the bindings between variables and values to instantiate the template.  For convenience, the template pull down menu already provides the link to the template described above. Likewise, the example pull down menu contains several examples of bindings. Let’s select the first one, and click on the SVG button to generate an SVG representation of the expanded provenance.

template-service

Template expansion service

The result is as follows, variable names have been instantiated with identities of activity, entities and agent, but also with values of properties. Property-value pairs whose value was variable not assigned by the bindings are simply removed from the expanded provenance (for instance, as variable var:operation_type is unassigned, the property type was removed from the expansion).

provenance-b1

Below, we find the expanded provenance for the fourth binding. There, we see that two different outputs were provided output1 and output2, and they have been given different numbers of attributes.

provenance-b4

The language to express the bindings is a simple json structure. The first set of bindings is expressed as follows.

{
  "var": {
    "operation": [
      {
        "@id": "ex:operation1"
      }
    ],
    "agent": [
      {
        "@id": "ex:ag1"
      }
    ],
    "consumed1": [
      {
        "@id": "ex:input_1"
      }
    ],
    "consumed_value1": [
      {
        "@value": "4",
        "@type": "xsd:int"
      }
    ],
    "consumed2": [
      {
        "@id": "ex:input_2"
      }
    ],
    "consumed_value2": [
      {
        "@value": "5",
        "@type": "xsd:int"
      }
    ],
    "produced": [
      {
        "@id": "ex:output"
      }
    ],
    "produced_type": [
      {
        "@id": "ex:Result"
      }
    ],
    "produced_value": [
      {
        "@value": "20",
        "@type": "xsd:int"
      }
    ]
  },
  "context": {
    "ex": "http://example.org/#"
  },
  "template": "https://openprovenance.org/templates/org/openprovenance/generic/binaryop/1.provn"
}

 

Go ahead and experiment with templates and bindings using the service. For more details, please see previous posts.

Happy prov-templating …

Legacy sites at openprovenance.org

I recently wrote a blog post about the relaunch of openprovenance.org. Today, I am pleased to announce the availability of two websites providing a historical perspective on the work that took place in the provenance community.

The Provenance Challenge website is hosted at https://openprovenance.org/provenance-challenge/WebHome.html. It is kept in its original wiki look-and-feel as it constituted a significant community effort that led to the PROV standardisation. At the time, the community decided that it needed to understand the different representations of provenance, their common aspects, and the reasons for their difference. The Provenance Challenge was born as a community activity aiming to understand and compare the various provenance solutions. Three consecutive provenance challenges took place. A significant artifact that resulted from the Provenance Challenge series is the Open Provenance Model.

The Open Provenance Model (OPM) website is hosted at https://openprovenance.org/opm/. OPM is the first community data model for provenance. OPM was designed as a conceptual data model for exchanging provenance information. It contained key concepts such as Artifact (called Entity in PROV), Process (called Activity in PROV), and Agent. It also introduced notions of usage, generation and derivation of artifacts.

Of course, all this is now superseded by PROV, the W3C set of Recommendations and Notes for provenance. These legacy sites are made available to the community for reference. We aim to persist those pages and URLs in the future. Feel free to link to them!

 

 

openprovenance.org Relaunched

It is my pleasure to announce the relaunch of openprovenance.org, the site for standard-based provenance solutions.

With our move to King’s College London, Dong and I have migrated the provenance services from Southampton to King’s. I am pleased to announce the launch of the following services at openprovenance.org:

  • ProvStore, the provenance repository that enables users to store, share, browse and manage provenance documents. ProvStore is available from https://openprovenance.org/store/.
  • A translator capable of converting between different representations of PROV, including visual representations of PROV in SVG. The translator service can be found at https://openprovenance.org/services/view/translator.
  • A validator service that checks provenance documents against the constraints defined in prov-constraints. Such a service can detect logically inconsistent provenance. An example of such inconsistency is when an activity is said to have started after it ended, or when something is being used before it was even created. The validator is hosted at https://openprovenance.org/services/view/validator.
  • A template expansion service facilitates a declarative approach to provenance generation, in which the shape of provenance can be defined by a provenance document containing variables, acting as placeholders for values. When provided with a set of bindings associating variables to values, the template expansion service generates a provenance document. The template expansion service lives at https://openprovenance.org/services/view/expander.

The Southampton services will be decommissioned shortly. If you have data in the old provenance store, we provide a procedure for you to download your provenance documents from the old store, and to upload them at openprovenance.org. In the age of GDPR, you will have to sign up for the new provenance store and accept its terms and conditions.

While the look and feel of the services may look quite similar, under the bonnet, there have been significant changes.

  • We have adopted a micro-service architecture for our services, allowing them to be composed in interesting ways. Services are deployed in Docker containers, facilitating their redeployment and enabling their configurability. We are also investigating other forms of licensing that would allow the services to be deployed elsewhere, allowing the host to have full control over access, storage and management. (Contact us if this is of interest to you.)
  • We have adopted Keycloak for identity management and access control for our existing and future micro-services. This offers an off-the-shelf solution for managing identities and obtaining consent. A single registration for all our services will now be possible.

As before, the above services are powered by some open source libraries, which can also be found from openprovenance.org. ProvToolbox is a Java toolkit for processing provenance; it is available from http://lucmoreau.github.io/ProvToolbox/. Likewise, Prov Python is a toolkit for processing provenance in Python and can be found at https://pypi.org/project/prov/.