ProvToolbox Tutorial 3: Merging PROV Documents

1. Introduction

It has become a requirement in several of our applications to merge PROV documents. The purpose of this tutorial is to explain how ProvToolbox allows documents to be merged, ensuring that descriptions are uniquely represented with all their attributes, merging bundles they may contain, and optionally “flattening” them.

This functionality is directly available from the command line using provconvert.

The tutorial is standalone and a zip archive can be downloaded from the following URL: http://search.maven.org/remotecontent?filepath=org/openprovenance/prov/ProvToolbox-Tutorial3/0.7.0/ProvToolbox-Tutorial3-0.7.0-src.zip. The tutorial can also be found on the ProvToolbox project on GitHub.

The tutorial assumes that provconvert has been installed and is available in the execution path.
The tutorial relies on a Makefile and can simply be run by calling:

make do.all

2. Examples of Merges

2.1 Merging two documents without bundles

Our first example consists of two documents. The first document “doc1” consists of the attribution of an entity e1 to an agent ag1. The entity has an attribute attr1.

doc1

doc1

The second document “doc2” describes the derivation of the same entity e1 from another entity e0. The description of the entity e1 contains an attribute attr2.

doc2

doc2

By merging the two documents, we obtain a new document, in which the entity e1 is both attributed to ag1 and derived from e0. The description of e1 contains the attributes attr1 and attr2.

Merged documents doc1 and doc2

Merged documents doc1 and doc2

Merging the two documents is simply performed by calling provconvert with argument -merge as follows.

provconvert -merge doc1-2-listing.txt -outfile target/doc1-2.provn

The -merge option expects a path to a file (or – to indicate standard input) that lists the files that have to be merged. In our case, we have a file doc1-2-listing.txt with the following contents:

file, src/main/resources/doc1.provn, provn
file, src/main/resources/doc2.provn, provn

Each line consists of three elements separated by a comma:

  1. A tag indicating if we are dealing with a file on the file system or a URL
  2. The path to the file or a full http URL
  3. The PROV format expected to be read

For completeness, we show the details of the documents in PROV-N notation. First, “doc1”:

document

 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val1"])
 agent(ex:ag1)
 wasAttributedTo(ex:e1, ex:ag1)

endDocument

Then, “doc2”:

document

 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val2"])
 entity(ex:e0) 
 wasDerivedFrom(ex:e1, ex:e0)

endDocument

Finally, the merged document:

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)
endDocument

The merge operation follows key constraints of theĀ PROV-CONSTRAINTS specification, such as key-object (constraint 22) and key-properties (constraint 23).
The reader who is familiar with the RDF representation of PROV will note that the merge operation is simply obtained by “concatening” all the RDF files together.

The merge operation becomes interesting in the presence of bundles.

2.2 Merging two documents with distinct bundles

First, we consider two documents with distinct bundles.

We now examine a variant of “doc1”, which contains a bundle bun1. In the illustration, the bundle is represented by a rectangle, which contains a description of e2 generated by a2.

doc1 with bundle bun1

doc1 with bundle bun1

The second document is a variant of “doc2” with another bundle named bun2. It contains a description of e3 generated by a3.

doc2 with bundle bun2

doc2 with bundle bun2

After merging the two documents, we obtain a new document containing both bun1 and bun2.

doc1 with bundle bun1 merged with doc2 with bundle bun2

doc1 with bundle bun1 merged with doc2 with bundle bun2

As we can see, as the two bundles have different names, they are kept distinct in the merged document.

Concretely, the first document with bundle bun1.

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val1"])
 agent(ex:ag1)
 wasAttributedTo(ex:e1, ex:ag1)

 bundle ex:bun1
   entity(ex:e2)
   activity(ex:a2,-,-)
   wasGeneratedBy(ex:e2,ex:a2,-)
 endBundle

endDocument

The first document with bundle bun1.

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1="val2"])
 entity(ex:e0) 
 wasDerivedFrom(ex:e1, ex:e0)

 bundle ex:bun2
   entity(ex:e3)
   activity(ex:a3,-,-)
   wasGeneratedBy(ex:e3,ex:a3,-)
 endBundle

endDocument

The merged documents with two bundles is as follows:

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)

 bundle ex:bun2
  entity(ex:e3)
  activity(ex:a3,-,-)
  wasGeneratedBy(ex:e3,ex:a3,-)
 endBundle

 bundle ex:bun1
  entity(ex:e2)
  activity(ex:a2,-,-)
  wasGeneratedBy(ex:e2,ex:a2,-)
 endBundle
endDocument

2.3 Merging and flattening two documents with distinct bundles

We can optionally use the -flatten option to “remove” bundles, and “pour” their content in the surrounding document.

provconvert -merge doc1b1-2b2-listing.txt -flatten -outfile target/doc1b1-2b2-flatten.provn

The resulting document no longer contains bundles.

Merge and flatten of doc1 with bundle bun1 and doc2 with bundle bun2

Merge and flatten of doc1 with bundle bun1 and doc2 with bundle bun2

document
 prefix ex <http://example.org/#>

 entity(ex:e1,[ex:attr1 = "val1" %% xsd:string, ex:attr1 = "val2" %% xsd:string])
 entity(ex:e0)
 agent(ex:ag1)
 wasDerivedFrom(ex:e1, ex:e0)
 wasAttributedTo(ex:e1, ex:ag1)
 entity(ex:e3)
 activity(ex:a3,-,-)
 wasGeneratedBy(ex:e3,ex:a3,-)
 entity(ex:e2)
 activity(ex:a2,-,-)
 wasGeneratedBy(ex:e2,ex:a2,-)
endDocument

2.4 Merging two documents with the same bundle

Now, let us consider a variant of “doc2” with a bundle bun1, the same identifier as the bundle we had in the “doc1” variant. In the figure, we see that bundle bun1 contains a description of a2 generating e3.

A variant of doc2 with bundle named bun1

A variant of doc2 with bundle named bun1

If we now merge doc1 with bundle bun1 and doc2 with bundle bun1, the merge procedure merges the descriptions contained in the two instances of bundle bun1. We obtain:

doc1 with bundle bun1 merged with doc2 with bundle bun2

doc1 with bundle bun1 merged with doc2 with bundle bun2

2.5 Merging and flattening two documents with the same bundle

If in addition, we specify the -flatten option, merging and flattening operations result in the following document.

doc1 with bundle bun1 merged with doc2 with bundle bun1, after flattening

doc1 with bundle bun1 merged with doc2 with bundle bun1, after flattening

3. Conclusion

As our applications generate provenance incrementally, bundles by bundles, the ability to merge documents and collapse bundles has become critical. This functionality is implemented by ProvToolbox in the method IndexedDocument.merge(). This tutorial has shown that it is also directly available from the command line, using the provconvert utility.

What form of processing do you regularly perform on your provenance graphs? Which functionality would you like to see added to ProvToolbox? Tell us, and for any other issue related to ProvToolbox, on the Github issue tracker.

4. Appendix. Log Change

  • Original version submitted on 2015/07/27
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s