Provenance of Publications: A PROV style for Latex

1. Provenance is still Challenging

Isn’t it frustrating that it is still hard to generate the provenance of our documents?

Isn’t it still challenging for our  research community to demonstrate best practice?

While the provenance community has made substantial progress in terms of understanding and standardising provenance, it is an unfortunate reality that, due to the lack of easy tools, provenance still remains beyond the reach of the general public.

It is for me a great frustration that provenance of my papers cannot be generated automatically. Thus, I can’t demonstrate best practice to the community.

2. A Style for LaTeX

If we look at publications, we can see that they already contain a lot of provenance information in textual form, but this information is not made accessible in machine-processable format. Given that I use LaTeX for many of my publications, I have developed prov.sty – a LaTeX style that generates provenance information, on the basis of annotations that are inserted in the source of the document.  In this blog post, I show the LaTeX annotations supported by prov.sty, and the type of provenance they generate.

3. LaTeX Annotations

I will use a running example taken from a recent paper “The Rationale of PROV”, which I co-authroed with Paul, James, Tim and Simon.

Luc Moreau, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. The Rationale of PROV. Web Semantics: Science, Services and Agents on the World Wide Web, 2015, doi: 10.1016/j.websem.2015.04.001, available under CC BY license (http://creativecommons.org/licenses/by/4.0/).

3.1 Authors, Organizations, Title, …

Of course, a publication has got a title, authors, and their affiliation.

Title, Authors and Organizations

Document Title, Authors and Organizations

The following LaTeX macros allow us to annotate

  • an author’s name with a URI using \provAuthor
  • an organization’s name with a URI using \provOrganization, and
  • a title using \provTitle

Lines 1-2 show how my name is marked up with \provAuthor and my ORCID URI. Likewise, lines 4-5 show how my institution’s name is marked up with \provOrganization and its web site’s URI. Finally, the title is marked up with \provTitle.

\provAuthor {Luc Moreau}
            {http://orcid.org/0000-0002-3494-120X}

\provOrganization {University of Southampton}
                  {http://www.soton.ac.uk/}

\provTitle {The Rationale of PROV}

As far as LaTeX is concerned, these annotations are macros which expand into their first argument, discarding the others, if any.

The resulting provenance is illustrated below. At the bottom, we see a yellow ellipse, with uniquely generated identifier 20892220-a071-4ef3-a799-3056447ec8a2; it has an attribute — the title. This entity is the publication entitled “The Rationale of PROV”. It is attributed to two agents, myself and the University of Southampton.

Authorship of Document

Authorship of Document

The following Turtle excerpt shows that the provided URIs are used in the description of the Person “Luc Moreau” and the Organization “University of Southampton”, both agents. Attribution of the document to the agent is by means of the property prov:wasAttributedTo.

<http://orcid.org/0000-0002-3494-120X> 
  a prov:Agent, prov:Person;
  foaf:name "Luc Moreau" . 

<http://www.soton.ac.uk/>
  a prov:Agent, prov:Organization;
  foaf:name "University of Southampton" . 

doc:20892220-a071-4ef3-a799-3056447ec8a2
  a prov:Entity ;
  schema:headline "The Rationale of PROV" ;
  prov:wasAttributedTo <http://orcid.org/0000-0002-3494-120X> ;
  prov:wasAttributedTo <http://www.soton.ac.uk/> .

Every time LaTeX typesets the document, a new identifier is generated in place of doc:20892220-a071-4ef3-a799-3056447ec8a2.

3.2 Projects, Funding agencies, …

Many publications include an acknowledgement section listing the projects and funding agencies that sponsored the work.

ack

Acknowledgement to Projects and Funding Agencies

The following LaTeX macro allows us to annotate

  • a project’s name with two URIs, for the project and the funding agency, using \provProject, and
\provProject
  {SOCIAM (EP/J017728/1)}
  {http://www.sociam.org/}
  {http://www.epsrc.ac.uk/}

The resulting provenance is illustrated below. At the bottom, we see the same entity 20892220-a071-4ef3-a799-3056447ec8a2 for the publication entitled “The Rationale of PROV”. It is attributed to the project, itself funded by the funding agency.

Project and Funding  Agency

Project and Funding Agency

The following Turtle excerpt shows an attribution of the document to the project by means of the property prov:wasAttributedTo, and that the project was sponsored by the funding agency, encoded with the property prov:actedOnBehalfOf.

<http://www.epsrc.ac.uk/> 
  a prov:Agent.

<http://www.sociam.org/> 
  a prov:Agent;
  foaf:name "SOCIAM (EP/J017728/1)" ; 
  prov:actedOnBehalfOf <http://www.epsrc.ac.uk/> .

doc:20892220-a071-4ef3-a799-3056447ec8a2
  a prov:Entity ;
  prov:wasAttributedTo <http://www.sociam.org/> .

3.3 Bibliography

As far as the bibliography is concerned, very little work is required.

bibliography

Bibliography

The usual LaTeX commands \bibliography and \bibliographystyle need to be preceded by \provBibliography, declaring that provenance need to be generated for bibliographical entries.

\provBibliography
\bibliographystyle{elsarticle}
\bibliography{rationale}

For this to work, each bibliography entry needs to have a URI or DOI associated with it. We do this by creating an attribute url or doi for each bibtex entry.

@TechReport{prov-dm:20130430,
  author = {Luc Moreau and Paolo {Missier (eds.)} …},
  title = {PROV-DM: The PROV Data Model},
  institution = {World Wide Web Consortium},
  year = {2013},
  type = {W3C Recommendation},
  number = {REC-prov-dm-20130430},
  month = oct,
  url = {http://www.w3.org/TR/2013/REC-prov-dm-20130430/}}

The resulting provenance is illustrated below. At the bottom, we see the same entity 20892220-a071-4ef3-a799-3056447ec8a2 for the publication entitled “The Rationale of PROV”. It was derived from the cited document.

The paper "The Rationale of PROV" cites the "PROV-DM Recommendation"

The paper “The Rationale of PROV” cites the “PROV-DM Recommendation”

The following Turtle code shows a derivation from the document to the cited publication using prov:wasDerivedFrom.

doc:20892220-a071-4ef3-a799-3056447ec8a2 
  prov:wasDerivedFrom
  <http://www.w3.org/TR/2013/REC-prov-dm-20130430/> . 

3.4 Included Figures

The provenance of included figures can also be expressed.

A Figure Included from the PROV-O specification

A Figure Included from the PROV-O specification

The LaTeX macro \includegraphics can include a file (e.g. pdf, jpeg, etc). It now can generate the provenance of this inclusion: the current document is said to be derived from the included resource.

The included resource is a file on the file system, so a third party would typically not be able to access it directly. For this reason, the macro \provResource allows for an online resource, copy of the included file, to be declared.

\provResource{http://www.w3.org/TR/2013/REC-prov-o-20130430/diagrams/starting-points.svg}
\includegraphics{starting-points.png}

Thus, the provenance of this inclusion is modelled as follows: the current document was derived
from the included resource, itself an alternate of the online resource. For a third party to be able to check that the online
resource is a copy of the included one, prov.sty computes the md5 hash of the included file.

doc:20892220-a071-4ef3-a799-3056447ec8a2
  prov:wasDerivedFrom 
    inc:20892220-a071-4ef3-a799-3056447ec8a2-1. 

inc:20892220-a071-4ef3-a799-3056447ec8a2-1 
  a prov:Entity ;
  schema:contentLocation <starting-points.png> . 
  prov:alternateOf <http://www.w3.org/TR/2013/REC-prov-o-20130430/diagrams/starting-points.svg> . 
  crypto:md5 "1727ca12ed150ec814e3475859d7b362" . 

In this specific example, the online resource is an SVG file, whereas the included file in a PNG. Thus, the md5 hash does not allow to check that they are identical.

4. Embedding Provenance

The macro \provEmbed allows for metadata about the provenance to be inserted in the PDF document, using the XMP metadata format. This command is expected to be called as the last macro before the end of the document.

\provLocation{http://eprints.soton.ac.uk/375233/7/provenance.ttl}
\provEmbed

XMP supports a subset of RDF/XML that does not appear to be expressive enough to embed PROV provenance directly. Instead, using the approach recommended by PROV-AQ, a pointer to the provenance is expressed, using the XMP format. The location itself is specified by LaTeX command \provLocation: http://eprints.soton.ac.uk/375233/7/provenance.ttl.

<rdf:RDF>
  <rdf:Description rdf:about=""
                   xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
    <xmpMM:DocumentID>uuid:3c59bdaa-dbf1-a740-963a-7c266e65f7b2</xmpMM:DocumentID>
    <xmpMM:InstanceID>uuid:085e7f83-2095-4342-8a5b-57b0d87f5715</xmpMM:InstanceID>
  </rdf:Description>
  <rdf:Description rdf:about=""
                   xmlns:prov="http://www.w3.org/ns/prov#">
    <prov:alternateOf rdf:resource="http://openprovenance.org/documents#20892220-a071-4ef3-a799-3056447ec8a2"/>
    <prov:has_anchor rdf:resource="http://openprovenance.org/documents#20892220-a071-4ef3-a799-3056447ec8a2"/>
    <prov:has_provenance rdf:resource="http://eprints.soton.ac.uk/375233/7/provenance.ttl"/>
  </rdf:Description>
</rdf:RDF>

Using the LaTeX command \provBanner, it is also possible to generate a textual description of where the provenance is accessible.

A textual description of where the provenance is located

A textual description of where the provenance is located

5. prov.sty: a github project

With this Blog post, I have showed that it is possible to lower PROV’s barrier of adoption, by adapting tools to generate provenance automatically. For those tools to be useful, they need to generate provenance systematically, for every created artifact. Over time, as similar tools get developed, their provenance should be linked up. For instance, the git2prov converter is capable of exporting PROV from GIT. It should be possible for users to seamleassly navigate the provenance generated by both tools.

The LaTeX style prov.sty is still a proof of concept, but I feel that it is time to release it, and have others to use it. Improving usability, enhancing the quality of provenance, and strengthening of LaTeX integration are all desirable.

prov.sty is available at https://github.com/prov-suite/prov-sty under the MIT Open Source license.

Pull requests are welcome and let’s make it a community effort to develop prov.sty

github project for prov.sty

github project for prov.sty

For a more complete description of prov.sty, please see:

Moreau, Luc and Groth, Paul (2015) Provenance of Publications: A PROV style for LaTeX. In the Seventh USENIX Workshop on the Theory and Practice of Provenance (TAPP’15), USENIX. URI: http://eprints.soton.ac.uk/378019/

Advertisements

2 thoughts on “Provenance of Publications: A PROV style for Latex

  1. Pingback: Trip Report: Theory and Practice of Provenance 2015 | Think Links

  2. Pingback: Trip Report: Theory and Practice of Provenance 2015 | Web & Media

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s