Git as Cryptographically Tamperproof File Archive using Chained RFC3161 Timestamps

Matthias Bühlmann
The Startup
Published in
14 min readFeb 22, 2021

--

There are situations in which you either want — or are required — to store data in a way that allows to unequivocally prove when exactly this data has been stored or altered.

In this article I will explain what RFC3161 timestamps are and what problem they solve, and will then go on to explain why Git™ provides the perfect architecture to be used together with such timestamps in order to create a tamperproof data repository that can provide the following properties:

  • Authenticity: trusted, non-refutable time when data was added or changed
  • Integrity: protection of the timestamped data as well as the timestamps themselves from tampering without detection
  • Timeliness: proof that the time of digital signatures on data was in fact the actual time
  • An evidentiary trail of authenticity for legal sufficiency

RFC3161

RFC3161 (and its extension RFC5816) is an internet standard that defines protocols and data formats for the issuance of trusted timestamps in a public key infrastructure (PKI).

In particular, it defines a Timestamping Authority (TSA), which is a trusted third party server, from which clients can request Timestamping Tokens (TST) for data for which they want to be able to proof that it existed at a certain point in time and that it has not been changed ever since.

Graphic describing the process of requesting a timestamp token from a Timestamping Authority (TSA) (Original by Bart Van den Bosch, vector by Tsuruya, CC BY-SA 2.0 be)

A timestamp token is a CMS ContentInfo object, which contains the data hash plus the timestamp (as determined by the TSA) and is signed using the private key of the TSA certificate.

Since only the hash (aka. digest) of the data is sent to the TSA, the data itself remains secret, yet it is now possible for the client to prove to anyone who trusts this particular TSA, that the data existed at the time certified by the timestamp and that it has not been tampered with ever since.

RFC3161 tokens therefore can be used to create an unequivocal proof that for example deadlines were met, that unpublished intellectual property already existed before a certain date, that some transactions took place before others, that a photograph has been taken before some event or that business records have been digitally archived at a specific time and have not been altered since.

RFC5816 is a minor extension to RFC3161. All it does is to extend the way how the token identifies the TSA certificate. In RFC3161 this is always done using the SHA1 hash of the DER encoded TSA certificate. In RFC5816 that hash algorithm can be arbitrarily chosen. It’s important to note that this is NOT the hashing algorithm used for the timestamped hash (for this purpose, SHA1 should not be considered secure enough anymore), but only the hash of the TSA certificate in order to identify it. Since any secondary pre-image certificate would need to be signed by a trusted CA as well, the crippling security of SHA1 does not open a viable attack vector there and thus RFC5816 merely generalizes the specification to not depend on any particular hashing algorithm.

Here are a couple of public TSA providers:

Example: PDF

There are defined standards (such as the PDF specification itself as well as restrictions and extensions defined in PAdES) as to how digital signatures (and timestamps) can be embedded into the PDF container format.

If a PDF document is digitally signed using a private key that has been issued by a trusted Certificate Authority (CA), then one can be sure that the document has indeed been signed by the entity indicated in the signature.

Adobe Reader will verify that the embedded signature matches the data of the PDF and display the signature properties:

It’s important to note that while the signature does include a timestamp and the document has provably not been altered after that signature has been applied, one can not trust that the indicated Signing Time indeed is the time when the signature was created. This is because the signing entity can freely choose the timestamp that should be indicated in the signature. Adobe Reader points out this fact by saying “Signing time is from the clock on the signer’s computer”.

This is where RFC3161 token come into play. By calculating a secure hash of the document including the applied signature, sending that hash to a trusted TSA and then embedding the received token into the PDF, there is now an authentic timestamp that can be trusted (so long as one trusts the issuing TSA):

Now Adobe Reader indicates that there is a trusted timestamp embedded, and the certificate this timestamp was signed with can be further checked to decide whether the TSA that issued this timestamp should be trusted or not.

For How Long Does an RFC3161 Timestamp Remain Valid?

Generally speaking, an RFC3161 Timestamp can be considered an authentic proof that the timestamped data existied at the indicated time, so long as the following conditions hold:

  • One trusts the TSA that issued the token (or any entity higher up in the trust chain of the TSA certificate)
  • The hash contained in the signed part of the token matches the hash of the timestamped data
  • The hashing algorithm used is still considered secure (for example, SHA1 should not be considered secure anymore, especially for timestamping purposes)
  • The private key of the TSA certificate (or any entity higher up in the trust chain) did not get compromised

The last point is critical:

While a timestamp can be considered valid long beyond the expiration date of the TSA certificate that signed it, if that signing cetificate gets compromised at any point, all signed token cannot be trusted anymore and lose their validity, including all token already issued (that is, if they haven’t been additionally protected, see below).

This in turn means that one can only trust a timestamp token for as long as one can check the revocation status of the TSA certificate (which, if revoked, must contain the reasonCode extension and be one of four defined ‘acceptable’ reasons of revocation defined in chapter 4.1 of the RFC3161 specification). If the TSA ceases to exist or its CA otherwise stops to publish the certificate revocation status (in the form of CRLs or OCSPs), one can not trust the issued timestamp anymore.

So, for how long beyond the expiration date of a TSA certificate will the CA provide revocation status? That depends on the TSA.

For example, the four nationally recognized issuers of certification services in Switzerland (Swisscom AG, QuoVadis Trustlink Schweiz AG, SwissSign AG as well as Bundesamt für Informatik und Telekommunikation), which are also certified to act as Qualified Timestamping Authorities (QTSA), are required by Swiss law (Art.9 Abs.3 VZertES) to provide revocation status of these TSA certificates for at least 11 years beyond the expiration of each respective TSA certificate.

But this of course is a legislative solution to the problem, not a technological one. What happens after these 11 years? What if the TSA you want to use is not bound by similar laws? What if the government is overthrown tomorrow and with it all its laws? What good does it do to add a cryptographically secure, trusted timestamp to your data if you can’t rely on being able to prove its authenticity in a couple of years from now?

The technological answer to this problem is called Long Term Validation.

LTV: Long Term Validation

There are different flavors of LTV in different cryptographic standards, but the general idea is the following:

A digital PKI signature can only be considered valid for as long as the certificate that was used to create it is valid. If the signing certificate expires or gets revoked, the signature becomes invalid (since it’s possible that the certificate got compromised and now anyone in possession of the private key could forge the signature).

However, if one “LTV enables” the signature by retrieving a proof of validity for that signature while it is still considered valid, and then timestamps the signature together with that proof using a trusted timestamp, the signature can be considered authentic for as long as the timestamp can be trusted, even if the signing certificate expires or gets revoked at a later date.

By embedding LTV data — in particular the entire certificate chain plus a currently valid CRL or OCSP response for the signing certificate — into a PDF BEFORE timestamping it, the signature becomes “LTV enabled”:

It’s important to note that the signature’s lifetime however is only extended to the lifetime of the timestamp token. If the token loses its validity and the signature is past its expiration, then the signature won’t be valid anymore either.

So, How Does One LTV Enable a Timestamp?

Answer: By timestamping it.

If one has a document with a timestamp token that is currently considered valid, retrieves proof of the timestamp’s validity (in the form of currently valid CRLs or OCSPs for the TSA certificate and its trust chain) and then timestamps the document (or even just the older timestamp token itself) together with that LTV data using a new timestamp token, then that older timestamp’s lifetime gets extended to whatever the lifetime of this new timestamp is.

If both timestamp token are signed by the same TSA certificate, nothing is gained. However, since TSA’s change their signing certificate every now and then, or a different TSA can be used for the new timestamp, the old timestamp gets protected from becoming invalid once it is timestamped with a new trusted timestamp that is signed by a different TSA certificate, meaning that even if the private key of that older timestamp’s TSA certificate would leak, it could still be trusted so long as the newer timestamp can be trusted (which, as you probably see, becomes a transitive property).

So the answer is to timestamp again (wile also including LTV data for the older timestamp) so long as it is still valid.

This can become very cumbersome however if one has thousands of documents that would all need to be individually timestamped again now and then just to keep them validatable. Also, while such a document has provably existed already at the timestamped time and has not been tampered with since, it’s still possible to “lose” that timestamped document (so timestamping in that manner doesn’t really prevent the unnoticed removal of data if it happens in chunks of entire documents).

The solution to this logistics challenge is to use a Merkle tree.

Merkle Tree

A Merkle tree is a tree structure (not necessirly a binary tree as in the graphic below) where each node is labeled with a hash that is derived from the hashes of its child nodes.

It is possible to timestamp multiple documents with a single timestamp by creating a Merkle tree of the documents hashes.

Schematic of a classical merkle tree (Original illustration by David Göthberg, Sweden, released to public domain as a PNG here: https://commons.wikimedia.org/wiki/File:Hash_tree.png. Converted to SVG by User:Azaghal)

For example, if L1-L4 are different documents and one retrieves now a timestamp for the Top Hash, this timestamp can now be used to proof the existence of all 4 documents at the time of the timestamp. It also means that the timestamp proves that no documents have been removed from the archive.

Some care must be given to how the hash values are generated in order to not allow for a second pre-image attack on the structure, but if this is done, an arbitrary amount of documents can be securely timestamped with a single trusted timestamp, which also means that only a single timestamp must be re-stamped now and then to extend its lifetime, which is a lot less cumbersome than re-stamping every document in a huge archive individually.

Leveraging Git’s Merkle Tree Design

Finally we get to the point of this article: Git

Understanding the significance of Merkle trees for timestamps makes it clear why Git is the perfect candidate for a timestamped, tamperproof data archive, because a Git repository is inherently designed as a Merkle-tree-like structure where version hashes depend on ALL data and ALL previous changes to that data.

Originally, Git hasn’t been designed as a full fledged SCM system, but the focus was more on creating a revisioned filesystem in which all data that is put into it would be bit-perfectly preserved, and not a single bit could flip unnoticed.

In many ways you can just see git as a filesystem - it’s content-addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a _filesystem_ person (hey, kernels is what I do), and I actually have absolutely _zero_ interest in creating a traditional SCM system.                                                  -- Linus Torvalds

That’s also why the command git fsck is called that way, fsck being the UNIX system tool for filesystem check. Git implements a revisioned filesystem, in which its merkle tree structure allows the detection of any bitflip in any revision of the data.

The Git history is stored in such a way that the ID/hash of a particular version (a commit in Git terms) depends upon the complete development history leading up to that commit. Once it is published, it is not possible to change the old versions without it being noticed. The structure is similar to a Merkle tree, but with additional data at the nodes as well as the leaves.

The property that not a single bit in the stored data (or the history of the changes to that data) can be changed without completely altering all subsequent commit hashes is exactly the property that makes it interesting for a timestamped data archive.

Instead of (or in addition to) publishing the data by pushing it to a public repository, we can retrieve an RFC3161 token from a trusted TSA to timestamp a commit hash and then store that token in the Git repository.

If we additionally also retrieve LTV data for the previous timestamp and add it to a commit before calculating the hash for the new timestamp, this iteratively extends the lifetime of timestamps arbitrarily into the future, meaning that as long as a new timestamped commit is added to the repository while the last timestamp can still be validated (which is usually for many years after it was issued), all older timestamps’ lifetime get extended to the lifetime of the new timestamp token.

By using more than one trusted TSA, it’s also possible to protect the repository against the eventuality of the newest timestamps becoming invalid (in the unlikely case that the TSA’s private key should leak and the certificate get revoked).

The timestamp token can be directly embedded in commit messages using PEM encoding . LTV data could also be embedded in the commit message, but since this data usually does not change that often, it is better to commit the LTV data as revisioned files into the repository and reference them from the commit.

The following schematic visualizes the Merkle tree stucture of the Git repository and shows how the timestamping data can be embedded:

On the left side, the Git commit objects are listed, which represent changes to the repository (the one on the bottom is the oldest, the one on the top is the newest). On the horizontal bars all additional files that are added together with that commit are shown. Dark blue are the actual user data (a tree corresponds to a folder, a blob to a file) that should be revisioned and timestamped, light blue are the LTV metadata files.

Each commit/revision references its entire history, by referencing the hash of its parent commit, as well as the current state of the repository by referencing the hash of its tree object.

Thus, by calculating a secure hash that depends on this parent-commit hash as well as the tree hash and then retrieving a trusted RFC3161 token for this hash, the entire state of the repository and its history can be securely timestamped. The timestamp is then attached to the commit as a trailer.

As you can see, in this design timestamps are added with separate commits (rather than being embedded right into the commits that contain changes to user data). This is for forward compatibility, so that always the entire commit object is being timestamped, no matter what future versions of Git might add to the commit object. Also, in this graphic two different TSAs are used, so each timestamp-commit contains two token from different TSAs.

The following schematic shows a simplified version of the Merkle tree as well as how the LTV data is added:

Each user-commit is followed by a timestamp-commit that adds LTV data for the new and previous timestamps and both the user-commit as well as the LTV-data get timestamped.

So, How to Do This in Practice?

I’m providing a free open-source implementation, called GitTrustedTimestamps, which implements the design discussed in this article as a post-commit hook, which uses OpenSSL for the cryptographic functions.

The implementation consciously doesn’t require any custom binaries and can be installed into any Git repository (SHA1 as well as SHA256) by copying the bash scripts into the .git/hooks folder.

It allows to configure mutliple TSAs, automatically downloads and adds LTV data and provides a validation script to check the validity of the timestamps contained in the repository.

It is available here https://github.com/MrMabulous/GitTrustedTimestamps

It must be noted that Git as of the current version (2.3.0) still uses SHA1 as the default hashing algorithm. If Git shall be used for its tamperproof properties, one should really not rely on SHA1 anymore. Luckily git is transitioning to SHA256 (for quite a while already) and a git repository can be set to use SHA256 instead of SHA1 by initializing it using
git init --object-format=sha256

While the feature is still labeled ‘experimental’ in the Git documentation and not enabled by default, this mostly regards interoperability with SHA1 based repositories, clients, servers and tools that build on top of Git which might not yet support the new hashing function (for example, such a repository cannot yet be pushed onto github, but they can be pushed onto self-hosted Git servers). If the repository is used standalone, with SHA256 enabled, the implementation is rock solid.

TL;DR;

This article shows how RFC3161 token can be used together with Git to create a tamperproof file repository that automatically creates a cryptographically secure evidence record of all changes to the files contained.

The Merkle tree structure of Git makes it a perfectly suited filesystem for timestamping using RFC3161 timestamps.

An archive created in such a way allows to prove when data was added or changed and that this data has not been tampered with unnoticed.

By the addition of LTV data, the validatability of created timestamps can be arbitrarily extended into the future and be protected from becoming invalid due to TSA compromise or cessation of revocation status provision.

You can add such timestamps automatically through a post-commit hook using https://github.com/MrMabulous/GitTrustedTimestamps

--

--

Matthias Bühlmann
The Startup

Software Engineer, Entrepreneur, Inventor and Philosopher