Open Context Data in GitHub

We’ve recently completed exporting the majority of the data from Open Context to GitHub. For most data in Open Context, we link directly into the GitHub repository where the version history of the XML representation can be seen. Here’s an example: A coin from Domuztepe (the GitHub link follows the thumbnails).

GitHub is mainly a code-repository for software projects. However, it’s seeing more use for other applications that need robust version-control and transparency in development processes. GitHub now serves “open government” applications, including a project that actively tracks changes in US legal code. GitHub also serves many “open science” purposes, mainly source-code for scientific analysis software, but also, increasingly datasets. In fact, we already found some archaeological data together with analytic methods in GitHub, published by Thomas Dye.

GitHub has some fascinating potential for sharing archaeological data. GitHub provides robust version control. Changes are tracked and documented so they can be reviewed, and accepted or rejected by collaborators. This provides more transparency into data manipulations. This is a great feature, since we had a problem with our initial XML dump that we used to populate the repository in GitHub. Some of our documents did not have proper UTF-8 character encoding (needed to properly represent non-Latin characters). We fixed the output problem and we’re committing updated, better data to the repository.

To us, one of GitHub’s greatest advantages is in allowing datasets to be easily “forked” (i.e. duplicated and taken in a new direction). This gives people the freedom to take a dataset, work with it independently, and transform it to meet their needs. The provenance and history of forking is retained. We’ve made data portability a priority in Open Context with lots of emphasis on Web-services and machine-readable data. GitHub works towards these goals by providing a fantastic community and collaborative space to work with data in new ways.


Please note, Open Context does not rely upon GitHub for long-term archiving and data preservation. Open Context also works with the California Digital Library‘s Merritt repository. Open Context uses of GitHub mainly to encourage collaboration and transparency, and not for data preservation.

Leave a Reply

Your email address will not be published. Required fields are marked *