With summer wrapping up and a new fellowship about to begin, it’s time to share some updates about Open Context. Warning! Much of this post is pretty geeky. So if you don’t enjoy geeking out on the nitty-gritty of archaeological informatics issue, you’re welcome to move on to something else!
I’m busy working with John Ward on completely rebuilding Open Context from scratch. Open Context is now over 7 years old, and has already gone through two significant revisions. Our current effort to rebuild Open Context marks the most radical rebuild yet. We’re moving away from PHP and MySQL toward Python and Postgres.
Open Context and Python
Both Python and Postgres give us more options for enhanced geospatial data management. Also, Python has a large array of powerful open source Natural Language Processing, Linked Data, and other scientific computing packages available. Again, these ready made tools will give us more options to further enhance Open Context.
I’m finding Python (with the Django framework), pretty straightforward and easier to write more intelligible and hopefully easier to maintain code. So, despite having to overcome some initial learning-curve obstacles, I think we’re making rapid process. Here’s the new code repository for Open Context’s first iteration in Python: https://github.com/ekansa/open-context-py
Consolidating Some Experience
Since the last major revision of Open Context in 2009-2010, we’ve increasingly emphasized participation in “Linked Open Data“. For Open Context, this mainly means we annotate certain data by referencing stable Web identifiers (URIs) to concepts curated by experts at other institutions. However, it took some time to learn how and where we should use Linked Data to enhance the data we publish (we’re still learning!). Until now, we took tentative and incremental steps in working with Linked Data; we didn’t invest a huge amount of effort in reorganizing Open Context’s schemas (ways of organizing data) and software to better manage Linked Data.
Thus, Open Context’s current PHP code based reflects a pretty organic and not-so-systematic approach to implementing Linked Data. Some other features we’ve added over the years, especially with regard to the faceted search, are also not so systematic. This has led to a lot of sprawl in the current version of Open Context. Just like sprawl isn’t great for communities, it’s also not great for software. The bloated code in Open Context makes it harder to maintain. It needs a very comprehensive overhaul – which is exactly what we’re doing.
What’s Next? Lot’s of GeoJSON(LD)…
We’ve published over 50 datasets, many of which are very large from research projects across the globe. We’ve now got a better understanding of some of the key issues and requirements for managing this kind of scale and diversity of archaeological data. At the same time, the archaeological “information ecosystem” has also grown. Our community has made great strides in sharing more and more interoperable data.
One exciting recent development centers on the uptake of GeoJSON. GeoJSON is a simple and easy to use format for sharing geospatial data. It is widely used and widely supported by Web mapping software and desktop GIS software. Sean Gillies, one of the architects of GeoJSON (while he was with Pleiades) organized an ad hoc, bottom up, push to combine GeoJSON with JSON-LD, a new W3C standard for expressing Linked Data. The goal of JSON-LD is to combine the ease of use of JSON with the semantic precision of Linked Data. Merging GeoJSON with JSON-LD can therefore be a powerful, but low-barrier-to-entry (meaning you don’t have to be a maladjusted nerd to participate) way for sharing archaeological data.
So, we’re deprecating Open Context’s current XML format that was based on ArchaeoML. David Schloen of the OCHRE project designed ArchaeoML, but the OCHRE project has also outgrown that schema. I’ve already migrated all of Open Context’s data into a new organizational schema that retains the powerful modeling features of ArchaeoML, but uses GeoJSON-LD (not ArchaeoML-XML) as the main way of representing and sharing these data. By design GeoJSON-LD is good for sharing linked data, and because GeoJSON-LD is backward compatible with all sorts of tools that support GeoJSON, Open Context’s new data will be much easier to consume immediately without writing custom software. For example, it does a good job of mapping sample Open Context GeoJSON-LD records (see examples). The software for doing all of this is much more compact and simple than the versions of Open Context we’re replacing.
The DINAA project is a key driver in motivating changes to Open Context’s approach to modeling archaeological data. So far, the DINAA project has published about 340,000 archaeological site file records generously contributed by state site file managers. We’ve annotated these site file records with a controlled vocabulary of archaeological time periods (in draft stage) to facilitate searches across state boundaries. We’ve also experimented with new ways of indexing numeric date ranges. However, Open Context’s current index only allows 1 numeric date range / per site record. This overly simplifies important aspects of reality, in that sites typically have gaps or hiatuses in occupation. While episodic occupation is described by the controlled vocabulary of periods, it also needs to be described with numeric date ranges. The next revision of Open Context will do this.
In addition to better meeting needs for DINAA, the new GeoJSON-LD approach will support some “event-like” data modeling that can have other useful applications. An “event” is basically an abstract entity that takes place at some time and at some place (even if time and place are only vaguely described). The most sophisticated, elaborate and comprehensive event model used in archaeology is the CIDOC-CRM. We’re referencing some of the CIDOC-CRM for our event modeling. However, for better and for worse, we’re less interested in semantic perfection than pragmatic usability, so our event modeling takes most of its cues from a simple optional extension to GeoJSON-LD discussed here. Adding some event modeling will be useful for representing where an object was found and where it may have been made (such as this Late Roman coin found at Petra but minted in London).
All of this may sound complicated, but it really isn’t that bad. We’re actually simplifying things and cleaning up our act. We’ll get more capability with less software by choosing somewhat better models and abstractions.
Open Context’s current implementation of faceted search needs some attention. It’s hard to use because there are too many facets without clear organization and we do not make numeric fields easy to query. Fortunately, John Ward, an experienced developer with expertise in enterprise search, is leading revisions on this critical bit of Open Context. Again, the focus is on making much smaller and easier to maintain code and taking better advantage of mature open source software (Apache Solr). In addition to a tuned-up search interface, we’ll also have a GeoJSON-LD search API. We’re also sticking with the old-but-super-useful Atom feed API as an option for getting search results from Open Context. Atom might not be as hip as GeoJSON, but it still has great utility in sharing lists of search results.
Better User Interface Design
We’ll be using GeoJSON-LD as a common representation format for all Open Context data. In addition to making publicly-available, machine-readable data, we’ll be reading GeoJSON-LD data ourselves to show records in our own interface. The open source Bootstrap libraries provide the layout, typography, styling, and various interactive features (drop-down lists, tabs, accordion boxes). The new (to us) grid layout will be very mobile friendly and will re-size well. Here’s an early draft example:
Showing main description
Showing links to other loci with the main description collapsed
Showing main media with descriptions and links to other loci collapsed
When will it be done?
We’re making great progress, but we still have a tremendous amount of work to do. In all likelihood, I will be wildly wrong in guessing when this will be finished, except that it’ll need months more concerted effort. I’ll post more as we get ready to deploy the next upgrade to Open Context.