Skip to content

Decoding Data- A View from the Trenches

This has been a busy data month for me, as I prepare zooarchaeological datasets for publication for a major data sharing project supported by the Encyclopedia of Life Computable Data Challenge award. The majority of my time has been spent decoding datasets, so I’ve had many quiet hours to mull over data publishing workflows. I’ve come up for air today to share my thoughts on what I see as some of the important issues in data decoding.

  • Decoding should happen ASAP. Opening a spreadsheet of 25,000 specimens all in code makes my blood pressure rise. What if the coding sheet is lost? That’s a lot of data down the drain. Even if the coding sheet isn’t lost, decoding is not a trivial task. Though much of it is a straightforward one-to-one pairing of code to term, there are often complicated rules on how to do the decoding. Though an individual with little knowledge of the field could do much of the initial decoding, one quickly arrives at a point where specialist knowledge is needed to make judgment calls about what the data mean. Furthermore, there are almost certainly going to be typos or misused codes that only the original analyst can correct. Decoding should be done by the original analyst whenever possible. If not, it should be done (or at least supervised) by someone with specialist knowledge.
  • Decoding is expensive. In fact, it is one of the biggest costs in the data publishing process. I’ve decoded five very large datasets over the past few weeks and they required about five to ten times more work than datasets authors submitted already decoded. The size of the dataset doesn’t matter—whether you have 800 records of 100,000 records, data decoding takes time. For example, one of the datasets I edited for the EOL project had over 125,000 specimens. It was decoded by the author before submission. Editing and preparing this dataset for publication in Open Context took about four hours. In comparison, another dataset of 15,000 specimens was in full code and took over 30 hours to translate and finalize for publication. This is something critical for those in the business of data dissemination to consider when estimating the cost of data management. Datasets need to be decoded to be useful, but decoding takes time. Should data authors be required to do that work as part of “good practice” for data management?
  • Coding sheet formats matter. Ask for coding sheets in a machine-readable format so you can easily automate some of the decoding. Though PDFs are pretty, they’re not great for decoding.
  • Decoding often has complicated (and sometimes implicit!) rules. Keep all the original codes until you are sure you have finished decoding. Otherwise, you may find you need a code from one field to interpret another field. For example, one researcher used four different codes that all translated to “mandible.” It turns out each code was associated with a certain set of measurements on the mandible. If you decode the elements first (as you would) and make all the mandibles just “mandible,” then you reach the measurements section and realize you still need that original code distinction.

Because of all of this complexity, in practice it is hard to totally automate decoding, even if you are lucky enough to have machine-readable “look-up” tables that relate specific codes to their meanings. In practice, codes may be inconsistently applied or applied according to some tacit set of rules that make them hard to understand. Mistakes happen when unpacking complicated coding schemes. It really helps to use tools like Google Refine / Open Refine that record and track all edits and changes and allow for the role-back of mistakes.

Finally, the issues around decoding help illustrate that treating data seriously has challenges and requires effort. One really needs to cross-check and validate the results of decoding efforts with data authors. That adds effort and expense to the whole data sharing process. It’s another illustration why, in many cases, data sharing requires similar levels of effort and professionalism as other more conventional forms of publication.

Decoding is necessary to use/understand data. Why not do it at the dissemination stage, when it only has to be done once and can be done in collaboration with the data author. Why make future researchers struggle through often complicated and incompletely documented coding systems?

Support for our research in data publishing also comes from the ACLS and the NEH. Any views, findings, conclusions, or recommendations expressed in this post do not necessarily reflect those of the funding organizations.

Posted in Data publications, Editorial Workflow, Projects.

New Publication on Open Access and Open Data

Eric Kansa’s hot-off-the-press paper Openness and Archaeology’s Information Ecosystem provides a timely discussion of how Open Access and Open Data models can help researchers move past some of the dysfunctions of conventional scholarly publishing. Rather than threatening quality and peer-review, these models can unlock new opportunities for finding, preserving and analyzing information that advance the discipline. The paper is published in an Open Archaeology-themed special issue of World Archaeology (ironically, a closed-access journal). For those who can’t get past the pay-wall, Eric has archived a preprint. Abstract:

The rise of the World Wide Web represents one of the most significant transitions in communications since the printing press or even since the origins of writing. To Open Access and Open Data advocates, the Web offers great opportunity for expanding the accessibility, scale, diversity, and quality of archaeological communications. Nevertheless, Open Access and Open Data face steep adoption barriers. Critics wrongfully see Open Access as a threat to peer review. Others see data transparency as naively technocratic, and lacking in an appreciation of archaeology’s social and professional incentive structure. However, as argued in this paper, the Open Access and Open Data movements do not gloss over sustainability, quality and professional incentive concerns. Rather, these reform movements offer much needed and trenchant critiques of the Academy’s many dysfunctions. These dysfunctions, ranging from the expectations of tenure and review committees to the structure of the academic publishing industry, go largely unknown and unremarked by most archaeologists. At a time of cutting fiscal austerity, Open Access and Open Data offer desperately needed ways to expand research opportunities, reduce costs and expand the equity and effectiveness of archaeological communication.

Posted in Publication.

Digital Humanities Conference in Berkeley

The 2012 Pacific Neighborhood Consortium (PNC) Annual Conference and Joint Meetings will take place at School of Information at UC Berkeley from December 7th to December 9th, 2012. The conference is hosted by the Electronic Cultural Atlas Initiative (ECAI) and the School of Information at UC Berkeley. The main theme is New Horizons: Information Technology Connecting Culture, Community, Time, and Place. The program is packed with presentations on various digital heritage topics. Eric Kansa of the AAI will present Sunday morning on Applying Linked Open Data: Refining a Model of ‘Data Sharing as Publication’.

Posted in Events, News.

Tagged with , , , , .

Cyberinfrastructure in Near Eastern Archaeology

At the 2012 ASOR meeting in Chicago last month, the AAI co-organized (with Chuck Jones, ISAW) and presented in the second of a 3-year session Topics in Cyberinfrastructure, Digital Humanities, and Near Eastern Archaeology I. This year’s theme was From Data to Knowledge: Organization, Publication, and Research Outcomes. Presentations and demonstrations took place two back-to-back sessions. Topics in Cyberinfrastructure follows a loose format, with short papers followed by a long discussion period in order to allow maximum time for exchange of ideas among presenters and audience members. We see this as critical in an emergent and quickly developing field and we’re delighted that ASOR is offering more opportunities for these types of “discussion-heavy” sessions. The AAI’s presentation discussed how editorial oversight and the application of linked open data can vastly improve understanding and reuse of data shared on the Web. Other participants presented current projects (The Ur Digitization Project, The Diyala Proejct, The Oriental Institute’s Integrated Database Project), approaches such as 3-D visualization and text analysis/visualization, and the theory of archaeological knowledge creation. View the full program. The third and final Topics in Cyberinfrastructure session, with a tentative theme around training for a digital future, will take place in November 2013 at the ASOR meeting in Baltimore.

Posted in Events, News, Projects.

Tagged with , , , , .

The Red Sea Is Arabian, Erythraean, …

Place Name Clustering in Pleiades and TAVO

Since September of last year, I have been working with Eric Kansa on the Gazetteer of the Ancient Near East project of the Alexandria Archive Institute (with NEH funding). Our goal is to export the cornucopia of information contained in the index of the Tübinger Atlas des Vorderen Orients (TAVO)(Tübingen Atlas of the Near and Middle East) into Pleiades, a Community-Built Gazetteer and Graph of Ancient Places. While digitizing, proofreading, systematizing and entering the name, coordinates, map and language facts into a database, we ran into some practical obstacles. With the help of Tom Elliott and Sean Gillies (New York University’s Institute for the Study of the Ancient World), a lot of the obstacles were cleared but some defy an easy solution. That is where an editor comes in.

Place names can have many variants due to not only chronological (e.g., Byzantium, Constantinople and Istanbul) or cultural/linguistic reasons (e.g., Genève, Genf, Ginevra and Geneva) but also due to differences in the exact place or area covered. The Red Sea, a long and narrow body of water between Africa and Arabia, has been an important trade route since ancient times. Consequently, it is mentioned in texts from many different periods and cultures. At times, parts of the Red Sea were named separately, e.g., the Gulf of Aqaba between the Sinai peninsula and Saudi Arabia, and the Gulf of Suez between the major part of Egypt and the Sinai peninsula

In the two figures at the end of this article (click to enlarge), the relationships between the various Red-Sea-related toponyms are sketched both for the TAVO Index and Pleiades. In the TAVO diagrams, the arrows indicate referrals while in the Pleiades diagrams, they explain which (combination of) place name(s) serve(s) as the main entry (“Place”) and which are the toponyms associated with said main entry (“Names”). In other words, TAVO emphasizes relationships, mutual or not, while Pleiades organizes its information along the lines of one concept (“Place”) and usually-multiple labels (“Names”). The distinction between the whole and its parts is not always maintained, e.g., TAVO’s Rotes Meer (Red Sea) refers to Baḥr al-Qulzum (Gulf of Suez; named after a town at the northern end). The same name can cover different areas, e.g., Pleiades’ Erythr(ae)m Mare is both the Red Sea and the Persian Gulf… and even a “large section of the Indian Ocean, including the Persian Gulf and Red Sea.”

The confusing connections between place names ultimately reflect the extent/lack of accurate knowledge of the textual sources as well as the changing economic and political lenses through which the Red Sea region was viewed. The sea was variously seen as far-off, almost mythical, to important and oft-navigated. In a way, the place names accrete together in clusters which are unstable and can change composition depending on the time period and which observer is viewing them. With a nod to physics, we encounter an observer effect: “the act of observation will make [changes] on the phenomenon being observed. This is often the result of instruments that, by necessity, alter the state of what they measure in some manner.” After all, our measuring instrument is the human brain which is nurtured and to an extent determined by its social and cultural environment.

[click to enlarge]

Posted in Projects.

Tagged with , , , , , .

Workshop Develops Infrastructure for Archaeology in Australia

AAI’s Technology Director Eric Kansa is currently in Sydney attending the Stocktaking Workshop of the Federated Archaeological Information Management System project (FAIMS). FAIMS launched in the summer of 2012 with a major grant from the National eResearch Collaboration Tools and Resources program (NeCTAR), an Australian government program to build new infrastructure for Australian researchers (working in Australia or abroad). By the end of the project (Dec. 2013), FAIMS aims to have developed a portal containing a suite of compatible tools for archaeologists to facilitate the entire research process—from data collection to visualization, archiving, and publishing. The project is also prioritizing interoperability to promote the discovery and use of current and future online resources.

FAIMS is led by the University of New South Wales in collaboration with participants from 41 organizations from Australia and abroad. The Stocktaking Workshop is taking place this week in Sydney at the University of New South Wales School of Humanities. Eric will present the keynote address on August 16, Capacity Building in (Digital) Archaeology. A series of plenary sessions will explore such topics as Mobile Applications, Online Repositories, various aspects of archaeological data sharing and visualization, and sustainability.

Posted in Events, News, Projects.

NSF Support for Linking State Site Files

DINAA-logo-final-colorWe are delighted to announce the success of our grant proposal to the National Science Foundation to create interoperability models for archaeological site databases in the eastern United States (NSF #1216810 & #1217240). Our core team consists of researchers from the Department of Anthropology and Archaeological Research Laboratory at the University of Tennessee, the Alexandria Archive Institute, and the Anthropology and Informatics programs at Indiana University. Open Context will be used as the primary platform for data dissemination for this project.

Our aims are to work with the databases held by State Historic Preservation offices and allied federal and tribal agencies in Eastern North America, with the goal of developing protocols for their linkage across state lines for research and management purposes. Data from some 15 to 20 states (more than a half million sites) will be integrated and linked to promote extension and reuse by government personnel in state and federal agencies, and domestic and international researchers. The interoperability models we develop will be designed to:

  • facilitate and enhance resource management and protection far beyond local levels
  • make protocols and, where appropriate, primary data readily available through open source formats, platforms, and services
  • allow for interoperability among multiple disparate datasets and data systems
  • be sustainable, flexible, adaptable, and capable of growth in a number of directions
  • create frameworks for future “Linked Data” applications in North American archaeology

This project is designed to involve datasets from numerous organizations, and testers from the professional archaeological community. It will generate data products in the form of maps, tables, and analyses useful for primary research, cultural resources management, higher education, and public outreach. Data products will be abstracted and cleaned of sensitive information pursuant to all applicable state and federal requirements.

Posted in Events, News, Projects.

NEH Support for Publishing Archaeology to the Web of Data

We interrupt our vacation for a short blog post.

We are very pleased to report that the National Endowment for the Humanities just awarded a Digital Humanities Implementation grant to the Alexandria Archive Institute in support of our efforts to develop data publishing services with Open Context. In collaboration with a team of archaeologists working in the Mediterranean region, our project will further develop workflows to publish archaeological datasets as “Linked (Open) Data“, so that they can be used and integrated with many diverse data sources available on the Web to address important research topics.

We are grateful the NEH for their generous support and to our colleagues that bring energy, ideas, and talent to this collaborative effort. We’ll make further announcements about this and other projects in the upcoming weeks. Until then, it’s back to our previously scheduled vacation.



Posted in Events, Projects.

Kenan Tepe’s Massive Dataset

We are very pleased to announce the online publication of the second installment of the Upper Tigris Archaeological Research Project’s excavation data, images, and documentation.

The Upper Tigris Archaeological Research Project (UTARP), under the direction of Bradley J. Parker (University of Utah), was active in the Upper Tigris River Region of southeastern Turkey between 1998 and 2011. The online data publication in Open Context aims to be a complete accounting of all of the excavation records as well as all of the records of the subsequent analyses from the Upper Tigris Archaeological Research Project’s excavations at the site of Kenan Tepe in southeastern Turkey (

This data publication includes records and analyses from Areas A, B, C, D, E, G, H and I, with findings from Ubaid, Late Chalcolithic, Early Bronze Age, Middle Bronze Age, and Iron Age contexts. With the publication of these data we can now say that approximately 95% of the data from Kenan Tepe are published. The database includes more than 30,000 images, 1900 journal entries and 43,000 records of contexts and finds. These data complement other datasets from several other sites in the Near East also published in Open Context.

A final installment of the last 5% of the data, which includes issues that remain to be resolved and an accounting of analyses that are still underway, will be forthcoming later this fall. All the UTARP data are available free-of-charge, under liberal Creative Commons Attribution licensing conditions, at:

Bulk down load of data can be obtained through Open Context’s GitHub repository. Additional options for downloading data tables as comma separated values (CSV) is forthcoming.

Posted in Data publications, News, Publication.

Archaeology Icon Salon

A collaborative design event producing archaeology-related symbols for the public domain

On May 23, the AAI will host the first-ever Archaeology Icon Salon—a facilitated design session where event participants generate icons and symbols that visually convey concepts frequently needed in archaeology. The event will inform developments to Open Context by using visual design strategies to help make Web-based data publication more accessible. The anticipated result is a set of icons that effectively communicate the content of Open Context, which will be made available publicly so others can adopt the icons to facilitate access to archaeological content online.

Archaeology Icon Salon participants include designers, archaeologists, and anyone interested in the concept. Participants will brainstorm and sketch ideas and concepts during the event, and refine them from their home or design studios while continuing the collaboration process through social media. All designs can be submitted to The Noun Project, an open source resource, which curates submissions based on technical and stylistic guidelines and makes high quality icons available publicly.

The Archaeology Icon Salon will be hosted at Mission Social, a workspace in downtown San Francisco. Participation is free and drinks and appetizers will be provided. Folks farther afield are welcome to take part via Skype (though, sadly, you’ll have to provide your own snacks). We encourage on-site participants to bring art supplies.

Where: Mission Social, 972 Mission Street, 5th Floor, San Francisco
When: Wednesday May 23 @ 7pm
Please RSVP to let us know you plan to attend.

Posted in Events.

Tagged with , , , , .