Skip to content


Comments on OSTP Open Data Policy

Today, Open Context’s Eric Kansa spoke (via phone) at the meeting on Public Access to Federally-Supported Research and Development Data and Publications: Data, hosted by the National Research Council of the National Academies. The meeting, taking place May 16-17, is hearing invited and public comments on the White House OSTP memo on expanding access to data resulting from federally-funded research, with the aim of informing agencies as they develop policies in response to the memo.

The NRC has posted a video of the meeting online. In addition, you can read the AAI’s comments in a document that includes responses from various individuals/organizations, some of whom also spoke at the meeting. In the meantime, here are Eric’s comments:

My name is Eric Kansa, and I manage and direct Open Context, an open access, open licensed data publication venue for archaeology and related fields. I’ve also participated in text-mining in the digital humanities. Text-mining really shows that the boundaries between text and data are increasingly burred, and that texts (publications) increasingly share many of the open intellectual property requirements critical to the re-useability of data.

While we focus on editorial and peer review services on data contributions, we work closely with colleagues at the University of California, California Digital Library, an institution that provides us with essential digital repository & persistent identity services. With Open Context, we are grateful for grant support from the National Endowment for the Humanities, particularly the Office of Digital Humanities, the National Science Foundation (see current work), and private foundations. We’re one example of how the lines between the humanities and sciences are increasingly blurred, and that’s a good thing.

In receiving support from multiple federal agencies, I think coordination across agencies is vital. Research suffers when stove-piped in artificial silos. Similarly other agencies also support and even mandate research, especially to enforce laws in historical preservation and environmental protection. Data practices relating to compliance-oriented research also need to be harmonized with agencies that support mainly academic oriented-research.

Based on over 10 years experience promoting greater data openness and professionalism in archaeology, I think it critical for policy-making to promote dynamism and innovation in the management of data. Data needs are diverse and ever evolving. We need to encourage that dynamism by welcoming new entrants with new ideas and approaches to data management, data preservation, dissemination and reuse.

There’s often a tacit assumption that data are a “residue” of research, and a researcher’s primary responsibility with respect to data centers mainly on preservation. I think that is limiting, and in some circumstances, data can and should be valued as a primary outcome of research. To borrow a phrase from my colleagues at the California Digital Library, data can also be a “first class citizen” of scholarly production. Data can also play a central role in new modes of scholarly communications, with approaches like “data sharing as publication”, or exhibition, or even data sharing as a kind of open-source release cycle. The point is, data can play many and expanding roles in researcher communications. Policy should not assume that data should only play the role of a secondary, supplemental outcome to research.

The need to foster dynamism also needs to inform thinking about financial sustainability. Public policy needs to recognize that the sustainability of particular organizations and practices in the research endeavor is only a means to an end in promoting the public good. Sustainability of particular interests should not be an end to itself. “Resiliency” may be a better term, since it may better capture our obligations for data and knowledge stewardship without lock-in to particular set of institutions or practices.

In other words, notions of data “openness” need to expand beyond technical and licensing concerns, but also to the organizations and people participating in the research community’s information ecosystem, esp. the next generation of students who will have their own needs and priorities with respect to data. True resiliency will require real funding, an issue where OSTP policy memo falls short. And I urge agencies to work with the research community, libraries, and others to honestly understand funding requirements. We need this to make a clear case to the American public about investing in unlocking the richness of research data.

Earlier this week, the NRC sponsored a related meeting to hear comments on the other part of the OSTP memo, relating to public access to publications resulting from federally-funded research. The AAI submitted comments for this meeting, as well, which you can read here in a PDF containing all mail-in responses.

Posted in Events, News, Policy.


Lessons in Data Reuse, Integration, and Publication

On April 17, members of the Central and Western Anatolian Neolithic Working Group met at Kiel University to participate in the International Open Workshop: Socio-Environmental Dynamics over the Last 12,000 Years: The Creation of Landscapes III. Working group participants presented their hot-off-the-press analyses of various aspects of integrated faunal datasets from over one dozen Anatolian archaeological sites spanning the Epipaleolithic through the Chalcolithic (a range of 10,000+ years). Several more sites will add data to the project in the coming months to ensure that the resulting collaborative publications are as comprehensive as possible.

These presentations took place in the session Into New Landscapes: Subsistence Adaptation and Social Change during the Neolithic Expansion in Central and Western Anatolia. The session, which was chaired by Benjamin Arbuckle (Department of Anthropology, Baylor University) and Cheryl Makarewicz (Institute of Pre- and Protohistoric Archaeology, CAU Kiel), included a panel of presentations followed by an open discussion.

A bit of background: Over the past five months, with enabling funding from the Encyclopedia of Life (EOL), we have worked with participants in this project to prepare their datasets for publication. Each participant contributed a dataset that would be edited and published in Open Context, and integrated with the other datasets. Rather than ask all participants to analyze the entire corpus of datasets, we asked each participant to address a specific topic. These topics (“sheep and goat age data”, “cattle biometrics”) required access to a smaller set of relevant data, their analysis of which the participants presented at the Kiel conference.

The research community has very little experience with this kind of collaborative data integration. Archaeology rarely sees projects that go beyond conventional publication outcomes, to also emphasize the publication of high-quality, reusable structured data. After months of preparing datasets for shared analysis and publication, I was really looking forward to seeing the research outcomes unfold.

As an added bonus, our colleagues from the DIPIR project joined us there to document the data publishing and collaborative data reuse processes. We felt very fortunate that the DIPIR team members could apply highly rigorous methods to observing and studying how researchers grappled with integrating multiple datasets. We’re looking forward to learning from the DIPIR team as they synthesize their observations on how researchers collaborate with shared data.

In the meantime, we’d like to share some initial impressions and lessons on data reuse that emerged from this work:

Full data access can improve practice. We can learn a lot by looking at how others record data. Some may see sharing our databases and spreadsheets as opening ourselves up to criticism. Such practices can greatly improve the consistency in the way we record data, and therefore facilitate meaningful data integration. In this one-day workshop alone, we identified a few key areas where zooarchaeologists can improve their consistency in data recording.

An example of this from the workshop: Although all zooarchaeologists record age data based on the fusion stage of skeletal elements, some elaborate on their notations where others don’t. For example, an unfused calcaneus of a sheep might come from a newborn lamb or from a sheep up to about two years of age (when the calcaneus fuses). One researcher might put a note in a “Comments” field indicating that the bone is from a neonate. Another researcher, dealing with the same specimen, might leave the notation simply as “unfused.” Thus, two recording systems can lead to very different interpretations, one that recognizes the newborn lambs in the assemblage, and one that lumps them with the other “sub-adult” sheep. Such differences in aggregate can lead to vastly different interpretations of an assemblage. These recording discrepancies become apparent when data authors begin looking “under the hood” at each others’ datasets. Recognizing these discrepancies and their possible effects on interpretation can inform better practice in data recording, and thus work toward improving future comparability and integration of published datasets.

While data preservation is a good motivation for better data management, we think a professional expectation for data will help motivate researchers to create better data in the first place. The discussions provoked by this study helps us to better understand what “better data” may mean in zooarchaeology.

Documenting data in anticipation of reuse. I think we can all agree that datasets must contain certain critical information or they will not be useful to future researchers. But here’s the catch: Information deemed “critical” for one project is not the same for another project. Sure, there may be a baseline of key information that applies to all projects (location, date, author, etc.), but there is a much larger amount of discipline-specific or even project-specific information that needs to be documented to enable reuse. To complicate things, the absence of this documentation may only be noticed upon reuse. That is, the project may appear well-documented until an expert attempts to reuse the dataset.

An example: Some datasets in this study contained a large number of mollusks. From the perspective of a data re-user wanting to integrate multiple datasets, this poses a big question: Does an absence of mollusks at the other sites mean that the ancient inhabitants did not exploit marine resources? Or is their absence simply a result of the mollusks having not been included in the analysis (either not collected or perhaps set aside for analysis by another specialist)? Understanding this absence of data is critical for any reuse of the dataset.

This highlights the important role of data editors and reviewers, who can work with data authors to identify and gather this key information at the time the dataset is disseminated (rather than having questions come up years later upon reuse). Furthermore, not just anybody can review the dataset. Knowing if a dataset is documented sufficiently requires in-depth knowledge of the subject matter, and the ability to project potential applications of the data to anticipate questions that might arise with future use.

The benefits of peer-review via data reuse. Data publication is still in its infancy. There is a lot of exploration taking place as to what “data publication” means and how it should be carried out. If it mimics conventional publication, peer-review of datasets would occur before their publication. However, our data reuse studies are showing that, in fact, the most comprehensive peer-review of data occurs upon its reuse. It is only at the time of reuse that a dataset is tested and scrutinized to the point where key data documentation questions emerge. This may only be an issue in today’s data-sharing world. Perhaps future data authors, accustomed to full and expected data dissemination, will practice exhaustive documentation from the get-go. But what do we do now? How does post-publication peer-review, which appears to be so critical to documenting datasets properly, fit with models of data publication?

This work is supported by a Computable Data Challenge Grant from the Encyclopedia of Life, as well as by funds from the National Endowment for the Humanities and the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed in this post do not necessarily reflect those of the funding organziations.

Posted in Data publications, Editorial Workflow, Events, Projects.


New data publications in Open Context highlight early globalization

What does a fragment of a Canton blue and white porcelain plate from the early 19th century in Alaska have in common with a stone jar from the mid-18th century Northern Mariana Islands? Give up? Both were published in Open Context this week!

The two projects these objects come from also share a common theme— documenting early globalization in the greater (much greater!) Pacific region. The Asian Stoneware Jars project, authored by Peter Grave of the University of New England (Australia), presents data on the likely provenance and production dynamics of large stone jars, many found in dozens of shipwrecks in the Pacific and Indian Oceans. Using a variety of analytical techniques to detect trace elements, Graves and his team identified that the stoneware vessels originated in at least seventeen discrete production zones ranging from southern China to Burma, providing insights on the transport of goods around the globe during the 14th- 17th centuries.

The Mikt’sqaq Angayuk Finds project (authored by Amy Margaris, Fanny Ballantine-Himberg, Mark Rusk, and Patrick Saltonstall, in collaboration with the Alutiiq Museum) catalogs finds from an historic Alutiiq settlement of the early 19th century on Kodiak Island. The site was a springtime encampment occupied only briefly by a small number of individuals, likely Alutiiqs conscripted into service to provision the residents of Russia’s first colonial capital in Alaska (St. Paul Harbor, now the City of Kodiak). Ceramics of Russian, British, and Chinese origin, together with a variety of artifacts of local manufacture, reveal a settlement that saw the interface of two cultures and participation in an increasingly global economy.

Both publications currently carry a three star rating as they await external peer review. The star ratings are part of a new system Open Context uses to help users understand the editorial status of the publication (ranging from one star for demonstration projects to five stars for peer reviewed projects). Open Context’s Publishing page has more details on how the star ratings work.

Posted in Data publications.


Decoding Data- A View from the Trenches

This has been a busy data month for me, as I prepare zooarchaeological datasets for publication for a major data sharing project supported by the Encyclopedia of Life Computable Data Challenge award. The majority of my time has been spent decoding datasets, so I’ve had many quiet hours to mull over data publishing workflows. I’ve come up for air today to share my thoughts on what I see as some of the important issues in data decoding.

  • Decoding should happen ASAP. Opening a spreadsheet of 25,000 specimens all in code makes my blood pressure rise. What if the coding sheet is lost? That’s a lot of data down the drain. Even if the coding sheet isn’t lost, decoding is not a trivial task. Though much of it is a straightforward one-to-one pairing of code to term, there are often complicated rules on how to do the decoding. Though an individual with little knowledge of the field could do much of the initial decoding, one quickly arrives at a point where specialist knowledge is needed to make judgment calls about what the data mean. Furthermore, there are almost certainly going to be typos or misused codes that only the original analyst can correct. Decoding should be done by the original analyst whenever possible. If not, it should be done (or at least supervised) by someone with specialist knowledge.
  • Decoding is expensive. In fact, it is one of the biggest costs in the data publishing process. I’ve decoded five very large datasets over the past few weeks and they required about five to ten times more work than datasets authors submitted already decoded. The size of the dataset doesn’t matter—whether you have 800 records of 100,000 records, data decoding takes time. For example, one of the datasets I edited for the EOL project had over 125,000 specimens. It was decoded by the author before submission. Editing and preparing this dataset for publication in Open Context took about four hours. In comparison, another dataset of 15,000 specimens was in full code and took over 30 hours to translate and finalize for publication. This is something critical for those in the business of data dissemination to consider when estimating the cost of data management. Datasets need to be decoded to be useful, but decoding takes time. Should data authors be required to do that work as part of “good practice” for data management?
  • Coding sheet formats matter. Ask for coding sheets in a machine-readable format so you can easily automate some of the decoding. Though PDFs are pretty, they’re not great for decoding.
  • Decoding often has complicated (and sometimes implicit!) rules. Keep all the original codes until you are sure you have finished decoding. Otherwise, you may find you need a code from one field to interpret another field. For example, one researcher used four different codes that all translated to “mandible.” It turns out each code was associated with a certain set of measurements on the mandible. If you decode the elements first (as you would) and make all the mandibles just “mandible,” then you reach the measurements section and realize you still need that original code distinction.

Because of all of this complexity, in practice it is hard to totally automate decoding, even if you are lucky enough to have machine-readable “look-up” tables that relate specific codes to their meanings. In practice, codes may be inconsistently applied or applied according to some tacit set of rules that make them hard to understand. Mistakes happen when unpacking complicated coding schemes. It really helps to use tools like Google Refine / Open Refine that record and track all edits and changes and allow for the role-back of mistakes.

Finally, the issues around decoding help illustrate that treating data seriously has challenges and requires effort. One really needs to cross-check and validate the results of decoding efforts with data authors. That adds effort and expense to the whole data sharing process. It’s another illustration why, in many cases, data sharing requires similar levels of effort and professionalism as other more conventional forms of publication.

Decoding is necessary to use/understand data. Why not do it at the dissemination stage, when it only has to be done once and can be done in collaboration with the data author. Why make future researchers struggle through often complicated and incompletely documented coding systems?

Support for our research in data publishing also comes from the ACLS and the NEH. Any views, findings, conclusions, or recommendations expressed in this post do not necessarily reflect those of the funding organizations.

Posted in Data publications, Editorial Workflow, Projects.


New Publication on Open Access and Open Data

Eric Kansa’s hot-off-the-press paper Openness and Archaeology’s Information Ecosystem provides a timely discussion of how Open Access and Open Data models can help researchers move past some of the dysfunctions of conventional scholarly publishing. Rather than threatening quality and peer-review, these models can unlock new opportunities for finding, preserving and analyzing information that advance the discipline. The paper is published in an Open Archaeology-themed special issue of World Archaeology (ironically, a closed-access journal). For those who can’t get past the pay-wall, Eric has archived a preprint. Abstract:

The rise of the World Wide Web represents one of the most significant transitions in communications since the printing press or even since the origins of writing. To Open Access and Open Data advocates, the Web offers great opportunity for expanding the accessibility, scale, diversity, and quality of archaeological communications. Nevertheless, Open Access and Open Data face steep adoption barriers. Critics wrongfully see Open Access as a threat to peer review. Others see data transparency as naively technocratic, and lacking in an appreciation of archaeology’s social and professional incentive structure. However, as argued in this paper, the Open Access and Open Data movements do not gloss over sustainability, quality and professional incentive concerns. Rather, these reform movements offer much needed and trenchant critiques of the Academy’s many dysfunctions. These dysfunctions, ranging from the expectations of tenure and review committees to the structure of the academic publishing industry, go largely unknown and unremarked by most archaeologists. At a time of cutting fiscal austerity, Open Access and Open Data offer desperately needed ways to expand research opportunities, reduce costs and expand the equity and effectiveness of archaeological communication.

Posted in Publication.


Digital Humanities Conference in Berkeley

The 2012 Pacific Neighborhood Consortium (PNC) Annual Conference and Joint Meetings will take place at School of Information at UC Berkeley from December 7th to December 9th, 2012. The conference is hosted by the Electronic Cultural Atlas Initiative (ECAI) and the School of Information at UC Berkeley. The main theme is New Horizons: Information Technology Connecting Culture, Community, Time, and Place. The program is packed with presentations on various digital heritage topics. Eric Kansa of the AAI will present Sunday morning on Applying Linked Open Data: Refining a Model of ‘Data Sharing as Publication’.

Posted in Events, News.

Tagged with , , , , .


Cyberinfrastructure in Near Eastern Archaeology

At the 2012 ASOR meeting in Chicago last month, the AAI co-organized (with Chuck Jones, ISAW) and presented in the second of a 3-year session Topics in Cyberinfrastructure, Digital Humanities, and Near Eastern Archaeology I. This year’s theme was From Data to Knowledge: Organization, Publication, and Research Outcomes. Presentations and demonstrations took place two back-to-back sessions. Topics in Cyberinfrastructure follows a loose format, with short papers followed by a long discussion period in order to allow maximum time for exchange of ideas among presenters and audience members. We see this as critical in an emergent and quickly developing field and we’re delighted that ASOR is offering more opportunities for these types of “discussion-heavy” sessions. The AAI’s presentation discussed how editorial oversight and the application of linked open data can vastly improve understanding and reuse of data shared on the Web. Other participants presented current projects (The Ur Digitization Project, The Diyala Proejct, The Oriental Institute’s Integrated Database Project), approaches such as 3-D visualization and text analysis/visualization, and the theory of archaeological knowledge creation. View the full program. The third and final Topics in Cyberinfrastructure session, with a tentative theme around training for a digital future, will take place in November 2013 at the ASOR meeting in Baltimore.

Posted in Events, News, Projects.

Tagged with , , , , .


The Red Sea Is Arabian, Erythraean, …

Place Name Clustering in Pleiades and TAVO

Since September of last year, I have been working with Eric Kansa on the Gazetteer of the Ancient Near East project of the Alexandria Archive Institute (with NEH funding). Our goal is to export the cornucopia of information contained in the index of the Tübinger Atlas des Vorderen Orients (TAVO)(Tübingen Atlas of the Near and Middle East) into Pleiades, a Community-Built Gazetteer and Graph of Ancient Places. While digitizing, proofreading, systematizing and entering the name, coordinates, map and language facts into a database, we ran into some practical obstacles. With the help of Tom Elliott and Sean Gillies (New York University’s Institute for the Study of the Ancient World), a lot of the obstacles were cleared but some defy an easy solution. That is where an editor comes in.

Place names can have many variants due to not only chronological (e.g., Byzantium, Constantinople and Istanbul) or cultural/linguistic reasons (e.g., Genève, Genf, Ginevra and Geneva) but also due to differences in the exact place or area covered. The Red Sea, a long and narrow body of water between Africa and Arabia, has been an important trade route since ancient times. Consequently, it is mentioned in texts from many different periods and cultures. At times, parts of the Red Sea were named separately, e.g., the Gulf of Aqaba between the Sinai peninsula and Saudi Arabia, and the Gulf of Suez between the major part of Egypt and the Sinai peninsula

In the two figures at the end of this article (click to enlarge), the relationships between the various Red-Sea-related toponyms are sketched both for the TAVO Index and Pleiades. In the TAVO diagrams, the arrows indicate referrals while in the Pleiades diagrams, they explain which (combination of) place name(s) serve(s) as the main entry (“Place”) and which are the toponyms associated with said main entry (“Names”). In other words, TAVO emphasizes relationships, mutual or not, while Pleiades organizes its information along the lines of one concept (“Place”) and usually-multiple labels (“Names”). The distinction between the whole and its parts is not always maintained, e.g., TAVO’s Rotes Meer (Red Sea) refers to Baḥr al-Qulzum (Gulf of Suez; named after a town at the northern end). The same name can cover different areas, e.g., Pleiades’ Erythr(ae)m Mare is both the Red Sea and the Persian Gulf… and even a “large section of the Indian Ocean, including the Persian Gulf and Red Sea.”

The confusing connections between place names ultimately reflect the extent/lack of accurate knowledge of the textual sources as well as the changing economic and political lenses through which the Red Sea region was viewed. The sea was variously seen as far-off, almost mythical, to important and oft-navigated. In a way, the place names accrete together in clusters which are unstable and can change composition depending on the time period and which observer is viewing them. With a nod to physics, we encounter an observer effect: “the act of observation will make [changes] on the phenomenon being observed. This is often the result of instruments that, by necessity, alter the state of what they measure in some manner.” After all, our measuring instrument is the human brain which is nurtured and to an extent determined by its social and cultural environment.

[click to enlarge]

Posted in Projects.

Tagged with , , , , , .


Workshop Develops Infrastructure for Archaeology in Australia

AAI’s Technology Director Eric Kansa is currently in Sydney attending the Stocktaking Workshop of the Federated Archaeological Information Management System project (FAIMS). FAIMS launched in the summer of 2012 with a major grant from the National eResearch Collaboration Tools and Resources program (NeCTAR), an Australian government program to build new infrastructure for Australian researchers (working in Australia or abroad). By the end of the project (Dec. 2013), FAIMS aims to have developed a portal containing a suite of compatible tools for archaeologists to facilitate the entire research process—from data collection to visualization, archiving, and publishing. The project is also prioritizing interoperability to promote the discovery and use of current and future online resources.

FAIMS is led by the University of New South Wales in collaboration with participants from 41 organizations from Australia and abroad. The Stocktaking Workshop is taking place this week in Sydney at the University of New South Wales School of Humanities. Eric will present the keynote address on August 16, Capacity Building in (Digital) Archaeology. A series of plenary sessions will explore such topics as Mobile Applications, Online Repositories, various aspects of archaeological data sharing and visualization, and sustainability.

Posted in Events, News, Projects.


NSF Support for Linking State Site Files

DINAA-logo-final-colorWe are delighted to announce the success of our grant proposal to the National Science Foundation to create interoperability models for archaeological site databases in the eastern United States (NSF #1216810 & #1217240). Our core team consists of researchers from the Department of Anthropology and Archaeological Research Laboratory at the University of Tennessee, the Alexandria Archive Institute, and the Anthropology and Informatics programs at Indiana University. Open Context will be used as the primary platform for data dissemination for this project.

Our aims are to work with the databases held by State Historic Preservation offices and allied federal and tribal agencies in Eastern North America, with the goal of developing protocols for their linkage across state lines for research and management purposes. Data from some 15 to 20 states (more than a half million sites) will be integrated and linked to promote extension and reuse by government personnel in state and federal agencies, and domestic and international researchers. The interoperability models we develop will be designed to:

  • facilitate and enhance resource management and protection far beyond local levels
  • make protocols and, where appropriate, primary data readily available through open source formats, platforms, and services
  • allow for interoperability among multiple disparate datasets and data systems
  • be sustainable, flexible, adaptable, and capable of growth in a number of directions
  • create frameworks for future “Linked Data” applications in North American archaeology

This project is designed to involve datasets from numerous organizations, and testers from the professional archaeological community. It will generate data products in the form of maps, tables, and analyses useful for primary research, cultural resources management, higher education, and public outreach. Data products will be abstracted and cleaned of sensitive information pursuant to all applicable state and federal requirements.

Posted in Events, News, Projects.