Sharing Paleodata (Part 2): Dryad

June 26, 2013 PLOS Blogs Open Access Open Data Open Science Technology Zoology

As promised, today I begin a series on repositories used for paleontological raw data. I will focusing on repositories to which data is submitted before publication, so that mention of it can appear in the manuscript. If you didn’t read the comments section of my last post, Mark Uhen and colleagues published an article about this exact topic a few months ago:

Uhen, MD, et al. 2013. From card catalogs to computers: Databases in vertebrate paleontology. Journal of Vertebrate Paleontology 33: 13-28.

I didn’t know about this article when I began planning this series, but now that I’ve read it, I think my series will complement that piece nicely. As luck would have it, I wound up gathering a lot of the same types of information about each database. To avoid immediate redundancy, I’m starting with a repository Uhen et al. did not report on, Dryad.

DRYAD (http://datadryad.org/)

Full disclosure: The statements about Dryad in the “Nitty Gritty” section were checked for accuracy by Laura Wendell, Dryad Project Manager. Thanks! My impressions are my own. Also, I haven’t submitted any of my data to Dryad yet, but I have downloaded data from it.

Impressions: This repository is the most important omission from Uhen et al. (2013), especially considering that Evolution, Systematic Biology, and Journal of Paleontology are among the integrated, sponsoring journals.

There are several advantages to using Dryad. First, so many people are using it, for so many different journals. Being part of a large data clearinghouse makes your data more discoverable. Also, Dryad is big on quality control; every file is checked for usability and sensitive information before posting. You don’t have to worry about viruses or bad files. All my downloads have worked smoothly regardless of filetype. Additionally, they take care of making the data go live when the paper is published online, and you can append the file if you missed something or made an error the first time, while keeping the original file too.

Dryad hosts an impressive range of filetypes. Nexus and Excel files are among the most common, but images (2D, CT, video) and other analysis files (BEAST, Structure, etc.) are also present. Because it’s so versatile, Dryad can be used to complement other open data. For example, you can only accession sequences and annotations in GenBank, not alignments or nexus files. GenBank gets more traffic than Dryad, so the sequences should still go there, but the analysis files can go up on Dryad. It takes so much effort to get a Structure file working sometimes; why limit its use to yourself? Another thing to consider: though GenBank numbers are cited by users down the road, the original papers that produce aren’t always. When Dryad files are cited, both the dataset and the original paper tend to appear in the references section (see below). You can boost your citations this way.

The website is easy for users to navigate; you can search by keyword, author, paper title, and journal (or anything, really). The search function works quickly but isn’t especially smart (e.g., a search for ‘paleontology’ won’t bring up articles with ‘palaeontology’). Clicking on an article brings up all the data files, example citations for the article and data package, and usually a link to the original article. My colleagues who have used Dryad as submitters assure me that the submission process is smooth (it’s streamlined into the manuscript process for some journals), but I can’t speak to that.

Negatives? Hard to think of any. I wish the keywords on the article pages linked to other articles with those keywords, but that’s a small quibble. In the near future, you’ll have to pay to use Dryad – I think the price is reasonable for permanently archiving your data, but some may balk.

Bottom line: Grant writers should feel comfortable listing Dryad in their Data Management Plan because it’s safe, easy to use, and your reviewers will (should) have heard of it; reviewers should feel comfortable asking authors post data to Dryad when appropriate, for the same reasons.

DRYAD: THE NITTY GRITTY

What it is: International repository for the data associated with peer-reviewed scientific publications. All types of data are welcome, but life sciences and medical data are most common.

What it is, in their words: “Dryad is a nonprofit repository for data underlying the international scientific and medical literature.”

Who runs it: Dryad, a nonprofit. The members (organizations, not individuals) elect a board of directors annually. Currently these are mostly people who work in academic publishing or information technology, or or do government, academic or industry research.

Who funds it: Initially funded by the NSF. Moving forward, Dryad will stay sustainable through a combination of membership fees and submission fees (to cover operating costs like maintaining and curating the data) and grants from private and government sources (to cover research and development costs). Dryad also hopes that funding agencies, especially those that require a data management plan, will be amenable to grantees using grant funds to pay the submission fee (more here).

Who uses it: Researchers who submit or use data; Many life sciences journals that require data to be posted on Dryad.

Cost to submit: Submitters pay an $80 fee per data package on submission (starting 1 Sept 2013), unless the journal contracts with Dryad to pay the fee, or the submitter is from an economically disadvantaged country. Additional fees apply to users who submit through non-integrated journals ($10) or have very large datasets (10+ GB; $15). Fees list here.

Cost to access: Free to individuals and to institutions. So… free.

Data and file types supported: Most; they don’t restrict based on filetype. Files should be data, but data includes text, tables, spreadsheets, images, CT data, finite element models, video, photos, nexus files, software code, compressed file archives. Non-data files may also be submitted, if they are “integral to the publication and can be released in the public domain”.

File sizes allowed: 10GB limit per project; more is ok, but you’ll pay a small fee starting 1 September 2013.

Copyright status: All files are CC0.

Data available during peer review? Yes, depending on the journal.

Allowed to post data from previous pubs? “Data that was originally collected for another publication may be submitted as long as it is referenced by the current publication.” In other words, you need to be citing the data in a new manuscript; you’re not supposed to create projects for projects you’re completely done with. However, it’s clear from my searches that some users are uploading data from previous publications. I’m not naming names because I think it’s a good idea. Also, it seems like some journals (e.g., Systematic Biology) may be posting all their old data files to Dryad. If so, I heartily applaud the effort and suggest more journals do the same.

Accession numbers provided? A unique DOI is provided. Dryad registers it for you after checking the data. Because you list this in the publication itself, the data submission process begins early (though this varies by journal).

Data goes live when: Varies by journal, but generally, when the publication goes live online at the publisher’s site. For some journals, you can embargo data for up to a year after the paper is published. Dryad does all this for you.

Data is backed up? Yes (CLOCKSS)

Stats provided? Page views and file downloads are visible on each page.

How to cite your data in your manuscript? Dryad doesn’t have a requirement; this varies by journal anyway. Dryad recommends an in-text citation, something like:

Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.j2m25n82 (but where ‘j2m25n82’ is replaced with a code unique to your project).

How to cite someone else’s data you download and use in a new manuscript? Dryad doesn’t have a requirement, but recommends a data citation in-text (see above) AND in the references section. This is in addition to citing the original article, and would look something like:

Valentine JW, Jablonski D, Krug AZ, and Berke SK. 2013. The sampling and estimation of marine paleodiversity patterns: implications of a Pliocene model. Paleobiology 39(1): 1-20.

Valentine JW, Jablonski D, Krug AZ, and Berke SK. 2013. Data from: The sampling and estimation of marine paleodiversity patterns: implications of a Pliocene model. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.42577

Can update after publication? Yes (But the original version is not editable).

Benefits in a nutshell:

Dryad checks each submission to make sure all files work, scans for viruses, checks for copyright status and sensitive data.
Already linked with several journals. It’s even integrated with some journals, so data submission to Dryad is part of the manuscript submission process.
Can link data packages directly to/from GenBank, TreeBASE, or other specialized repositories.
Everyone else is using it (multi-disciplinary across the life sciences).

Three recent paleo papers using it:
Ksepka DT, Balanoff AM, Bell MA, Houseman MD (2013) Fossil grebes from the Truckee Formation (Miocene) of Nevada and a new phylogenetic analysis of Podicipediformes (Aves). Palaeontology, online in advance of print. Dryad data package here.

Olori JC (2013) Ontogenetic sequence reconstruction and sequence polymorphism in extinct taxa: an example using early tetrapods (Tetrapoda: Lepospondyli). Paleobiology 39(3): 400-428. Dryad data package here.

Wood HM, Matzke NJ, Gillespie RG, Griswold CE (2012) Treating fossils as terminal taxa in divergence time estimation reveals ancient vicariance patterns in the palpimanoid spiders. Systematic Biology 62(2): 264-284. Dryad data package here.