Explore the first Open Science Indicators dataset—and share your thoughts

December 12, 2022 PLOS Open Code Open Data Open Science Open Science Indicators Preprints

Written by Lauren Cadwallader, Lindsay Morton, and Iain Hrynaszkiewicz

Open Science is on the rise. We can infer as much from the proliferation of Open Access publishing options; the steady upward trend in bioRxiv postings; the periodic rollout of new national, institutional, or funder policies.

But what do we actually know about the day-to-day realities of Open Science practice? What are the norms? How do they vary across different research subject areas and regions? Are Open Science practices shifting over time? Where might the next opportunity lie and where do barriers to adoption persist?

To even begin exploring these questions and others like them we need to establish a shared understanding of how we define and measure Open Science practices. We also need to understand the current state of adoption in order to track progress over time. That’s where the Open Science Indicators project comes in. PLOS conceptualized a framework for measuring Open Science practices according to the FAIR principles, and partnered with DataSeer to develop a set of numerical “indicators” linked to specific Open Science characteristics and behaviors observable in published research articles. Our very first dataset, now available for download at Figshare, focuses on three Open Science practices: data sharing, code sharing, and preprint posting.

How can this dataset be used?

Open Science Indicators are a tool with broad potential applications to many different situations and questions in research communications. Indicators can complement and support the aims of the UNESCO Open Science Monitoring Framework Working Group and meet the needs of organizations that wish to better understand Open Science practices. They can also be used to assess the impact of policy changes, like those set forth in the recent OSTP memo, across the literature—or, in the future, to parse by research discipline or subject, institution, region, or time period. They can tell us which infrastructure is used most often, and by whom.

At PLOS in particular, we hope that a better understanding of how Open Science tools and practices are applied today can help us to identify barriers, understand community norms, better support best practices, and track changes over time.

Importantly, our intention is not for these indicators to be used as a tool for ranking journals, authors or institutions. As for every quantitative assessment of research characteristics, responsible use calls for context and multiplicity of measures (see for example The Metric Tide and Leiden Manifesto). We feel, however, that these indicators are best used as a tool for improvement.

This is just the beginning

In the future we plan to expand on this dataset with new data points, additional publication years, and fresh indicators related to other aspects of Open Science practice. We appreciate your feedback to help inform future iterations. Let us know what you think of the data fields collected, our Open Science Indicator definitions, the Open Practices identified and how we have measured them in this first, early sharing of the results. We also welcome feedback on your needs for this kind of information and uses for the data. Please leave a comment below, or email your thoughts to community [at] plos.org.

What Indicators would be most useful to you?

Download dataset

Initial observations

In the initial dataset we primarily examine data-sharing and code-sharing behaviors in both PLOS articles, and in the wider scientific literature. This dataset also includes insights into preprint posting.

The data cover approximately 61,000 PLOS research articles published in the 3.5 years between January 2019 and June 2022, as well as a comparator set of 6,000 publicly-available research articles from PubMed Central (10% of the PLOS sample).

It’s important to remember that this dataset only measures machine-detectable traits. If, for example, the authors of an article have shared a dataset without labeling it as such, that data may not be counted as “shared.” The accuracy rates for data sharing range from 81% for comparator articles to 85% for PLOS articles. For code sharing, accuracy rates range from 94% for comparators to 97% for PLOS. Preprint accuracy rates are 96% for comparators and 94% for PLOS. Our goal is to reach an accuracy rate of at least 85% for all indicators and content sources. For Open Science Indicators to function at scale, it is essential to automate the process and compare this work with that of other researchers, and we are working with DataSeer to improve these accuracy rates, which will be reported with each data release.

Composite chart of Open Science Indicators first dataset

Data repository use

While there are many ways to share data, depositing it in a purpose-built data repository is considered best practice. Data repositories offer benefits like improved discoverability and metadata, stable unique identifiers, and archival measures to maintain the integrity of the record over time.

The Open Science Indicators dataset offers two different views into data-sharing methods:

Confirmed data repository: Data deposited in a “known repository” based on a controlled list of ~130 repositories. This number may be somewhat conservative.
Available online: Data available at a recognizable url. This is a less conservative number that will include less frequently used repositories, including some institutional repositories, as well as other online methods for sharing data (e.g. open notebooks, lab websites, etc.).

By either measure, PLOS articles are more likely to link to an associated public dataset than comparable articles published elsewhere. When viewed over time, a generally positive trend in confirmed repository usage is evident for both PLOS and comparator articles.*

Composite chart of chart of data indicators for Open Science Indicators first dataset

Code sharing

Code-sharing rates among PLOS and comparator articles are generally similar to one another. Overall, code sharing – in any form – is less common than data sharing, due perhaps in part to reduced relevance (most research generates datasets, but only some research produces code).

Cart of PLOS and comparator articles with publicly available code

In addition to code-sharing rates, the dataset also captures whether code was generated as part of the research, lending new insight into adoption rates and potential future adoption. We aim to explore this data more fully in a future post.

Preprint posting

The data indicate that PLOS articles are more likely to have an associated preprint than comparable articles published elsewhere. Overall, 21% of PLOS articles have an associated preprint, versus 19% of comparator articles.

What is next for Open Science Indicators?

This summary explores the three Open Science Indicators we described in our September announcement—but there are many other ways the dataset can be analyzed to understand Open Science practices. For example, you can delve into data- and code-sharing methodologies, differentiating between sharing as Supporting Information (SI) or in a repository. We are eager to hear from you on what aspects of the data are most useful, and what additional data points you’d like to see. At the same time, we’ll take a deeper look at Open Science practices within different research communities.

*All data are current through June 30 2022 (end of H1); all rates are calculated as a percentage of all articles analyzed