Around the world tomorrow groups from all sectors will be celebrating Open Data Day – an annual event that highlights the benefits of open data and encourages the adoption of open data policies in government, business and civil society. As publishers, data availability is crucial for the validation and foundation of new research and key to our mission to help researchers advance the scientific record. In the spirit of Open Data Day, we’ve decided to defer to researchers and data enthusiasts to answer the questions: why is data important and how can we make it better?
We got our answers from Sudhakaran Prabakaran, one of a number of dedicated volunteers at Cambridge University known as the Data Champions who are helping advise members of the research community on managing their data. Read his thoughts below.
Who are the Data Champions? What kinds of data questions do you help researchers navigate?
I think it’s a fantastic forum. [We have] a lot of discussions and people exchange ideas and not necessarily just in the sciences but also in every other field. [Data management] can be kind of confusing, even simple things like can I put my [datasets] in Dropbox? Can I share them in Google Drive? You’re talking about even labeling stuff in desktop computers. There is no clarity because this landscape is fast-moving and people are not trained to catch up with that kind of speed at which things change.
What are you working on right now? How does open data play a role in it?
Our lab thrives on open data. We train machine learning algorithms that looks at specific regions of the human genome and trying to identify the most important mutations and then identify drugs to target them. Most of the datasets I work with people have already published and analyzed those datasets and they’ve extracted what they want. I’m kind of looking at things that they don’t want – I’m looking at non-coding regions, just kind of digging deeper into the datasets.
Why do you think open data is important? What do you think the future open data landscape looks like?
I don’t think open data is enough…it’s the analysis also. For example, we train a lot of machine learning algorithms and in the process we fail many, many times and we know the pitfalls we know what to avoid. But if you share that process with other people that will enable them to overcome it, to get there. It is very difficult and that process can be shared with people.
I think future young people are going to be brought up in an environment where they can just click something and get access to the code and get access to the data themselves. And then the issues of reproducibility would be mitigated if you can share what you’ve done and the data set is there for other people to work with.
What advice would you give to authors and researchers to encourage them to share their data?
I think we have encountered these scenarios even as a data champion in my own department. I think if you if you have incentives, as in [getting] your DOI and authorships for the dataset even before publication then it’s easy to share. It’s your data, [someone] can probably do a different kind of analysis and publish it but they have to cite this data and you will be benefited by that.
And it’s in the best interest of the authors to share it ahead of time because of reproducibility.
What can you do to encourage good data management?
You can practice the open data lifestyle by sharing your research data in an open repository and making it available when you submit your manuscript. If you’re reviewing a submission, knowing how to evaluate the associated datasets can be tricky which is why we’ve worked with the Data Champions to cover everything you need to know in this Reviewer’s Quick Guide to Assessing Datasets. If the manuscript you’re reviewing doesn’t have an associated dataset, request it!
About the Data Champions Program
The Data Champions Programme is a network of volunteers who advise members of the research community on proper handling of research data. In this, they promote good research data management (RDM) and support Findable, Accessible, Interoperable, and Re-usable (FAIR) research principles. It is run by the Research Data Management Facility at the University of Cambridge.