Peter Kraker on Google Dataset Discovery, the open science movement, and his #DontLeaveItToGoogle campaign.
You initiated the #DontLeaveItToGoogle campaign after Google brought out a search engine for scientific datasets. What was the reason to start such an initiative?
Research data is an important scientific output and there are many benefits to research data sharing, including data reuse and aggregation. But discovery is a big problem, even bigger than in literature. In our research, we found that up to 85% of datasets are not reused. In many ways, we cannot cash the cheques written by the open science movement, when we do not enable adequate discovery of the things we make open.
Therefore, we need better tools for data discovery. But I do not believe that Google Dataset Discovery is the right answer. It represents a proprietary and closed system on top of our own data. This is a system that benefits massively from researchers’ labour, but where researchers will have no say in. Google is capitalizing on a movement that they have contributed nothing to. Therefore, we need an open alternative. However, at the moment it seems to me that funders, research administrators and infrastructures are content to leave it to Google. This is highly problematic, especially since we have discussed the problems of lock-in effects and other negative outcomes of proprietary infrastructure for years now.
#DontLeaveItToGoogle is therefore an effort to make people aware of the problems related to Google Dataset Search and to protest the inertia when it comes to funding an open alternative. It’s the responsibility of the public and private funders to take initiative and provide an open alternative. We do have the European Open Science Cloud, which is to be released very soon, and data is the main focus point of this cloud, but yet I don’t see, how discovery will work it these systems of federated infrastructures. I think the public and the private funders need to step up here and bootstrap such an open alternative.
Where do you see the problem in search engine for data run by Google?
I fear that Dataset Search will go the way of Google Scholar. When Google Scholar came about 15 years ago, it was a ground-breaking literature search engine. The scientific literature, however, has doubled in the meantime, and Google has not made enough investment to keep up with this growth. As a result, Google Scholar is of very limited use today and does a bad job at helping researchers to find relevant papers for their information needs. Now, this lack of innovation would not be a problem, if other tools could build on top of Google Scholar. But unfortunately they can’t, because the Google Scholar index is not reusable. Innovators in this market have to first build their own index, which is not helped by the fact that Google has many special arrangements with content providers that the rest of the world does not have.
And we are all poorer for it – discovery is in many ways the departure point of research, and the results of this step decide whether research is reused or duplicated, whether new collaborations are formed or these opportunities are missed. Discovery is therefore important for efficiency, effectiveness and quality of research. Now I do fear that the same could happen for datasets, if we do not put an open alternative out there.
Do you think that public funders are not responsive enough to the demands of the research community?
I don’t think that the funders don’t recognize that infrastructure is generally needed, but I’m not always sure that they have the right focus. In my opinion, there is a lot of money for backend infrastructure, for creating data stores, high performance computing and that kind of thing, but there is very little money for front ends and services.
It’s interesting to consider why this is so. The reason might be to same extent in the history of how scientific applications were created in the past: the interface in many cases was an afterthought. One thing that we’ve seen with the web and especially starting 2008-2009 is that there were innovative services that also had a very good interface. And this drew a lot of users into digital science and I think that this kind of mindset now also needs to be applied to public and non-profit infrastructures – to really think from the user’s perspective, and to make sure that we take into account how things work, because that’s the essence of design.
How can the focus be changed?
At the moment it’s mostly about awareness – it’s necessary to really bring this idea forward and also to make sure that, for example, in a conference on open infrastructures, design aspects and user interface aspects are actually taken into account. Usually, in many of these events these topics are never represented. There are always a lot of conversations about technology and interoperability and all these important things, but almost nobody talks about the interface.
What are truly open research infrastructures and why is it important to have them?
Truly open research infrastructures are those that can be reused. By that I mean the software (open source), the content and the data. They are community-driven and community-owned. In such an ecosystem, innovation thrives, because we can all build on top of each other’s work. There are also no lock-in effects that we see with closed offerings – if an organisation does not work out in the way the community expects it to, the community can take it somewhere else. Therefore, truly open infrastructures are the strongest drivers of innovation in scholarly infrastructures today.
Why is it difficult for non-profit research infrastructure providers to compete in the market?
Funding for non-profit infrastructures is scarce. The VC route, meaning taking on venture capital, is not possible for non-profits and therefore, it is difficult to establish sustainable business models. In addition, commercial players such as Elsevier suck all of the money out of the market by offering expensive bundle deals. This makes it difficult for libraries to support open alternatives, even though they usually have much smaller asks than their commercial competitors. There are also free commercial services, but there users pay with their data – a highly problematic business model as we have seen in the Facebook scandal.
You founded Open Knowledge Maps a few years ago, a non-profit tool that helps to visualize scientific literature. Where are the difficulties of such initiatives?
Well, first of all, let me say that there are many positive things to a non-profit initiative. I love working with a dedicated team of mostly volunteers, who put thousands of hours into Open Knowledge Maps in their free time. An enthusiastic community has formed around Open Knowledge Maps. We have had half a million users in the 2.5 years of our existence. It’s great to hear the many stories of people from around the world, who are now able to get an overview research topics much faster than before and discover new relationships and findings that were previously hidden from them.
But funding is indeed the moot point here. So far, we have come by on a tiny budget that would usually be barely enough for a single person. What I said earlier about scarce funding for nonprofit organisations is especially true for open source services and frontends – but they are the way researchers engage with open science. By leaving this market to proprietary and closed solutions, we are limiting innovation in how researchers – and the rest of the world – interacts with scientific knowledge.
The things we can do on a pure volunteer basis is limited. Therefore my call would be to invest in interfaces and services – this is how we make the open science revolution a reality.
CommentHi Peter, very good commentary! I couldn’t agree more that research data discoverability is the crucial missing bit for making data more reusable. It is the “F” in FAIRdata. I also interpret your commentary that you believe that Google Datasets is actually a good initiative by itself, but that it is not open and community driven enough.
Therefore I am wondering if you are aware of the Mendeley Data search initiative, where over 35 data repositories are indexed (number is steadily growing). While this initiative is also not an open source application, it does deliver a couple of really important open aspects that might be of interest to you and the broader research community.
1) Any researcher can use the public open API to query the search engine at scale. This allows researchers to build their own services and tools on top of the engine, or create their own discover services.
2) Any data repository can use the ‘push API’ to add datasets. The search engine delivers a service to the datasets to deep index the data. This means that the index metadata is enriched with deeper information retrieved from the data files themselves
This initiative has launched about a year ago and is steadily gaining more traction from additional data and institutional repositories/collections that see the benefit of these services, as well as a steady flow of researchers that are using the front-end as well.
I would look forward to your ideas to see how we can engage and improve more.
thanks for your comment.
Regarding FAIR, I do see findability as a pre-requisite of discoverability, but to me findability is a characteristic of the dataset itself (and its metadata), whereas discoverability is provided by the tools around it, if that makes sense. That’s why I believe we need to talk about these as two, albeit interconnected characteristics.
I do believe that we need dataset discovery tools, but Google Dataset Search is again applying a closed, proprietary model to what should be a completely open infrastructure. Therefore, I actually do not believe that it is a good initiative.
I do agree with you that having an API is better than not having an API, but otherwise Mendeley Data Search is also far from an open infrastructure. Without the source code and the index in the open (API does not equal full data dumps), there is no way that the effort could be taken somewhere else. And being able to push one’s datasets does not mean that it’s a community driven or even community owned initative. Mendeley Data Search is firmly in the hands of Elsevier, who are using it as a tool to market their own data store, with the goal of upselling universities premium services on top of their researchers’ data.
Elsevier is now trying to co-opt the open science movement, even though it did not only not contribute to it, but actively opposed it. Therefore, I believe we need to build a truly open contender to these efforts to keep control of our data, its governance and its evaluation, and ensure constant innovation in this space.