• Tuesday, August 14, 2012



A Digital Library Guru Discusses New Rules on Sharing Scientific Data

January 28, 2011, 5:34 pm

Last week, a significant change went into effect at the National Science Foundation: The agency will now require researchers to submit data-management plans with their grant proposals.

Open government advocates hailed the move as the latest in a series of steps that are expanding public access to work done with taxpayer money. The policy will not go so far as to mandate public sharing of all data, which in this context could mean anything from glacier images to scientific papers to computer code. But it will “require people to essentially justify why they choose not to be open,” says Beth Noveck, a professor at New York Law School who until recently directed the White House Open Government Initiative.

You can find lots of detailed information about the change at the NSF and the Association of Research Libraries sites. We sat down with a leading data guru, Sayeed Choudhury, to get his take on what the move means for science. Mr. Choudhury, associate dean of university libraries at Johns Hopkins University, heads a project called the Data Conservancy. That effort has an NSF grant to help develop part of the foundation’s ambitious DataNet project, which seeks to build an international, large-scale data-curation network.

Q. What’s your opinion about the NSF’s change?

A. Generally speaking, there is quite a bit to be said for allowing not only other scientists, but the general public to have access to the results of federally funded research. We’ve seen some of that with the NIH PubMed Central (a free archive of life-sciences journals). There have been a couple of cases when we’ve opened up data to what one of the professors here at Hopkins likes to call “Internet scientists.” If you look at Galaxy Zoo, basically what astronomers did is open up access to images from NASA’s Hubble Space Telescope archive. And what they found is that people are much better, quite frankly, at classifying galaxies by looking at images than machines are right now. This may seem like a cute little thing, but it’s not. This is really helpful to professional astronomers for their research. It’s really taken a life of its own, in that the framework people are using, they’re now using for other kinds of science projects. So it really is not only “it’s good for taxpayers.” It actually gets much broader participation in science activities than I think you’d otherwise get. (For more on the rise of crowd science, see this story The Chronicle published last May.)

Q. How big a deal is this?

A. The way a lot of sharing happens now is like sending e-mail to each other. It’s point to point. I may read a paper, I may discover that somebody’s doing this kind of research, or I may know people and contact them. And I think there’s a lot of, “OK, here you go. Here are my files. If you have questions, I can explain it to you.” That’s fine. But I think what we are starting to see is much more distributed. It’s a little bit more like peer-to-peer networks. To me, the ultimate value of preservation of data is that I don’t need to go back to the original producer to figure out how to use it. It becomes much more systematic, rather than idiosyncratic. If that’s the case, then you build this network—it becomes part of the social fabric, rather than this point-to-point e-mail and telephone kind of exchange. I think that’s what’s potentially significant about this. But the devil’s in the details, right? We’ve got a lot of work in order to make it work like that.

Q. What is motivating these changes?

A. I do think the taxpayer issue is an important one. That’s probably the most explicit reason. There are some implicit ones as well, including the idea that if you can actually share data, preserve it, use it in responsible and meaningful ways, then you can get better science out of it. … Some publishers have a policy right now for providing free access to a lot of their journals in least-developed countries. And there’s at least some noise that they were about to change this, or some of them may have changed this. A lot of the counterarguments that have come up are that this is a really bad idea. These countries don’t have a lot of resources. And by getting access to publications, they’re able to get better science, they’re able to deal with public health issues, and so on. And I don’t think it’s any different with data.

The other aspect of this is there’s also the possibility of spurring on reuse outside of the academic or the scientific world. There could be companies that produce services around data, things of that nature that they may not be doing right now. If you think about the weather data, for example, that the National Weather Service produces. But other people use it and repackage it, the Weather Channel and people like that. So there are, in fact, for-profit uses that could come up if you release data into the public. People may be very interested in having visualization tools, for example.

Q. Practically, do you have any sense of what will change? What will we start to see—public repositories of this stuff?

A. We are thinking about where these data will reside. My impression is that there will be a combination of both centralized and decentralized approaches. What I don’t think we want is many, many data sets linked to many, many Web sites. The Web sites may go away. They may not be maintained. They may be personal Web sites rather than institutional Web sites. The data need to be curated … Documents, including even the publications within PubMed Central, are designed to be read by people. Data are born to be processed by machines. And that has very profound implications in terms of how they’re managed and accessed and preserved over time. So that’s a very practical, substantive question that has to be put out there. If we invest a lot of funding in producing new data, we have to invest some amount of funding in actually making sure that the data are preserved and can be used. So beyond that, let’s fast-forward to a world where, in fact, that is happening, and scientists know that in fact they can put their data somewhere, and it’ll be taken care of. Then people start to think about how they can do things in different ways.

We have a researcher within the Data Conservancy, Patricia Romero Lankao, who looks at climate-change research, particularly the social impacts of climate change. She’s been thinking about a whole new type of research that would be possible if you actually were able to bring together data from these different places and run different kinds of analyses. From a science perspective, you start to get people saying, “Well, OK, what if this kind of environment existed? What kinds of questions might I ask that I don’t ask today, because it’s just not practical?”

Q. How do scientists feel about the new requirement?

A. I think it varies. I think you’re going to get reactions all the way from, “I have enough to do, and I have enough documentation to produce,” all the way to, “This is good. This is, in fact, what science is about.” The most common experience that we’ve had so far is they come in the room, and there’s this sense of, “I don’t really know what this is—can you help me?” And we go through the template we’ve gotten, we go through the interview process, and I hope get them to a comfort level where they realize, “OK, I get it now. I understand why this is useful. I understand why it’s important.”

This entry was posted in Libraries, Research. Bookmark the permalink.

  • Print
  • Comment (11)

11 Responses to A Digital Library Guru Discusses New Rules on Sharing Scientific Data

johnlaudun - January 31, 2011 at 9:42 am

If we were, like some non-American systems, recognize all forms of scholarship as a form of science — I am thinking here of the humanities — and then to think of making all kinds of data available for all kinds of uses, we would have truly accomplished something. Obviously agreeing upon a set of standards for storage and access will be critical, but once we get that worked out, I can’t wait to see what people — inside the academy or out — begin to do with data.

mjw13 - January 31, 2011 at 1:16 pm

It is precisely because there are commercial applications for National Weather Service data that they’re not freely available to the public. Academics have access to more of their data than the public through a special dispensation.

Libraries have seen this in the past with govt. products like STAT USA (now deceased) and FBIS. Once collected into databases with a mandate to recoup costs, access was restricted to either a librarian logged in secret password, or commercial subscription only.

crunchycon - June 1, 2012 at 10:47 am

Whether there is a desire to be public or not, one should understand that when it is a matter of public record, it can be made public.

leah_shopkow - June 1, 2012 at 10:48 am

I’m a little puzzled by the tenor of some of the comments. When information is public, it is public for all legal purposes. It is not illegal to mine data. It is not illegal to send a letter to someone you don’t know. What others “should” or “should not” do doesn’t come into it. I’m not devaluing someone’s feelings of outrage about having what they consider private violated or questions about morality (although having struggled with our IRB, I would note that the mining of publicly available information is not unethical as IRBs would define it). The real issue isn’t that the information shouldn’t be public, but is rather the manipulative implication of the study; once notified that they are under personal scrutiny, people have difficult choices about how they plan to proceed, and they may change their conduct. But scrutiny, even at the individual level, was the intention of the laws.

dashwood - June 1, 2012 at 10:54 am

To mjw13: Where this information is made available varies by states. In some states this is available in the courthouse or other governmental facility within each county, but most other states make these data available on a computer tape or through some other electronic means. Many states charge a fee for access to these data–in some cases the data can be quite expensive to obtain. Usually these data sets include information on whether an individual voted as well as date of birth, race, gender, party registration, and other basic information. In most cases these data include information about whether an individual voted in the last several elections, so one can trace a person’s turnout history.

mcdonaldj - June 1, 2012 at 11:03 am

I received one of these and found it very odd.  It looked like junk mail or some sort of scam. The problem isn’t that this is somehow protected information — clearly it is a matter of public record. Rather, I simply fail to see what the research might be. Their web page does not make this clear.  An email query to the researchers produced a response that also failed to provide any transparency on the issue. So what folks are probably feeling — I certainly am — is that they’re being manipulated, and toward what end is unclear.

fairday - June 1, 2012 at 11:08 am

The influence of corporations in our lives is undeniable 24/7 year in year out.  They really can make you do things even when you don’t know they are doing so.  They also control the politicians that “push you around”. It is even worse now that corporations are people too as ruled by Citizens United and the republican nominee for POTUS. 

jsibelius - June 1, 2012 at 11:48 am

It’s one thing to have the information available for anyone who cares to expend the effort to look it up.  It’s another to have unsolicited information handed to you.  This is particularly important when you and the neighbors keep a detached existence and suddenly knowing (without asking) that they contributed to a cause you hate with the “very fiber of your being” can cause some nasty problems that extend far beyond the effect of next year’s contribution.

jsibelius - June 1, 2012 at 11:50 am

And another question:  is this money donated to a specific candidate – who just happens to be a member of said political party – or is this money donated to the political party general campaign fund?  Because if it’s the former, there can be some very big distinctions there as well.

mtyler - June 1, 2012 at 11:57 am

I’m surprise that the Harvard Institutional Review Board didn’t put a stop to this kind of human research.  If I remember my CITI course correctly, this is a violation of federal law.

22286504 - June 1, 2012 at 12:49 pm

Disclosure of campaign contributions has been the law for many years at the national level and in a majority of the state for decades, so there’s really no expectation of privacy as the previous post suggests.   There may be a desire for privacy, but not an expectation of one.  So it really isn’t correct to suggest that one expects contributions to be private while law signs or bumper stickers involve no such expectation.  And in any case, why should the low income person whose only way to influence an election is by posting a sign or sporting a bumper sticker have their politics disclosed while the wealthy person who can give a lot of money remains anonymous.    s to voting, a majority of states allow you to register either by party or as an independent.   But if you register as an independent there is a price to pay: you cannot vote in the party primary and thus cannot influence the selection of general election candidates.   Your party registration is a matter of public record.

Most states have public records laws that extend to other activities–such as signing petitions submitted to get candidates or ballot propositions on the ballot.   The Supreme Court held two years ago that such public records laws apply to petition signatures (which often also include residential address).   So when you sign petitions directed to government officials or actions, your involvement becomes a matter public record.

But is this public record stuff a “good thing.”   When the Surpreme Court considered this in the 1970s, following a revision in the federal campaign finance laws, it concluded that there were three reasons to uphold public disclosure:  (1)  to inform citizens about who the backers were of various candidates and parties so that citizens could take that into account when deciding who to vote for; (this seems a lot more pertinent today when corporations are pouring money into campaigns), (2) to help enforce campaign finance laws that may restrict the sources or amounts of contributions, by allowing citizens to check to make sure enforcement was occurring, (3) to help prevent or identify corruption in government by allowing citizens and enforcement agencies to see whether are connections between who contributes to campaigns and who benefits from government actions (contracts, legislation, etc.)_and then to pursue questions about whether there were quid pro quo arrangmeets  when these contributions are made.

Others have commented already that in the internet age a whole lot of information becomes public, whether you want it to be or not. So who is it that really expects that in the political wars, disclosure of campaign contributions won’t occur?   I can’t even set foot out of my house wihout having that disclosed if a Google maps truck happens to be driving by and scanning my neighorhood.  (This really happened to me and I know have the dubious privilege of being on Google maps.)   When I get involved in funding ithe political wars between candidates and parties, do I really expect that people won’t find that out?