Since the Biology of Genomes meeting in early May, a tempest has been brewing. It is only in this last week that this tempest has gathered enough strength that it could no longer be contained by those who have chosen to stir it up. The esteemed Daniel MacArthur blogged and tweeted from the conference. This apparently caught the attention of the conference organizers and GenomeWeb. As journalists, the folks at GenomeWeb are required to follow CSHL's media rules which require that journalists get the permission of a speaker before publishing any information from her talk. GenomeWeb saw a double standard when comparing what Daniel was allowed to do and what they were allowed to do. They then contacted CSHL. The initial write-up of the gathering storm in Science Insider characterized this contact as complaining. GenomeWeb characterized it as asking CSHL for clarification of their policy (in a comment on a response posted by Daniel in his blog, Genetic Future). Of course this attempt to, in effect, censor has only served to bring more attention to Daniel's blog (the so-called Streisand effect), and has resulted in a number of responses from other bloggers like Anthony Fejes, DrugMonkey, and even GenomeWeb's Daily Scan, comments (some quite passionate) on the Science Insider story, Daniel's response, and FriendFeed, as well as a couple well-reasoned pieces on where the policy should head from here by Ed Yong and Andrew Maynard. Daniel himself provides a nice summary of it all in a follow-up post. With all that sound and fury, there is not much to add on the subject other than to say I suppose I am lucky that the 500 or so emails I had to pore through each night after the meeting ended at 10:30 or 11 p.m. prevented me from posting any commentary during the meeting (well, the emails plus the fact that I knew Daniel would do a better job than me).
Taking a step back, there is a larger double standard at play here than the distinction between professional journalists and peddlers of new media. Many of the conclusions around whether CSHL is right in restricting any type of journalist focus on the type of conference and the expectations that type of conference creates in the minds of the presenters. At a private, invitation-only conference, no publishing. At a breaking results conference like Biology of Genomes, get permission. At an open conference, anything goes. So then one might ask: why aren't all conferences open? The whole notion that presenting something at a conference that has some understanding of respecting others' unpublished work is a bit ridiculous (this point has been made by others, along with the fact that Biology of Genomes is over-subscribed every year; getting people in the door is not a problem). But I am not even going to debate that point. The more interesting question is: why aren't all data and research released rapidly and freely available? Since the Bermuda Principles were agreed to in 1996, all genome sequencing centers have submitted their data, from raw sequence data to finished sequence to assemblies to annotation, to public repositories as quickly after generation as possible. These principles were reinforced by the Fort Lauderdale agreement in 2003 which added a provision that protected the production centers' right to first publication. But as we have seen recently, that provision of the Fort Lauderdale agreement is not always enforced. As sequencing has moved into medical applications, the sequencing centers have taken great pains to release human sequence data in a responsible manner, but still rapidly. What's more, they now also release the detected variants fully annotated and correlated with phenotypic information in protected access databases available to any researcher. As data that requires more and more analysis and significant human curation are made rapidly available well before publication, the production centers become ever more vulnerable to getting "scooped" on their hard won findings.
As Church and Hillier properly conclude in the above referenced article
Sequence data are now easier to produce, but decisions about timelines for data release, publication, and ownership and standards for assembly comparison and quality assessment, as well as the tools for managing and displaying these data, need considerable attention in order to best serve the entire community. (Emphasis mine)
This conclusion begets many questions. If the rapid release described in the Bermuda Principles still holds true, why does it only apply to large-scale sequencing centers? Many researchers are generating more sequence in a month than the Human Genome Project was able to produce in a year. As they continue to be allowed to perform pre-publication (as opposed to post-generation) data submission, why are they not being held to the same standard as the large-scale sequencing centers?
Stepping back further, does dumping all of those data, literally terabytes and terabytes, into public nucleotide repositories like the SRA and ERA as soon as it is generated still make sense? Who has the bandwidth to download and use it all? Mainly only those centers that are submitting it. For human data, a single instrument run contains enough data to identify an individual. Should there not be at least some provisions in place to allow data generators to properly assess and quality control their data?
The human reference has been published (with a recent update to GRCh37). The blueprint exists. Thus, many of the reasons underlying the conclusions of the Bermuda Principles are no longer applicable. So should those open access principles be applied more widely to other areas of biology and science at large or should they no longer apply to sequence data from a genome for which a reference exists? It is time to rethink the current policies and begin to apply them to all sequence generators. And people are doing just that. The double standard must end.