Murphy’s law for the digital age: anything that can fail, will fail throughout a live presentation. For Ben Marwick, that happened in front of a roomful of landscape-archaeology trainees in Berlin. The topic: computational reproducibility utilizing Docker.
Docker is a software application tool that produces ‘containers’– standardized computational environments that can be shared and reused. Containers guarantee that computational analyses always work on the same underlying facilities, promoting reproducibility. Docker thereby insulates researchers from the obstacles of setting up and updating research software. Nevertheless, it can be hard to use.
Marwick, an archaeologist at the University of Washington in Seattle, had become skilled in moving Docker setup files (‘ Dockerfiles’) from one task to the next, making small tweaks and getting them to work. Colleagues in Germany invited him to teach their students how to do the same. But due to the fact that every student had a somewhat various set of hardware and software installed, every one required a personalized setup. The demo “was a total catastrophe”, Marwick states.
Today, a growing collection of services allows scientists to sidestep such confusion. Using these services– which include Binder, Code Ocean, Colaboratory, Gigantum and Nextjournal– scientists can run code in the cloud without requiring to set up more software application. They can lock down their software setups, move those environments from laptop computers to high-performance computing clusters and share them with colleagues. Educators can produce and share course materials with students, and journals can improve the reproducibility of outcomes in released articles. It’s never ever been simpler to comprehend, assess, adopt and adapt the computational methods on which modern science depends.
William Coon, a sleep scientist at Harvard Medical School in Boston, Massachusetts, spent weeks composing and debugging an algorithm, only to find that an associate’s containerized code might have saved a lot of time. “I could have just gotten up and running, using all of the debugging work that he had actually already done, at the click of a button,” he says.
Scientific software often needs setting up, browsing and troubleshooting a byzantine network of computational ‘dependences’– the code libraries and tools on which each software module relies. Some need to be put together from source code or configured so, and an installation that should take a couple of minutes can deteriorate into an aggravating online odyssey through websites such as Stack Overflow and GitHub. “One of the hardest parts of reproducibility is getting your computer system established in precisely the same method as somebody else’s computer is established. That is just extremely difficult,” says Kirstie Whitaker, a neuroscientist at the Alan Turing Institute in London.
Archaeology Easier assessment
Docker decreases that to a single command. “Docker truly offers lowered friction for that phase of the cycle of replicating somebody else’s work, in which you need to construct the software application from source and combine it with other external libraries,” says Lorena Barba, a mechanical and aerospace engineer at George Washington University in Washington DC. “It helps with that part, making it less error-prone, making it less difficult in researcher time.”
Barba’s group does the majority of its work in Docker containers. But that is a computationally savvy research group; others might find the process intimidating. A text-based ‘command-line’ application, Docker has lots of alternatives, and developing a working Dockerfile can be an exercise in aggravation.
That’s where the cloud-based services come in. Binder is an open-source task that permits users to test-drive computational notebooks– files such as Jupyter or R Markdown note pads, which blend code, figures and text. Colaboratory(free), Code Ocean, Gigantum and Nextjournal(the latter three have free and paid tiers) let users write code in the cloud also and, sometimes, bundle it with the data to be processed. These platforms likewise allow users to customize the code and use it to other data sets, and supply version-control functions for examining modifications.
Such tools make it much easier for researchers to assess their colleagues’ work. “With Binder, you have actually taken that barrier [of software installation] away,” says Karthik Ram, a computational ecologist at the University of California, Berkeley. “If I can click that button, be dropped into a notebook where whatever is set up, the environment is exactly the way you meant it to be, then you’ve made my life simpler to go have a look and give you feedback.”
Determining required reliances, and where to find them, differs with the platform. On Code Ocean and Gigantum, it’s a point-and-click operation, whereas Binder requires a list of reliances in a Github respository. Whitaker’s guidance: codify your computing environment as early as possible in a task, and stay with it. “If you try and do it at the end, then you are generally doing archaeology on your code, and it’s truly, really hard,” she states. Ram developed a tool called Holepunch for jobs that utilize the analytical programs language R. Holepunch distils the procedure of establishing Binder into four easy commands. (See examples of our code running on all 5 platforms at go.nature.com/2ps9se1)
The simplest way to attempt Binder is at mybinder.org, a totally free, albeit computationally restricted, website. Or, for greater power and security, scientists can build private ‘BinderHubs’ rather. The Alan Turing Institute has 2, including one called Center23(a recommendation to Hut 23 at the 2nd World War code-breaking facility at Bletchley Park, UK), that offers higher computational resources and the ability to work with information sets that can not be publicly shared, Whitaker says. The Pangeo neighborhood, which promotes open, reproducible and scalable geoscience, developed a devoted BinderHub so that scientists can check out climate-modelling and satellite information sets that can amount to tens of terabytes, says Joe Hamman, a computational hydroclimatologist at the National Center for Atmospheric Research in Boulder, Colorado. (Whitaker’s team has published a tutorial on building a BinderHub at go.nature.com/349 jscv)
Archaeology Languages and clouds
Google’s Colaboratory is generally a cross in between a Jupyter notebook and Google Docs, implying users can share, comment on and collectively modify notebooks, which are stored on Google Drive. Users execute their code in the Google cloud– just the Python language is officially supported– on a standard central processing unit (CPU), a graphics processing system (GPU) or a tensor processing unit (TPU), a specialized chip optimized for Google’s TensorFlow deep-learning software. “You can open up your note pad or somebody else’s notebook from GitHub, begin experimenting with it and after that conserve your copy on Google Drive and work on it later on,” states Jake VanderPlas, a member of the Colaboratory team at Google in Seattle.
Nextjournal supports notebooks written in Python, R, Julia, Celebration and Clojure, with more languages in development. According to Martin Kavalar, president of Nextjournal, which is based in Berlin, the business has signed up nearly 3,000 users given that it introduced the platform on 8 May.
Gigantum, a beta version of which launched in 2015, features a browser-based customer that users can install by themselves system or remotely, for cloud-based coding and execution in the Jupyter and RStudio coding environments. Coon, who utilizes Gigantum to run machine-learning algorithms in the Amazon cloud, states the service makes it easy for partners to strike the ground running. “[They] can check out my Gigantum note pads and use this cloud-compute infrastructure to do the training and learning,” he describes.
Then there’s Code Ocean, which supports both note pads and standard scripts in Python, R, Julia, Matlab and C, to name a few languages. A number of journals now utilize Code Ocean for peer evaluation and to promote computational reproducibility, consisting of titles from Taylor & Francis, De Gruyter and SPIE. In 2018, Nature Biotechnology, Nature Machine Intelligence and Nature Approaches introduced a pilot programme to utilize Code Ocean for peer review; Nature, Nature Procedures and BMC Bioinformatics consequently signed up with the trial. More than 95 papers have actually now been involved in the trial, according to Erika Pastrana, editorial director of Nature Research study’s applied-science and chemistry journals, and more than 20 of those have actually been published.
Felicity Allen, a computer system scientist at the Wellcome Sanger Institute in Hinxton, UK, co-authored one research study because trial, which evaluated the types of anomaly that can develop from CRISPR-based gene modifying ( F. Allen et al. Nature Biotechnol.37, 64–72; 2019). She estimates that it took a week to get the Code Ocean environment working. “The reviewers appeared to actually like it,” Allen says. “And I think it was truly great that it made an example that somebody could just press ‘go’ on and it would run.”
Although some fret about the long-lasting practicality of industrial container-computing services, scientists do have options. Simon Adar, primary executive of Code Ocean, keeps in mind that Code Ocean ‘compute pills’ are archived by the CLOCKSS project, which preserves digital copies of online scientific literature. And Code Ocean, Gigantum and Nextjournal enable Dockerfiles to be exported for use on other platforms. All of which indicates that researchers can be confident that their code will stay usable, whichever platform they choose.
Benjamin Haibe-Kains, a computational pharmacogenomics researcher at the Princess Margaret Cancer Centre in Toronto, Canada, embraced Code Ocean to react quickly to reviews of an analysis he published in Nature( B. Haibe-Kains et al. Nature504, 389–393; 2013). For him, Code Ocean provides a method to ensure his code can be used and assessed by his team, peer reviewers and the more comprehensive scientific neighborhood. “It’s not a lot that an analysis must be appropriate or wrong,” he states. “Absolutely nothing is really fully right in this world. However, if you’re very transparent about it, you can constantly communicate efficiently in the face of criticism. You have nothing to conceal; everything exists.”