DNA is converted into a binary code. getty
At the 2021 SNIA Storage Developers Conference (SDC) there was an insightful session on DNA for digital storage. This article borrows information from speakers at the SDC session as well as some discussion with Iridia, a DNA storage startup that wasn’t represented in the SDC session. In addition, I also had a chance to interview Steffen Hellmold, formerly with Western Digital and now Vice President of Business Development at Twist Bioscience, an early proponent of DNA storage.
As the amount of data being generated and processed grows, so does the demand for storing data. Over 60% of retained data is stored in colder storage technologies (mostly HDDs and magnetic tape with some optical disc storage as well). That is because this data is less frequently accessed and thus is stored in less expensive storage media (HDD storage costs are about $30/TB and magnetic tape media costs are about $7/TB) rather than storing that data in higher performing, but more expensive, SSDs (now dominating as storage for active data), which costs about $150/TB. Note that the costs of all these storage media are decreasing with time and any competing technology, such as DNA storage, in order to achieve broad adoption, must be competitive to these technologies when it is commercially introduced.
Karin Strauss from Microsoft Research and Luis Ceze from the University of Washington gave a good overview of DNA storage at the SDC. The data is stored in synthetic DNA strands and can potentially provide the highest density storage and a shelf life of over 1,000 years as shown below.
Potential storage density of DNA compared to other storage technology 2021 SDC, Image from Karin Strauss
The chart below shows what an end-to-end DNA storage system would look like. In addition to reading data by SBS (discussed below) DNA can also be read through nanopores (the example given was ONT MinION).
End to end DNA storage system 2021 SDC, Image from Karin Strauss
They pointed out that synthesis and sequencing are currently batch processes and that this may match their use in archiving. They also said that emerging technologies, such as nanopore devices can provide closer to real time latencies (versus hours). They then proceeded to talk about how DNA could be used for some types of compute applications, in particular to provide matching capabilities for searching images that also used magnetic nanoparticles for sorting possible matches magnetically. DNA could thus be used as a type of molecular computing.
At the SDC Illumina said that their NovaSeq 6000 is the first sequencer to exceed $1B in annual revenue. They say that the cost of a human whole genome sequencing has dropped to $600 today. The company is using ultra-high throughput fluorescence microscopy to measure individual base pairs, across billions of fragments of DNA placed in nanowells built into glass wafers. The read process is called Sequencing by Synthesis (SBS).
In SBS, DNA libraries flow through flowcells in the glass wafers and attach to the flowcell surfaces in the nanowells (referred to as libraries). In these nanowells the DNA strands are amplified (many copies are made) to increase the signal intensity during fluorescent imaging. Illumina’s NovaSeq 6000 can run at 6 trillion bases in 44 hours and the company says this can be 2X faster with 2X longer read lengths (increased batch sizes lowers the costs per GB). The slide below, from their talk suggests that if the cost per whole human genome gets to $100 then the cost of reading data would get down to $80/TB.
Human genome cost declines and potential data write costs 2021 SDC, IMAGE BY TOM COUGHLIN
Marthe Colotte, from IMAGENE described a hermetically sealed device for long term storage of dehydrated DNA (DNAshell), shown below.
Imagen DNAshell Technology 2021 SDC, IMAGE BY TOM COUGHLIN
The sealed capsules can be stored in a library system that could store 250,000 of these capsules in a 3 cubic meter footprint at room temperature. Dehydrated DNA can last for centuries. Reading the stored data involves opening up the capsules and rehydrated some of the DNA which can then be read back using DNA sequencing. In 2019, to celebrate the 30-year anniversary of the United Nations Convention on the Rights of the Child UNICEF Norway created capsules with the document written on DNA by Twist Bioscience and encapsulated by Imagene.
Andres Fernandez from Twist Bioscience also spoke at the SDC. He said that DNA archives can last more than a thousand years with little cost for storage (at room temperature) although access time for data on the DNA can get down to about 24 hours. He described Twists CMOS-based DNA synthesis process. The figure below shows a high-density chip-based DNA synthesis system.
Twist DNA Synthesis System 2021 SDC, IMAGE BY TOM COUGHLIN
Costs can be reduced by reusing chips and multiple runs per day as well as by reusing reagents and increasing the volume per reaction.
Although not presenting at the 2021 SDC, DNA start up Iridia, Inc. plans to use a disruptive biochemical approach to encode and read data in the form of synthetic DNA. Thus far, the dominant view has been that each DNA base should encode two bits (A=00, C=01, T=10, G=11, etc.). Iridia’s proprietary Topoisomerase-based chemistry enables the sequential stitching together of “cassettes”, each composed of multiple DNA bases as a unit. This unique capability permits both the writing of multiple bits simultaneously (where a cassette has multiple data encoding bases) as well as the assembly of chains of single bits (0s and 1s), each comprised of multiple DNA bases as illustrated.
The potential for the approach is that such a chemistry would permit the selection of the most efficient synthesis and read formats such that the cost to write, store and read data in an integrated fashion could be radically reduced and eventually compete with tape and HDDs. Of note, writing and reading data as “0s” and “1s” composed of DNA makes it possible to leverage error correction coding (ECC) of the type tape, HDDs and SSDs use today. The figure below shows these approaches using Topo Cassettes for DNA data encoding.
Iridia Top Cassette Encoding and Readout Image from Iridia
DNA storage is focused on deep archive applications. According to Steffen Hellmold, a couple of examples of possible DNA storage workloads are storing video surveillance data and large data sets used for AI training. DNA may be stored in a liquid or dehydrated form and in either case latency (or time to get data once requested) will generally take a few hours. The DNA Data Storage Alliance (now with 25 members and founded by Twist Bioscience Corp., Illumina, Inc., Western Digital and Microsoft), in a June white paper (An Introduction to DNA Data Storage) projects that synthesis (write) costs for DNA storage may get down to $1/TB by 2030.
Steffen said that DNA storage could become commercial in 5-year time. Another important element in DNA storage that differs from other data storage technologies is that a great many copies can be created at one time—which may have value in some applications and may also be exploited as replication to improve the odds of data recovery over an extended period of time.
The write paper points out that the costs of sequencing a human genome has fallen faster than the rate of Moore’s law cost reductions for silicon devices (see figure above). If a bit of information is recorded per DNA base, then the cost of information storage at a $1,000 per human genome sequencing price point (approximately where we are today) comes out to $1,300/GB, or $1,300,000/TB. Reducing this cost of writing data to $1/TB by 2030 is impressive, but based upon the magnetic tape roadmap, the costs for writing and reading magnetic tape by 2030 will be considerably less, and the estimated DNA storage costs don’t include the costs to read the data.
DNA holds great promise for very dense data storage. Although the current costs for writing and reading data from DNA are much higher than conventional storage technologies, this will come down over time. If the decline in DNA costs is higher than other storage technologies it could become a competitor, particularly for archival storage.