Reporter and Curator: Dr. Sudipta Saha, Ph.D.
As digital information continues to accumulate, higher density and longer term storage solutions are necessary. DNA has many potential advantages as a medium for immutable, high latency information storage needs. For example, DNA storage is very dense. At theoretical maximum, DNA can encode two bits per nucleotide (nt) or 455 exabytes per gram of ssDNA. Unlike most digital storage media, DNA storage is not restricted to a planar layer, and is often readable despite degradation in non-ideal conditions over millennia. Finally, DNA’s essential biological role provides access to natural reading and writing enzymes and ensures that DNA will remain a readable standard for the foreseeable future.
Storing messages in DNA was first demonstrated in 1988 and the largest project to date encoded 7920 bits. The small scale of previous work stems from the difficulty of writing and reading long perfect DNA sequences, and has limited broader applications. A strategy was developed to encode arbitrary digital information using a novel encoding scheme that utilizes next-generation DNA synthesis and sequencing technologies. An html-coded draft of a book that included 53,426 words, 11 JPG images and 1 JavaScript program was converted into a 5.27 megabit bitstream. Then these bits were encoded onto 54,898 159nt oligonucleotides (oligos) each encoding a 96-bit data block (96nt), a 19-bit address specifying the location of the data block in the bit stream (19nt), and flanking 22nt common sequences for amplification and sequencing. The oligo library was synthesized by inkjet printed, high-fidelity DNA microchips. To read the encoded book, the library was amplified by limited-cycle PCR and then sequenced on a single lane of an Illumina HiSeq. Overlapping paired-end 100nt reads were joined to reduce the effect of sequencing error. Then using only reads that gave the expected 115-nt length and perfect barcode sequences, consensus was generated at each base of each data block at an average of ~3000-fold coverage. All data blocks were recovered with a total of 10 bit errors out of 5.27 million, which were predomi-nantly located within homo-polymer runs at the end of the oligo where there was only single sequence coverage.
This method has at least five advantages over past DNA storage approaches. One bit per base (A or C for zero, G or T for one) was encoded instead of two. This allowed to encode messages many ways in order to avoid sequences that are difficult to read or write such as extreme GC content, repeats, or secondary structure. By splitting the bit stream into addressed data blocks, the need for long DNA constructs were eliminated which are difficult to assemble at this scale. To avoid cloning and sequence verifying constructs many copies of each individual oligo were synthesized, stored, and sequenced. Since errors in synthesis and sequencing are rarely coincident, each molecular copy corrects errors in the other copies. Purely in vitro approach was used that avoided cloning and stability issues of in vivo approaches. Finally, next-generation technologies in both DNA synthesis and sequencing was leveraged to allow for encoding and decoding of large amounts of information for ~100,000-fold less cost than first generation encodings.
The density (5.5 petabits/mm3 at 100x synthetic coverage) and scale (5.27 megabits) of this work compare favorably to other experimental storage technologies while only using commercially available materials and instruments. DNA is particularly suitable for immutable, high-latency, sequential access applications such as archival storage. Density, stability, and energy efficiency are all potential advantages of DNA storage, while costs and times for writing and reading are currently impractical for all but centuryscale archives. However, the cost of DNA synthesis and sequencing have been dropping at exponential rates of 5- and 12-fold per year, respectively – much faster than electronic media at 1.6-fold per year. Hand-held, single-molecule DNA sequencers are becoming available, and would vastly simplify reading DNA-encoded information. The general approach of using addressed data blocks combined with library synthesis and consensus sequencing should be compatible with future DNA sequencing and synthesis technologies. Reciprocally, large-scale use of DNA such as for information storage could accelerate development of synthesis and sequencing technologies. Future work could use compression, redundant encodings, parity checks, and error correction to improve density, error rate, and safety. Other polymers or DNA modifications can also be considered to maximize reading, writing, and storage capabilities.
Source Reference:
GREAT post on a very important topic
What would happen if you had to introduce as yet known nucleotides that are not in the Watson-Crick model?
I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette
I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette
I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette