Fq_delta: efficient storage of processed versions of fastq files

Next Generation Sequencing (NGS) is a new technology to research biological processes and genetic properties of organisms. NGS generates much more data than older technology – up to terabytes for one experiment. And the technology is still improving: on average every five months, twice as much data is being made available. This data is usually stored in fastq files.

Although this offers exciting new possibilities for researchers, it also poses a challenge: how can researchers store such vast amounts of data? During my graduation-internship at the Viroscience Lab of the Erasmus Medical Centre in Rotterdam, I was asked to look into this problem.

One part of my work revolved around deduplication of data. During research the data may be processed in several steps before the actual analysis. However, those steps are not set in stone. Both for practical reasons and to improve reproducibility, researchers at the Viroscience Lab would like to store a version of the data after each step. In normal circumstances this would require more and more storage capacity.

I developed fq_delta: a python module to store differences between versions of fastq files. It uses the google-diff-match-patch library, which in turn implements Myer’s diff algorithm. Where traditional diff algorithms are either line or block based, Myer’s diff algorithm is able to detect changes at the character level. For the type of data and types of changes involved in NGS, this is much more efficient than for instance diff or rdiff. Tests have indicated that processed versions require less than three percent of the original file size.

Fq_delta creates delta files by cycling through both original and processed files.

The module offers a class that acts like a file-like object, so fq_delta can be used in conjunction with a module like Biopython. The module also installs to commandline scripts that allow the user to stream data to and from fastq processing tools like cutadapt or the FASTX-Toolkit.

I’ve written a technical note on fq_delta which is, at the time of this writing, under review for publication. The module itself is available for download at https://github.com/averaart/fq_delta. The rest of my graduation internship involved a review of several dedicated and general-purpose compression applications, and an advice regarding the future of the faculty’s ICT infrastructure.

Comments are closed.