Fq_delta: efficient storage of processed versions of fastq files

Next Generation Sequencing (NGS) is a new technology to research biological processes and genetic properties of organisms. NGS generates much more data than older technology – up to terabytes for one experiment. And the technology is still improving: on average every five months, twice as much data is being made available. This data is usually stored in fastq files.

Although this offers exciting new possibilities for researchers, it also poses a challenge: how can researchers store such vast amounts of data? During my graduation-internship at the Viroscience Lab of the Erasmus Medical Centre in Rotterdam, I was asked to look into this problem.

One part of my work revolved around deduplication of data. During research the data may be processed in several steps before the actual analysis. However, those steps are not set in stone. Both for practical reasons and to improve reproducibility, researchers at the Viroscience Lab would like to store a version of the data after each step. In normal circumstances this would require more and more storage capacity.

I developed fq_delta: a python module to store differences between versions of fastq files. It uses the google-diff-match-patch library, which in turn implements Myer’s diff algorithm. Where traditional diff algorithms are either line or block based, Myer’s diff algorithm is able to detect changes at the character level. For the type of data and types of changes involved in NGS, this is much more efficient than for instance diff or rdiff. Tests have indicated that processed versions require less than three percent of the original file size.

Fq_delta creates delta files by cycling through both original and processed files.

The module offers a class that acts like a file-like object, so fq_delta can be used in conjunction with a module like Biopython. The module also installs to commandline scripts that allow the user to stream data to and from fastq processing tools like cutadapt or the FASTX-Toolkit.

I’ve written a technical note on fq_delta which is, at the time of this writing, under review for publication. The module itself is available for download at https://github.com/averaart/fq_delta. The rest of my graduation internship involved a review of several dedicated and general-purpose compression applications, and an advice regarding the future of the faculty’s ICT infrastructure.

Showing correlation in Open Data

During de course INFLAB Maarten van den Hoek and I were challenged to explore potential applications of open data made available by the city of Rotterdam. After some experimentation we decided to utilise Python, MongoDB and the lessons we had learned so far in the Data Mining course to develop web-application that calculates the correlation between different types of objects and their attributes in a given area. For instance: what is the correlation between small trees and red playground equipment? Given the right data, this could be expanded with locations of events. For example: what is the correlation between the presence (or absence) of traffic signs and accidents. In some cases, the tool can even be used to check the quality of the different datasets. If, for instance, the type of soil in which a tree is planted is labeled differently from the traffic sign that stands right next to it, at least one of the data points is probably wrong.

Then we took it a step further: we divided the area in smaller sections and calculated correlations for each subsection. Then the deviations from the global correlation were projected on a map. This way both a pattern and the exceptions to that pattern can be examined. You could discover that one particular subsection of an area has a significantly more desirable correlation between two variables, for instance playgrounds and traffic accidents, than it’s surrounding area. This discovery could then lead to an investigation of the subsection, to see what can be learned from it to improve other areas.

The web-application is available for download at https://github.com/averaart/INFLAB01-Object-Heatmap/tree/NewInterface.

INFPRJ07/08: Planning Parent-Teacher Conferences

For this project, we were asked to develop a web-application that would enable a school to automate the proces of planning their parent-teacher conferences. We were given a set of rules we needed to try a adhere to, like “Parents shouldn’t need to wait more than half an hour in total between different conferences.” And there were, of course, practical limitations we needed to account for, such as the number of available tables, the number of time-slots each evening and the annoying fact that teachers simply can’t be at to tables at once.

The assignment stated that we should use the Google App Engine platform, but we were free to choose the language. Because we already were familiar with Java, and the language Go was still in development, we chose to use Python.

We ended up building a system that allowed parents and guardians to pick what subjects needed to be discussed per child, and to order the available days according to preference as well as indicate if they had a preference to be planned early or late in the evening.

During the first part of this project, I focussed on the ingest of the available data into GAE’s datastore and on generating plausible parent-preferences to work on. The second part of the project I worked on the algorithm that generated the actual planning. The parents were first sorted by the number of conferences they requested. The parents with the most conferences were planned first, as they would be harder and harder to place as the schedule filled up. This would take into account their preference for day and time. All parents were planned with one empty time-slot between them. After a while all requested conferences would be planned, but there would be teachers planned at two or more tables at the same time, as shown in this PDF. The algorithm would then start reordering the planned conferences of each parent, until it found the order that resulted in the lowest number of conflicts. Ultimately, this would result in a planning that adhered to all the practical restrictions and most preferences, as shown in this PDF.