Hackathon session #1: Test Cases #1 and #2

March 18-20 2013 ALLBIO will organize three meetings in parallel in Hotel Eden in Amsterdam. One meeting will be an EMBRACE-style technology workshop on computing interoperability issues for the life-sciences. The other two will be hackathons for two of the ALLBIO testcases. The idea is to have breakfast, lunch, dinner, and two short sessions all-together to generate synergy. We have ensured ample access to the Amsterdam super computer at SARA called Lisa. Lisa does not have the fastest chips, but she has very many of them. We will get an account about a month in advance to start installing things if needed.

Location
Eden hotel
Amstel 144
1017 AE Amsterdam
T: +31 (0)20 5307830
F: +31 (0)20 5307819

And if you want to pre-enjoy the luxury of this place, look at:

  www.edenhotelamsterdam.com
  www.edenrembrandtsquare.com
  www.floamsterdam.com 
Hackathons


Coordination
Gert Vriend
Gregoire Rossier

The three day schedule will be (very roughly, hackathons tend to live their own life…):

  Monday 18
  13-14 lunch
  14-14:30 Introduction (Gert, Greg; joint with other hackathon)
  14:30-15 Problem specific introduction (Endre)
  15-18 Getting on with it
  18-19 Time to dress up for dinner
  19-24 Get together in the local Italian restaurant with other two meetings
  
  Tuesday 19
  9-10 Recap and planning
  10-13 Hacking
  13-14 Lunch
  14-16 If needed, help from the EMBRACEionists
  16-18 Hacking
  18-19 Time to dress up for dinner
  19-24 Get together in the local non-Italian restaurant with other two meetings
  
  Wednesday 20
  9-10 Recap and planning
  10-13 Hacking
  13-14 Lunch
  14-16 Recap and writing final report 
Hackathon / Test case #1: FAIRE-seq data analyser

Location Board 1, Eden hotel

Participants
Endre Barta (TC leader)
Amin Omidbakhshfard (TC proposer)
Matti Kankainen
Nooshin Omranian
Attila Horvath

Project description
Basically our case is about to develop a pipeline for FAIRE-seq analysis. I have earlier developed a pipeline for ChIP-seq analysis (see the reference below), which we are using now in the University of Debrecen for hundreds of samples successfully. During the Milano meeting we have discussed that in generally we can use the same tools for FAIRE-seq as in the case of ChIP-seq analysis. During the hackathon we would try different programs for example for peak calling and for denovo motif finding by using of the real data. Amin promised me sending the link to download the data in advance for being able to try it. It would be needed for writing the description.
In the meanwhile we had also discussed it with Eija and concluded that the best solution would be to integrate the pipeline in some extent into the Chipster or at least use tools already integrated in Chipster for the analysis. I will contact Eija later this week and we will further discuss it. As for the hardware requirements, as a minimum, having access to a good internet connection would be enough to connect to the Chipster server in the CSC and to our server (for the command line work) in the University of Debrecen. Having a small linux server available on site (in case we will be working in a hotel) would be a benefit, but we should then install basic NGS tools and /or the chipster server (it is possible to use the chipster virtual machine).
I have a PhD student Gergely Nagy, who is working on developing such pipelines. He will work also on the data and he would also come to Amsterdam if possible.

Hackathon / Test case #2: Identification of large structural variations

Location Board 2, Eden hotel

Participants
Yael Maoz (TC leader)
Jozefus Schippers (TC proposer)
Laurent Falquet
Dimitrios Vlachakis
Sophia Kossida
Shumaila Sayyab
Ilari Scheinin (Is already in Amsterdam, might drop by, Chipster expert)

Project description

Overview
Although many tools designed for structural variation analysis there are no guidelines for bench scientists to help them choose the best tool for their particular data set. In this test case, a researcher is eager to identify large structural variants in multiple accessions of Arabidopsis. What is needed is a tool which can identify the type and quality of the sequencing reads, assemble them with regards to a reference genome, identify SNPs, short indels and large indels, and determine if any annotated genes are associated with the large indels. Lastly, it is desirable that the tool should have a user-friendly interface with which bench scientists can easily examine structural variants in their accession(s) of interest that need to be validated. The long term goal would be to scale up this process to analyse multiple genomes at a time (for example, multiple time points, treatments, or related individuals).
Rough Sketch of the Plan:
Initially we need to first identify which tools for analyzing SVs do the best analysis and provide the most information or closest to the outcome the researcher is looking for. Then we can work on incorporating the additional features requested (such as the gene annotation within INDELs) and how to make it into a user-friendly interface. So, to that end, we need test data. It is likely we will go with the Arabidopsis data (here and here) as it seems to be fairly well validated already even the large indels. At the first hackathon we will test multiple programs and see which performs best according to our needs. Another hackathon will deal with incorporating additional features and also allowing for multiple genome analysis in a single run (such as comparing 6 genome to each other and a reference at once). Lastly, we can look at incorporating all of this into a user friendly - click and go type program and setting up an online repository for the data.

To Do List
1) Check quality of genomes and apply quality control as needed using fastq
2) Align reads using BWA
3) Quality check alignments
4) Run SV programs
5) Format output as needed for visualization
6) Test accuracy of output against data from http://genome.cshlp.org/content/22/3/508.full (Supp. Table 7)
7) Write up SWAT analysis

Technical Description
Using the genomic comparison previously performed on the Arabidopsis accessions Columbia and Landsberg as a test dataset, we will run and evaluate 12 structural variation analysis programs aligning Landsberg (Ler) to the reference genome Columbia (Col) - see Data section below.
We will need the following SV programs installed (with parallelizing capabilities when possible):

Delly
InGap (Java)
BreakDancer
GenomeSTRIP
Tigra-SV
GASV
Hydra
SVMerge
SVDetect
CNVSeq
CNVnator
SVM2

Additionally we will need BWA, samtools, fastqc, picard, perl, python, java and R. We estimate that we will need 512 GB as we will be generating lots of intermediary files using the tools. We would like to run these as jobs (qsub or bsub) on a cluster. It would be great if the set up would allow us to copy the files to a temporary directory while running the job so that each job can independently use the same data.
Data:
4 Ler genomes of Arabidopsis thaliana that were obtained with Illumina PE and MP sequencing (200bp to 5Kbp libraries): http://www.1001genomes.org/projects/MPISchneeberger2011/index.html We will use the Col-0 line for the reference.
Background
We have recently published two articles on gene regulatory networks on the regulatory network guiding the insulin producing cells and the regulatory region analysis of Wnt5 genes, revealing the conservation of a regulatory module with putative implication in pancreas development.
1) Kapasa M, Vlachakis D, Kostadima M, Sotiropoulou G, Kossida S, Towards the elucidation of the regulatory network guiding the insulin producing cells, Genomics. 2012, 1(2), 35-47
2) Kapasa M, Arhondakis S, Kossida S, Phylogenetic and regulatory region analysis of Wnt5 genes reveals conservation of a regulatory module with putative implication in pancreas development, Biol Direct. 2010 4;5:49

The EMBRACE meeting
  Monday 18
  12-14 lunch
  14-16 Introduction seminars of the new club members
  16-18 Discussion on the outcome of the previous two meetings
  18-19 Time to dress up for dinner
  19-24 Get together in the local Italian restaurant
  Tuesday 19
  10-13 Seminars
  13-14 Lunch
  14-16 Helping out at the hackathons
  16-18 Seminars
  18-19 Time to dress up for dinner
  19-24 Get together in the local non-Italian restaurant
  Wednesday 20
  09-11 Helping out at the hackathons
  11-13 Final discussions (including one on the future of this series)
  13-... Lunch 

Participants
Tanya Goldberg
Guy Yadav
Jan Krüger
Madis Rumming
Jose Maria Fernandez
Matus Kalas
Kristoffer Rapacki
Kalle Happonen
Jon Ison
Pawel Sztromwasser
Dmitry Repchevski
Gert Vriend
Tim te Beek

Seminars (puting them in some order will be the work of the first ten minutes of the meeting):

Jon Ison: EDAM ontology: update and plans
Jon Ison: BioMedBridges service registry (http://wwwdev.ebi.ac.uk/fgpt/toolsui/)
Gert Vriend: The question domain, or in other words, why are we doing this?
Guy Yadav: Web service sustainability (survivability)
Madis Rumming: CeBiTec history, present, and future
Jan Krüger: Technological issues related to CeBiTec move to the Cloud
Kalle Happonen: Elixir and the Finish cloud-based delivery model for compute and capacity
Dmitry Repchevski: INB Semantic Web Registry (WSDL 2, WSDL2 RDF etc)
Kristoffer Rapacki: Danish ELIXIR Node
Matus Kalas: BioXSD take-off
Pawel Sztromwasser: Bergen experience with Amazon and the Finish Cloud