====Hackathon session #1: Test Cases #1 and #2==== March 18-20 2013 ALLBIO will organize three meetings in parallel in Hotel Eden in Amsterdam. One meeting will be an EMBRACE-style technology workshop on computing interoperability issues for the life-sciences. The other two will be hackathons for two of the ALLBIO testcases. The idea is to have breakfast, lunch, dinner, and two short sessions all-together to generate synergy. We have ensured ample access to the Amsterdam super computer at SARA called Lisa. Lisa does not have the fastest chips, but she has very many of them. We will get an account about a month in advance to start installing things if needed. **Location**\\ Eden hotel\\ Amstel 144\\ 1017 AE Amsterdam\\ T: +31 (0)20 5307830 \\ F: +31 (0)20 5307819 \\ And if you want to pre-enjoy the luxury of this place, look at: www.edenhotelamsterdam.com www.edenrembrandtsquare.com www.floamsterdam.com ==Hackathons== \\ **Coordination**\\ Gert Vriend\\ Gregoire Rossier\\ The three day **schedule** will be (very roughly, hackathons tend to live their own life...): Monday 18 13-14 lunch 14-14:30 Introduction (Gert, Greg; joint with other hackathon) 14:30-15 Problem specific introduction (Endre) 15-18 Getting on with it 18-19 Time to dress up for dinner 19-24 Get together in the local Italian restaurant with other two meetings Tuesday 19 9-10 Recap and planning 10-13 Hacking 13-14 Lunch 14-16 If needed, help from the EMBRACEionists 16-18 Hacking 18-19 Time to dress up for dinner 19-24 Get together in the local non-Italian restaurant with other two meetings Wednesday 20 9-10 Recap and planning 10-13 Hacking 13-14 Lunch 14-16 Recap and writing final report ==Hackathon / Test case #1: FAIRE-seq data analyser== Location Board 1, Eden hotel **Participants**\\ Endre Barta (TC leader)\\ Amin Omidbakhshfard (TC proposer)\\ Matti Kankainen\\ Nooshin Omranian\\ Attila Horvath\\ **Project description**\\ Basically our case is about to develop a pipeline for FAIRE-seq analysis. I have earlier developed a pipeline for ChIP-seq analysis (see the reference below), which we are using now in the University of Debrecen for hundreds of samples successfully. During the Milano meeting we have discussed that in generally we can use the same tools for FAIRE-seq as in the case of ChIP-seq analysis. During the hackathon we would try different programs for example for peak calling and for denovo motif finding by using of the real data. Amin promised me sending the link to download the data in advance for being able to try it. It would be needed for writing the description.\\ In the meanwhile we had also discussed it with Eija and concluded that the best solution would be to integrate the pipeline in some extent into the Chipster or at least use tools already integrated in Chipster for the analysis. I will contact Eija later this week and we will further discuss it. As for the hardware requirements, as a minimum, having access to a good internet connection would be enough to connect to the Chipster server in the CSC and to our server (for the command line work) in the University of Debrecen. Having a small linux server available on site (in case we will be working in a hotel) would be a benefit, but we should then install basic NGS tools and /or the chipster server (it is possible to use the chipster virtual machine).\\ I have a PhD student Gergely Nagy, who is working on developing such pipelines. He will work also on the data and he would also come to Amsterdam if possible. ==Hackathon / Test case #2: Identification of large structural variations== Location Board 2, Eden hotel **Participants**\\ Yael Maoz (TC leader)\\ Jozefus Schippers (TC proposer)\\ Laurent Falquet\\ Dimitrios Vlachakis\\ Sophia Kossida\\ Shumaila Sayyab\\ Ilari Scheinin (Is already in Amsterdam, might drop by, Chipster expert)\\ **Project description** //Overview//\\ Although many tools designed for structural variation analysis there are no guidelines for bench scientists to help them choose the best tool for their particular data set. In this test case, a researcher is eager to identify large structural variants in multiple accessions of Arabidopsis. What is needed is a tool which can identify the type and quality of the sequencing reads, assemble them with regards to a reference genome, identify SNPs, short indels and large indels, and determine if any annotated genes are associated with the large indels. Lastly, it is desirable that the tool should have a user-friendly interface with which bench scientists can easily examine structural variants in their accession(s) of interest that need to be validated. The long term goal would be to scale up this process to analyse multiple genomes at a time (for example, multiple time points, treatments, or related individuals).\\ Rough Sketch of the Plan:\\ Initially we need to first identify which tools for analyzing SVs do the best analysis and provide the most information or closest to the outcome the researcher is looking for. Then we can work on incorporating the additional features requested (such as the gene annotation within INDELs) and how to make it into a user-friendly interface. So, to that end, we need test data. It is likely we will go with the Arabidopsis data (here and here) as it seems to be fairly well validated already even the large indels. At the first hackathon we will test multiple programs and see which performs best according to our needs. Another hackathon will deal with incorporating additional features and also allowing for multiple genome analysis in a single run (such as comparing 6 genome to each other and a reference at once). Lastly, we can look at incorporating all of this into a user friendly - click and go type program and setting up an online repository for the data.\\ //To Do List//\\ 1) Check quality of genomes and apply quality control as needed using fastq\\ 2) Align reads using BWA\\ 3) Quality check alignments\\ 4) Run SV programs\\ 5) Format output as needed for visualization\\ 6) Test accuracy of output against data from http://genome.cshlp.org/content/22/3/508.full (Supp. Table 7)\\ 7) Write up SWAT analysis\\ \\ //Technical Description//\\ Using the genomic comparison previously performed on the Arabidopsis accessions Columbia and Landsberg as a test dataset, we will run and evaluate 12 structural variation analysis programs aligning Landsberg (Ler) to the reference genome Columbia (Col) - see Data section below.\\ We will need the following SV programs installed (with parallelizing capabilities when possible):\\ Delly\\ InGap (Java)\\ BreakDancer\\ GenomeSTRIP\\ Tigra-SV\\ GASV\\ Hydra\\ SVMerge\\ SVDetect\\ CNVSeq\\ CNVnator\\ SVM2\\ \\ Additionally we will need BWA, samtools, fastqc, picard, perl, python, java and R. We estimate that we will need 512 GB as we will be generating lots of intermediary files using the tools. We would like to run these as jobs (qsub or bsub) on a cluster. It would be great if the set up would allow us to copy the files to a temporary directory while running the job so that each job can independently use the same data.\\ Data:\\ 4 Ler genomes of Arabidopsis thaliana that were obtained with Illumina PE and MP sequencing (200bp to 5Kbp libraries): http://www.1001genomes.org/projects/MPISchneeberger2011/index.html We will use the Col-0 line for the reference.\\ Background\\ We have recently published two articles on gene regulatory networks on the regulatory network guiding the insulin producing cells and the regulatory region analysis of Wnt5 genes, revealing the conservation of a regulatory module with putative implication in pancreas development.\\ 1) Kapasa M, Vlachakis D, Kostadima M, Sotiropoulou G, Kossida S, Towards the elucidation of the regulatory network guiding the insulin producing cells, Genomics. 2012, 1(2), 35-47\\ 2) Kapasa M, Arhondakis S, Kossida S, Phylogenetic and regulatory region analysis of Wnt5 genes reveals conservation of a regulatory module with putative implication in pancreas development, Biol Direct. 2010 4;5:49\\ ==The EMBRACE meeting== Monday 18 12-14 lunch 14-16 Introduction seminars of the new club members 16-18 Discussion on the outcome of the previous two meetings 18-19 Time to dress up for dinner 19-24 Get together in the local Italian restaurant Tuesday 19 10-13 Seminars 13-14 Lunch 14-16 Helping out at the hackathons 16-18 Seminars 18-19 Time to dress up for dinner 19-24 Get together in the local non-Italian restaurant Wednesday 20 09-11 Helping out at the hackathons 11-13 Final discussions (including one on the future of this series) 13-... Lunch **Participants**\\ Tanya Goldberg\\ Guy Yadav\\ Jan Krüger\\ Madis Rumming\\ Jose Maria Fernandez\\ Matus Kalas\\ Kristoffer Rapacki\\ Kalle Happonen\\ Jon Ison\\ Pawel Sztromwasser\\ Dmitry Repchevski\\ Gert Vriend\\ Tim te Beek\\ Seminars (puting them in some order will be the work of the first ten minutes of the meeting):\\ Jon Ison: EDAM ontology: update and plans\\ Jon Ison: BioMedBridges service registry (http://wwwdev.ebi.ac.uk/fgpt/toolsui/)\\ Gert Vriend: The question domain, or in other words, why are we doing this?\\ Guy Yadav: Web service sustainability (survivability)\\ Madis Rumming: CeBiTec history, present, and future\\ Jan Krüger: Technological issues related to CeBiTec move to the Cloud\\ Kalle Happonen: Elixir and the Finish cloud-based delivery model for compute and capacity\\ Dmitry Repchevski: INB Semantic Web Registry (WSDL 2, WSDL2 RDF etc)\\ Kristoffer Rapacki: Danish ELIXIR Node\\ Matus Kalas: BioXSD take-off\\ Pawel Sztromwasser: Bergen experience with Amazon and the Finish Cloud\\