Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:loadedtestcases:tc7 [2012/09/12 15:21] – [Paul Bowyer, Jane Mabey Gilsenan - UoM] andreaspublic:loadedtestcases:tc7 [2019/02/12 09:04] (current) – external edit 127.0.0.1
Line 24: Line 24:
 Setting up a pipeline to process un-annotated genomes is fraught with difficulties. In theory, given a set of related data files, a typical pipeline ties in different programs to detect repeats, to predict genes and to combine the data to provide the best gene model (see flowchart). These predicted genes can then be used to detect homologues in public databases and be embellished further. In practice, however, each file provided for a project can describe data attributed to different assembly components (sometimes with no assembly description). In addition, each piece of software produces its own variation of a standard format that cannot be used by the next program in the chain. For example, for one of our test cases (Aspergillus fumigatus CEA10), assembled sequences are provided by chromosome but other data (e.g., RNA-seq, DIP and SNP) are provided by supercontig. There is no assembly file describing the relationships between the components. Initially, this in not a problem, but it does cause difficulties further down the line when trying to improve gene models, to visualise all data within a genome viewer for further analyses and to upload data into a related database such as CADRE.  Setting up a pipeline to process un-annotated genomes is fraught with difficulties. In theory, given a set of related data files, a typical pipeline ties in different programs to detect repeats, to predict genes and to combine the data to provide the best gene model (see flowchart). These predicted genes can then be used to detect homologues in public databases and be embellished further. In practice, however, each file provided for a project can describe data attributed to different assembly components (sometimes with no assembly description). In addition, each piece of software produces its own variation of a standard format that cannot be used by the next program in the chain. For example, for one of our test cases (Aspergillus fumigatus CEA10), assembled sequences are provided by chromosome but other data (e.g., RNA-seq, DIP and SNP) are provided by supercontig. There is no assembly file describing the relationships between the components. Initially, this in not a problem, but it does cause difficulties further down the line when trying to improve gene models, to visualise all data within a genome viewer for further analyses and to upload data into a related database such as CADRE. 
 We used GeneMark-ES and Augustus for ab inito gene prediction; both required the use of different in-house programs to convert the output before it could be passed onto Evidence Modeller (EVM). The output from EVM again required conversion before being used with BLAST to find the best match within an Aspergillus database. Development of an in-house program was required to merge resultant data into gff3 format for the next step in the process.  We used GeneMark-ES and Augustus for ab inito gene prediction; both required the use of different in-house programs to convert the output before it could be passed onto Evidence Modeller (EVM). The output from EVM again required conversion before being used with BLAST to find the best match within an Aspergillus database. Development of an in-house program was required to merge resultant data into gff3 format for the next step in the process. 
-The Ensembl suite uses the gff3, agp and fasta file formats for uploading data into an ensembl database system such as CADRE. However, as pointed out earlier we do not always have consistent assembly information, therefore, we may need to re-create such files from a reference genome. A new database is initiated within CADRE with FASTA sequence files and assembly data (i.e., agp files), where appropriate. EBI software (GffDoc.pl), can then be used to upload predicted genes from gff3 files. All in all, this has become a rather bloated and time-consuming process. +The Ensembl suite uses the gff3, agp and fasta file formats for uploading data into an ensembl database system such as CADRE. However, as pointed out earlier we do not always have consistent assembly information, therefore, we may need to re-create such files from a reference genome. A new database is initiated within CADRE with FASTA sequence files and assembly data (i.e., agp files), where appropriate. EBI software (GffDoc.pl), can then be used to upload predicted genes from gff3 files. All in all, this has become a rather bloated and time-consuming process. {{ :public:flow.jpg?300 |}}
  
 ==Desired final state of the Test Case== ==Desired final state of the Test Case==
Line 34: Line 34:
 ==Discussion== ==Discussion==
 The test case will have impact for the substantial worldwide fungal genomics community The test case will have impact for the substantial worldwide fungal genomics community
 +
 +LF: a typical pipeline to build (ideal test case!?) with probably a risk of failure in the long term run
 +Invite people from Taverna or other such tools.
public/loadedtestcases/tc7.1347463316.txt.gz · Last modified: 2019/02/12 09:04 (external edit)
Trace: