Gennotate: a platform for sharing data representations, predictors, and machine learning algorithms for a broad range of gene structure prediction tasks

Over the past two decades, a variety of knowledge-based, statistical, and machine learning methods have been developed for many genome annotation tasks. They differ in terms of the training data sets used to train the predictive models, the data representations (e.g., sequence features) used for encoding the inputs and outputs (class labels) of the predictive models, the algorithms used for building the predictors, and the validation data sets and the performance metrics used to assess the effectiveness of the predictors.

The development of improved prediction methods relies solely on understanding the strengths and weaknesses of existing methods. Unfortunately, direct comparisons of existing methods are not straightforward in the absence of access to implementations of the algorithms and the precise data sets and data representations used. This is further complicated by the fact that some of the servers often update the predictors periodically using newly available data, newer computational methods, or data representations, making it difficult to determine whether the reported or measured changes in predictive accuracy stem from improvements in the methods, data representations, or better data sets.

To address these challenges, we built an open source software suite, Gennotate, as a platform for sharing data representations, predictors, and machine learning algorithms for a broad range of gene structure prediction tasks. Gennotate will have two main components: 1) model builder, an application for building and evaluating predictors and serializing these models in a binary format (model files); 2) predictor, an application for applying a model to test data (e.g., sequences to be annotated). The model builder application is an extension of Weka, a widely used machine learning workbench supporting many standard machine learning algorithms. Weka provides tools for data pre-processing, classification, regression, clustering, validation, and visualization. We believe Gennotate will have a significant impact on the development of machine learning based gene annotation tools since it attempts to standardize not only the input/output format of the predictors but also the predictors themselves as serialized Java objects (model files) that can be executed, updated, and combined using Gennotate platform.

Tools and software:
Gennotate development kit is freely available at:

Abbas, M., Mohie-Eldin, M., & EL-Manzalawy, Y. (2015). Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One, 10(3):e0119721. doi: 10.1371/journal.pone.0119721