Mapping the functional gene-scape of the oceans under conditions of global change

Context: This proposal builds on our work which identified differences in function between polar and non-polar microbes at the genomic level, through constructing metagenome assembled genomes (MAGs) from large-scale metagenomic sequencing from Mocks lab (Duncan, 2020). Within this data, we have recently identified varying components within photosynthetic metabolic pathways correlated with satellite observations of chlorophyll concentration, suggesting a direct link between gene function and ocean regions.

Objective 1: The student will functionally annotate extensive public omics datasets including those in (Sunagawa, 2015) and (Zhang, 2020) as well as from the ongoing polar MOSAiC cruise when it becomes available.

All data will be annotated in a format comparable to Tara Oceans data using the EBI MGnify pipeline or similar tools. In this way the student will gain experience with bioinformatics tools, and handling large sequencing datasets. She will then perform a bioinformatics comparison of gene functions identified in the Arctic and Antarctic communities, as well as comparing this to the gene functions of MAGs already available from the same data. Objective 2: The student will apply non-negative matrix factorisation and related machine learning approaches on the annotated data to identify groups of cooccurring functions characterising variation between surface ocean regions. NMF has been used to investigate functional patterns in metagenomics data in marine contexts (Jiang, 2012). To investigate the biological and metabolic processes underlying the groupings, the student will apply clustering and visualisation methods like network representations to identify key functions within groups which characterise their variation. Associations between groups and environmental metadata including satellite data will then be explored. This will provide experience in applying unsupervised machine learning approaches to mixed data sets, and working with experts to identify environmental insights from the results. Objective 3: Based on associations identified between the reduced dimension models and environmental metadata, the student will develop models to predict gene functions from environmental conditions. Models have been successfully used to predict taxonomic structure (Bracher, 2017) as well as community gene function in phytoplankton using taxonomy as an intermediate step (Larsen, 2015). We aim to predict gene function without reference to taxonomy. The student will approach this using Bayesian networks, in particular learning the network structure from the metagenomic data. This will enable them to infer gene functions from the independent environmental variables. Security vs challenge: As this project builds on strong foundations for all 3 objectives, there is little evidence that this work will not be successful. Yet, the challenge is to develop skills in bioinformatics and machine learning to identify environmentally meaningful results. With the acquired skill set, there is potential to identify synergies through working in a highly integrative and multidisciplinary environment.

Grant reference
Natural Environment Research Council
Total awarded
£0 GBP
Start date
30 Sep 2021
3 years 8 months 29 days
End date
29 Jun 2025