All data will be annotated in a format comparable to Tara Oceans data using the EBI MGnify pipeline or similar tools. In this way the student will gain experience with bioinformatics tools, and handling large sequencing datasets. She will then perform a bioinformatics comparison of gene functions identified in the Arctic and Antarctic communities, as well as comparing this to the gene functions of MAGs already available from the same data. Objective 2: The student will apply non-negative matrix factorisation and related machine learning approaches on the annotated data to identify groups of cooccurring functions characterising variation between surface ocean regions. NMF has been used to investigate functional patterns in metagenomics data in marine contexts (Jiang, 2012). To investigate the biological and metabolic processes underlying the groupings, the student will apply clustering and visualisation methods like network representations to identify key functions within groups which characterise their variation. Associations between groups and environmental metadata including satellite data will then be explored. This will provide experience in applying unsupervised machine learning approaches to mixed data sets, and working with experts to identify environmental insights from the results. Objective 3: Based on associations identified between the reduced dimension models and environmental metadata, the student will develop models to predict gene functions from environmental conditions. Models have been successfully used to predict taxonomic structure (Bracher, 2017) as well as community gene function in phytoplankton using taxonomy as an intermediate step (Larsen, 2015). We aim to predict gene function without reference to taxonomy. The student will approach this using Bayesian networks, in particular learning the network structure from the metagenomic data. This will enable them to infer gene functions from the independent environmental variables. Security vs challenge: As this project builds on strong foundations for all 3 objectives, there is little evidence that this work will not be successful. Yet, the challenge is to develop skills in bioinformatics and machine learning to identify environmentally meaningful results. With the acquired skill set, there is potential to identify synergies through working in a highly integrative and multidisciplinary environment.