Skip to main content

Tutorial for SARS-CoV-2 genome data submission to ENA

About this tutorial

The research community has put considerable effort into research on the SARS-CoV-2 virus and COVID-19. Fast and open access to different data types (societal, molecular, epidemiological, among others) has been key to the swift development and deployment of, for example, preventative measures, tests, vaccines, and treatments for COVID-19. The pandemic has thus further highlighted how important making data open and FAIR (Findable, Accessible, Interoperable, Reusable) is in facilitating research efforts.

Thanks to efforts globally, many SARS-CoV-2 genome sequences have been made openly available in international databases, such as the Global Initiative on Sharing Avian Influenza Data (GISAID), and the European Nucleotide Archive (ENA). The ENA is part of the International Nucleotide Sequence Database Collaboration (INSDC), and also indexes data from the National Centre for the Biotechnology Information (NCBI) and DDBJ.

Both GISAID and ENA constitute valuable resources, each with distinct relative advantages for those performing research. For example, as of February 2022, GISAID contains more SARS-CoV-2 data from all around the world. Specifically, GISAID contained almost 8 million SARS-CoV-2 sequences, whereas ENA contained around 800,000 sequences. The data in GISAID thus enables more reliable insights to be made into the situation globally. However, GISAID only accepts the consensus sequences of assembled genomes, whilst ENA accepts both consensus sequences and ‘raw’ sequence data. Further, although the data in GISAID is considered open, access is restricted to individuals with verified accounts, whilst there are no restrictions on who can access the data in ENA. This means that using data from ENA simplifies sharing the data (e.g. between members of your group) and access to the data is less likely to become compromised during a project.

The aim of this tutorial is to assist researchers in submitting SARS-CoV-2 sequence data to ENA. This should ultimately lead to an increased availability of open data, including ‘raw’ sequence data. This would not only facilitate greater reproducibility, but also provide more opportunity for reusing the data to address new scientific questions.

Learning outcomes

By the end of this tutorial you will:

  • Understand the terminology used by ENA (and other similar databases).

  • Know how to properly describe and format SARS-CoV-2 data for submission into ENA.

  • Know how to complete a submission into ENA.

  • Know where to get help for future submissions (whether for SARS-CoV-2, or something else) to ENA.

Prerequisites

No specific knowledge is needed before starting this tutorial.

Overview

This tutorial is separated into tabs to aid users in moving through the tutorial. If you are unfamiliar with ENA, we recommend reading the Terminology and Metadata tab before commencing with the tutorial.

Multiple routes of submission are possible with ENA. We describe two complete routes that can be used for submission. Some preparatory steps are common to both routes. These steps are described in the Preparations for Submissions tab. We explain how to determine which of the routes is most likely to work best for you in the Select Submission Route tab. The Submission Route 1 and Submission Route 2 tabs explain different routes to completing submissions to ENA.

Information about where to get further guidance is given in the Get Help tab. For answers to frequently asked questions (FAQs) regarding submissions, please see the FAQs tab.

References used for this tutorial

Multiple sources of information were used to build this tutorial. Links to the reference material are listed below: