melbioinf_logo{: style="width:350px; padding-right:50px"} unimelb_logo{: style="width:150px"}

Introduction to Genome Browsers

Anticipated workshop duration when delivered to a group of participants is 4 hours.
Note that not all the exercises are expected to be completed during the workshop.

For queries relating to this workshop, contact Melbourne Bioinformatics at:
bioinformatics-training@unimelb.edu.au.

Overview

This tutorial will introduce you to the genome browser format and illustrate how some freely available genome browsers can be used to interrogate a variety of data types, such as gene expression, genomic variation, methylation and many more.

Topic

Skill level

This workshop is designed for participants with no previous experience of using Genome Browsers and no programming experience.

Description

Learn how to make the most of Genome Browsers!

By focusing on gene expression, this hands on tutorial will provide beginners with an introduction to both the UCSC Genome browser and IGV (Integrated Genome Viewer). Tools and public datasets will be used to illustrate how the expression of transcript variants can be investigated in different, tissues and cell types using public data, including human RNAseq data from GTEX and mouse cell type RNAseq data from Tabula Muris, as viewed within the UCSC genome browser. A subset of Single cell RNAseq data from the Allen Brain Atlas Celltax study will also be downloaded from SRA and visualised in IGV. The data and genes used in this workshop are taken from the neuroscience field, however the analysis approaches and tools illustrated can be applied to many research areas.


This tutorial is in three parts:

Section 1: Introduction to the general features of genome browsers.
Section 2: Hands on tutorial of the UCSC Genome Browser.
Section 3: Hands on tutorials of the Integrative Genomics Viewer.

This tutorial was developed for use as part a series of workshops for neuroscience researchers, hence the data and example genes are drawn from neuroscience field and focused on analysis and visualisation of expression data. However, the skills taught in this tutorial are applicable to all areas of research.

Data: GTEX and Tabular Muris data as represented in the UCSC Genome Browser, and Celltax single cell expression atlas data downloaded from SRA.

Tools: UCSC Genome Browser, Integrative Genomics Viewer (IGV).

Workshop instructions
Click here for a printer friendly PDF version of this workshop.


Learning Objectives

At the end of this introductory workshop, you will:


Requirements and preparation

Attendees are required to provide their own laptop computers.

If delivered as a workshop, participants should install the software and data files below prior to the workshop. Ensure that you provide sufficient time to liaise with your own IT support should you encounter any IT problems with installing software. Unless stated otherwise, recommended browsers are Firefox or Chrome.

Preparing you and your laptop prior to starting this workshop

Required software:

  1. Download and install IGV (Free).
  2. Ensure that Chrome or FireFox are installed and up-to-date.
  3. Create a user account in the UCSC genome browser.

Required Data

No additional data needs to be downloaded prior to this workshop. Required data will be downloaded as part of the tutorial exercises.


Mode of Delivery

This workshop will be run using freely available Web interfaces and free software using graphical user interfaces. See above.


Author(s) and review date

Written by: Victoria Perreau | Melbourne Bioinformatics, University of Melbourne.

Created: October 2020
Reviewed and revised: October 2021


Background

Genome Browser background

Genome browsers are invaluable for viewing and interpreting the many different types of data that can be anchored to genomic positions. These include variation, transcription, the many types regulatory data such as methylation and transcription factor binding, and disease associations. The larger genome browsers serve as data archives for valuable public datasets facilitating visualisation and analysis of different data types. It is also possible to load your own data into some of the public genome browsers.

By enabling viewing of one type of data in the context of another, the use of Genome browsers can reveal important information about gene regulation in both normal development and disease, assist hypothesis development relating to genotype phenotype relationships.

All researchers are therefore encouraged to become familiar with the use of some of the main browsers such as:

They are designed for use by researchers without programming experience and the developers often provide extensive tutorials and cases studies demonstrating the myriad of ways in which data can be loaded and interpreted to assist in develop and supporting your research hypothesis.

Many large genomic projects also incorporate genome browsers into their web portals to enable users to easily search and view the data. These include:

BDNF and TrkB signalling

This tutorial uses the a well known and important signalling pathway in the central nervous system (CNS) to illustrate some of the Genome browser tools and utility.

TrkB-schema-eng{: align=left }

Brain Derived Neurotrophic factor (BDNF) protein is an important neurotrophin responsible for regulating many aspects of growth and development in different cells within the CNS. TrkB is an important receptor that binds extracellular BDNF and propagates the intracellular signalling response via a tyrosine kinase. This TrkB receptor protein is encoded by the NTRK2 gene.

The NRK2 gene expresses a number of different transcript variants in different cell types. The most well studied of these is the full length TrkB receptor referred to as TrkB, which is mainly expressed in neuronal cell types. The other transcript variants all express the same exons encoding the extracellular domain of the receptor (shown in the fugure here in green) but have truncated intracellular domains, which do not include the tyrosine kinase domain and thus activate different signalling pathways upon binding to BDNF. None of these truncated protein products have been well studied, but the most highly expressed receptor variant is known as TrkB-T1, and is known to be highly expressed in astocytes.

Since the transcript variants are differently expressed in different cell types within the CNS the NTRK2 gene is a very useful example for exploring cell type specific transcript expression in available public data.


**Major CNS cell types:**

1209 Glial Cells of the CNS-02{: align=left }









Section 1: Introduction to Genome Browsers

Genome browsers rely on a common reference genome for each species in order to map data from different sources to the correct location. A consortium has agreed on a common numbering for each position on the genome for each species. However, this position will vary based on the version of the genome, as error correction and updates can change the numbering. Therefore it is very important to know which version of the genome your data of interest is aligned to.

The sequence for the human reference genome was accumulated up over many years from sequence data from many different sources and does not represent the sequence of one single person. Instead it is a composite of fragments of the genome from many different people. Also, unlike the human genome which is diploid, the human genome is haploid. That is there is only one copy of each chromosome. It therefore does not reflect the variation on the population, or even the most common variants in the human genome. Exploring variation within human genome is very important and facilitated by genome browsers but not covered in this workshop.

!!! info "Genome Build version number - further reading" * The Genome reference consortium
* What does the nomenclature mean?

For further info on Human Genome version updates I recommend you look at the updates and [blog pages on the UCSC genome browser](https://genome.ucsc.edu/goldenPath/newsarch.html#2019).

Section 2: The UCSC Genome Browser Interface

In this section we will become familiar with the web interface of the UCSC genome browser and explore some of the tools and public datasets available.

Weekly maintenance of the browser is at 5-6 pm Thursdays Pacific time, which is equivalent to 11am-12pm AEST time. During this time the browser may be down for a few minutes. To ensure uninterrupted browser services for your research during UCSC server maintenance and power outages, bookmark one of the mirror sites that replicates the UCSC genome browser.

Accessing the tools: Many of the tools that we will explore can be selected via multiple different routes within the browser interface. One way to access many tools is via from the top toolbar on a pull down list, other tools can be accessed from within the browser window. In the following instructions a series of blue boxes is used to indicate successive lower levels from the pull down menu when starting with the top toolbar. For example, the notation below indicates that you should select 'Genome Browser' from the top tool bar and then click on 'Reset all user settings'.

Toolbar Genome Browser Reset all user settings

Accessing help and training: The UCSC genome browser is supported by a rich training resource which has new material added regularly to the YouTube channel. To access training and develop your skills further go to: Toolbar Help Training

Getting started

  1. Open the Browser interface:

  2. First reset the browser, so that we all see the same screen:
    Toolbar Genome Browser Reset all user settings

  3. Select and open the human Genome Hg38 at the default position, there are a few different ways to do this

    You should see a view of the browser similar to the image below, opening at a position on the X chromosome of Human genome version GRCh38 showing the gene model for the ACE2 gene. Some of the default tracks may have been updated since this screen shot was made. ace2_ucsc

  4. Familiarise yourself with the main areas of the interface and locate:

  5. Customise your view by using the 'Configure' tool to change the font size to 12. Use either method below to open the Configure tool.

  6. Practice navigating around the genome view.

###Understanding the gene models genemodel

NTRK2

First we are going to familiarise ourselves with the gene model representation of the different transcripts of NTRK2.

  1. Navigate to the NTRK2 gene position in GRCh38 and view the gene models

    ntrk_transcripts

    !!! question "Which strand is the gene encoded on / transcribed from? (+ or - strand)"

BDNF

Now we look at the gene model for BDNF in the same genome. There are some differences that enable us to demonstrate some more tools.

  1. Navigate to the BDNF gene position in GRCh38 and view the gene models

bdnf_ucsc

bdnf_multiregionview

It is now a lot easier to view a number of interesting features in the BDNF transcript models:

You may find that using the multi-region tool facilitates visualisation and interpretation of gene expression data later in the workshop.

###Blat tool exercise The Blat tool is a sequence similarity tool similar to Blast. It can quickly identify region(s) of homology between a genome and a sequence of interest. Due to the presence of orthologs and paralogs a target sequence may have similarity to more than one region in the genome. In this exercise you will use Blat to map the sequences of two different expression probes to their target regions and determine which gene transcripts the probes are likely to detect in an expression study.

Microarray expression data is not commonly used now, but some of the data generated from large well orchestrated studies still provide valuable information to researchers. Microarray probes, like in situ hybridisation probes, target a small region of the RNA and do not measure the whole RNA transcript. If you are measuring gene expression it is important to know exactly which region of the gene you are detecting. In this exercise we will employ the blat tool to determine which region of the NTRK2 gene the microarray probes in the following study are detecting.

The study was the Human Brain gene expression atlas generated by the Allen Institute. Below are sequences of two hybridisation probes that were use in a microarray used to detect expression of the gene NTRK2. These two probes result in very different hybridisation and expression patterns across different regions of the brain. As we observed in the exercise above NTRK2 has a number of different transcript variants. The question we have is whether these probes are detecting different or multiple transcripts of NTRK2, and if so which ones?

NTRK2 Probe A_23_P216779 sequence:
TTCTATACTCTAATCAGCACTGAATTCAGAGGGTTTGACTTTTTCATCTATAACACAGTG

Z score of expression level in Human brain (blue = low expression, red = high expression) hbrain_A_23_P216779

NTRK2 Probe A_24_P343559 sequence
AAGCTGCTCTCCTTCACTCTGACAGTATTAACATCAAAGACTCCGAGAAGCTCTCGAGGG

Z score of expression level in Human brain (blue = low expression, red = high expression) hbrain_A_23_P343559

The images above are of one of the six donors included in the atlas, and typical of the expression pattern for NTRK2. These images are taken from the NTRK2 gene page of Human Brain Atlas.

Most obvious in the images above is the high level of expression signal using Probe A_23_P216779 and low level for A_24_P343559 in the corpus callosum (CC) which is a region of white matter in the brain with relatively few neurons and relatively high proportion of myelinating oligodendrocytes. This expression profile is reversed in the the cortical regions, eg. frontal lobe (FL) and parietal lobe (PL), which have a relatively high density of neuronal cells.

  1. Use Blat tool to find region of homology

  2. Use the 'highlight' tool to keep track of region of interest in the Genome view. It is easy to loose track of a region you are investigating when navigating around the genome in a browser. So we are going to highlight each region of probe homology within the NTRK2 gene, using a different colour for each probe. Highlight is also useful if you have lots of different tracks loaded and you want to check that a feature on one track lines up with another.

    !!! question "Do the probes detect coding regions of the NTRK2 gene?"

    !!! question "Are the probes likely to detect different transcripts?"

  3. Use 'Multiregion view' to make it easier to compare coding regions of different transcripts

I have created a 'public session' of the Blat NTRK2 exercise you can view it from the link in the sessions
Toolbar My data Public session search for "hg38_NTRK2_blat_probes".

###Gene expression data coverage_plot

  1. Human tissue specific expression data from the GTEX project is available in UCSC genome browser

    !!! question "Can you locate an exon in the MYRF gene that is present in transcripts expressed in the brain but not in the pancreas?"

    !!! question "Does this alternative splicing event result in a frame shift of the coding sequence?"

    !!! question "How many amino acids are there in the protein products for each MYRF transcript?"

  2. The FACS derived data from the Tabular Muris cell type data can be visualised as a coverage plot

    !!! question "Which cell type has the highest level expression of Ntrk2 in this dataset?"

    !!! question "Which cell type(s) express the long and short transcripts of Ntrk2?"

  3. Mouse CNS cell type expression data can also be validated using an independent single cell dataset of mouse cortex from the Linnarsson lab.

Section 3: IGV

In this section we will download a BAM file of gene expression data from SRA, and view it in the Integrated Genome Viewer (IGV). BAM files must first be sorted and indexed before they can be loaded into genome viewers and IGV has tools to do this without having to use the command line.

The expression data we are using for this exercise is from the mouse Celltax single cell expression atlas published by the Allen Brain Institute. The cell tax vignette has an expression browser that displays gene level expression as a heat map for any gene of interest. The readsets (fastq files) and aligned data (BAM files) for 1809 runs on single cells are also available for download from SRA.

The SRA study ID for this study is SRP061902 and individual runs from this study are easily selected by viewing the samples in the 'RunSelector'. If you wish to identify particular cell types of interest. For this exercise I have already identified a few samples that we will download in order to illustrate navigating in IGV by looking at the expression of NTRK2 in the same cell types we have discussed in earlier exercises.

For each cell type we will down load a .BAM file containing only the reads from the chromosome of interest.

For each SRA run in the table below open the link to the run to down load the data. Not many raw data sets in SRA have aligned data available for down load but this data set does.

Cell type SRA run Vignette Cell ID
astrocyte SRR2138962 D1319_V
astrocyte SRR2139935 A1643_VL
neuron SRR2139989 S467_V4
neuron SRR2140047 S1282_V
  1. Download BAM files from SRA

  2. Use IGV tools to SORT and INDEX the BAM files Store sorted BAM files and index files in the same folder.

  3. View the BAM files in IGV

igv_settings

You may need to change this back to a smaller range in the future if you are working with large datasets and/or small amounts of memory on your computer.

junction_plot

  1. Export images

sashimi_plot

  1. Download and install the Gencode gene model annotation track

Additional reading

IGV
https://rockefelleruniversity.github.io/IGV_course/presentations/singlepage/IGV.html