A census of tandemly repeated polymorphic loci in genomic regions through the comparative integration of human genome assemblies

In order to facilitate studies associating human diseases with Polymorphic Tandem Repeats (PTR), which include both Variable Number Tandem Repeats (VNTR) and Short Tandem Repeats (STR), we have compiled a catalogue of PTR in genic regions (exones, intrones, UTR and adjacent regions) of the human genome (hg38). We apply four different algorithms (TRF, mreps, TandemSwan, TRStalker) to uncover 55,223,485 unique TR, of which 373,173 are determined to be PTR, by comparison with five assembled human genomes (at different stages of assembly). While previous catalogues focus mainly on the class of Short TR (STR) of short total length (due to the limitations of the sequencing technology used), we report PTR of any size. The genomes used are:

 

Data

The catalogue consists in a series of tables containing the core information about the census VNTRs as well as supplementary information.
All the tables are tab-delimited files, start with a header line (beginning with #) and do not contain spaces within the fields. Although for convenience the coordinates are often reported as a first column of supplementary tables, the row number should be considered as a join operator.

Core - file description

The core information partially respect the bed file format and consists in a tab-delimited file containing the following columns
  1. Chromosome
  2. 0-based starting coordinate
  3. ending coordinate (included)
  4. comma separated list of UCSC identifiers overlapping the TR
  5. Tandem repeat sequence
  6. Strand
  7. Motif length (as predicted by the algorithm)
  8. Number of copies (as predicted by the algorithm)
  9. Filter score (TH stands for medium quality TRs while HQ stands for high quality TRs

Annotations - file description

We used the bedtools suite (v2.22.0) for the annotation setting 1bp as a minimum threshold to mark the TR as overlapping. The coordinates of the functional annotation have been downloaded from the UCSC website (Primary table: UCSC RefSeq - refGene). We downloaded the coordinates of the SNP, in-del, insertions, deletions and MMP from the dbSNP v. 146 table on the UCSC website. miRNAs coordinates have been downloaded from miRBase.
The annotation information consists in a tab-delimited file containing the following columns
  1. Coordinates in the form chr:init_pos-end_pos[+-]
  2. 3’ upsteam: the TR intersects a 1000bp interval before the beginning of a gene [Y/N]
  3. 5’ UTR: the TR intersects a 5’ UTR of a gene [Y/N]
  4. Coding: the TR intersects a coding region [Y/N]
  5. 3’ UTR: the TR intersects a 3’ UTR of a gene [Y/N]
  6. 3’ downstream: the TR intersects a 1000bp interval beyond the end of a gene [Y/N]
  7. Exon: the TR intersects an exon [Y/N]
  8. Intron: the TR intersects an intron [Y/N]
  9. pseudogene: the TR intersects a pseudogene [Y/N]
  10. lncRNA: the TR intersects a long non coding RNA [Y/N]
  11. long-intergenic-RNA: the TR intersects a long intergenic RNA [Y/N]
  12. miRNA: the TR intersects a miRNA. [Y/N]
  13. SNP: the TR intersects a SNP [Y/N]
  14. insertions: the TR intersects a insertion [Y/N]
  15. deletions: the TR intersects a deletion [Y/N]
  16. in/del: the TR intersects an in/del [Y/N]
  17. MNP: the TR intersects an MNP [Y/N]

Algorithm - file description

File consisting in a tab-delimited file containing the following two columns:
  1. Coordinates in the form chr:init_pos-end_pos[+-]
  2. Algo: a comma separated list of algorithms that identified the VNTR

Flanking - file description

The flanking file reports the sequence and alignment score of the upstream and downstream flanking regions in a tab-delimited format containing the following columns:
  1. Coordinates in the form chr:init_pos-end_pos[+-]
  2. upstream_seq: 250bp long upstream flanking region
  3. upstream_align_score: alignment score of the upstream flanking region expressed as percentage
  4. upstream_align_length: number of character of the upstream flanking region aligned to the reference hg38
  5. downstream_seq: 250bp long downstream flanking region
  6. downstream_align_score: alignment score of the downstream flanking region expressed as percentage
  7. downstream_align_length: number of character of the downstream flanking region aligned to the reference hg38
We accept an alignment if the length at 5’ and 3’ have an overall score not lower than 450, thus the minimum value must be higher than 200.

Polymorphism - file description

This table contains all the details of the measurements that lead to the assessment of the polymorphism. When the same measure has been repeated for all target genomes (i.e. the alignment scores) only the most accurate is reported.
  1. Coordinates in the form chr:init_pos-end_pos[+-]
  2. Number of copies (as predicted by the algorithm)
  3. Motif length (as predicted by the algorithm)
  4. Polymorphic: number of polymorphic instances of the TR in the target genomes
  5. Non_polymorphic: number of not polymorphic instances of the TR in the target genomes
  6. Unaligned: number of instances in the target genomes discarded for ambiguous or pour alignment.
  7. Not_found: number of instances discarded because flanking regions have not been aligned.
  8. Variability: number of variants found
  9. first_align_score: score of the first alignment expressed as percentage (minimum threshold 90%)
  10. first_align_length: ratio between the number of bases aligned and the length of the shortest sequence. A value of 1 means that the entire sequence has been aligned. Intuitively the higher this value the more accurate is the alignment
  11. second_align_score: score of the second alignment expressed as percentage (minimum threshold 90%)
  12. second_align_length: ratio between the number of bases aligned and the length of the residual part of the longest sequence. A value of 1 means that the entire sequence has been aligned. Intuitively the higher this value the more accurate is the alignment
  13. hg38_length: TR length on hg38 in bp
  14. max_TR_length: Maximum TR length on a target genome (either huref, YH2_0, CHM1.1 BGIAF, ASM77258v3)
  15. longer: holds + (respectively -) when the TR is expanded (respectively contracted) on the target genome
  16. CN_by_length: CN estimate by sequence length obtained as the number of bp of the longest sequence used in the second alignment divided by the motif length
  17. CN_by_align: CN estimate by alignment measured as the overlap between the first and the second alignment of the TR sequence. The measure is done on both reference and target genome and the maximum is returned. This measure is expected to be more accurate than the same based on the sequence length.
We evaluate the polymorphism when both the first and second alignment have score higher than or equals to 90%. We declare the TR unaligned otherwise.

Index - file description

This file contains a subset of the UCSC track kgXref that allows to convert UCSC gene names to other common formats.
  1. Known Gene ID (UCSC)
  2. mRNA ID
  3. Gene Symbol
  4. RefSeq ID

 

Data visualization

A convenient visualization of the census data through the UCSC genome browser is available here (NOTICE: the link can take up to one minute to display data)

 

Source code

The software pipeline consists of three steps: alignment, polymorphism detection and finishing. Some preliminary steps must be accomplished before running it. In particular, the pipeline does not provide the algorithms for finding tandem repeats along the reference genome. Neither the genomic sequences nor mappings are provided. The software is provided "as is" and without any warranties. Please refer to the README file before running the pipeline. A copy of the pipeline can be downloaded from here.

 

Consortium

Joint work of IIT-CNR, ITB-CNR and Università del Piemonte Oreintale within project RepeataASL

 

Citation

Please consider citing the following paper if you found a resource useful:
Styled by: Geraci Filippo, Mantained by: Loredana M. Genovese