Retrocopies were identified using LAST program (Kiełbasa et al. 2011), by the translated protein sequence alignment to the hard masked reference genome sequence. All sequences, of all species, downloaded from Ensembl 73 (Flicek et al. 2013) and Ensembl Plants 30[[REF for correct ensemble version]]. Species names with the genome assembly numbers were listed in the supplementary table S1. Genes, that contain reverse transcriptase domain, were excluded from the set. We used the following LAST parameters:
Multiple alignment hits to the same genomic locus were clustered using BedTools (Quinlan and Hall 2010). We required at least 150 bp overlap between alignments on the same strands, to join them into cluster. If particular cluster overlapped a known protein-coding gene, it was excluded from the further analysis. All remaining clusters were now considered as potential retrocopies. For each of these clusters we have selected an optimal alignment, with the highest score, as well as the suboptimal alignments, if their score was at least 98% of the optimal alignment. We considered specific cluster as a retrocopy, if the following criteria were fulfilled:
The final step included annotation of the retrocopy genomic coordinates and parental gene identity and coverage, which was based on the best alignment from the cluster, that was showin signs of retroposition. In case in more than one alignment was equally good, the final alignment was chosen randomly. If a newly annotated retrocopy was in at least 50% overlapping with known pseudogenes from Ensembl annotations, its status was considered as “KNOWN_PSEUDOGENE”. Otherwise, retrocopy status was considered as “NOVEL”.
To identify retrocopies, that are known protein coding genes, a modified approach was applied. For a given species, based on Ensembl annotations, the collection of all products of “protein-coding” genes, was self-aligned using LAST. Alignments of the alternative products of the same gene were removed. As potential retrocopies, we have then considered all genes, with the entire coding sequence contained always within one exon. In case if a gene encoded more than one protein, the longest transcript was taken under consideration. For each of these potential retrocopies, we were trying to find protein sequence alignments, produced of genes, which does not show reverse transcriptase activity (based on Ensembl protein descriptions), which coding sequence is at least 150 bp long, and protein sequence alignments coverage and identity are at least 50%. In case of the parental gene protein sequence included into alignment, we additionally required it to consist a sequence of at least 3 exons. All retrocopies, that meet these requirements, received a “KNOWN_PROTEIN_CODING” status.
All of the retrocopies went through the manual curation, before the final release of the database. Particularly, we have manually screened less than a 100 parental genes, that gave a large number of retrocopies. We have exluded genes, that originated from transposons, and which evolutionary conservation patterns seemed untrustworthy. In the comparison to the first release of the RetrogeneDB database (Kabza et al. 2014), we have also excluded X human retrocopies (accession numbers here), which were located on chromosome patches. It was dictated by the fact, that retrocopies located on genomic patches are usually identical or extremely similar to their counterparts located on actual chromosomes or scaffolds. As a result it is impossible to obtain unique mappings of reads (RNA-Seq, ChIP-Seq etc.) to those regions.
References: