Gorilla gorilla NC

Martín-Peciña, María, Ruiz-Ruano, Francisco J, Camacho, Juan Pedro M & Dodsworth, Steven, 2019, Phylogenetic signal of genomic repeat abundances can be distorted by random homoplasy: a case study from hominid primates, Zoological Journal of the Linnean Society 185 (3), pp. 543-554 : 545-547

publication ID

https://doi.org/10.1093/zoolinnean/zly077

persistent identifier

https://treatment.plazi.org/id/03D387D3-FFC4-E072-CAA6-9CBEFD5CFCFA

treatment provided by

Plazi

scientific name

Gorilla gorilla NC
status

 

Gorilla gorilla NC View in CoL _001645.1

Pongo pygmaeus NC _001646.1

Macaca mulatta NC _005943.1

NCBI, National Center for Biotechnology Information.

PREPARATION OF READ DATA FOR REPEAT ANALYSES

The SRA files were unpacked into FASTQ using the FASTQ-DUMP tool from the SRA Toolkit. Lowquality reads in FASTQ files were discarded using Trimmomatic ( Bolger et al., 2014) by removing adapters and selecting read pairs with all their nucleotides with Q (Phred quality score)> 30, using the options ‘ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:[100/101]’.

All samples were assumed to have a genome size of ~3.5 Gbp, based on data available in the Animal Genome Size Database, which showed only slight variation in genome size between the species used in this study (3.47–3.85 Gbp; http://www.genomesize.com/ last accessed 12 November 2016), which is considered appropriate for this type of study ( Dodsworth et al., 2016a). Each accession was then sampled for 0.6% of the genome by randomly subsampling each Illumina dataset. This resulted in 200 000 reads per sample from all Hominidae accessions, randomly selected with SeqTK and then converted into FASTA format .

Selected reads from each sample were labelled with a unique five-character prefix, making a total combined dataset of 1 200 000 reads for datasets of one individual per species, 2 200 000 reads for datasets of two individuals per species and 3 200 000 reads for the global dataset including all individual samples. Specifically, we prepared three different datasets of one individual (library or sample) per species plus M. mulatta as an outgroup (six operational taxonomic units [OTUs] per dataset), three different datasets of two biological individuals per species plus M. mulatta as an outgroup (11 OTUs per dataset) and one dataset grouping together all libraries representing three biological individuals per species, making a total of 16 OTUs for phylogenetic analysis, as shown in Table 3.

REPEATEXPLORER CLUSTERING OF SAMPLES

Clustering of Illumina reads was performed using the RepeatExplorer (RE) pipeline, implemented in a GALAXY server environment running locally in the University of Granada. RepeatExplorer clustering was used to identify genomic repeat clusters within each dataset, with default settings (minimum overlap = 55, cluster size threshold for detailed analysis = 0.01%, and the ‘all reads are paired’ option selected). For additional details about the clustering algorithm see Nóvak et al. (2010, 2013). For further identification of repeat clusters, we used a custom repeat database of all primate repetitive DNA annotations included in RepBase ( Bao et al., 2015; http://www.girinst.org/repbase/ last accessed 20 November 2016). Following Dodsworth et al. (2016a), we used the 1000 most abundant repeat clusters, because they represented a sufficient proportion of the genome for phylogenetic analyses. Read counts per cluster and sample information obtained from RE can be found in figshare under the accession https:// figshare.com/s/c2ccda047dd502890dcb

PHYLOGENETIC ANALYSIS OF CLUSTERS

The 1000 most abundant clusters of each dataset were used to create the data matrices for phylogenetic inference. TNT software was chosen for phylogenetic analyses under the maximum parsimony principle ( Goloboff & Mattoni, 2006; Goloboff et al., 2008). Cluster abundances were used as input (continuous characters). To make the cluster abundance values suitable as input for the TNT software, we divided all abundances by a factor calculated by dividing the abundance of the most abundant cluster by 65, so that all data would fall within the range 0–65 (with up to three decimals) as needed for analysis of continuous characters with TNT. Further transformations (e.g. cubed root) were checked but provided no improvement on the factorial transformation. Implicit enumeration (branch and bound) tree searches were used for datasets in this study owing to the small number of taxa in each dataset. Resampling was performed using 10 000 replicates, and symmetrical resampling was done by a modification of the standard bootstrap ( Goloboff et al., 2003). FigTree v.1.4.3 (http://tree.bio.ed.ac.uk/) was used for graphical view and representation of phylogenetic trees.

FILTERING OF DISTURBING CLUSTERS

After the first RE clustering, we found some clusters for satellite DNA and an endogenous retrovirus (ERV) that were abundant in chimpanzee, bonobo and gorilla but were absent in human and orangutan libraries. We identified these clusters by means of a Python script (https://github. com/mmarpe/phyl_rep_hominidae/blob/master/sel_ clusters.py) that helped us to locate those clusters that had <25 reads in Homo and Pongo but that were abundant in the rest of the hominid species. The identity of these clusters was confirmed by the RepeatExplorer annotation and further characterized by means of sequence homology search using BLASTn ( Altschul et al., 1990) and CENSOR ( Kohany et al., 2006) tools.

To test the effect of these clusters on the phylogenies built with the abundance of repeats, we performed two sets of phylogenetic analyses, one using unfiltered libraries and the other using libraries previously filtered out for these particular clusters. Filtering was performed by DeconSeq software against the CL3 satellite consensus sequence (X74280.1 and X74281.1 GenBank accessions; Royle et al., 1994) and against the CERV1_INT, the internal sequence for the endogenous retrovirus ( Skaletsky et al., 2004) included in RepBase.

COMBINATIONS OF ONE OR TWO INDIVIDUALS PER SPECIES

Using a custom script, written in Python (https:// github.com/mmarpe/phyl_rep_hominidae/blob/master/ sample_mix.py), we phylogenetically analysed all possible combinations of one or two individuals per taxon (243 phylogenetic trees each), with abundances obtained from a global RE run of all libraries involved in this study after the above filtering of clusters. The 1000 most abundant clusters of each combination were phylogenetically analysed by means of MP implemented using TNT software as described previously. From the 1000 top abundant cluster data obtained from the RE of all three individuals per species (all samples included in this paper) after filtering, this script constructs all possible cluster abundance datasets for all different abundance data combinations of two individuals per species or one single individual per species without sample repetitions; later, it generates the trees derived from each dataset using the same parameters described above for the TNT software, and finally, transforms the tree files from.nex format to.pdf format using FigTree to make their visualization more accessible.

The 243 trees produced from these combinations were grouped together in a file and, using Consense v.3.695 included in the PHYLIP package ( Felsenstein, 1989, 2005), we obtained the consensus tree for two individual per species cluster abundances combinations and for one individual per species combinations. This consensus tree consists of groups that occur as often as possible in the data through implementation of the majority rule (extended) method ( Margush & McMorris, 1981).

Kingdom

Animalia

Phylum

Chordata

Class

Mammalia

Order

Primates

Family

Hominidae

Genus

Gorilla

Darwin Core Archive (for parent article) View in SIBiLS Plain XML RDF