Contents

1 Introduction

SureTypeSCR is an R package for QC and rapid genotyping of single cell SNP array data. It encapsulates previously developed library SureTypeSC (paper). The core consists of a two layered machine learning method that assigns a quality score to each SNP in an array. On top of that, SureTypeSCR implements various QC strategies to examine the data using packages from the tidyverse collection.

This tutorial gives basic introduction to the data analysis with SureTypeSCR. The data used in this tutorial is publicly available (paper and data) and subset of it (sperm data amplified by MDA) was transformed into an R data package (johnsonspermdata) for purpose of this demonstration.

2 Installation

Both, SureTypeSCR and johnsonspermdata can be installed from github repository using devtools. The dependencies will be installed automatically.

library(devtools)

##install  SureTypeSCR  from  github
#devtools::install_github("Meiomap/SureTypeSCR",force=TRUE)

##install  data  package  with  sperm  data (compiled  from  GSE19247)
#devtools::install_github("Meiomap/johnsonspermdata")

When loaded for the first time, SureTypeSCR will create a python virtual environment and install all required python libraries via python package installer (pip). Besides the actual data in GTC format, the R data package johnsonspermdata contains samplesheet and data frame with metadata that will be used for loading and plotting, respectively.

library(SureTypeSCR)
library(johnsonspermdata)
# library(tidyverse) library(magrittr) library(ggrepel) load metadata and
# samplesheet location
data(metadata, samplesheet)

3 Data loading and basic operations

The manifest file *.bpm and cluster file *.egt) that are required for feature extraction and creation of the basic data frame are deployed with the package. Note that these files are array specific and differ accross Illumina products. The program lists the sample files as being analyzed and reports dimensions of the final data frame to the output (before and after applying internal indexing).

manifest = system.file("files/HumanCytoSNP-12v2_H.bpm", package = "SureTypeSCR")
cluster = system.file("files/HumanCytoSNP-12v2_H.egt", package = "SureTypeSCR")
df = scbasic(samplesheet = samplesheet, bpm = manifest, egt = cluster)

3.1 Filtering

After loading the basic dataframe df we filter out markers that do not contain genotypes (so called intensity-only SNPs) and view the structure of the data frame.

# filtering out intensity -only SNPs
df %<>%
    dplyr::filter(Chr != 0 & !str_detect(Name, "cnv"))
head(df)
##        Name Chr  Position individual b_allele_freq gtype log_r_ratio
## 1 rs1000002   3 183635768  gsm477532     0.9859948    BB   -1.009453
## 2 rs1000002   3 183635768  gsm477533     0.8003091    AB   -4.800915
## 3 rs1000002   3 183635768  gsm477534     0.8919672    BB   -4.125801
## 4 rs1000002   3 183635768  gsm477535     0.3622532    AB   -4.449686
## 5 rs1000002   3 183635768  gsm477536     0.6464212    AB   -3.152554
## 6 rs1000002   3 183635768  gsm477537     0.2595262    AB   -3.872426
##         score           x x_raw          y y_raw
## 1 0.869653046 0.019764755  1031 0.40287372  8306
## 2 0.009361755 0.008056554   504 0.02349852   937
## 3 0.014443527 0.008048215   752 0.04152295  1591
## 4 0.022765663 0.024449598   498 0.01598536   512
## 5 0.165830925 0.039333437   729 0.06226622  1067
## 6 0.040316373 0.040040240   645 0.01812202   610

3.2 Call rates

Call rates can be calculated per group basis - here per individual (sample) and genotype and functions from tidyverse can be used to reshape the results for better readability.

df %>%
    group_by(individual, gtype) %>%
    callrate() %>%
    pivot_wider(values_from = Callrate, names_from = gtype)
## # A tibble: 23 x 5
## # Groups:   individual [23]
##    individual    AA    AB    BB    NC
##    <chr>      <dbl> <dbl> <dbl> <dbl>
##  1 gsm477532  0.223 0.134 0.228 0.415
##  2 gsm477533  0.235 0.100 0.251 0.415
##  3 gsm477534  0.218 0.141 0.227 0.415
##  4 gsm477535  0.184 0.227 0.174 0.415
##  5 gsm477536  0.175 0.224 0.187 0.415
##  6 gsm477537  0.220 0.170 0.196 0.415
##  7 gsm477538  0.209 0.187 0.190 0.415
##  8 gsm477539  0.167 0.211 0.207 0.415
##  9 gsm477540  0.161 0.214 0.210 0.415
## 10 gsm477541  0.181 0.204 0.200 0.415
## # ... with 13 more rows

4 Principal component analysis

Principal component analysis determines relationships between the individuals by reducing n-dimensional representation of SNP array data (n is equal to total number of SNPs that call in every sample) into few (here two) most important dimensions that maximize the variation (explain most of the variability). The user has an option whether to calculate the PCA per chromosome basis (parameter by_chrom), which can reveal potential aneuploidies or just to reveal relationship between samples across all SNPs in one PCA.

df %>%
    plot_pca(features = "gtype", metadata = metadata, by_chrom = FALSE)
## [1] "PCA from  174231  features"

5 MA transformation

MA transformation displays intensities as logarithmic average (A) and logarithmic difference (M). Roughly speaking, M~0 corresponds to heterozygous genotypes, M<0 corresponds to homozygous BB and M>0 corresponds to homozygous AA. As explained in Vogel et al. (2019), MA transformation minimizes the intra-group variability and allows to visualise the genotyping clusters (AA, BB and AB) from the whole dataset on one plot.

df %>%
    calculate_ma() %>%
    plot_ma(n = 0.1)

6 Single cell genotype classification

The classification of the single cell genotypes is performed via pre-trained classifier (Random Forest) that is distributed within the package. A second layer of the algorithm (Gaussian Discriminant Analysis) is executed directly on the data and per sample basis.

clf = system.file("files/rf.clf", package = "SureTypeSCR")
df_model = df %>%
    calculate_ma() %>%
    group_by(individual) %>%
    nest() %>%
    mutate(model = map(data, function(df) suretype_model(df, individual, clf)))
## Processing sample gsm477532
## Processing sample gsm477533
## Processing sample gsm477534
## Processing sample gsm477535
## Processing sample gsm477536
## Processing sample gsm477537
## Processing sample gsm477538
## Processing sample gsm477539
## Processing sample gsm477540
## Processing sample gsm477541
## Processing sample gsm477542
## Processing sample gsm477543
## Processing sample gsm477544
## Processing sample gsm477551
## Processing sample gsm477552
## Processing sample gsm477553
## Processing sample gsm477554
## Processing sample gsm477555
## Processing sample gsm477556
## Processing sample gsm477557
## Processing sample gsm477558
## Processing sample gsm477559
## Processing sample gsm477560
head(df_model)
## # A tibble: 6 x 3
## # Groups:   individual [6]
##   individual data                    model                 
##   <chr>      <list>                  <list>                
## 1 gsm477532  <tibble [298,412 x 15]> <tibble [298,412 x 2]>
## 2 gsm477533  <tibble [298,412 x 15]> <tibble [298,412 x 2]>
## 3 gsm477534  <tibble [298,412 x 15]> <tibble [298,412 x 2]>
## 4 gsm477535  <tibble [298,412 x 15]> <tibble [298,412 x 2]>
## 5 gsm477536  <tibble [298,412 x 15]> <tibble [298,412 x 2]>
## 6 gsm477537  <tibble [298,412 x 15]> <tibble [298,412 x 2]>

If we visualise the MA plot again - this time for the filtered data to explore the effect of the RF-GDA, we can see that the algorithm has effectively filtered out the heterozygous genotypes (m ~ 0). The heterozygous genotypes are likely artefacts and products of erroneous whole genome amplification (WGA) as sperm cells are haploid and there were no aneuploidies reported on these samples (Johnson et al., 2010).

df_model_unnested = df_model %>%
    unnest(c(data, model))
head(df_model_unnested)
## # A tibble: 6 x 18
## # Groups:   individual [1]
##   individual Name       Chr    Position b_allele_freq gtype log_r_ratio   score
##   <chr>      <chr>      <chr>     <dbl>         <dbl> <chr>       <dbl>   <dbl>
## 1 gsm477532  rs1000002  3     183635768       0.986   BB         -1.01  0.870  
## 2 gsm477532  rs1000003  3      98342907       0.00834 AA         -0.367 0.932  
## 3 gsm477532  rs10000030 4     103374154       0.335   AB         -2.87  0      
## 4 gsm477532  rs10000037 4      38924330       0       AA          0.363 0.918  
## 5 gsm477532  rs10000041 4     165621955     NaN       NC        NaN     0      
## 6 gsm477532  rs1000007  2     237752054       0.821   BB         -3.27  0.00924
## # ... with 10 more variables: x <dbl>, x_raw <int>, y <dbl>, y_raw <int>,
## #   m <dbl>, a <dbl>, m_raw <dbl>, a_raw <dbl>, rfgda_score <dbl>,
## #   rf_score <dbl>
df_model_unnested %>%
    set_threshold("rfgda_score", threshold = 0.5) %>%
    plot_ma(n = 0.1)

6.1 Conditional processing

An optional parameter of function suretype_model(.) is .sclist that controls samples that will be processed using the RF-GDA algorithm (by default, all samples in the dataset are analysed). The following example demonstrates use case when we only want to analyze subset of samples as single cells - this is particularly useful in case there are also bulk DNA samples (i.e. parental genotypes) in the dataset and we wish to filter those samples out. First we list all sample names in the dataset:

df %>%
    select(individual) %>%
    distinct()
##    individual
## 1   gsm477532
## 2   gsm477533
## 3   gsm477534
## 4   gsm477535
## 5   gsm477536
## 6   gsm477537
## 7   gsm477538
## 8   gsm477539
## 9   gsm477540
## 10  gsm477541
## 11  gsm477542
## 12  gsm477543
## 13  gsm477544
## 14  gsm477551
## 15  gsm477552
## 16  gsm477553
## 17  gsm477554
## 18  gsm477555
## 19  gsm477556
## 20  gsm477557
## 21  gsm477558
## 22  gsm477559
## 23  gsm477560

And now we only choose sample gsm477532 and gsm477533 for classification.

sample_subset = list("gsm477532", "gsm477533")

df_model = df %>%
    calculate_ma() %>%
    group_by(individual) %>%
    nest() %>%
    mutate(model = map(data, function(df) suretype_model(df, individual, clf, .sclist = sample_subset)))
## Processing sample gsm477532
## Processing sample gsm477533

Mean and standard deviation of the score per sample clearly shows that only the selected samples were process while all the other samples were assigned score 1 for all genotypes.

df_model %>%
    unnest(c(data, model)) %>%
    group_by(individual) %>%
    summarize(mean = mean(rfgda_score, na.rm = TRUE), sd = sd(rfgda_score, na.rm = TRUE))
## # A tibble: 23 x 3
##    individual  mean    sd
##    <chr>      <dbl> <dbl>
##  1 gsm477532  0.505 0.389
##  2 gsm477533  0.575 0.368
##  3 gsm477534  1     0    
##  4 gsm477535  1     0    
##  5 gsm477536  1     0    
##  6 gsm477537  1     0    
##  7 gsm477538  1     0    
##  8 gsm477539  1     0    
##  9 gsm477540  1     0    
## 10 gsm477541  1     0    
## # ... with 13 more rows

7 Converting idat to gtc

In some scenarios, the data is present in idat format - format that only stores the raw intensities and some other metadata and requires preprocessing, normalization and genotyping. SureTypeSCR is equipped with simple wrapper that facilitates conversion from IDAT to GTC. Illumina’s IAAP-cli is required for running the conversion and the user is guided to download this software. We demonstrate the conversion on publicly available sperm data from the GEO database. Following entities are required (on the top of SureTypeSCR dependencies) to run this demo:

7.1 Initialization and data retrieval

We recommend to first correctly setup path to third-party IAAP-cli. Assuming SureTypeSCR is installed, we invoke IAAP-cli wrapper configuration script by running following code:

library(SureTypeSCR)

configure_iaap()

The program will guide you to Illumina’s web site where you can download IAAP-cli archive corresponding to your platform. After the archive has been download, press enter and type in the path of the downloaded archive, or, alternatively, a pop up window will appear (if GUI available) that you use for navigating to the downloaded archive. The function will then unpack the files into package’s home folder and return the path of the binary file.

After IAAP-cli is correctly set up, We initialize the required packages, load metadata related to GSE19247 and select only MDA amplified sperm data

library(GEOquery)
gse <- getGEO("GSE19247", GSEMatrix = TRUE)

samplelist_sperm = as.data.frame(gse$`GSE19247-GPL6985_series_matrix.txt.gz`) %>%
    filter((str_detect(cell.type.ch1, "sperm")) & str_detect(cell.amplication.ch1, 
        "MDA")) %>%
    rownames()
samplelist_sperm
##  [1] "GSM477532" "GSM477533" "GSM477534" "GSM477535" "GSM477536" "GSM477537"
##  [7] "GSM477538" "GSM477539" "GSM477540" "GSM477541" "GSM477542" "GSM477543"
## [13] "GSM477544" "GSM477551" "GSM477552" "GSM477553" "GSM477554" "GSM477555"
## [19] "GSM477556" "GSM477557" "GSM477558" "GSM477559" "GSM477560"

We then use function getGEO_and_folder(.) to download the samples defined in samplelist_sperm. The idat files originating from the same array are downloaded into a common folder and the name of the folder corresponds to the array ID. Having idat files grouped by the array ID is the required by the conversion software (IAAP-cli).

metadf_sperm = map_if(samplelist_sperm, function(x) !is.null(x), function(x) SureTypeSCR::getGEO_and_folder_in(x, 
    download = TRUE))
metadf_sperm[[1]]
##                                     fname  sampleid    arrayid position
## 1 GSM477532_4299791150_R02C01_Grn.idat.gz GSM477532 4299791150   R02C01
## 2 GSM477532_4299791150_R02C01_Red.idat.gz GSM477532 4299791150   R02C01
##       channel
## 1 Grn.idat.gz
## 2 Red.idat.gz
##                                                                                                           url
## 1 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477532/suppl//GSM477532_4299791150_R02C01_Grn.idat.gz
## 2 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477532/suppl//GSM477532_4299791150_R02C01_Red.idat.gz
##                      outfile
## 1 4299791150_R02C01_Grn.idat
## 2 4299791150_R02C01_Red.idat

Every element of metadf_sperm stores record about a single sperm sample - we collapse all records into single data frame by running the following code.

metadf_sperm_merged = Reduce(function(...) merge(..., all = T), metadf_sperm)
head(metadf_sperm_merged)
##                                     fname  sampleid    arrayid position
## 1 GSM477532_4299791150_R02C01_Grn.idat.gz GSM477532 4299791150   R02C01
## 2 GSM477532_4299791150_R02C01_Red.idat.gz GSM477532 4299791150   R02C01
## 3 GSM477533_4299791150_R01C02_Grn.idat.gz GSM477533 4299791150   R01C02
## 4 GSM477533_4299791150_R01C02_Red.idat.gz GSM477533 4299791150   R01C02
## 5 GSM477534_4299791150_R02C02_Grn.idat.gz GSM477534 4299791150   R02C02
## 6 GSM477534_4299791150_R02C02_Red.idat.gz GSM477534 4299791150   R02C02
##       channel
## 1 Grn.idat.gz
## 2 Red.idat.gz
## 3 Grn.idat.gz
## 4 Red.idat.gz
## 5 Grn.idat.gz
## 6 Red.idat.gz
##                                                                                                           url
## 1 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477532/suppl//GSM477532_4299791150_R02C01_Grn.idat.gz
## 2 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477532/suppl//GSM477532_4299791150_R02C01_Red.idat.gz
## 3 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477533/suppl//GSM477533_4299791150_R01C02_Grn.idat.gz
## 4 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477533/suppl//GSM477533_4299791150_R01C02_Red.idat.gz
## 5 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477534/suppl//GSM477534_4299791150_R02C02_Grn.idat.gz
## 6 https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM477nnn/GSM477534/suppl//GSM477534_4299791150_R02C02_Red.idat.gz
##                      outfile
## 1 4299791150_R02C01_Grn.idat
## 2 4299791150_R02C01_Red.idat
## 3 4299791150_R01C02_Grn.idat
## 4 4299791150_R01C02_Red.idat
## 5 4299791150_R02C02_Grn.idat
## 6 4299791150_R02C02_Red.idat

Now we query the function idat_to_gtc (which is essentially the wrapper for IAAAP-cli) over all arrayid’s. At this stage manifest and cluster file are required (deployed with the package for this type of array - Human CytoSNP array).

In the final stage we create sample sheet that links the sample names with their array ID and position on the array:

write_samplesheet("samplesheet.csv", metadf_sperm_merged, "GTC", manifest)
head(read.csv("samplesheet.csv"))
##           X.Header.                       X X.1 X.2 X.3
## 1 INVESTIGATOR NAME           Investigator1            
## 2      PROJECT NAME                   Test1            
## 3   EXPERIMENT NAME     SingleCellReference            
## 4              DATE              2021-05-29            
## 5       [Manifests]                                    
## 6                 A HumanCytoSNP-12v2_H.bpm

Now we can test whether we can load the GTC files listed in the sample sheet using function scbasic(.)

df = scbasic(manifest, cluster, "samplesheet.csv")
head(df)
##          Name Chr  Position individual b_allele_freq gtype log_r_ratio score
## 1 cnvi0111185  13 100631779  gsm477532           NaN    NC         NaN     0
## 2 cnvi0111185  13 100631779  gsm477533           NaN    NC         NaN     0
## 3 cnvi0111185  13 100631779  gsm477534           NaN    NC         NaN     0
## 4 cnvi0111185  13 100631779  gsm477535           NaN    NC         NaN     0
## 5 cnvi0111185  13 100631779  gsm477536           NaN    NC         NaN     0
## 6 cnvi0111185  13 100631779  gsm477537           NaN    NC         NaN     0
##     x x_raw   y y_raw
## 1 NaN     0 NaN     0
## 2 NaN     0 NaN     0
## 3 NaN     0 NaN     0
## 4 NaN     0 NaN     0
## 5 NaN     0 NaN     0
## 6 NaN     0 NaN     0

From now on we can proceed with the analysis of df as shown in the main tutorial.

8 Further information

For detailed description of available functions and parameters please see the reference manual (vignette). You can invoke help for a particular function by typing ?function_name.