About

In this short tutorial, I will show how to preprocess mass spectrometry imaging (MSI) data using MALDIquant (Gibb and Strimmer 2012), MALDIquantForeign

First load the required packages:

library(MALDIquant)

## 
## This is MALDIquant version 1.20
## Quantitative Analysis of Mass Spectrometry Data
##  See '?MALDIquant' for more information about this package.

library(MALDIquantForeign)
library(irlba)

## Loading required package: Matrix

library(viridis)

## Loading required package: viridisLite

Importing ImzML

For this tutorial, I will use a MALDI-MSI dataset EY_210930pm_EDI-27jul21_p56-Male-642-12-6-T_50um_2SE collected from a mouse brain in positive ion mode.
The dataset is publicly available on METASPACE2020 (thanks to Eylan Yutuc for sharing it online).

We call the function importImzML from MALDIquant to load the MSI dataset imzML file.
The dataset is in centroided mode, so we will set the option (centroided = TRUE). This operation usually takes few minutes, depending on the dataset size.

peaks <- importImzMl('EY_210930pm_EDI-27jul21_p56-Male-642-12-6-T_50um_2SE.imzML', centroided = TRUE, verbose = FALSE)

Some details about the dataset

# Total number of pixels
print(length(peaks))

## [1] 35000

# Spatial dimensions (same metadata for all pixels)
print(peaks[[1]]@metaData$imaging$size)

##   x   y 
## 250 140

NOTE: The following code expects the spectra to be acquired row-wise left-to-right. This small function can reorder the peaks list elements in case they were acquired with a different pattern:

orderPixels <- function(peaks) {
  
  require(MALDIquant)
  
  coords <- coordinates(peaks)
  
  ord <- c()
  for (y in sort(unique(coords[, 2]))) {
    curr.y <- which(coords[, 2] == y)
    ord <- c(ord, curr.y[order(coords[curr.y, 1])])
  }
  
  return(ord)
  
}

px.ord <- orderPixels(peaks)

In this case, the order remains unchanged, since the spectra were already acquired in the expected order:

px.ord[1:10]

##  [1]  1  2  3  4  5  6  7  8  9 10

max(diff(px.ord))  # The max difference between two consecutive pixels order is 1, because it's identical to the original order

## [1] 1

# To reorder the peaks

peaks <- peaks[px.ord]

Let’s have a look at the distribution of the number of detected peaks and pixels mean intensities (in the log-space):

n.peaks <- unlist(lapply(peaks, function(x) length(mass(x))))

hist(n.peaks)

mu.peaks <- unlist(lapply(peaks, function(x) mean(intensity(x))))

hist(log1p(mu.peaks))

We can check that we have a set of MS peaks (class MassPeaks):

print(class(peaks[[1]]))

## [1] "MassPeaks"
## attr(,"package")
## [1] "MALDIquant"

and plot the peaks from one pixel:

plot(peaks[[3000]])

Total-ion-count (TIC) image

A quick way to check the spatial properties of an MSI dataset is to plot the TIC image. In this image, each pixel represents its total peaks intensity:

tic <- unlist(lapply(peaks, function(x) sum(intensity(x))))
tic <- matrix(tic, peaks[[1]]@metaData$imaging$size)
image(tic, col=viridis(64))
title('TIC')

As you may notice, it is not very informative, although the section boundaries are barely visible.
Things are a bit clearer if we calculate the TIC of the log-transformed intensities:

tic.log <- unlist(lapply(peaks, function(x) sum(log1p(intensity(x)))))
tic.log <- matrix(tic.log, peaks[[1]]@metaData$imaging$size)
image(tic.log, col=viridis(64))
title('TIC (log)')

Peak binning (or peak matching)

After quickly checking some details of the dataset, we can start processing it. The first step is matching the peaks across all pixels. In this way, we assign the peaks intensities to a set of common masses, based on the similarity of their original values.
This can be done by calling the function binPeaks from MALDIquant. We save the new masses in the same list of MassPeaks (NOTE: this procedure will overwrite the original masses, so if you want to keep them, you have to export the results in a different list).

We will use the ‘strict’ method, and a tolerance of 20 ppm:

# binPeaks tolerance is expressed in delta mass / mass, we want it in ppm:
tol.ppm <- 20
tol.maldiquant <- tol.ppm / 1e6

# The new common masses will overwrite the original ones. The intensities remain unmodified
peaks <- binPeaks(peaks, method = 'strict', tolerance = tol.maldiquant)

Intensity matrix (feature)

To perform downstream analysis we need the so-called intensity matrix (or feature matrix), representing the intensities of all peaks assigned to the common masses. This matrix has a shape npixels x nfeatures and it’s generated using the function intensityMatrix from MALDIquant.

NOTE: depending on the dataset, this matrix can be very big, so large amount of RAM may be necessary

# This can occupy a lot of RAM
X <- intensityMatrix(peaks)

NOTE: MALDIquant sets the unassigned values to NA. You may need to convert them to 0 for downstream analysis.

In this case, we have a whooping number of common masses equal to:

# most of the common masses represent noise!
ncol(X)

## [1] 181591

Most of these features are noise, as we can see from their assignment frequency:

# Unfortunately my computer doesn't have enough RAM to run apply
nz <- array(0, ncol(X))
for (i in 1:ncol(X)) nz[i] <- sum(!is.na(X[, i]))

summary(nz / nrow(X))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000286 0.0001429 0.0003714 0.0075767 0.0014857 0.9943143

plot(seq(0, 1, 0.01), sapply(seq(0, 1, 0.01), function(x) sum(nz / nrow(X) > x) / length(nz)),
     xlab = 'Freq. threshold',
     ylab = 'Fraction of kept masses', type = 'b')

As we can see, less than 10% of the common masses are detected in more than 1% of the pixels.

We can set 5% as a threshold to remove the rare features (this is quite arbitrary and now based on memory constraints)

keep.mass <- which(nz > nrow(X) * 0.05)
X <- X[, keep.mass]

dim(X)

## [1] 35000  4871

We now have 4,871 features which looks more reasonable than > 100,000.

Where are my masses?

The common masses are saved as column names of the intensity matrix

common.masses <- as.numeric(colnames(X))

print(common.masses[1:20])

##  [1] 400.3575 401.0743 401.1333 401.3608 401.8943 401.9524 401.9695 402.0778
##  [9] 402.1366 402.1569 402.1643 402.1809 402.2389 402.3114 402.3477 402.3640
## [17] 402.3729 402.3862 402.9474 402.9645

NOTE: let’s covert the NA to 0.

X[is.na(X)] <- 0

Normalization and log-transformation

We can now normalize the intensities to take into account of the pixel-to-pixel variations.
To do so, we use TIC scaling. Also we apply a log-transformation to reduce the skeweness of the intensity distributions.

scaling.factor <- apply(X, 1, sum)
X <- X / scaling.factor

# Be sure that the empty pixels are set to 0
X[scaling.factor == 0, ] <- 0

X <- log1p(X)

How the data looks like?

A quick way to check the global molecular heterogeneity of the sample is based on plotting the first 3 principal components (PC) as RGB channels. We use irlba to quickly extract the first 3 PC:

results <- irlba::prcomp_irlba(X, center = TRUE, scale. = TRUE, n = 3)

# First rescale the scores in [0, 1]
pc <- apply(results$x, 2, function(x) (x - min(x)) / (max(x) - min(x)))

# Create the RGB
im <- rgb(pc[, 1], pc[, 2], pc[, 3])

im <- matrix(im, peaks[[1]]@metaData$imaging$size)
# Plot the image as a raster (transpose before)
plot(as.raster(t(im)), interpolate = FALSE)

The image shows some degree of heterogeneity of the tissue, and its difference from the off-tissue region.

Ready for downstream analysis

Now that we have the preprocessed intensity matrix, we can perform downstream analysis, such as supervised classification, clustering, etc.

In this tutorial, we have used a basic filtering approach to reduce the number of common masses. More sophisticated methods based on the spatial information are available, such as SPUTNIK (Inglese et al. 2018)

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] viridis_0.6.2          viridisLite_0.4.0      irlba_2.3.3           
## [4] Matrix_1.3-4           MALDIquantForeign_0.12 MALDIquant_1.20       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1         xfun_0.25                bslib_0.3.0             
##  [4] purrr_0.3.4              lattice_0.20-44          colorspace_2.0-2        
##  [7] vctrs_0.3.8              generics_0.1.0           htmltools_0.5.2         
## [10] yaml_2.2.1               base64enc_0.1-3          utf8_1.2.2              
## [13] XML_3.99-0.7             rlang_0.4.11             jquerylib_0.1.4         
## [16] pillar_1.6.3             glue_1.4.2               DBI_1.1.1               
## [19] lifecycle_1.0.1          stringr_1.4.0            munsell_0.5.0           
## [22] gtable_0.3.0             evaluate_0.14            knitr_1.33              
## [25] fastmap_1.1.0            parallel_4.1.1           fansi_0.5.0             
## [28] highr_0.9                scales_1.1.1             jsonlite_1.7.2          
## [31] gridExtra_2.3            ggplot2_3.3.5            digest_0.6.28           
## [34] stringi_1.7.5            dplyr_1.0.7              readBrukerFlexData_1.8.5
## [37] grid_4.1.1               tools_4.1.1              magrittr_2.0.1          
## [40] sass_0.4.0               readMzXmlData_2.8.1      tibble_3.1.5            
## [43] crayon_1.4.1             pkgconfig_2.0.3          ellipsis_0.3.2          
## [46] assertthat_0.2.1         rmarkdown_2.10           R6_2.5.1                
## [49] compiler_4.1.1

Gibb, Sebastian, and Korbinian Strimmer. 2012. “MALDIquant: A Versatile R Package for the Analysis of Mass Spectrometry Data.” Bioinformatics 28 (17): 2270–71. https://doi.org/10.1093/bioinformatics/bts447.

Inglese, Paolo, Gonçalo Correia, Zoltan Takats, Jeremy K Nicholson, and Robert C Glen. 2018. “SPUTNIK: an R package for filtering of spatially related peaks in mass spectrometry imaging data.” Bioinformatics 35 (1): 178–80. https://doi.org/10.1093/bioinformatics/bty622.

Preprocessing MSI data using MALDIquant

Paolo Inglese

15/10/2021