In this short tutorial, I will show how to preprocess mass spectrometry imaging (MSI) data using MALDIquant
(Gibb and Strimmer 2012), MALDIquantForeign
First load the required packages:
library(MALDIquant)
##
## This is MALDIquant version 1.20
## Quantitative Analysis of Mass Spectrometry Data
## See '?MALDIquant' for more information about this package.
library(MALDIquantForeign)
library(irlba)
## Loading required package: Matrix
library(viridis)
## Loading required package: viridisLite
For this tutorial, I will use a MALDI-MSI dataset EY_210930pm_EDI-27jul21_p56-Male-642-12-6-T_50um_2SE
collected from a mouse brain in positive ion mode.
The dataset is publicly available on METASPACE2020 (thanks to Eylan Yutuc for sharing it online).
We call the function importImzML
from MALDIquant
to load the MSI dataset imzML
file.
The dataset is in centroided mode, so we will set the option (centroided = TRUE
). This operation usually takes few minutes, depending on the dataset size.
peaks <- importImzMl('EY_210930pm_EDI-27jul21_p56-Male-642-12-6-T_50um_2SE.imzML', centroided = TRUE, verbose = FALSE)
Some details about the dataset
# Total number of pixels
print(length(peaks))
## [1] 35000
# Spatial dimensions (same metadata for all pixels)
print(peaks[[1]]@metaData$imaging$size)
## x y
## 250 140
NOTE: The following code expects the spectra to be acquired row-wise left-to-right. This small function can reorder the peaks
list elements in case they were acquired with a different pattern:
orderPixels <- function(peaks) {
require(MALDIquant)
coords <- coordinates(peaks)
ord <- c()
for (y in sort(unique(coords[, 2]))) {
curr.y <- which(coords[, 2] == y)
ord <- c(ord, curr.y[order(coords[curr.y, 1])])
}
return(ord)
}
px.ord <- orderPixels(peaks)
In this case, the order remains unchanged, since the spectra were already acquired in the expected order:
px.ord[1:10]
## [1] 1 2 3 4 5 6 7 8 9 10
max(diff(px.ord)) # The max difference between two consecutive pixels order is 1, because it's identical to the original order
## [1] 1
# To reorder the peaks
peaks <- peaks[px.ord]
Let’s have a look at the distribution of the number of detected peaks and pixels mean intensities (in the log-space):
n.peaks <- unlist(lapply(peaks, function(x) length(mass(x))))
hist(n.peaks)
mu.peaks <- unlist(lapply(peaks, function(x) mean(intensity(x))))
hist(log1p(mu.peaks))
We can check that we have a set of MS peaks (class MassPeaks
):
print(class(peaks[[1]]))
## [1] "MassPeaks"
## attr(,"package")
## [1] "MALDIquant"
and plot the peaks from one pixel:
plot(peaks[[3000]])
A quick way to check the spatial properties of an MSI dataset is to plot the TIC image. In this image, each pixel represents its total peaks intensity:
tic <- unlist(lapply(peaks, function(x) sum(intensity(x))))
tic <- matrix(tic, peaks[[1]]@metaData$imaging$size)
image(tic, col=viridis(64))
title('TIC')
As you may notice, it is not very informative, although the section boundaries are barely visible.
Things are a bit clearer if we calculate the TIC of the log-transformed intensities:
tic.log <- unlist(lapply(peaks, function(x) sum(log1p(intensity(x)))))
tic.log <- matrix(tic.log, peaks[[1]]@metaData$imaging$size)
image(tic.log, col=viridis(64))
title('TIC (log)')
After quickly checking some details of the dataset, we can start processing it. The first step is matching the peaks across all pixels. In this way, we assign the peaks intensities to a set of common masses, based on the similarity of their original values.
This can be done by calling the function binPeaks
from MALDIquant
. We save the new masses in the same list of MassPeaks
(NOTE: this procedure will overwrite the original masses, so if you want to keep them, you have to export the results in a different list).
We will use the ‘strict’ method, and a tolerance of 20 ppm:
# binPeaks tolerance is expressed in delta mass / mass, we want it in ppm:
tol.ppm <- 20
tol.maldiquant <- tol.ppm / 1e6
# The new common masses will overwrite the original ones. The intensities remain unmodified
peaks <- binPeaks(peaks, method = 'strict', tolerance = tol.maldiquant)
To perform downstream analysis we need the so-called intensity matrix (or feature matrix), representing the intensities of all peaks assigned to the common masses. This matrix has a shape npixels x nfeatures and it’s generated using the function intensityMatrix
from MALDIquant
.
NOTE: depending on the dataset, this matrix can be very big, so large amount of RAM may be necessary
# This can occupy a lot of RAM
X <- intensityMatrix(peaks)
NOTE: MALDIquant
sets the unassigned values to NA
. You may need to convert them to 0 for downstream analysis.
In this case, we have a whooping number of common masses equal to:
# most of the common masses represent noise!
ncol(X)
## [1] 181591
Most of these features are noise, as we can see from their assignment frequency:
# Unfortunately my computer doesn't have enough RAM to run apply
nz <- array(0, ncol(X))
for (i in 1:ncol(X)) nz[i] <- sum(!is.na(X[, i]))
summary(nz / nrow(X))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000286 0.0001429 0.0003714 0.0075767 0.0014857 0.9943143
plot(seq(0, 1, 0.01), sapply(seq(0, 1, 0.01), function(x) sum(nz / nrow(X) > x) / length(nz)),
xlab = 'Freq. threshold',
ylab = 'Fraction of kept masses', type = 'b')
As we can see, less than 10% of the common masses are detected in more than 1% of the pixels.
We can set 5% as a threshold to remove the rare features (this is quite arbitrary and now based on memory constraints)
keep.mass <- which(nz > nrow(X) * 0.05)
X <- X[, keep.mass]
dim(X)
## [1] 35000 4871
We now have 4,871 features which looks more reasonable than > 100,000.
Where are my masses?
The common masses are saved as column names of the intensity matrix
common.masses <- as.numeric(colnames(X))
print(common.masses[1:20])
## [1] 400.3575 401.0743 401.1333 401.3608 401.8943 401.9524 401.9695 402.0778
## [9] 402.1366 402.1569 402.1643 402.1809 402.2389 402.3114 402.3477 402.3640
## [17] 402.3729 402.3862 402.9474 402.9645
NOTE: let’s covert the NA
to 0.
X[is.na(X)] <- 0
We can now normalize the intensities to take into account of the pixel-to-pixel variations.
To do so, we use TIC scaling. Also we apply a log-transformation to reduce the skeweness of the intensity distributions.
scaling.factor <- apply(X, 1, sum)
X <- X / scaling.factor
# Be sure that the empty pixels are set to 0
X[scaling.factor == 0, ] <- 0
X <- log1p(X)
A quick way to check the global molecular heterogeneity of the sample is based on plotting the first 3 principal components (PC) as RGB channels. We use irlba
to quickly extract the first 3 PC:
results <- irlba::prcomp_irlba(X, center = TRUE, scale. = TRUE, n = 3)
# First rescale the scores in [0, 1]
pc <- apply(results$x, 2, function(x) (x - min(x)) / (max(x) - min(x)))
# Create the RGB
im <- rgb(pc[, 1], pc[, 2], pc[, 3])
im <- matrix(im, peaks[[1]]@metaData$imaging$size)
# Plot the image as a raster (transpose before)
plot(as.raster(t(im)), interpolate = FALSE)
The image shows some degree of heterogeneity of the tissue, and its difference from the off-tissue region.
Now that we have the preprocessed intensity matrix, we can perform downstream analysis, such as supervised classification, clustering, etc.
In this tutorial, we have used a basic filtering approach to reduce the number of common masses. More sophisticated methods based on the spatial information are available, such as SPUTNIK
(Inglese et al. 2018)
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] viridis_0.6.2 viridisLite_0.4.0 irlba_2.3.3
## [4] Matrix_1.3-4 MALDIquantForeign_0.12 MALDIquant_1.20
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.25 bslib_0.3.0
## [4] purrr_0.3.4 lattice_0.20-44 colorspace_2.0-2
## [7] vctrs_0.3.8 generics_0.1.0 htmltools_0.5.2
## [10] yaml_2.2.1 base64enc_0.1-3 utf8_1.2.2
## [13] XML_3.99-0.7 rlang_0.4.11 jquerylib_0.1.4
## [16] pillar_1.6.3 glue_1.4.2 DBI_1.1.1
## [19] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0
## [22] gtable_0.3.0 evaluate_0.14 knitr_1.33
## [25] fastmap_1.1.0 parallel_4.1.1 fansi_0.5.0
## [28] highr_0.9 scales_1.1.1 jsonlite_1.7.2
## [31] gridExtra_2.3 ggplot2_3.3.5 digest_0.6.28
## [34] stringi_1.7.5 dplyr_1.0.7 readBrukerFlexData_1.8.5
## [37] grid_4.1.1 tools_4.1.1 magrittr_2.0.1
## [40] sass_0.4.0 readMzXmlData_2.8.1 tibble_3.1.5
## [43] crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2
## [46] assertthat_0.2.1 rmarkdown_2.10 R6_2.5.1
## [49] compiler_4.1.1