Illumina Microarray Analysis with R – a newbie tutorial

If you stumbled here because of this title, you probably would have read all the other gazillion posts on microarray analysis using Bioconductor and packages like lumi, limma and beadarray – these posts are usually the top hits on google. Also, the reasons why you continued your search and ending up here, despite having read those,  can include:

  • after going through the tons of tutorial pdfs, you still can’t import your data into R
  • you don’t have any or all of these files: matrix.txt, TIFFs, genepix, control probe profile, targets.txt, idats…
  • your collaborator only gave you a freaking “sample probe profile.csv”, nothing else!
  • you didn’t know R existed before you started this frustrating journey

In case you are wondering, all those 4 bullets applied to me. When I first fired up R, I tried to open my csv file in the file->open menu. Yes, I kid you not. Anyway, fast forward 1 week, I finally have, in my hands, lists of transcripts/genes that changed due to different treatments. And thus I present to you my super simplified but working workflow.

Step 1. Firstly, this “tutorial” is for illumina HT12 chip. The “sample probe profile.csv” probably come from something called Beadsummary (whatever that is). It probably had column headings like:

PROBE_ID SYMBOL R2r.AVG_Signal R2r.Detection Pval R2r.NARRAYS R2r.ARRAY_STDEV R2r.BEAD_STDERR R2r.Avg_NBEADS

R2r is my sample name. Scrolling down the file, you will probably hit a region where this is no SYMBOL, only PROBE_ID. If you go across to the right of the file, you will see a column of UNIGENE. You can copy the list of UNIGENE with no SYMBOL, paste them into bioDBnet. and choose UNIGENE as input and Gene ID as output. Export the file as excel, and paste the list back into the SYMBOL column. Next, I delete the entire PROBE_ID column, such that SYMBOL is your first column. Now, save the file as txt, e.g. “Sample Probe Profile.txt”.

Step 2, install R, if you haven’t. I then installed R studio, which is “easier” to manage than R. Install the required packages:

source(“http://bioconductor.org/biocLite.R”)
biocLite(c(“lumi”, “limma”, “illuminaHumanv4.db”)

Load the libraries:

library(lumi)
library(limma)
library(illuminaHumanv4.db)

In RStudio, under the Files frame, go to the folder where you keep your microarray data, click More -> Set As Working Directory

Load your “Sample Probe Profile.txt” into R.

dataFile = “Sample Probe Profile.txt”
array.lumi = lumiR(dataFile, sep = NULL, detectionTh = 0.01, na.rm = TRUE, convertNuID = TRUE, lib.mapping = NULL, dec = ‘.’, parseColumnName = FALSE, checkDupId = TRUE, QC = TRUE, columnNameGrepPattern = list(exprs=’AVG_Signal’, se.exprs=’BEAD_STDERR’, detection=’Detection Pval’, beadNum=’Avg_NBEADS’), inputAnnotation=TRUE, annotationColumn=c(‘ACCESSION’, ‘SYMBOL’, ‘PROBE_SEQUENCE’, ‘PROBE_START’, ‘CHROMOSOME’, ‘PROBE_CHR_ORIENTATION’, ‘PROBE_COORDINATES’, ‘DEFINITION’), verbose = TRUE)

Step 3. Now we use the lumi package to do the log2 conversion and normalization. Then we run QC and see if any samples are outliers. If they are, you have to go back to the excel file and delete them.

dataT <- lumiT(array.lumi)
dataN <- lumiN(dataT)
plot(dataN, what=”boxplot”)
plot(dataN, what=”cv”)
plot(dataN, what=”density”)
plot(dataN, what=”pair”)
plot(dataN, what=”MA”)
plot(dataN, what=”sampleRelation”)

If everything looks ok, great, you have your sample successfully loaded into R.

Step 4. Create a targets.txt file. Open up excel, and make something like this, where FileName is basically the name of the sample that you ran on the chip. In my case, I have 2 variables, a condition and a treatment variable. Yours could just be a single variable. Save the file as targets.txt.

FileName Condition Treatment
R2r R R
R2i R I
R7r R R
R7i R I
NR1r NR R
NR1i NR I
NR2r NR R
NR2i NR I
NR3r NR R
NR3i NR I

eset <- dataN
targets = readTargets(“targets.txt”)
TS <- paste(targets$Condition, targets$Treatment, sep=”.”)
TS <- factor(TS, levels=c(“R.R”,”R.I”,”NR.R”,”NR.I”))
design <- model.matrix(~0+TS)
colnames(design) <- levels(TS)
fit <- lmFit(eset, design)

At this point, I need to teach you how to change stuff. Replace the Condition with your own heading. And if you don’t have the Treatment variable, remove targets$Treatment; or if you have more, then add it in. Next, you need to define the variables you want to compare. These variables must be the same as those you use in your targets.txt file. E.g. for me, I want to compare R.R, R.I, NR.R and NR.I. Note you separate the variables with a fullstop (.). Next, you design the contrast matrix like this.

cont.matrix <- makeContrasts(RIvsRR=R.I-R.R, NRIvsNRR=NR.I-NR.R, RvsNR=(R.I-R.R)-(NR.I-NR.R), levels=design)

Here, you define what you want to compare. For me, I want to find differences between R.I and R.R, hence I write it as RIvsRR=R.I-R.R. Its pretty intuitive like math formula. You can even use brackets like RvsNR=(R.I-R.R)-(NR.I-NR.R). After the contrast matrix, you continue with these codes, which ultimately will yield a Venn Diagram depicting overlaps in genes.

fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
results <- decideTests(fit2)
vennDiagram(results)

Step 5. The final step. Is the extraction of data so you can continue to process them for whatever you want, i.e. IPA, GSEA, heatmap, GO…. To export the genes with p<0.05. The number correspond to the order in which I setup my contrast matrix. Hence 1 = RIvsRR; 2=NRIvsNRR; and 3=RvsNR.

coef1 = topTable(fit2, coef=1, number=1000000)
coef1f=coef1[coef1$P.Value<0.05,]
eset1 <- eset[coef1f$SYMBOL,]
write.table(eset1, “eset1.txt”, sep=”\t”)

coef2 = topTable(fit2, coef=2, number=1000000)
coef2f=coef2[coef2$P.Value<0.05,]
eset2 <- eset[coef2f$SYMBOL,]
write.table(eset2, “eset2.txt”, sep=”\t”)

coef3 = topTable(fit2, coef=3, number=1000000)
coef3f=coef3[coef3$P.Value<0.05,]
eset3 <- eset[coef3f$SYMBOL,]
write.table(eset3, “eset3.txt”, sep=”\t”)

Next to extract Venn Diagram genes. To get all the genes from contrast 1, e.g. all significant genes in RIvsRR

Iv = which(results[,1]!=0)
write.table(Iv, “RIvsRR.txt”, sep=”\t”)

To get intersect between contrast 1 and contrast 3; e.g. genes that changed in both RIvsRR and RvsNR.

Iv = which(results[,1]!=0 & results[,3]!=0)
write.table(Iv, “RIvsRR intersect RvsNR.txt”, sep=”\t”)

To get genes changes only in contrast 3; e.g. unique genes that changed in RvsNR.

Iv = which(results[,1]==0 & results[,2]==0 & results[,3]!=0)
write.table(Iv, “RvsNR only.txt”, sep=”\t”)

THAT’S THE END OF THE TUTORIAL!!! HAPPY analyzing your microarray!!!

April 2016

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s