Gene expression microarrays generate extremely high amounts of transcriptomic data. These datasets may account for thousands of genes from hundreds of different individuals and can be used to identify genes associated with a particular disease or to assess gene expression profiles in response to a given therapy. Transcriptomic datasets are usually uploaded to a larger database, such as Gene Expression Omnibus (GEO), where others can review the data and draw their own conclusions. In effect, these datasets don’t just shape the hypothesis of the paper from which it was published; they influence all other scientists using those datasets to guide their own research.
In 2015, Miriam Lohr and her group from Dortmund University in Germany decided to quantify the frequency of mislabeled samples in 45 publicly available transcriptomic datasets with data obtained from cancer patients. They accomplished this using sex-specific identifiers—genes that are expressed from either X- or Y-chromosomes. They analyzed these gene expression patterns to determine whether the sample was, in fact, from a male or female patient, then cross-referenced those results to the actual sex of the patient. Of the 4913 patients they evaluated, they found that 1.1% were “misclassified” and 3.0% were “unconfident,” meaning that the sex could not be confirmed based on transcriptomic analysis. In 18 of the 45 datasets (40%) tested, they detected at least one “misclassified” sample. To demonstrate the effect these mislabeled samples could have on actual study results, Lohr et al assessed which genes had prognostic value from the cohorts. They found that by incorporating mislabeling errors, 12% to 53% of the genes significantly associated with patient survival were no longer significant, while another 9% to 39% of genes appeared as newly significant.1
Another similar study was performed in 2016 by Lilah Toker and colleagues. They used a similar methodology, applying sex-specific genes to identify mislabeled samples in 70 transcriptomic datasets, which included both cancer-related and non-cancer–related studies. This group confirmed Lohr’s initial findings, as they discovered mislabeled samples in 46% of the datasets analyzed, with an average mismatch rate of 2%. Though the source of error was usually difficult to determine, they found that the most common source appeared to be samples that had been physically mixed up, and not mistakes due to improper recording of the participants’ sex.2
The main point of both studies was to shed light on how pervasive mislabeling can be in transcriptomic datasets. These mislabeled samples are extremely distressing, as they might wrongly guide any number of research groups who use them to erroneous conclusions. The authors also suggest that while sex-specific identifiers could be used to correct mismatches, mislabeled samples between patients of the same sex can’t be identified with these methods, likely leading to a greater amount of error than what was reported. Altogether, the importance of appropriate labeling can’t be overstated. Every precaution should be taken to ensure that samples are labeled correctly, including making sure your labels are tailored for their environment. Barcoded labels, radio-frequency identification (RFID) labels, and laboratory information management systems (LIMS) can also help reduce errors during the processing of high-throughput data generated from transcriptomic analyses.
- Lohr M, Hellwig B, Edlund K, et al. Identification of sample annotation errors in gene expression datasets. Arch Toxicol. 2015;89:2265-2272.
- Toker L, Feng M, Pavlidis P. Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Research. 2016;5:1-13.