A Data Analysis Pipeline for Identification of Untargeted GC-EI-MS Spectra

DEBORAH F. MCGLYNN, Lindsay Yee, Lewis Geer, Yuri Mirokhin, Dmitrii Tchekhovskoi, Coty Jen, Allen Goldstein, Anthony J. Kearsley, Stephen E. Stein, National Institute of Standards and Technology

     Abstract Number: 186
     Working Group: Instrumentation and Methods

Abstract
Despite the thousands of compounds represented in mass spectral (MS) libraries, a large fraction of spectra in complex mixtures cannot be identified. Environmentally sampled compounds pose a special challenge, as these compounds are commonly subjected to complex chemical reactions such as oxidation or pyrolysis. While matching chromatographic retention index data improves the confidence of any spectral identification, this information is not always available and typically requires significant manual effort to be incorporated in the analysis. In this work, we present a search method that considers MS similarity, retention indices, and molecular mass when scoring library matches. When retention indices are not available, AI-estimated values are provided. The new method was applied to a MS dataset containing 4833 Trimethylsilyl (TMS)-derivatized spectra collected by the University of California at Berkeley of particulate organic compounds emitted by wildland fires. The dataset was run against the NIST 2023 EI-MS dataset using the identity search method with retention index penalization. The RI threshold window used was ± 25 between the unknown and the library spectra. Based on NIST23 library matching, this analysis led to 181 new identifications in the dataset. 105 identities from previous work were confirmed while 34 previous IDs were changed. Following this, estimations of molecular mass were made for MS with high signal to noise ratios. This allowed the library-based ’hybrid’ search method to be used to identify compounds similar to, but not present in, libraries. These new methods increase reproducibility, reduce analysis time, and increase identification frequency in these complex mixtures. Since the number of remaining unidentified spectra remains large (4530 spectra), this work concludes by identifying which spectra are most likely to be identified by additions to libraries or further analysis, and which are most likely to remain unidentifiable due to possible contamination or low signal strength.