Bhavik Vyasa,
Lenka Halámkováb and
Igor K. Lednev*a
aDepartment of Chemistry, University at Albany, State University of New York, Albany, NY 12222, USA. E-mail: ilednev@albany.edu
bDepartment of Environmental Toxicology, Texas Tech University, Lubbock, TX 79409, USA
First published on 27th August 2024
Modern criminal investigations heavily rely on trace bodily fluid evidence as a rich source of DNA. DNA profiling of such evidence can result in the identification of an individual if a matching DNA profile is available. Alternatively, phenotypic profiling based on the analysis of body fluid traces can significantly narrow down the pool of suspects in a criminal investigation. Urine stain is a frequently encountered specimen at the scene of crime. Raman spectroscopy offers great potential as a universal confirmatory method for the identification of all main body fluids, including urine. In this proof-of-concept study, Raman spectroscopy combined with advanced statistics was used for race differentiation based on the analysis of urine stains. Specifically, a Random Forest (RF) model was built, which allowed for differentiating Caucasian (CA) and African American (AA) descent donors with 90% accuracy based on Raman spectra of dried urine samples. Raman spectra were collected from samples of 28 donors varying in age and sex. This novel technology offers great potential as a universal forensic tool for phenotypic profiling of a potential suspect immediately at the scene of a crime, providing invaluable information for a criminal investigation.
The analysis of body fluid traces at a crime scene is of paramount importance in forensic investigations.2 Identifying the type of body fluid associated with a specific stain can provide crucial contextual information, helping investigators determine the stain's relevance to the case. Body fluid traces are particularly significant as they can serve as a source of DNA evidence and make a link to a person of interest.3 Alongside fingerprints, DNA is one of the few pieces of physical evidence capable of conclusively identifying an individual.4
Along with DNA profiling, it is important to determine the type of bodily fluid, so that the prosecutor can demonstrate its relevance to the crime. Current forensic methods for body fluid identification are primarily based on enzymatic effects or serology.5 These methods can be time-consuming and destructive, sometimes showing false positive results.6 These limitations are a massive downside when the volume of samples collected at crime scenes is low in quality and/or quantity. In cases with small amounts of evidence, there is a possibility of using up all of the evidence available to identify the body fluid and not having enough remaining for DNA analysis. Forensic investigations aim to prioritize critical testing and get the best possible outcome from the evidence to identify an individual. Ideally, the body fluid trace should be preserved after identification for future tests. Recent literature has discussed that forensic laboratories are dealing with a considerable backlog of DNA evidence because all collected stains are subjected to DNA analysis without a prior body fluid identification.7 Our laboratory3,5 and others8–13 have been working on developing emerging technologies for identifying body fluid traces. SupreMEtric LLC is commercializing a universal, non-destructive test for the confirmatory identification of all main body fluids using Raman spectroscopy (https://www.supremetric.com/).
A corresponding match is required to utilize the results of a DNA test in a criminal investigation. Alternatively, characteristics such as sex and race could be used to create a profile for a person of interest. Developing a suspect profile immediately after the crime scene is discovered could be invaluable for the investigative leads and narrowing down the pool of potential suspects. Our laboratory and others combine vibrational spectroscopy and machine learning for phenotypic profiling based on the analysis of body fluids. Specifically, Raman spectroscopy showed promising results for determining sex based on the analysis of bloodstains14 and saliva,15 race based on bloodstain16 and semen,17 and the age group of the donor based on bloodstains.18 In addition, ATR FTIR of bloodstains was used to determine the sex, race,19 and chronological age of the donor.20
One body fluid of interest for forensic analysis is urine, as it is vital evidence often found at crime scenes of sexual assault cases.3 At other times, urine is commonly discovered on the victims of kidnapping and confinement,21 in drug-related crimes, and on corrections officers after prisoners have thrown urine bombs.22 Forensic investigations aim to identify individuals involved in a crime, whether as the culprit or victim, by employing techniques such as DNA profiling from body fluid traces and fingerprint analysis from crime scenes. Studies have reported that DNA can be extracted from dry urine traces.23,24 However, the challenge in analyzing urine samples lies in the composition, as they are predominantly water and contain minimal cellular components.25 Thus, urine provides little DNA, which can be insufficient for profiling and makes identifying an individual challenging.26 Given these complexities, there is a pressing need for novel methods in forensic science to analyze urine traces accurately and non-destructively.
Urine is a primarily transparent, amber-colored, sterile liquid generated by the kidneys while filtering blood. Humans produce an average of 0.6–2.6 L of urine per day. The generation of urine depends upon the water balance in the body.25 The major components of urine are metabolic byproducts like urea, creatinine, ammonia, creatine, inorganic ions (Na+, K+, Cl−), hippuric acid, citric acid, etc.25 Urine composition can vary depending on the donors’ diet, physical activity, and environment. Various factors affect the concentration of creatinine.27 Creatinine is formed in the body by spontaneous irreversible dehydration of creatine and creatine phosphate from muscle metabolites. The rate of creatinine formation decreases with age. Approx. 2% of the body's creatine is converted to creatinine every 24 hours. Recent studies have shown that the amount of creatinine formed differs among different races because of genetic and biological factors.28 In the context of phenotype profiling based on urine trace evidence, Takakura et al. recently reported a method using Fourier transform infrared (FTIR) spectroscopy combined with multivariate statistics to determine the donor's sex from urine traces.29
The advancement of analytical techniques has significantly enhanced forensic trace evidence analysis, particularly in the context of body fluid identification. Emerging methods, such as advanced liquid chromatography mass spectrometry (LC-MS), X-ray diffraction, next-generation RNA sequencing, nanotechnologies, and lab-on-chip devices, are increasingly employed as presumptive and confirmatory tests.30 Vibrational spectroscopy techniques like Raman spectroscopy and infrared spectroscopy (IR) are gaining popularity in forensic evidence analysis.31–34 These techniques require minimal sample preparation, are highly specific, and offer great sensitivity with non-destructive and rapid analysis.5,35 Specifically, Raman spectroscopy has shown great potential in analyzing various types of forensic trace evidence, providing detailed information about the molecular composition of samples in a non-destructive manner. Examples include paint, hair, ink, fibers, fingerprints, gunshot residues, and body fluid analysis.35–39 Raman spectroscopy also allows for rapid analysis at a crime scene with the help of handheld spectrometers, which are commercially available.40,41 Handheld Raman instruments, such as the TruNarc™ handheld narcotic analyzer from Thermo Fisher Scientific (https://www.thermofisher.com/order/catalog/product/TRUNARC), are designed to be compact and lightweight. These features enhance their portability, allowing forensic experts to easily transport them to crime scenes or other locations requiring on-site forensic analysis. A handheld Raman spectrometer allows for real-time analysis and decision-making, which can be critical in forensic investigations. The addition of advanced chemometrics to spectroscopic techniques makes it a powerful and universal tool for trace evidence analysis.42–46 Chemometrics utilizes the multivariate property of spectral data and can uncover latent relationships of variables. Based on these relationships, it can draw a comprehensive output and, specifically, can provide classifications and identify significant components based on spectral features.47,48
In this proof-of-concept study, we utilize Raman spectroscopy combined with advanced statistical analysis to introduce a novel technique for determining the racial background of donors from dry urine traces. This approach has the potential to enhance the efficiency of suspect identification and streamline forensic investigations.
RF is a robust classification method known for its resilience against outliers and non-normal distributions (e.g., zero-truncated data and extreme value distributions). It can handle large numbers of variables54 even if they are highly correlated55 and can estimate the importance of each predictor (i.e., representative wavenumbers in spectra). RF is a classification and regression machine learning method that constructs an ensemble of numerous de-correlated decision trees.52,54 Each decision tree is trained on a different subset of the data, known as a bootstrap sample, drawn with or without replacement from the original training dataset. The remaining subset of the original dataset serves the out-of-bag (OOB) portion, which will be used as a cross-validation dataset. The OOB error rate is an important metric used to evaluate the cross-validation performance (misclassification rate) of the Random Forest model. It is calculated by aggregating results from all OOB portions and determining differences between the predictions and the actual instances. In a Random Forest, each decision tree is trained on a bootstrapped sample of the original dataset. Consequently, specific data points are excluded or considered “out-of-bag” in each tree. Calculating the OOB error rate involves evaluating each data point in the training set using the trees not trained on that specific data point. This process enables the estimation of the model's performance on unseen data. The predicted output of each out-of-bag data point is compared to its actual output, resulting in the calculation of the error rate as the proportion of misclassified data points. The OOB error rate is typically calculated using the out-of-bag (OOB) samples, which are not included in the training of each decision tree within the Random Forest ensemble. This error rate provides an overall estimation of the model's performance on unseen data. It is important to note that the OOB error rate is specific to Random Forest models and is not a standard evaluation metric for other models.54
We can tune several parameters to optimize the RF model for the intended classification. The first is node size, which determines the depth of the tree build and the number of observations in each node of the classifier tree. Additionally, each node separates the data into two subsets, maximizing their homogeneity concerning the classes (races). The size of this subset is the same for all trees set by the researcher and referred to as “m-try”. To determine the optimal number of trees, we have built multiple Random Forest models with different numbers of trees (n-tree values) and recorded the OOB error rate. Subsequently, we select the number of trees with a stabilized minimum OOB rate. If there are many features in the training dataset, many trees may be necessary to encompass the variance. Hence, big data like spectral datasets will need lots of computing power and time to determine an optimal number of trees. We can use the Gini index feature selection technique for dimensionality reduction, select the most critical spectral feature for the discrimination between classes, and eliminate the spectral noise.56 The GINI index is computed from permuting OOB data and observing how much a prediction error changes when the data for that variable is permuted while all others remain unchanged. The prediction error on the out-of-bag portion of the data for each tree was recorded as an OOB error rate for classification. Gini importance measures the average gain of purity (homogeneity) by splits of a given variable. The more critical the variable, the more it splits labeled nodes into pure single-class nodes. Permuting a vital variable leads to relatively significant decreases in mean Gini importance.
The RF model adopts the Gini index to determine the best-split selection based on spectral features. The OOB sample is used to estimate the prediction error and then to evaluate variable importance. RF assesses the relative importance of the features during the classification process by identifying variables that contribute the most to the analysis.
In addition to predicting outcomes in classification, RF can be applied to the training datasets to select essential variables. RF essentially tries to build homogeneous groups of samples, and RF reveals the features that most strongly influence the formation of these groups. RF allows for estimating the importance of elements used for classification, shedding light on the biological basis of the classification results.
In the tree-building process, a set of randomly chosen variables are considered candidates for each tree split, and the variables that yield the best separation are chosen. For subsequent nodes (partitions) in the tree, another optimal binary division is performed until a leaf (or terminal node) is created, representing a class. This process is repeated to construct other trees with another bootstrap portion from the original dataset. Many bootstrap samples and feature subsets are drawn from the original dataset. Each classification tree is fitted to a bootstrap sample (referred to as the “in-bag” spectra) using the subset features. The spectra not sampled (referred to as the “out-of-bag” spectra OOB) are left for testing, and the model makes the predictions based on these spectra. The predictions for all spectra in the OOB portion are made by traversing down the tree, and final predictions are made by averaging over the forecast of all decision trees. During the Random Forest (RF) construction, the OOB error rate is calculated to estimate predictive performance. The final prediction of the RF is a combination of predictions from all the trees in the ensemble. Each tree predicts a class for spectra, and the entire forest generates the percentage of votes for each class by aggregating results across all trees. Combining trees and their predictions is known as “bagging” and ensures that the trees are de-correlated with each other.
The “bagging” techniques aggregate high-variance trees to enhance prediction accuracy.56 While the RF model randomizes the variable selection during each tree split, making it susceptible to overfitting due to its nature of creating multiple decision trees, using the Gini index for the feature selection mitigates this risk and prevents overfitting and noise at the spectral level.
In the context of a Random Forest model applied to a Raman spectral dataset, the mean decrease Gini (or Gini index) is a measure used to assess the importance of each spectral variable (or feature) in predicting the target variable. The Gini index measures impurity or the extent of class mixing within a decision tree node. The mean decrease Gini provides insight into which spectral variables contribute more to the predictive power of the Random Forest model for the Raman spectral dataset. Variables with higher mean decrease Gini values are typically considered more relevant or influential in distinguishing between different classes or categories within the dataset.57
Fig. 1 Average Raman spectra of both races: American of African descent (AA, green) and Caucasian American (CA, blue). |
There is a small difference between the preprocessed average Raman spectra of AA and CA samples, as evident in Fig. 1. The difference between the mean spectra of AA and CA datasets is shown in Fig. 2, along with one standard spectral deviation for each class. The difference spectrum is within one standard deviation for each class, indicating that the difference is most probably statistically insignificant. Therefore, using individual bands in the Raman spectra cannot identify the donor's race class. Therefore, a statistical analysis of the entire Raman spectra would be needed to classify individual Raman spectra.58
Fig. 2 Difference mean Raman spectrum (red) and in-class standard deviations of urine spectra of Caucasian American (blue) and American of African descent (green) classes. |
We used 18 samples (9-CA and 9-AA) to create a training dataset for the RF model and left aside ten randomly chosen samples as the test dataset. The randomized selection of test samples ensures that each sample has an equal opportunity to be included in the test set, mitigating potential limitations and biases. After the training data from the 18 donors were processed and the final RF model was built, the remaining spectral data from 10 samples in external validation (test dataset) were analyzed, and their class was predicted to determine the RF's performance.
The Gini index (or mean decrease Gini) produced by RF was applied for dimensionality reduction. Specifically, the Gini index selected the most “important spectral features” to build a simpler model. Furthermore, we reran Random Forest, dropping 65% of the least informative features from the model suggested by the Gini index. This step aimed to reduce the Raman spectral region, minimize the inclusion of features that may not significantly contribute to the model's predictive power, and prevent overfitting the model based on data noise. Using the predictors chosen by the Gini index, a new RF model was trained on the entire training dataset comprising 316 spectra. The Mean Gini index helped to select 500 features (wavenumber regions) from the training dataset to build the final model and eliminate noise. Fig. 3A shows the mean decrease in the Gini coefficient as a measure of how each variable contributes to the homogeneity of the nodes in the resulting Random Forest.
Mean decrease Gini values were selected for the critical Raman band of the urine, which is best suited for classifying the two racial groups (Fig. 3A). Raman shifts are 549 cm−1, 780 cm−1, 1013 cm−1, and 1610 cm−1 and can be assigned mostly to creatine, urea, and creatinine.25,26 The literature supports the peak assignment as the creatinine formation rate differs among the races.28 The band at 546 cm−1 has been identified as the most significant based on the Mean Gini index. However, it is noted that the mean and standard deviation values of this band exhibit substantial overlap between the classes (Fig. 3B). Consequently, there is a necessity for a machine learning approach that can leverage all features selected by the Gini index, along with their combinations, to construct an algorithm capable of effectively harnessing the complexity inherent in hyperspectral datasets. This algorithm should be designed to discern donors’ race accurately.
In the next step, we explored how changes in n-tree and m-try affected the OOB error rate. The value of m-try = 12 was obtained using the automatic tuning function in R software (Liaw and Wiener 2001). We also plotted RF models with varying numbers of trees against the corresponding OOB error rate. Fig. 4 shows that after about 200 trees, the error stabilizes and reaches the minimum. Consequently, the final RF model was based on 200 classification trees with twelve variables at each split (m-try). Regarding the node size parameter (the minimum number of observations required to create a terminal node), we kept the default settings of the package (node size = 1), which results in deeper trees with more final nodes.
The final RF model was based on 200 classification trees with 12 variables at each split (m-try). Once the RF model was created, it was used to predict the test dataset of 175 spectra from ten different donors. The OOB error rate is calculated during model training and indicates the model's error approximation.59 This error rate, derived from the prediction accuracy of all other trees, assessed the final RF model's performance. Once the final Random Forest model is trained, it is then applied to the external test dataset to evaluate its performance on completely new and unseen data. The performance of the final model on the test dataset was summarized by the confusion matrix that provides a detailed breakdown of the model's predictions compared to the actual classes in the test dataset.
The OOB error rate of the final RF model was estimated as 2%, corresponding to seven misclassified spectra (Table 1), which indicates a high predictive performance of the constructed RF model. All donors in the training dataset were correctly classified. The mean decrease Gini index for the feature selection helped improve the performance and prediction ability of the Random Forest model by selecting the spectral region that contributes most to the discrimination of two races while eliminating noise and spurious data points. The OOB error rate provided a robust estimation of error, validating the model's effectiveness.
Type of Random Forest | Classification |
---|---|
Number of trees | 200 |
Number of variables at each split | 12 |
Node size | 1 |
OOB estimate of error rate | 2% |
Confusion matrix (cross validation) | CA (actual) | AA (actual) |
---|---|---|
CA (predicted) | 152 | 3 |
AA (predicted) | 4 | 157 |
Confusion matrix (external validation) | CA | AA |
---|---|---|
CA (predicted) | 84 | 13 |
AA (predicted) | 3 | 75 |
External validation (donor level) | CA (actual) | AA (actual) |
---|---|---|
CA (predicted) | 5 | 1 |
AA (predicted) | 0 | 4 |
The OOB method provided very stable prediction results with high prediction accuracy and low error for the cross-validation. However, the cross-validation is performed solely on the training dataset. To determine the model's true performance, we conducted external validation using ten samples that were withheld from the training process and were kept separate from the model during the training phase. In Random Forests (RF), predictions are made by aggregating votes from all decision trees in the ensemble. Each spectrum is classified based on the majority vote, with the class receiving the highest number of votes considered the prediction. Thus, each unknown spectrum is ultimately assigned to one of the races, CA or AA, with a higher classification probability. It means the default 50% threshold was applied to the model's spectral level predictions. A total of 175 spectra from 10 donors were selected for external validation, ensuring they were not used in model training, and were introduced into the final RF model. The classification prediction for the spectra from 10 donors from the external validation dataset is shown in Fig. 5, with the probabilities (per spectrum) to be assigned to class CA. This analysis allowed for the discrimination of CA and AA races with an accuracy of 91% at the spectral level (Table 1). Urine stains display inherent heterogeneity with a non-uniform distribution of components throughout the stain. This heterogeneity results from evaporation, diffusion, and crystallization processes, which impact urine constituents’ concentration and spatial organization. As a result, distinct variations in the Raman spectra are observed across different regions of the stain, highlighting the influence of these processes on the spectroscopic characteristics of urine stains. Spectral misclassification is not an unusual result as not all the Raman spectra collected from urine stains will reflect the characteristic Raman signature of the race intended because of the low concentration regions of the biomarkers in urine stains, as mentioned above. Hence, it is vital to perform sample-level classification using spectral-level prediction with a 50% default threshold. Considering the donor-level predictions with an appropriate threshold (50%), external validation reached 90% accuracy for predicting the class of unknown donors. Nine of ten donors were classified correctly, with most of the spectra assigned to an actual class, thus confirming the method's reliability for this proof-of-concept study of race differentiation based on human urine traces analyzed by Raman spectroscopy.
Fig. 5 External validation of the RF model. The estimated classification probability for each spectrum is shown. All urine spectra were scored with the likelihood attributed to the CA race. |
For this proof-of-concept study, we used 18 samples, each with 18 spectra from different spots, totaling approximately 316 spectra. The high-dimensional nature of Raman spectroscopy, with numerous features per spectrum, compensated for the limited sample size in the calibration dataset. The Random Forests are effective with small datasets,54 and bootstrapping cross-validation with out-of-bag error estimation further supports our model's reliability by training on multiple data subsets and validating with the remaining data.60 Our binary Random Forest model achieved an impressive 90% accuracy on external validation samples, which were completely unknown to the model, demonstrating its robustness. Urine serves as an ultrafiltrate of blood, containing cellular components indicative of genetic and hereditary traits. The results obtained affirm that Raman spectroscopy's selectivity enables the capture of distinct genetic markers present within urine stains found at the crime scene.
This journal is © The Royal Society of Chemistry 2024 |