A Data-Driven Model for Classification of Traditional Chinese Medicine Materials
##plugins.themes.bootstrap3.article.main##
This paper proposes a data-driven classification model for traditional Chinese medicinal herbs based on mid-infrared spectral data. Addressing the limitations of traditional identification methods when the herbs’ appearance is damaged or incomplete, this study employs machine learning techniques to achieve accurate classification of medicinal herb types through spectral data preprocessing and feature extraction. Firstly, the Savitzky-Golay convolution smoothing method and Standard Normal Variate (SNV) transformation were used for denoising the spectral data. Then, Principal Component Analysis (PCA) was employed to reduce the dimensionality of the high-dimensional spectral data and extract the key features. Finally, the Gaussian Mixture Model (GMM) was applied to cluster the reduced data, categorizing the medicinal herbs into six classes. The results show that this method produces the accurate and stable classification. The constructed model is not only applicable to the classification and origin identification of medicinal herbs but also provides an important reference value for the classification and origin identification of other plant species.
Downloads
Introduction
Authentic medicinal materials refer to herbs that have been carefully selected through long-term clinical application in traditional Chinese medicine (TCM). These herbs, grown in specific regions, exhibit superior quality and efficacy compared to the same species produced in other areas. They are known for their stable quality and high reputation. The authenticity of Chinese medicinal materials is primarily determined by their place of origin, which is crucial for identifying the quality of the herbs. Although many medicinal herbs have similar appearances, their chemical composition, properties, toxicity, dosage, pharmacological effects, and functions can vary significantly. In traditional Chinese medicine, the identification of herbs has long relied on methods such as organoleptic identification, origin identification, microscopic identification, and physicochemical identification. However, these traditional methods pose challenges when the herb’s appearance is damaged or incomplete. Using infrared spectroscopy, we can accurately identify the authenticity of medicinal herbs and distinguish between adulterated or confusing species based on their microscopic characteristics.
This study employs a data-driven approach to classify medicinal herbs based on mid-infrared spectral data. Unlike traditional experience-based analysis methods, this approach leverages large-scale data analysis [1]–[4], and machine learning models [5]–[8] to extract valuable information from the data, achieving accurate classification of the herbs. By utilizing the absorbance data of herbs under numerous wavelengths and through the establishment of a machine learning-based classifier, the mid-infrared data and spectra were analyzed to study the features and differences between various types of medicinal herbs, enabling effective classification. Given that raw spectral data may exhibit baseline shifts and overlaps, the Savitzky-Golay convolution smoothing method and Standard Normal Variate (SNV) transformation were applied to preprocess the spectral data. Principal Component Analysis (PCA) was then employed to reduce the dimensionality of the raw spectral data, extracting 12 principal components. A Gaussian Mixture Model (GMM) was used to cluster the medicinal herbs into six categories, followed by an analysis of the characteristics and differences for each category.
This research demonstrates that the preprocessing, feature extraction and selection algorithms for spectral data can effectively improve the accuracy of spectral analysis. In this study, the combination of Savitzky-Golay preprocessing, PCA for feature extraction, and Gaussian mixture clustering yielded the most stable classification accuracy.
Data Processing and Analysis
In the mid-infrared spectral data of Chinese medicinal materials, the “No” column indicates the identification number of the herbs, while the remaining columns contain the wavenumber data in the first row. The subsequent rows represent the absorbance of the herb corresponding to each wavenumber under spectral irradiation. However, since these absorbance values are instrument-corrected, they may contain negative values.
First, spectral preprocessing techniques were employed to denoise and baseline-correct the raw data. Subsequently, Principal Component Analysis (PCA) was utilized to extract representative features from the high-dimensional spectral data. This series of steps is based on a data-driven analytical framework, extracting the most classification-relevant features by analyzing a large amount of spectral data, thereby establishing a mathematical model to study the characteristics and differences of various types of medicinal herbs based on their mid-infrared spectral data.
Data Preprocessing
During the spectral collection of medicinal herbs, various factors such as particle size, high-frequency random noise, sample background, scattered light, instrument response speed, and external environment can interfere with the process. As a result, the obtained spectral data not only contains substantial information about the sample itself but also includes components unrelated to the tested sample, leading to baseline shifts and overlaps in the spectra. These factors can severely affect the stability and accuracy of the established model. Therefore, to mitigate or eliminate these irrelevant non-target factors and obtain a stable, reliable, and accurate calibration model, it is essential to denoise the spectral data. The Savitzky-Golay convolution smoothing method and Standard Normal Variate (SNV) transformation can be employed for spectral data preprocessing.
Data Analysis
Based on the mid-infrared spectral data of several medicinal herbs, this study investigates the characteristics of different types of herbs and the differences among them. After preprocessing the spectral data, Principal Component Analysis (PCA) is applied to reduce the high-dimensional data to an appropriate dimension. Then, based on PCA, features corresponding to the principal components with a cumulative contribution rate exceeding 80% are selected. The Gaussian Mixture Model (GMM) clustering method is used to cluster the selected principal components. Finally, through graphical representation of the spectral data of different types of herbs, the features and differences among the various herbs are analyzed based on kurtosis, peak shape, and peak intensity.
Model Establishment and Analysis
Spectral Denoising
Spectral data was preprocessed using the Savitzky-Golay convolution smoothing method [9], [10] and Standard Normal Variate (SNV) transformation: Savitzky-Golay (SG) Convolution Smoothing: The Savitzky-Golay convolution is a polynomial regression algorithm implemented using a local moving window. Its principle is to replace the original noisy spectral values with a mean value, where this mean is essentially a polynomial fit through a local moving window that best approximates the true signal. This process allows the original noisy spectral values to be corrected to more reasonable signal values. The approximation process can essentially be regarded as a weighted averaging method. Let the width of the filter window be n , the measurement points are . We use a quadratic polynomial to fit the data within the window, with the expression: where a0, a1, and a2 are the coefficients of the quadratic polynomial, and y is the value after polynomial fitting. From (1), n similar equations can be obtained, thus forming a system of three linear equations. Standard Normal Variate (SNV) Transformation: The Standard Normal Variate (SNV) transformation is mainly used to eliminate the effects of particle size, surface scattering, and path length variations on diffuse reflectance spectra. The spectrum requiring SNV transformation is calculated using the following formula:
In (2), xi is the average value of the i-th sample spectrum, k = 1, 2,...,m, where m is the number of wavelength points; i = 1, 2,...n, where n is the number of calibration samples; xi, SNV is the transformed spectrum (See Table I for symbol definitions).
Symbols | Definition |
---|---|
x i | The average value of the i-th sample spectrum |
m | Number of wavelength points |
n | Number of calibration samples |
λ m | Eigenvalue |
a m | Eigenvector |
The denoising algorithm used in this process is the SG smoothing method, which helps obtain approximately denoised spectral data of traditional Chinese medicine, making the identification model more accurate in expressing the spectral characteristics of herbal medicine.
The spectral data of traditional Chinese medicine were processed using a spectral denoising model, with the results shown in Fig. 1. The absorption rates of several raw spectra of traditional Chinese medicine are shown in Fig. 1a, and the denoising effect of the original SG smoothing is illustrated in Fig. 1c. By comparing the figures, it can be observed that the raw spectra became smoother after SG smoothing. The spectra after SNV preprocessing retain the original values of the spectral data, and when the absorbance is magnified, the differences between the data are more visually apparent. As can be seen from Fig. 1, the spectra of traditional Chinese medicine samples show significant overlap.
After denoising the spectra, due to the large amount of data, we need to use Principal Component Analysis (PCA) to select feature values for dimensionality reduction.
Feature Selection Based on PCA
Due to the large amount of data in traditional Chinese medicine, we need to replace the original spectral data, which has thousands of dimensions, with a small number of principal components. Therefore, we first use PCA (Principal Component Analysis) to perform dimensionality reduction and select the feature values for traditional Chinese medicine. Then, the selected feature values are clustered using the Gaussian Mixture Model (GMM).
Principal Component Analysis [11]–[14] is one of the common feature extraction methods for high-dimensional data. PCA introduces random variables into the high-dimensional sample space and selects fewer important variables through linear transformations of multiple variables, which is a multivariate statistical method.
The number of original samples can be expressed as: where m is the dimensionality of the original spectral distribution graph.
To make the spectral dataset more suitable for PCA calculations, the data needs to be normalized first. Then, calculate the covariance matrix R of the standardized and even samples, along with its eigenvalues λm and eigenvectors am.
Take the first to the m-th (m ≤ p) principal components corresponding to the eigenvalues that contribute up to 80%, using the formula:
where a1i, a2i...api are the eigenvectors corresponding to the eigenvalues of the covariance matrix Σ of X, and X1, X2,..., Xp are the standardized values of the original variables. The larger the value of the eigenvector, the more original information or energy it contains and the greater its influence on that principal component, indicating higher importance.
Since there is no sufficient evidence that any specific spectral band has strong discriminative power, principal component analysis is performed on the entire spectral range 652 cm−1–3999 cm−1 to obtain the most representative spectral information. The results sorted by the contribution of the principal components are shown in Fig. 2.
Observing Fig. 2, we find that after dimensionality reduction, Principal Component 1 of the original data accounts for 82.37%, while the first three principal components account for 98.67% of the total information. After SNV smoothing, the contribution rates of the first three principal components account for 84.38% of the information. For the data smoothed with SG, the first three principal components account for 98.67% of the information. The principal components corresponding to the eigenvalues whose cumulative contribution rate exceeds 80% are extracted from the principal component analysis of each processed spectral dataset, and the number of saved principal components is shown in Table II.
Original data | SNV orthogonal transform | SG smoothing | |
---|---|---|---|
Principal components | 3 | 4 | 3 |
Contribution rate | 98.67% | 90.47% | 98.67% |
Classification Based on Gaussian Mixture Clustering
The idea of the Gaussian Mixture Model (GMM) algorithm [15], [16] is to view the dataset as a mixture model composed of multiple Gaussian distributions. First, when performing clustering, a parameter k needs to be pre-specified, which is the desired number of classes (clusters) to divide the dataset into. Next, the algorithm randomly initializes the parameters of k Gaussian distributions, including the mean vector, covariance matrix, and mixture weights for each distribution. The setting of these initial parameter values significantly affects the final result of the algorithm.
After initialization, the algorithm enters an iterative optimization process. First, in each iteration, the algorithm utilizes the current Gaussian distribution parameters to calculate the probability that each data object belongs to each of the k Gaussian distributions. This step is typically implemented through the Expectation-Maximization (EM) algorithm. Initially, in the expectation step (E-step), the algorithm estimates the probability of each data object belonging to each Gaussian distribution based on the current parameters. Subsequently, in the maximization step (M-step), the algorithm uses these probabilities to recalculate and update the parameters of each Gaussian distribution, including the mean, covariance matrix, and mixture weights.
Through the above process, data objects are classified into the cluster with the highest probability of belonging. As the iterations proceed, the parameters of each cluster gradually stabilize. Objects within the same cluster have a higher similarity because they are closer to the probability density center (i.e., the center of the Gaussian distribution) in the parameter space, while objects in different clusters have lower similarity due to their association with different Gaussian distributions. Ultimately, the clustering result is a model defined by k Gaussian distribution parameters, where each distribution represents a cluster, with its center and shape described by its mean and covariance. In this way, the similarity or dissimilarity between different clusters can be quantitatively described through these parameters. The process of the Gaussian mixture algorithm is illustrated in Fig. 3.
The advantages of the Gaussian Mixture Model algorithm lie in the following three points: Running the model multiple times under different initial conditions to avoid local optima. Using K-means++ to initialize the cluster centers of the Gaussian mixture clustering model ensures that the model starts from a better point, thereby improving the clustering effect. This avoids the “random initialization problem” in traditional K-means, reducing the risk of falling into local optima. Addressing the issue of the covariance matrix through regularization increases the stability of the covariance matrix and prevents singularity during the calculation of the covariance matrix.
After performing PCA dimensionality reduction and Gaussian mixture clustering on the original spectra and the SG smoothed spectra data, the distribution of different types of Chinese herbal medicines exhibits a certain degree of distinguishability. However, when applying the Gaussian mixture algorithm for clustering analysis of Chinese herbal medicine data, since it is not possible to know the optimal setting for the number of cluster centers that yields the best clustering effect, the number of cluster centers was set to 4, 6, and 8, respectively for clustering tests, and the clustering results were visualized, as shown in Fig. 4. From the figure, we can see that the clustering effect with 6 cluster centers is better than the clustering effects with 4 and 8 cluster centers. Therefore, we set the number of cluster centers to 6 and divided the Chinese herbal medicines into six categories.
Fig. 5 compares the clustering features of the original data and the first-order SG smoothed derivative. In the feature plot for clustering, the overall absorbance of the fourth class of Chinese herbal medicines is significantly higher than the other classes, exhibiting the most prominent features. For the sixth class of herbal medicines, after appearing the highest absorption peak at around 1600 wavenumbers, the trend becomes stable. Classes two, three, and five have roughly similar trends, but there are certain differences in peak differences, peak widths, and absorbance magnitudes, which may be because these few types of Chinese herbal medicines have similar growing environments.
Except for the fourth and sixth classes of herbal medicines, the characteristics of the remaining herbal medicines are as follows (Table III): The highest absorption peak appears at a wavenumber of 1100. After the wavenumber of 1700, as the wavenumber increases, the rate of increase in absorbance gradually slows down without significant “peak-valley” changes. At a wavenumber of 2800, the absorbance forms an absorption valley. At a wavenumber of 2900, there exists a weak reflection peak.
Category 1 | Category 2 | Category 3 | Category 4 | Category 5 | Category 6 | |
---|---|---|---|---|---|---|
Total peaks | 1–2 | 1 | 2–3 | 1 | 2–3 | 1–2 |
Highest peak | 1700–2000 | 1100–1300 | 1100–1300 | 652–700 | 1100–1300 | 1200–1400 |
Peak intensity | Moderately strong | Moderately weak | Moderately strong | Moderately weak | Strong | Weak |
Peak shape | Sharp | Blunt | Sharp | Blunt | Sharp | Blunt |
These common features provide a basis for distinguishing these classes of herbal medicines from the fourth and sixth classes while also highlighting the similarities among them. This information could be valuable for further analysis and classification of Chinese herbal medicines based on their spectral characteristics.
Conclusion
This study successfully classified Chinese herbal medicines by analyzing their mid-infrared spectral data using a data-driven classification approach and constructing machine learning-based classification models. Firstly, Savitzky-Golay convolution smoothing and Standard Normal Variate (SNV) transformation were applied to preprocess the spectral data, effectively reducing noise and baseline shifts. Subsequently, Principal Component Analysis (PCA) was used to reduce the dimensionality of the high-dimensional spectral data, extracting representative principal components. Finally, Gaussian Mixture Model (GMM) was employed to cluster the herbal medicines into six categories. The results demonstrate that the combination of preprocessed spectral data with PCA and GMM classifiers can significantly improve the accuracy and stability of Chinese herbal medicine classification.
Cluster analysis revealed that setting six clusters yielded the best classification results. Different types of herbal medicines exhibited significant differences in their spectral profiles, with the fourth and sixth classes showing particularly distinct characteristics. This research provides a scientific basis for quality control and origin identification of Chinese herbal medicines, proving the effectiveness and feasibility of data-driven methods in herbal classification.
However, some limitations exist in the spectral feature extraction process. For example, more sophisticated feature extraction algorithms, such as manifold learning and compressed sensing techniques, could be used to further enhance the model’s performance and applicability. Furthermore, the machine learning-based model for categorizing and identifying the origin of Chinese herbal medicines established in this study is not only applicable to herbal medicines but also has important reference value for the category and origin identification of other plants, demonstrating strong potential for broader application.
References
-
Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 2014;1(2):293–314.
Google Scholar
1
-
Hosen MZ. Effect of Ramadan on purchasing behavior: a panel data analysis. Int Rev Econ. 2024;71:325–41.
Google Scholar
2
-
Hughes G, Dobbins C. The utilization of data analysis techniques in predicting student performance in massive open online courses (MOOCs). Res Pract Techol Enhanc Learn. 2015;10:10.
Google Scholar
3
-
Zamawe FC. The implication of using NVivo Software in qual- itative data analysis: evidence-based reflections. Malawi Med J. 2015;27(1):13–5.
Google Scholar
4
-
Mahesh B. Machine learning algorithms—A review. Int J Sci Res (IJSR). 2020;9(1):381–6.
Google Scholar
5
-
Alzubi J, Nayyar A, Kumar A. Machine learning from theory to algorithms: an overview. J Phys: Conf Series. 2018;1142:012012.
Google Scholar
6
-
Mehyadin AE, Abdulazeez AM, Hasan, Dathar A, Saeed, Jwan N. Birds sound classification based on machine learning algorithms. Asian J Res Comput Sci. 2021;9(4):1–11.
Google Scholar
7
-
Sushma P. An automated method based on machine learning and link analysis. Glob Sci-Tech. 2018;10(3):153–60.
Google Scholar
8
-
Massaoudi M, Refaat SS, Abu-Rub H, Chihi I, Oueslati FS. PLS- CNN-BiLSTM: an end-to-end algorithm-based Savitzky-Golay smoothing and evolution strategy for load forecasting. Energies. 2020;13(20):1–29.
Google Scholar
9
-
Rajagopalan S, Robb R. Image smoothing with Savitzky-Golay filters. Proceedings of the Conference on Medical Imaging 2003: visualization, Image-Guided Procedures, and Display, pp. 1–9, 2003.
Google Scholar
10
-
Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev. 2010;2(4):433–59.
Google Scholar
11
-
Jian Y, Zhang D, Frangi AF, Yang J-Y. Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans Pattern Anal Mach Intell. 2004;26(1):131–7.
Google Scholar
12
-
Gedon D, Ribeiro AH, Wahlström N, Schön TB. Invertible kernel PCA with random fourier features. IEEE Signal Process Lett. 2023;30:563–7.
Google Scholar
13
-
Martinez AM, Kak AC. PCA versus LDA. IEEE Trans Pattern Anal Mach Intell. 2001;23(2):228–33.
Google Scholar
14
-
Deng Cai JL, He X. Gaussian mixture model with local consistency. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pp. 512–7, 2010.
Google Scholar
15
-
Wang Y, Chen W, Zhang J, Dong T, Shan G, Chi X. Efficient volume exploration using the Gaussian mixture model. IEEE Trans Vis Comput Graph. 2011;17(11): 1560–73.
Google Scholar
16