Skip to main content

Table 2 Data processing, advantages, limitations and suitable application of classification methods

From: Recent application of Raman spectroscopy in tumor diagnosis: from conventional methods to artificial intelligence fusion

Methods

Data processing

Advantages

Limitations

Suitable application

Peak intensity analysis

Peak intensity: Inormal vs Icancer

Peak intensity ratio (R = I12xx/I16xx): Rnormal vs Rcancer

Straightforward

Simple

Low accuracy

Data size increases, accuracy downs

Obvious characteristic peaks and large differences

Small-scale sample data (10 ~ 100)

Multivariate statistical analysis

-

High interpretability

Easily implement

Facing lager amount data, accuracy has limitation

Medium-scale sample data (100 ~ 1000)

PCA

Reduces the original Raman spectra to PCs while preserving the features that contribute most to the difference in the Raman spectra

Reduces data dimensionality to PCs

Retains important data information

Removes background noise

Relatively low classification accuracy

Unsupervised method

Exploratory study

Data-reduction algorithm

PLS

Regression modeling for independent and dependent variables of Raman spectra

Reduces data dimensionality to key factors

Better selects characteristic variables

Relatively low classification accuracy

Supervised method

Data-reduction algorithm

KCA

Takes the mean of the nearest point to the seed constantly to cluster analysis of Raman spectra

Simple algorithm principle

Fast processing speed

K value is difficult to determine

Not necessarily global optimal, but only local optimal

Unsupervised clustering technique

Exploratory study

Samples with large differences between groups

LDA

Projects the Raman spectra into the vector space with the maximum between-class distance and the minimum within-class distance

Commonest classification method

High accuracy

Overfit if data insufficient

Powerful supervised technique for classification

Integrates with PCA method

QDA

Estimates the single covariance matrix for each type of Raman spectra

Variant of LDA

High accuracy

Can’t for data dimension reduction

Supervised technique for classification

Sample analysis

GA

Feature extraction of Raman spectra, as a stage prior to classification

Feature selection

Strong robustness

Low computation speed

Complex programming process

General optimization technique

Feature extraction of data

Classical machine learning

-

Higher accuracy

Easily implement

Poor in training efficiency when processing large-scale data

Large-scale sample data (1000 ~ 10,000)

SVM

Seeks to determine the optimal hyperplane that maximizes the distance between the hyperplane and the nearest Raman spectra data sample in a high-dimensional space

Less prone to overfitting

Avoids local optimum and “curse of dimensionality”

Poor training efficiency when processing large-scale data

Nonlinear, multi-dimensional problems

Small sample learning problems

BT

Changes the weight of Raman spectra data, learns multiple classifiers, and combines these classifiers linearly to improve the performance of classification

Ensemble learning method

No need to do feature normalization

Sensitive to abnormal data and

Easy to overfit

Low dimensional data

Layers not too high

RF

Uses multiple trees to train and predict Raman spectra data

Ensemble learning method

Low risk of overfitting

Relatively lower learning speed

Limited samples

KNN

Uses proximity of a single Raman spectral data point to classify or predict groupings

High precision

Insensitive to outliers

Relatively large time complexity

Large space complexity

Small-size samples

Low-dimensional data

Deep learning

-

Higher accuracy

Good portability

Large amounts of computation

Complex model design

Larger-scale sample data (1000 + , 10,000 + , …)

CNN

Raman spectra/figures as input data, prefers Raman figures as input

Extracts features from input data directly and classifies the observed objects

Directly extracts features from input data

Classifies the observed objects

Simple architecture

Ease of use

Depends on quality and features of the data

Most of the modeling tasks (classification and regression)

RNN

Raman spectra as input data

Mines wavenumber and intensity information in the Raman spectra data

Strong learning ability of time series nonlinear data behavior

Stores more long-term sequence information

Mines temporal and semantic information in the data

Risks of gradient exploding and gradient vanishing

Sequence data

Time series nonlinear data behavior

Classification and prediction

  1. BT Boosted tree, CNN Convolutional neural network, GA Genetic algorithm, KCA k-means cluster analysis, KNN k-nearest neighbors, LDA Linear discriminate analysis, PCA Principal component analysis, PLS Partial least squares, QDA Quadratic discriminant analysis, RF Random forest, RNN Recursive neural network, SVM Support-vector machines