Application of artificial intelligence in cognitive load analysis using functional near-infrared spectroscopy: A systematic review

Abstract

Cognitive load theory suggests that overloading of working memory may negatively affect the performance of human in cognitively demanding tasks. Evaluation of cognitive load is a difficult task; it is often assessed through feedback and evaluation from experts. Cognitive load classification based on Functional Near-InfraRed Spectroscopy (fNIRS) is now one of the key research areas in recent years, due to its resistance of artefacts, cost-effectiveness, and portability. To make fNIRS more practical in various applications, it is necessary to develop robust algorithms that can automatically classify fNIRS signals and less reliant on trained signals. Many of the analytical tools used in cognitive sciences have used Deep Learning (DL) modalities to uncover relevant information for mental workload classification. This review investigates the research questions on the design and overall effectiveness of DL as well as its key characteristics. We have identified 45 studies published between 2011 and 2023, that specifically proposed Machine Learning (ML) models for classifying cognitive load using data obtained from fNIRS devices. Those studies were analyzed based on type of feature selection methods, input, and DL model architectures. Most of the existing cognitive load studies are based on ML algorithms, which follow signal filtration and hand-crafted features. It is observed that hybrid DL architectures that integrate convolution and LSTM operators performed significantly better in comparison with other models. However, DL models especially hybrid models have not been extensively investigated for the classification of cognitive load captured by fNIRS devices. The current trends and challenges are highlighted to provide directions for the development of DL models pertaining to fNIRS research.

1. Introduction

Cognitive load theory (CLT) has been considered one of the most important learning theories in the field of experimental psychology (Kirschner, Ayres, & Chandler, 2011), educational psychology (Sweller, 2016), developmental psychology (Sepp, Howard, Tindall-Ford, Agostinho, & Paas, 2019), and medical education (Skulmowski & Xu, 2021). CLT signifies that the capacity of the human mind is limited when dealing with novel information (Castro-Alonso et al., 2021, Curum and Khedo, 2021). The theory leverages instructional implications and learning procedures of human cognitive structure. Generally, cognitive architecture assumes that all the novel information is initially processed by human’s working memory, which has limited capacity and duration. The information is then stored in unlimited long-term memory. However, our working memory is limited when the information is retrieved from the previously organized long-term memory (Buchner, Buntins, & Kerres, 2021). The extent to which mental workload degrades performance depends on the experience of a person working in a particular domain. Increase in cognitive load compromises the performance through decline in motivation and increase in reaction time, fatigue, and error rates. Modern research on behavior sciences emphasizes that the influence of cognitive load must be considered during teaching and learning so that effective knowledge acquisition can take place (Heitmann et al., 2022, Tugtekin and Odabasi, 2022).

Measure of cognitive load plays a vital role in enhancing the skill set of a variety of tasks, e.g. in aviation (Wilson, Nair, Scielzo, & Larson, 2021; R. Zhu, Wang, Ma, & You, 2022), semi-autonomous cars (H. Zhang et al., 2022, Zhang et al., 2022), defense training (Buckley et al., 2022), aerospace (Magnusdottir, Johannsdottir, Majumdar, & Gudnason, 2022), e-learning (R. Liu et al., 2022, Liu et al., 2022), virtual reality-based trainer (Zhao et al., 2022), and assembly operations (Fournier et al., 2022). In the last few decades, several non-invasive modalities have been exploited to measure cognitive load by acquiring signals from the human body. Changes in cognitive load can be detected via various physiological parameters, e.g. Electroencephalogram (EEG) (Farkish, Bosaghzadeh, Amiri, & Ebrahimpour, 2022), Electrocardiogram (ECG) (Lagomarsino, Lorenzini, De Momi, & Ajoudani, 2022), eye tracking (Yan et al., 2022), Functional Near-InfraRed Spectroscopy (fNIRS) (Agbangla, Audiffren, Pylouster, & Albinet, 2022), skin conductance level (Saha, Jindal, Shakti, Tewary, & Sardana, 2022), and Positron Emission Tomography (PET) (Canário, Jorge, Martins, Santana, & Castelo-Branco, 2022). Each physiological parameter is responsible for observing different biological processes. However, bulkiness, high cost and sensitivity to different disturbances limit the capability of these devices in ubiquitous computing. As an example, while eye tracking is widely used and is unobtrusive, it only provides indirect measure of the brain activity (Anderson et al., 2011). Neuroimaging studies related to fMRI and PET have generated insights into the pathological changes in blood oxygenation and metabolic functions (Catana, Drzezga, Heiss, & Rosen, 2012). Besides being expensive, fMRI and PET require a subject to be immobilized in a tightly restrained environment (Fujikawa et al., 2022, Harauzov et al., 2022). In addition, both modalities expose the subject to hazardous materials and loud noise. Electrodes of EEG are prone to internal and external artifacts, such as heartbeat, movement, and other electromagnetic interference. These disturbances make it challenging to differentiate signals from noise (H. Wang et al., 2022, Wang et al., 2022). Skin temperature, eye tracking and skin conductance level are also widely used as non-intrusive measures of workload; but the findings suggest insignificant correlation between sensor data and subjective workload measure (Cosme et al., 2022, Žagar et al., 2022).

fNIRS has the potential to overcome the above-mentioned issues, and is useful and usable in a wide range of applications (Klein, Debener, Witt, & Kranczioch, 2022). Being known to be powerful and non-invasive, fNIRS functions as a safe tool to investigate hemodynamic responses in superficial cortical regions. fNIRS uses an optical fiber-based light source to emit infrared between a spectral window of 600 to 1000 nm and detectors to detect optical density changes (Li et al., 2022). Changes in neural activities result in changes in blood oxygenation levels. Based on the principles of the modified Beer-Lambert Law (Baker et al., 2014), fNIRS measures cognitive load by monitoring concentration variation in oxygenated hemoglobin (HbO2) and deoxygenated hemoglobin (dHb) at the cortical microcirculations blood levels, as shown in Fig. 1. The main advantages of fNIRS include high spatial resolution, safety, movement tolerability, portability, and ability for integration with EEG, PET, or ECG (Krampe, 2022; Y. Liu et al., 2022, Liu et al., 2022).

Download: Download high-res image (148KB)
Download: Download full-size image

Fig. 1. Multi-channel data acquisition for generating cortical activation maps.

Although fMRI provides high-resolution and in-depth information on the blood oxygenation levels, inexpensive fNIRS targets the cortical regions of interest. fNIRS is also tolerant of motion artefacts, which makes it a better candidate for detecting brain activities in cognitive load-related tasks (Zhuang et al., 2022). For these reasons, we focus only on fNIRS-based data collection campaigns that capture the hemodynamics changes in the prefrontal cortex using off-the-shelf equipment in our review. fNIRS signals are naturally complex, non-linear, and have a high dimension. This data format makes it difficult to identify abnormalities with our naked eyes. These properties have made fNIRS data suitable for analysis using Deep Learning (DL) and Machine Learning (ML) models.

DL/ML models have an ability to learn features hierarchically by complex mapping functions directly from data. They are the leading Artificial Intelligence (AI) tools in several domains, such as image processing (Suganyadevi, Seethalakshmi, & Balasamy, 2022), pattern recognition (Bai et al., 2021), image segmentation (Picon et al., 2022), speech analysis (Bhangale & Kothandaraman, 2022) and physiological data processing (Patlar Akbulut, 2022). Signals recorded from fNIRS devices usually contain mixed artifacts and noise. Traditional approaches require the decomposition of fNIRS signals to frequency or wavelet transformation for noise removal. DL models, specifically Artificial Neural Networks (ANNs) or Convolutional Neural Networks (CNNs), sometimes require minimum pre-processing effort by generating machine learned features for classification and pattern recognition (Wani et al., 2022). Success of AI across various engineering fields promises the development of model-free approaches with robust performance. We, therefore, focus on the implementation, validation, and development of wearable fNIRS sensors for logging and tracking of cognitive load during memory demanding tasks in this review.

Although several reviews on the assessment of cognitive load using physiological sensors exist, to the best our knowledge, there is no research paper that cover in-depth applications of DL/ML models for analyzing fNIRS-based cognitive load. Previous survey and review articles within the research domain of cognitive load and physiological signals are thoroughly discussed in Section 2 of this review. These papers have predominantly focused on conventional ML, DL, and statistical techniques, placing particular emphasis on handcrafted feature engineering methods for the analysis of fNIRS data. The focus of these articles has been on applications related to neurological disorders, stress, and emotional responses utilizing fNIRS technology. It is noteworthy that different cognitive tasks elicit specific cortical activations in various brain regions, necessitating customized hyperparameters for ML and DL algorithms tailored to each specific task. While existing reviews cover a broad spectrum of AI applications in fNIRS data analysis, there remains a challenge in highlighting and comprehending advancements specifically in ML and DL techniques for analyzing cognitive load data obtained from fNIRS measurements. Recognizing this research gap, we survey cognitive load and fNIRS AI literature with the explicit goal of highlighting progress made in employing ML/DL methods for cognitive load recognition.

The main contributions of this review are as follows:

Over the past few years, numerous researchers have undertaken reviews and surveys on cognitive load, with the aim of understanding current trends in monitoring cognitive load. The findings of these reviews have highlighted the complex nature of cognitive load assessment, revealing that it can be evaluated through various means, including both subjective and physiological measures. While subjective measures, such as questionnaires, have traditionally been a common means for gathering insights into cognitive load, metanalyses conducted by R. A. Block et al. (Block, Hancock, & Zakay, 2010) have indicated certain limitations associated with this approach. Their analysis, encompassing data from 117 experiments, revealed that relying solely on subjective measures can introduce biases and be influenced by individual differences in cognitive ability. Most reviews within this field consistently highlight the significance of employing physiological measures to gain valuable insights into cognitive performance during task execution. These measures include, but are not limited to, ECG, EEG, eye tracking, fNIRS and skin conductance level. These measures provide a direct and objective means of assessing the intricate aspects of cognitive function associated with task performance.

The development of deep learning techniques had a significant impact on the direction of neurology research. The current popularity of deep architectures brings the need to review and analyze existing studies about deep learning in physiological signals domain. Several studies have been conducted to discuss and investigate the role of DL models in analyzing physiological data. For instance, Y. Roy et al. (Roy et al., 2019) emphasizes the role of EEG in clinical applications such as sleep disorder diagnosis, epilepsy monitoring, and brain–computer interfacing. They highlight the increasing adoption of DL to address challenges like automating time-consuming tasks and improving generalization across subjects. The review identifies major trends, including DL's prevalence in EEG classification for various domains. Notably, studies varied widely in data quantity, architecture choices, and the use of raw EEG data. The review suggests a need for targeted investigations into optimal data amounts for DL in EEG processing. Recommendations are provided to enhance result reproducibility, including clear architecture and data descriptions, use of existing datasets, and code sharing. E. Banuelos-Lozoya et al. (Banuelos-Lozoya, Gonzalez-Serna, Gonzalez-Franco, Fragoso-Diaz, & Castro-Sanchez, 2021) highlights the research in the context of Quality of Experience/User Experience (QoE/UX) evaluation, focusing on recognizing cognitive states from various physiological data sources. The study found that while cognitive states such as mental workload, stress, and attention have been analyzed, there is still a need to understand their relationship with specific elements that contribute to the overall user experience. The main findings emphasized the general physiological and behavioral responses to stimuli rather than individual components of interfaces or interactions. Y. Zhou et al. (Y. Zhou et al., 2021, Zhou et al., 2021) provides a comprehensive review of EEG-based cognitive workload recognition using machine learning. The article covers the steps of classical machine learning, including data acquisition, preprocessing, feature extraction and selection, classification, and evaluation. Additionally, it explores widely used deep learning models for workload recognition. Adil et al. (Saleem et al., 2023) review centers around driver drowsiness detection and emphasizes the complexity of driving, where reduced cognitive performance due to drowsiness can lead to accidents. The study reviews recent techniques and technologies for detecting driver drowsiness, emphasizing the use of physiological signals, particularly EEG and ECG sensors, along with GSR and thermal cameras. This review identifies challenges such as the lack of customized deep learning architectures, limited multimodal approaches due to complexity and real-time constraints, and difficulties in comparing performance across heterogeneous hardware sensors. The authors suggest the need for novel solutions, including IoT and mobile devices, non-invasive sensors, transfer learning, and customized deep learning architectures to enhance the robustness, reliability, resilience, and real-time capabilities of driver drowsiness detection systems.

Similarly, numerous other reviews and surveys on DL/ML, which focus on specific fields or applications. These encompass in-depth explorations of deep learning methodologies applied to various domains, such as eye-tracking, ECG, EEG and fNIRS and specific tasks like stress, emotion recognition, sleep disorders, cognitive load, anemia and multimedia learning. These comprehensive review papers have primarily focused on the diverse applications of ML/DL in analyzing various physiological signals. Despite the wealth of literature exploring the application of ML/DL techniques for cognitive load analysis using physiological measures, a notable gap exists in the systematic examination of the use of these techniques specifically for fNIRS signals. To the best of our knowledge, we did not find any in-depth literature review comprehensively covering the application of ML/DL techniques in the context of cognitive load analysis using fNIRS signals. While existing reviews delve into the applications of ML/DL for cognitive load assessment using EEG and other physiological signals, there is a lack of literature addressing the unique characteristics and challenges posed by fNIRS signals in this domain. It is worth mentioning that a review conducted by C. Eastmond et al. (Eastmond, Subedi, De, & Intes, 2022) has provided a broader analysis of the progress made in the application of DL techniques for analyzing fNIRS signals. However, this study did not explore the specific intricacies related to cognitive load assessment using fNIRS. Secondly, it is noteworthy that these reviews have examined studies that analyze physiological signals by either utilizing publicly available datasets or repurposing data from prior studies. However, these reviews do not bring attention to the possible challenges and issues linked to the initial data collection processes utilized for subsequent ML and DL analyses. Therefore, to address the existing gap in the literature, our review aims to highlight the significant advancements made in the application of ML and DL methodologies for the recognition of cognitive workload using fNIRS signals. This involves an examination of all studies published within this specific domain, by providing information on the development of techniques, methodologies, and findings.

This review covers studies on cerebral activities during cognitive demanding tasks according to guidelines provided by the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) (Page et al., 2021) protocol. We formulate a comprehensible search strategy with the aim to answer specific research questions. To maintain the focus on neuroergonomics studies related to ML and DL, we first identify the keywords for our preliminary search. Therefore, we included the common terms of cognitive load together with fNIRS in the final search string presented in Table 1.

Table 1. Search strings used for each topic.

Topic	Search terms
Cognitive load	“cognitive load” OR “dual task” OR “cogniti*” OR “working memory” OR “attention” OR “load” OR “mental load” OR “overload” OR “mental effort” OR “germane load” OR “germane” OR “intrinsic load” OR “intrinsic cognitive load” OR “extraneous cognitive load”
Artificial intelligence	“deep learning” OR “machine learning” OR “artificial intelligence”
Functional Near Infrared Spectroscopy	“fNIRS” OR “functional near infrared spectroscopy”

We particularly limit the publications to those in 2011 onward in well-established sources, namely the ACM digital library, Web of Science, PubMed, IEEE Explore, Scopus, PubMed, Google Scholar, and EuropePMC. We use the search keywords in these electronic databases, and then initially titles and abstracts have been screened based on the following inclusion and exclusion criteria.

The objective of this review is to explore ML/DL-based techniques to decode brain activities from fNIRS signals. The studies included in this article should meet the following criteria:

The following criteria have been considered to determine whether an article needs to be excluded:

A selection process has been conducted in two main steps. The first step involves the removal of all duplicates; while the second applies the inclusion and exclusion criteria specified earlier. Articles that have no information on feature analysis, comparisons, study designs and outcomes have also been excluded. Fig. 2 summarizes the precise steps involved in the identification, screening, and eligibility processes.

Download: Download high-res image (337KB)
Download: Download full-size image

Fig. 2. A flow diagram of the literature search according to the PRISMA (Page et al., 2021) guidelines.

A total of 1428 studies have been retrieved according to the keyword search, and almost 280 duplicates studies have been removed. Then, a total of 410 studies that meet the exclusion criteria have been deleted, while studies that meet the full inclusion criteria, information regarding cognitive tasks, model designs and outcomes have been extracted. Over 50 % of the articles included in this review have been published in the last three years. In addition, the major results of all 45 articles on cognitive load with fNIRS and ML/DL are summarized in subsequent sections.

A cognitive activity indicates an evaluation of a task based on the performance outcome. Although the main aim of our research is to investigate the physiological measures of cognitive load, researchers have used subjective measure for its analysis. Subjective measure requires the participants to rate different aspects of the learning process using a multi-item scale. Particularly, NASA’s Task Load Index (TLX) (Hart & Staveland, 1988) is considered as a gold standard to measure workload in human-system evaluation. The NASA-TLX measure calculates a global index score based on mental demand, physical demand, temporal demand, performance, frustration, and effort. These scores are converted in the range of 0–100 (Nasirizad Moghadam et al., 2021) for task evaluation purposes. However, when many cognitive processes interact one another, learners may not be able to identify different forms of cognitive load. The usefulness of subjective measures has been questioned due to a lack of correspondence and assessment of events in the external world in correlation with the simulated cognitive environment. Therefore, it is important to improve the credibility of subjective measures so that the external world as well as internal sensation and feeling can be correlated in cognitive load measurement.

In contrast, physiological measures especially fNIRS provide uninterrupted evaluation, offering a more objective workload assessment. fNIRS-based systems have been widely used to study neural changes in simulated cognitive environments. Concentration changes in Hbo2 and dHb are proportional to the change in the cerebral blood volume, providing a useful measure of neural activities. Some studies have implemented subjective surveys and categorised fNIRS signals using DL and ML classification techniques (Asgher et al., 2020, Keles et al., 2021). The main reason to rely only on physiological signals is that surveys interrupt the underlying operation flow, lengthen the time of operations, and are only available post-task (T. Zhou et al., 2020), leading to intra and inter-subject variability, inconsistency, disruption and inadequacy pertaining to measurements for the scenarios discussed in this article.

The foremost step in developing an fNIRS-based system is the selection of brain regions from where the brain signals are generated. The signals are generally acquired from the pre-frontal cortex or motor cortices. Motor cortices mostly respond to the movement of body parts, e.g., legs, arms, fingers, hands, etc., In comparison, most of the included studies in this survey indicate that signals from the pre-frontal cortex are highly correlated with cognitive tasks. In addition, the signals acquired from the pre-frontal cortex are less sensitive to motion artifacts and high-frequency influence (Gemignani & Gervain, 2021). Fig. 3 depicts the distribution of studies in this review based on cognitive tasks. Cognitively demanding activities contribute to changes in Hbo2 and dHB over the pre-frontal region of the brain can be categorized into four groups: mental arithmetic (16 %), n-back task (24 %), Stroop task (5 %) and simulation-based tasks (55 %). A description of the general protocols for these tasks is as follows:

Download: Download high-res image (141KB)
Download: Download full-size image

Fig. 3. Task based distribution of studies.

Arithmetic tasks involve performing mathematical calculations without the help of using a paper, calculator, or computer. Arithmetic tasks usually consist of presenting a sequence of numbers to participants for performing addition, subtraction, multiplication, or division in a predefined duration. Mathematical equations of different complexity levels require simultaneous mental processing and information storage, which induces both low and high levels of mental workload in addressing complex experimental scenarios.

Introduced by Kirchner (Kirchner, 1958) in 1958, n-back tasks have been most extensively used in neuroscience to understand the neural basis of working memory. As visual-spatial tasks, researchers in neuroimaging have leveraged n-back tasks to induce different levels of memory load. It serves as a visual or auditory stimulus to participants with a series of several random numbers, pictures, or digits. Participants need to remember them and then, when enquired, need to determine the matches with stimuli of N items seen before. Cognitive load can be modified by varying the value of N . In the 0-back task participants are required to identify single pre-specified digit, letter, or image. In the 1-back task, each new item is identical to the one preceding it. Similarly, for a 2-back, 3-back, …, or n-back task, each new item is identical to item presented 2, 3, …, or n trails back. Fig. 4 shows a schematic of 1-, 2-, and 3-back tasks. Varying the value of N systematically increases the processing load, which results in changes in reaction time and accuracy (Lamichhane, Westbrook, Cole, & Braver, 2020).

Download: Download high-res image (127KB)
Download: Download full-size image

Fig. 4. Schematic of 1, 2 and 3 back tasks.

A Stroop task (Stroop, 1935) was developed in 1935 to study the effect of cognitive inhibition. Since then, many variants of Stroop task have been proposed. Some of them have been used in clinical neuropsychology to study neurological disorders of patients (Fischer-Jbali et al., 2022, Lewis et al., 2022). A traditional Stroop task, as shown in Fig. 5, entails the presentation of four-colour words displayed in red, green, blue, and yellow. As an example, the word green could be displayed in green, yellow, red, or blue. The Stroop effect has been extensively used in neurological studies with an opportunity to earn reward points for accurate and fast responses. In the Stroop test, participants are instructed to identify the font color while ignoring the word. This results in a delayed identification of colors, a slower response time and an increased cognitive workload.

Studies of human brain in a simulator-based environment offer the safest way to expose participants to simulated dangers without risking life or losing property (Frederiksen et al., 2020). Technologies such as driving/flying simulators (Asadi et al., 2023, Asadi et al., 2019), virtual reality (Kooijman et al., 2022, Kooijman et al., 2023), and cognitively demanding games can be used to create simulation where the surroundings from a real environment are integrated into a virtual system. These simulations, as shown in Fig. 6, have a high level of connectivity with different types of commercial joystick or customized controllers. Furthermore, distractions during simulation such as visibility, turbulence, mental state, or pre-programmable handling qualities add cognitive load to participants. In simulator-based studies, a flying/driving task constitutes the majority of neuro-ergonomics application (e.g., aircraft control systems, driving a car, or flying a plane in complex simulated scenarios) (Mejia-Puig and Chandrasekera, 2022, Reddy et al., 2022). Human attention is then monitored and assessed pertaining to complex cognitive tasks (e.g., surgery simulation, video lectures, identification of hazards in lab environment). Nonetheless, unrealistic scenarios that cannot be easily replicated present a detrimental impact on the cognitive and performance outcomes.

Download: Download high-res image (237KB)
Download: Download full-size image

Fig. 6. Cognitive load simulation environment.

AI, which includes ML and DL, leverages computational algorithms with learning capabilities to recognize patterns from data. Sometimes, it is difficult to interpret exact information from data samples (Mehta & Shukla, 2022). In this respect, DL and ML offer the underlying algorithms to learn from data without being specifically programmed to do so. AI-based models suffer from the requirement of a lengthy computation time and the problem of vanishing gradients (Khademi, Ebrahimi, & Kordy, 2022), causing researchers to use statistical and other methods for data analysis. However, the recent advancements in AI and the availability of graphic processing units (GPUs) enable neuroscientists to decode and classify fNIRS signals with unprecedented details.

In neuroimaging, ML/DL models takes fNIRS signals as training data to learn and predict the associated class labels. In the training phase, the ML/DL algorithm optimally configures the hyper-parameters in a way that the trained model can be generalized to produce the desired outcome when it is presented with an unseen data sample. Fig. 7 depicts a general flow of DL/ML model implementation. In the first step, the raw fNIRS signals are captured. These signals typically contain noise caused by changes in the heart rate, blood pressure, etc. In the pre-processing phase, the signal artifacts and other outliers in the data set are removed. Most of the studies presented in this review adopt bandpass and Butterworth filters as well as other methods for this purpose. The input spectrum and its correspondence are determined during an optional feature extraction process. Features selection improves the classification performance by reducing the data dimension and computational complexity. It is generally used with ML algorithms, and sometimes with DL algorithms, to increase robustness of the model. There have been few papers that use feature extraction along with DL algorithms, but most of the studies apply raw fNIRS signals as the model input. Most of the studies reported in the literature used the summary based statistical features (e.g., mean, variance, maxima, minima, slope, skewness, kurtosis, and normalization) or parameterization techniques (e.g., Wigner-Ville Distribution, continuous wavelet transform and Hough transform) to extract useful features from the data. A well-trained model can provide predictions pertaining to different levels of mental workload. To further enhance the model generalizability capability, most of the studies either utilise the n-fold Cross Validation (CV) or leave-one-out method.

Download: Download high-res image (99KB)
Download: Download full-size image

Fig. 7. The overall steps of fNIRS analysis using ML and DL include signal acquisition, pre-processing, feature extraction and classification.

Classifications tasks vary widely, and they can be categorized into three main groups: (a) supervised learning (b) unsupervised learning, and (c) reinforcement learning. In supervised learning, labels (target outputs) are generally determined by humans, and a supervised algorithm maps the input features to a desired output (label). Supervised learning algorithms need external assistance in the form of handcrafted labeled data for training and test phases. So, the algorithm learns pattern from training data and validate the model on test data for classification and prediction purposes. Classification methods, e.g. CNNs (Albawi, Mohammed, & Al-Zawi, 2017), ANNs (Abiodun et al., 2018), Support Vector Machine (SVM) (Vapnik, 1999), Decision trees (Kotsiantis, 2013), Random forests (Breiman, 2001), Naive Bayes (Fix & Hodges, 1951), Logistic Regression (DeMaris, 1995) and Linear Regression (Su, Yan, & Tsai, 2012) are common supervised learning algorithms. In contrast, unsupervised learning algorithms use unlabeled data for inference. These algorithms learn features from the raw data and develop a predictive model to categorise the input data into different clusters with dimensionality reduction. Example of unsupervised learning algorithms are K-Means clustering (Hartigan & Wong, 1979), Principal Component Analysis (PCA) (Maćkiewicz & Ratajczak, 1993), and Independent Component Analysis (ICA) (Stone, 2002). Reinforcement learning (RL) works on the principle of sequential decision making. It uses learning agents to interact with a dynamic environment, which maximizes the rewards when a task is successfully achieved. The main factors contribute to RL are the environment model, policy, raw signals, and reward function. Traditional RL models can only solve the problems having a low dimensional space. However, the recent introduction of deep neural networks (DNNs) in terms of reinforcement agents gives the model ability to learn from multi-dimensional inputs (Ibarz et al., 2021). Over time, more and more DNNs combined with RL give the power to solve problems in a high dimensional space, leading to various new RL research domains such as robotics (Bhagat, Banerjee, Ho Tse, & Ren, 2019) and autonomous driving (Kiran et al., 2021). Among all the three learning methods, supervised learning is mainly used to predict and classify cognitive load pertaining to fNIRS signals.

This section discusses the trends in the formulation of ML/DL models performed on fNIRS data. A comprehensive summary of DL design, architecture and experimental paradigm is presented in Table 2.

Table 2. Description of data collected from the included studies.

Authors	Functional near-infrared spectroscopy (fNIRS) properties		fNIRS Features		Artificial intelligence-based approaches		Performance evaluation		Participants
	Brain area	Environment	Feature methods	Derived features	Strategy	Architecture	Accuracy	Other metrics
(Gateau, Durantin, Lancelot, Scannella, & Dehais, 2015)	Prefrontal cortex anddorsolateral prefrontal cortex (DLPFC)	ISAE (French Aeronautical University in Toulouse, France) flight simulator	ANOVA	Mean, kurtosis and skewness	5-fold CV	SVM	62 %	Specificity = 58 %Sensitivity = 72 %	19 (13 males and 6 females)
(Oku & Sato, 2021)	Prefrontal cortex	Watching video lecture of astronomy for 27 min and answer 10 questions	N/A	Mean values of HbO2 and dHb	Leave one out	Random Forest and GLMNET	Random forest = 66 %GLMNET = 63 %	Random forest (sensitivity = 0.63 ± 0.066, specificity = 0.66 ± 0.0420, and Cohen’s kappa coefficient = 0.26)GLMNET (sensitivity = 0.62 ± 0.067, specificity = 0.64 ± 0.042, and Cohen’s kappa Coefficient = 0.22)	18 (8 males and 10 female)
(Kornev et al., 2022)	Left and right brainhemispheres	Iowa Gambling Task (IGT)	Pearson coefficient	Mean, variance and standard deviation	10-fold CV	Multiple regression, decision trees, ANN, SVM and random forest	Best accuracy is achieved by SVM with radial basis function	SVM RMSE = 3.37 to 7.84SVM R-squared = 0.29 – 0.96	30 (5 malesand 25 females)
(X. Zhou et al., 2021, Zhou et al., 2021)	Prefrontal cortex	Civil engineering lab (identification of hazards)	Fisher criterion	Mean	10-fold CV	LDA	70 %	N/A	48 (35 males and 13 women)
(Lamb, Neumann, & Linder, 2022)	Prefrontal cortex	VR based questions about presented content	ANOVA	N/A	2-fold CV	Random Forest	83.9 %	Sensitivity = 0.73 ± 0.071 Specificity = 0.71 ± 0.044Cohen’s kappa coefficient = 0.41	40 (21 males and 19 females)
(Khalil, Asgher, & Ayaz, 2022)	Prefrontal cortex	n-back task	Shapiro–Wilk test	N/A	Leave one out and 10-fold CV	CNN based model	94.52 %	N/A	26
(Zaman & Islam, 2021)	Prefrontal cortex	n-back tasks	Wigner-Ville Distribution	N/A	N/A	ResNet50	98 %	N/A	10 (6 males and 4 females)
(Le, Xuan, & Aoki, 2022)	N/A	Driving in simulation-based environment	N/A	N/A	N/A	Random forests	98.24 %	PPV = 97.02 %TPR = 97.17 %TNR = 98.71 %F1-score = 97.10NPV = 98.77FPR = 1.29	17 (5 males and females)
(Asgher et al., 2019)	Prefrontal cortex	Mental arithmetic	N/A	Mean-Variance, Mean-Peak, Mean Slope, Peak Slope, Peak and Variance	10-fold CV	SVM	94 %	N/A	20 (10 males and 10 females)
(E. Q. Wu et al., 2021)	Medial prefrontal cortex, left and right dorsolateral prefrontal cortex, left and right ventrolateral prefrontal cortex,and left and right temporal cortex	Physical flight simulator (cognitive states during simulated failure of the aircraft)	Hough Transform features	N/A	5-fold CV	Scalable gamma non-negative matrix network (SGNMN)	92 %	N/A	40 pilots
(Keles et al., 2021)	Prefrontal cortex	Laparoscopic trainer box (simulated surgery)	Wilcoxon signed-rank test	Mean, skewness and kurtosis	5-fold CV	SVM	90 %	N/A	11 surgeons and 17 medical students
(Kwon & Im, 2021)	Prefrontal cortex	Mental arithmetic and idle state tasks	N/A	N/A	Leave-one-subject out CV	CNN-based model	71.20 % ± 8.74 %	N/A	18 (10 males and 8 females)
(Derosiere, Dalhoumi, Perrey, Dray, & Ward, 2014)	Prefrontal cortex and the right parietal areas	Thumb abduction tasks	t-test	N/A	N/A	SVM	90 %	N/A	7 (male)
(Dong & Jeong, 2018)	Prefrontal cortex	Simple arithmetic (SA) and 1-back and 2-back tasks	Wilcoxon signed-rank test and PCA	N/A	Nested CV	SVM	77 %	N/A	22 (18 males and 7 females)
(Asgher et al., 2020)	Prefrontal cortex	Logic and arithmetic task with four difficulty levels	t-test	Normalization, signal mean, maxima, variance, minima, slope, variance, skewness, kurtosis and signal peak	10-fold CV	CNN and LSTM	CNN = 87.45 %LSTM = 89.3 %	N/A	7 (2 males and 5 females)
(Le, Aoki, Murase, & Ishida, 2018)	N/A	Real car different driving task along with digit recalling n-back task	PCA	N/A	5-fold CV	Random forests	96.08 %	N/A	5 (4 males and 1 female)
(Ho et al., 2019, Ho et al., 2019)	Prefrontal cortex	Stroop task experiment	PCA	N/A	N/A	SVM, Adaboost, Deep Belief Network and Convolution Neural Network	SVM = 64.74 % ± 1.57 %AdaBoost = 71.13 % ± 2.96 %DBN = 84.26 % ± 2.58 %CNN = 72.77 % ± 1.92 %	N/A	16 (8 males and 8 females)
(Abibullaev & An, 2012)	Frontal cortex	n-back task	Continuous wavelet transforms features	N/A	5-fold CV	BPNN, LDA and SVM	N/A	AUC BPNN = 0.7672AUC SVM = 0.9404AUC LDA = 0.8902	9 (8 males and 1 female)
(L. M. Wang et al., 2022, Wang et al., 2022)	Frontal cortex	Verbal fluency test	N/A	N/A	N/A	CNN (VGG-16 based)	100 %	TPR = 100FNR = 100	13 (6 males and 7females)
(Naseer, Qureshi, Noori, & Hong, 2016)	Prefrontal cortex	Mental arithmetic task vs rest signals	N/A	Mean, peak, slope, variance, kurtosis, and skewness and feature normalization between 0 and 1	10-fold CV	LDA, QDA, k-NN, Naive Bayes, SVM and ANN	LDA = 71.6 ± 1.1 %QDA = 90.1 ± 1.3 %k-NN = 69.8 ± 0.5 %Naive Bayes = 89.8 ± 1.4 %SVM = 89.5 ± 1.0 %ANN = 91.4 ± 0.3 %	LDA (Precision = 72.8 ± 6.2, Recall = 73.5 ± 9.2)QDA (Precision = 90.0 ± 4.4, Recall 91.2 ± 5.5)k-NN (Precision = 69.1 ± 1.3, Recall = 70.4 ± 2.6)Naive Bayes (Precision = 91.5 ± 5.1, Recall 88.5 ± 5.0)SVM (Precision = 89.1 ± 4.2, Recall = 91.8 ± 5.5)ANN (Precision = 90.1 ± 2.7, Recall = 91.5 ± 4.4)	7
(J. Wang, Grant, Velipasalar, Geng, & Hirshfield, 2021)	Frontal cortex	n-back task	N/A	N/A	10-fold CV	CNN-BiGRU-SLA	77.53 %	Precision = 77.41Recall = 77.65F1-score = 77.42	22
(Khanam, Hossain, & Ahmad, 2022)	Frontal area, motor part, parietal area, and occipital area	n-back task	ANOVA	Mean, minimum, maximum, standard deviation, slope and skewness	N/A	SVM	73.40 ± 0.076 %	N/A	26 (9 males and 17 females)
(Q. Zhu, Shi, & Du, 2021)	Prefrontal cortex	Sternberg test	N/A	Mean, peak, standard deviation, kurtosis and skewness	10-fold CV	SVM (Gaussian radial basis function)	70.02 ± 4.41 %	N/A	15 (14 males and 1 female)
(R. Liu, Reimer, Song, Mehler, & Solovey, 2021)	Prefrontal cortex	Fixed-base, full-cab Volkswagen New Beetle, Verbal, and n-back task	ANOVA	Multilayer perceptron features	10-fold CV	ESN	80.61 %	Precision = 79.08Recall = 81.67F1-Score = 80.38	18
(Varandas, Lima, Bermúdez i Badia, Silva, & Gamboa, 2022)	Dorsolateral prefrontal cortex	Corsi-Block task	N/A	Maximum, minimum, polarity, mean, variance, Standard deviation, kurtosis and skewness	10-fold CV	Random Forest	70.91 ± 13.67 %.	Precision = 72.86 ± 15.32Recall = 69.09 ±14.77F1-Score = 70.27 ± 14.30AUC-ROC = 72.50 ± 17.26	10 (6 males and 4 females)
(Lim et al., 2020)	Prefrontal Cortex	n-back task	Deep contribution ratios	N/A	10-fold CV	SVM	80.6 %	Sensitivity = 78.1 %Specificity = 85.5 %AUC = 85.1 %	25 (21 males and4 females)
(Saikia, Kuanar, Borthakur, Vinti, & Tendhar, 2021)	Prefrontal cortex	n-back task	N/A	Gradient value, mean, variance, number of peaks, kurtosis, skewness, maximum and minimum value	N/A	k-NN	75 %	N/A	12
(Berivanlou, Setarehdan, & Noubari, 2016)	Prefrontal cortex	n-back task	ANOVA	Mean, variance, skewness and kurtosis	10-fold CV	Linear regression	63.7 %	N/A	10 (6 males and 4 females)
(L. Wang et al., 2021)	N/A	n-back task	N/A	N/A	5-fold CV	CNN	71.63 %	N/A	27
(Saadati, Nelson, Curtin, Wang, & Ayaz, 2021)	N/A	n-back task	N/A	N/A	N/A	CNN and RNN based model	98.3 %	N/A	N/A
(Izzetoglu, Jiao, & Park, 2021)	Left and right hemispheres	Driving simulator	N/A	Normalization	N/A	Logistic regression	97.5 %	N/A	10 (4 males and 6 females)
(Lu, Yan, Chang, & Wang, 2020)	N/A	Mental arithmetic	N/A	N/A	N/A	LSTM-FCN	97.1 %	N/A	8
(Çakır, Vural, Koç, & Toktaş, 2016)	Prefrontal cortex	Thales Airbus 320 Simulator	N/A	Mean, standard deviation and slope	N/A	LDA	91 %	N/A	8
(Benerradi, A. Maior, Marinescu, Clos, & L. Wilson, 2019)	Prefrontal cortex	Customized task (Game based task target color balls using joystick to induced different levels of workload)	N/A	Normalization	N/A	Logistic regression, SVM and CNN	LR = 50.99 % SVM = 53.90 % CNN = 49.53 %	N/A	11 (6 males and 5 females)
(Ho et al., 2019, Ho et al., 2019)	Prefrontal cortex	Stroop tasks	N/A	N/A	N/A	DBN and CNN	DBN = 84.26 ± 9.10 %CNN = 65.42 ± 1.58 %	N/A	16 (8 males and 8 females)
(Kurihara et al., 2020)	Prefrontal lobe	Verbal memory retrieval and visuospatial memory retrieval	N/A	N/A	20-fold CV	k-NN and SVM	SVM = 100k-NN = 100	Positive Predictive values (PPV) = 1Negative Predictive Values (NPV) = 1	20 (13 males and 7 females)
(Durantin, Scannella, Gateau, Delorme, & Dehais, 2016)	N/A	Flight simulator with complex scenarios	ANOVA	N/A	10-fold CV	SVM	77.8 %	Sensitivity = 79.4 %Specificity = 76 %.	9 (8 males and 1 female)
(Qing, Huang, & Hong, 2021)	Cerebral prefrontal cortex	Visualization of product videos	N/A	N/A	8-fold CV	CNN	86.2 % to 86.3 %	N/A	8 (4 males and 4 females)
(Y. Zhang et al., 2022, Zhang et al., 2022)	N/A	Mental arithmatic and mental singing	GLM	Kalman filter based features	10-fold CV	Kalman filter and adaptive Gaussian Mixture model	97.89 %	N/A	8 (3 males and 5 females)
(Bak, Yeu, & Jeong, 2022)	Ventrolateral prefrontal cortex, medialprefrontal cortex, and orbitofrontal cortex	Buying behavior related task	t-test	Mean, variance,kurtosis, skewness, slope and area	10-fold CV	SVM	94 %	AUC = 0.97	33 (12 males and 21 females)
(Touhid, Anam, Alam, Foysal, & Shaiham, 2023)	N/A	Mental arithmatic	Haar wavelet-based features	Mean, Root mean square value and variance	8-fold CV	Gentle Boost	95.54 %	N/A	N/A
(Hasan, Mahmud, Poudel, Donthula, & Poudel, 2023)	N/A	n-back task	t-test	N/A	N/A	Random forests	96.7 %	AUC = 96.7,Precision = 97.0,Recall = 97.0,F1-Score = 97.0	68
(Cakar & Yavuz, 2023)	N/A	n-back task	N/A	N/A	N/A	Generalized Linear Mixed-Effects Model Tree	N/A	RMSE = 5.6 × 10-4MSE = 3.2 × 10-7	26 (9 males and 17 females)
(Howell-Munson et al., 2023)	N/A	Rule Learning Task	ANOVA	N/A	N/A	Logistic regression	N/A	F1-score = 0.76	22 (5 males, 13 females and 4 others)
(Y. Zhang et al., 2023)	N/A	Mental arithmetic and mental singing	N/A	Mean, slope and normalization	N/A	CGAN-rIRN	92.19 %	N/A	8 (2 males and 6 females)

Most articles in neuroscience employ fNIRS data sets that are not publicly available. The performance measure, e.g. a simple accuracy measure or other metrices such as the mean squared error (MSE), root mean squared error (RMSE), F1-score, true positives, or false positive, cannot be generalised since each study has different test subjects, data procurement protocols and different cognitively demanding tasks. Studies on fNIRS indices in mental workload can be categorized into three categories, as illustrated in Fig. 8: (1) ML-based fNIRS analysis; (2) DL-based fNIRS analysis; and (3) hybrid AI-based models for fNIRS analysis.

Download: Download high-res image (423KB)
Download: Download full-size image

Fig. 8. Taxonomy of AI-based models applied on cognitive load fNIRS data.

ML, which is a subset of AI, is capable of processing patient data and imitating the ability of humans in recognizing patterns. This section covers ML approaches to analyze fNIRS data. A total of 25 studies are reviewed, which apply ML to objectively evaluate the mental workload. A summary of ML based algorithms in the literature are as follows. Fig. 9 displays the distribution of studies utilizing ML classifiers for the analysis of fNIRS data. SVM emerge as a prominent choice within the fNIRS research community, followed by Random Forests, which are recognized for their efficacy in handling high-dimensional data. LDA and k-NN are also noted in the distribution as applied methods in fNIRS-based machine learning studies.

Download: Download high-res image (76KB)
Download: Download full-size image

Fig. 9. Distribution of ML studies employed for the classification of fNIRS data.

According to our investigation, SVM has been widely used in fNIRS signal analysis because of its ease of implementation and high accuracy. The idea of SVM is based on the structural minimization principle. It is mainly used for pattern recognition and regression analysis. While classifying the data samples in high dimensional classification space, it tries to find the optimal hyperplane with highest margin between classes. These hyperplanes are trained with algorithms so that different categories of input data points are separated. Several researchers like Gateau et al. (Gateau et al., 2015), Asgher et al. (Asgher et al., 2019), Keles et al. (Keles et al., 2021), Derosiere et al. (Derosiere et al., 2014), Dong et al. (Dong & Jeong, 2018) Abibullaev et al. (Abibullaev & An, 2012), and Kurihara et al. (Kurihara et al., 2020) used the SVM to classify mental workload from the fNIRS signals.

Khanam et al. (Khanam et al., 2022) applied the ANVOA test on the conventional mean, minimum, maximum, standard deviation (SD), slope, and skewness features from all 36 channels. The ANOVA analysis signified that only two channels in the frontal and motor area indicated statistical interference among different levels of workload. SVM was trained on the features obtained from two significant channels and achieved an accuracy rate of 71.48 %.

Zhu et al. (Q. Zhu et al., 2021) employed conventional feature extraction methods to explore the relationship between fNIRS signals and cognitive load based on a Sternberg experiment (Sternberg, 1969). Experimental results highlight the fact that the significant features for the prediction cognitive load using SVM varied across participants because each person processes information differently. So, instead of generalized models, personalized models are required to predict cognitive from fNIRS signals. A further pipeline to filter, clean and model fNIRS data has also been presented in this study.

To reduce the number of false positive, Lim et al. (Lim et al., 2020) introduced the feature extraction method named deep contribution ratio, which uses the k-means clustering method as well as Euclidean distance method to identify activated and non-activated channels. Experimental results showed that deep contribution ratio achieved better accuracy (80 %) in comparison with those obtained from conventional slope-based features (59.8 %).

Asgher et al. (Asgher et al., 2019) processed fNIRS data using a proposed Fixed-Value Modified Beer-Lambert law (FV-MBLL) and conventional MBLL. The results highlighted the fact that a combination of mean and peak values yielded better results in mental arithmetic tasks when the data samples were processed with either FV-MBLL or conventional MBLL. Low classification scores could also be improved through oversampling by balancing the number of features for cognitive tasks.

Durantin et al. (Durantin et al., 2016) optimized the Kalman filter to remove noise and other artifacts from the fNIRS signals. To estimate a pilot’s mental state in a simulated flight environment, SVM was trained on the fNIRS signals filtered from Kalman filter, IIR filters and Moving Average Convergence Divergence (MACD) filter (Durantin, Scannella, Gateau, Delorme, & Dehais, 2014). Experimental results show that the predicted accuracy on Kalman filtered data was 77.8 % which was higher when compared with data filtered from IIR filters and MACD filter.

Studies presented so far do not present comparison between SVMs and ML techniques. The study conducted by Kornev et al. (Kornev et al., 2022) not only used the SVM radial basis function for classification but also compared the results with Multiple regression, artificial neural network, random forests and classification and regression trees (CART). Although this study failed to report average accuracy of each algorithm, instead it demonstrated the high performance of SVM in terms of Root Mean Square Error (RMSE) and correlation coefficient (R2 error).

Despite the promising results offered by SVM for fNIRS signals analysis, most of the studies that used SVM for classification used the normal sized and balanced dataset. The time of the training also increases as the number of sample increases. Secondly, it is difficult to find an appropriate kernel function when the non-linearity in the data increases, so it is always recommended to use an appropriate noise removal technique before using SVM for fNIRS signal classification.

The k-Nearest Neighbors (k-NN) algorithm is a widely used ML method for classification and regression tasks. It is based on the principle of close instances, which means that it relies on the similarity between a new data point and the existing data points to classify or predict its label or value. It stores all the training samples, and each input instance is represented as a vector. In k-NN, the “k” refers to the number of nearest neighbors to consider when classifying a new data point. The algorithm works by calculating the distance between the new data point and all the existing data points in the dataset. The k-nearest neighbors are then selected as the data points with the closest distances to the new data point. The classification or prediction of the new data point is based on the labels or values of these k-nearest neighbors. The distance metric used in k-NN can vary depending on the type of data and the problem at hand. The most commonly used distance metrics are Euclidean distance (Durtschi, Mahat, Mashal, & Chrysler, 2021), Manhattan distance (Ehsani & Drabløs, 2020), and Minkowski distance (Iswanto, Tulus, & Sihombing, 2021). The choice of distance metric can have a significant impact on the performance of the algorithm (Shalika & Kumar, 2021). k-NN also requires less training as compared to the other algorithms. It is suitable for the data in which relation between input and output is complex to be expressed as linear models. To classify five different levels of workload during n-back tasks, Saikia et al. (Saikia et al., 2021) evaluated the training time and accuracy of Fine k-NN, Medium k-NN, Coarse k-NN, Cosine k-NN, Cubic k-NN, and Weighted k-NN. In the classification task, both Fine k-NN and Weighted k-NN were able to achieve 75 % accuracy, while Weighted k-NN took a shorter training time (4.93 s) than that of Fine k-NN (5.59 s).

Although k-NN requires less training times as compared to those of other training algorithms, but it requires more computational times during the classification process and determining the results. A study conducted by Naseer et al. (Naseer et al., 2016) reported that classification results produced by k-NN classifier were less accurate as compared to those of other ML algorithms.

LDA is a well-known dimensionality reduction and feature extraction technique. It is used to identify the linear combination of classes by reducing the dimensionality of vectors belonging to different classes to lower dimensional feature space in a way that features vectors of each class are separated from other classes. This technique is simple to implement and has less computational requirements. Some researchers such as Zhou et al. (X. Zhou et al., 2021, Zhou et al., 2021) and Cakır et al. (Çakır et al., 2016) used the LDA to classify different levels of mental workload. The main limitation of LDA is its linear nature which prevents the generation of competitive results on non-linear fNIRS signals.

Cakır et al. (Çakır et al., 2016) evaluated the 3 levels mental workload of 8 pilots. The results showed that when the LDA was trained on the data of only single pilot, this model could be generalised to evaluate the mental workload of the remaining pilots. The proposed model also has a high accuracy in predicting low levels of workload but low accuracy in predicting high levels of workload due to frequent head movements. Zhou et al. (X. Zhou et al., 2021, Zhou et al., 2021) studied hazard perception tasks in a lab environment and indicated that the LDA could achieve an accuracy rate of 70 %, when the model was trained on the features obtained from left prefrontal cortex. Fisher criteria were used to select the top five optimal features from the data and the results indicated that the left prefrontal cortex was involved more in hazard perception tasks as compared to the other regions of the brain.

Random forest is a tree-based ensemble learning method. It builds a classifier by constructing a number of randomized decision trees (Khan, Asadi, Hoang, Lim, & Nahavandi, 2023). Each decision tree in ensemble classifier casts vote for the predicted class and then the predicted class is determined with most votes on a particular class label. The ensemble nature of model helps random forests to deal with high dimension data and complex feature spaces and make the perfect candidate to handle non-linear fNIRS signals. One of the main advantages of random forests over individual decision trees is that it is less likely to overfit the data. Overfitting occurs when a model is too complex and captures noise or irrelevant patterns in the training data, resulting in poor performance on new, unseen data. By combining multiple decision trees, random forests can reduce the variance of the model and prevent overfitting (Balyan et al., 2022). The random selection of features for each tree also helps to reduce the correlation between the trees and increase their diversity, leading to a better overall performance. The study of (Z. Khan et al., 2020, Khan et al., 2020) also found that as compared to other ML algorithms such as SVM and k-NN, it is easier to determine hyperparameter in random forests. The example studies that use random forests for fNIRS signal analysis are Oku et al. (Oku & Sato, 2021), Lamb et al. (Lamb et al., 2022), M. Hasan et al. (Hasan et al., 2023), Le et al. (Le et al., 2022), Le et al. (Le et al., 2018), and Varandas et al. (Varandas et al., 2022).

Varandas et al. (Varandas et al., 2022) used the Corsi-Block task (Milner, 1971) and Lamb et al. (Lamb et al., 2022) used the virtual reality-based environment to induce cognitive load. Both studies reported more than 70 % accuracy when using random forest to classify different levels of mental workload. Le et al. (Le et al., 2018) used the auditory n-back task to classify the different levels of mental workload while driving a car at around 40 km/h. The experimental results (Le et al., 2018) show that the random forests performed better when the data from all the channels were used for classification and the position of channel does not have any significant effect on accuracy. In another study, Le et al. (Le et al., 2022) analysed senior drivers’ mental state and indicated that the significant changes were observed while driving a car in relaxed environment, trail driving and parking bay. The results indicated random forests performed better in terms of accuracy, true positive rate and F1-scrore as compared with those from Naive Bayes, Discriminant Analysis, SVM, Decision Trees, and K-NN methods.

Regardless of the ability of random forests in handling fNIRS high dimensional and non-linear data by using large number of decision tress, they still have some limitations. Depending on the nature and complexity of data, a large number of trees are required to overcome the problem of large variance. Random forests may produce spurious results if their parameters are optimally selected. Therefore, it is always advisable to use the cross-validation method to optimize the parameters of random forests model (Sundararajan et al., 2021).

In addition to widely utilized ML classifiers such as SVM, k-NN, LDA, and random forests, logistic regression and gentle boost has also been used in fNIRS cognitive load analysis. While these methods may not be as prevalent, recent studies have shown their effectiveness in enhancing the understanding of cognitive load dynamics. For example, A. Howell-Munson et al. (Howell-Munson et al., 2023), incorporated behavioral data, including reaction time and task difficulty, in conjunction with fNIRS to comprehensively analyze cognitive load. Their approach employed logistic regression, demonstrating superior results compared to other classifiers. Similarly, the study conducted by T. I. Touhid et al. (Touhid et al., 2023) delved into the comparative analysis of Gentle Boost algorithms alongside established classifiers such as LDA, SVM, and random forests. The experimental findings reveal that Gentle Boost, particularly when utilizing Haar wavelet-based features, exhibited superior performance in comparison with other methods. This suggests that the unique features of Gentle Boost, combined with innovative signal processing techniques like Haar wavelet transformation, contribute to a more enhanced understanding of cognitive load dynamics as captured by fNIRS data.

Beyond widely recognized ML classifiers, there are studies where researchers have proposed their own ML-based classification methods for analyzing cognitive load dynamics in fNIRS data. For instance, Y. Zhang et al. (Y. Zhang et al., 2022, Zhang et al., 2022), introduced a novel classification method incorporating Kalman filtering and an adaptive Gaussian Mixture model. This approach aimed to identify intricate patterns within fNIRS signals. The results of their study demonstrated a significant improvement in classification accuracy, showing an 87 % improvement compared to conventional classifiers such as GMM, SVM, and LDA. This suggests that the integration of Kalman filtering, and the adaptive Gaussian Mixture model provides a robust framework for extracting meaningful information from fNIRS data and, enhances the efficacy of cognitive load analysis. Similarly, S. Cakar et al. (Cakar & Yavuz, 2023) proposed the Generalized Linear Mixed-Effects Model Tree, which combines Linear Mixed Models (LMM) with ML-based models specifically designed for the analysis of repeated data in fNIRS. By leveraging the strengths of LMM and ML approaches, this study aimed to address the complexities associated with repeated measures in fNIRS experiments.

The exact working mechanism of the brain is yet to be fully known. Several studies investigate cognitive tasks based on fNIRS responses to critical areas of brain activation. Derosiere et al.(Derosiere et al., 2014) analysed the oxyhemoglobin (HbO2) features from the right parietal area of the brain, and found that they are more sensitive for classification of cognitive loads as comspared with those from other parts of the brain. Meanwhile, the study conducted by Keles et al.(Keles et al., 2021) on students and surgeons during simulated surgery tasks suggested that the neural activation in the left pre-frontal cortex near the dorso and ventrolateral areas is sufficiently higher than those from other regions. The relationship between HbO2 features and prefrontal cortex regions was also evaluated by Izzetoglu et al. (Izzetoglu et al., 2021) in simulated driving tasks. During slow-driving tasks, a high level of negative correlation was observed between HbO2 features and the right pre-frontal cortex activations. The logistic regression model was trained on these features, and it yielded an accuracy rate of 97.5 %.

6.2. Deep learning trends in fNIRS analysis

Different from ML, the architecture of an DNN contains many hidden layers. Multilayered networks have a finite number of non-linear elements (i.e., activation functions and neurons), which makes them more flexible and robust than ML algorithms. The first and last layers are defined as the input and output, while those in between are defined as hidden layers. Depending on the number of neurons and hidden layers, these models can easily go up to thousands or sometimes up to millions of trainable hyper-parameters. DL is prone to overfitting when dealing with smaller data sets; hence they are better in dealing with massive data sets (J. Wang et al., 2021). Nonetheless, DL can automatically learn useful features from data with less handcrafting effort. We have identified 11 studies on DL for classification of fNIRS signals. Nearly half of these studies have used the CNN models, while four studies leveraged Deep Belief Networks (DBN), Long Short-term Memory (LSTM), ANNs and Echo State Network (ESN). According to our presented taxonomy, use of algorithms other than CNN and LSTM in fNIRS signals is less prevalent. A summary of studies that utilized the DL algorithms for classification is as follows:

CNNs are designed in a way to specifically take images as the input. Numerous CNN variants have been proposed so far, which have shown excellent results in the field of computer vison (Balasundaram et al., 2023), Natural Language Processing (NLP) (Ahmed & Wang, 2023), image segmentation (M. A. Khan et al., 2020, Khan et al., 2020), remote sensing (Boulila, Ghandorh, Khan, Ahmed, & Ahmad, 2021), and signal processing (Ghandorh et al., 2021). In fNIRS signal classifications, the input formulation strategies, feature extraction, and feature selection methods vary significantly as a function of the architecture. DL model layers hierarchically extract features from the data samples. Performance of any CNN architecture is depending on the number of convolutional layers, pooling layers and fully connected layers. Convolution layers give the model ability to learn complex features from the data, Pooling layers not only improve the performance of the model but also reduce the dimensionality of feature maps and finally fully connected layers map the complex features to the output. During training CNN continuously optimizes weights and other parameters which will take time and once the model is trained it will take less time for classification.

Khalil et al. (Khalil et al., 2022) proposed a 6-layer CNN to classify four levels of n-back tasks. First, the data of few participants were used to train a CNN model, then the same pre-trained model was used to extract the features from data and employed transfer learning to re-train the model. Although this work does not provide the comparison with other ML/DL methods, but it compared the training time of proposed method with conventional method of training. The results suggest that their method helped in reducing the training time.

Wang et al. (Wang et al., 2022, Wang et al., 2022) used VGG-16 model to study the hemodynamics changes in the brain. Instead of using conventional features, authors converted the fNIRS signals of 52 channels into images which are then used to train CNN model. It was reported that their work with the proposed feature extraction models achieved 100 % accuracy. This work does not provide comparison with other ML/DL models, but it evaluates the model in terms of accuracy, True Positive Rate (TPR) and False Positive Rate (FPR).

Liu et al. (R. Liu et al., 2021) evaluated the performance of Autoencoders for analyzing the fNIRS data. This study demonstrates the significance of features extracted from Echo State Network (ESN) by training model on hand crafted features and feature obtained from convolutional autoencoders. The experimental results show that the features extracted from ESN autoencoders yield better results with an accuracy of 80.61 %.

Benerradi et al. (Benerradi et al., 2019) used a 7-layered CNN to classify the mental workload of two and three levels. The results of classification have also been compared with those from SVM and logistic regression. In 3 class modalities classification, the SVM outperformed other models but in two class modalities, CNN achieved the highest accuracy. The reason for low accuracy could be the small data (9 participants) and the sample size of only 9 s. Secondly, their model architecture has only two convolutional layers which limits the capability to extract features from the data and causes the lower performance of CNN on the three class classification tasks.

Kwon et al. (Kwon & Im, 2021) adopted the CNN model to classify fNIRS signals in mental arithmetic tasks and idle states. The evolving normalization-activation layer (H. Liu, Brock, Simonyan, & Le, 2020) was used, instead of the traditional normalization layer, in the architecture. The dropout probability was set to 0.5. Without using any feature extraction method, the proposed CNN architecture outperformed EEGNet and other ML classifiers.

Qing et al. (Qing et al., 2021) utilized the CNN input layer as a decoding data matrix to process the conventional features from fNIRS signal lengths of 15 s, 30 s and 60 s. The method achieved 86.3 % accuracy. Zaman et al. (Zaman & Islam, 2021) used Wigner-ville Distribution to transform the fNIRS signals of different window sizes into 2-D images and evaluated the results using ResNet50 (He, Zhang, Ren, & Sun, 2016). The proposed feature extraction method improved the accuracy from 89 % to 98 %. Similarly, in their study, Ho et al. (Ho, Gwak, Park, Khare, et al., 2019) compared the performance of a 9-layer CNN with a 5 layer DBN. PCA was applied on the dataset to reduce the dimensionality. The results indicated that both models exhibited better performance when trained with hemoglobin difference (HbT) features. However, lower accuracy was observed when using oxy-hemoglobin (HbO) and deoxy-hemoglobin (HbR) features. Despite giving outstanding classification accuracy, CNN comes with its disadvantages. CNN requires a large amount of data for training, but the research studies (Cascianelli et al., 2018) used limited number of test subjects. Therefore, more test subjects need to be recruited to increase the size of dataset while incorporating CNN. CNN may give high accuracy on smaller dataset, but it may cause overfitting (Ma et al., 2020). As the fNIRS signals are highly dependent on time, with the signal changes occurring over a range of temporal scales. However, CNNs are designed to capture local features of the data without explicitly modeling the temporal dynamics. This mismatch between the inherent nature of fNIRS signals and the limited capabilities of CNNs to capture temporal dependencies can limit the performance of CNN models on fNIRS datasets. Moreover, due to involvement of a large number of parameters, it is infeasible to express the logic and actual mechanisms involved in the reasoning process of the classification procedures.

To overcome the time series classification in CNN based models, LSTM and RNN models were proposed. Generally, LSTM models are commonly used in neuroergonomic studies because of vanish gradient descent problem in RNN. LSTM models possess input gate, forget gate and output gate which give the model capability to handle sequential data and hence more suitable for fNIRS signals as compared with other models. These models predict the future information by considering past and future, which are not possible using the CNN and other models. Asgher et al. (Asgher et al., 2020) used the model with 4 LSTM layers and 4 dense layers to classify the mental workload of four different layers. The model was trained on mean and slope features extracted from the hemodynamics response while doing mental arithmetic task. The results of classification were compared with those of SVM, k-NN, ANN with a 3-layer network topology and a CNN with 2 convolutional layers, 1 max pooling layer and 4 dense layers. The developed LSTM outperformed other models and achieved the accuracy of 89.01 % followed by CNN with 87.45 %. The CNN model used in their work contains practically very few layers in comparison with those of well-known CNN architectures for example VGG-16 or ResNet. CNN with complex layers could be used in this study to yield better results. In their work, the LSTM model outperformed CNNs but due to lack of studies focusing on transforming time series data to classification tasks, CNN models could perform better as compared with other DL methods.

6.3. Hybrid models trends in fNIRS analysis

Generally, ML methods are reliable when they are used for analysing smaller data sets or hand-crafted features. Similarly, DL techniques tend to function as black boxes and perform more efficiently in terms of feature extraction through trainable hyper-parameters (E. Q. Wu et al., 2021). An increase in the performance cannot be made possible by solely improving the mathematical model of ML or by increasing the number of neurons or hidden layers in DL models. The approach of combining two methods by analyzing the information of a data set leads to a hybrid model. We identify four studies on hybrid models for classification of fNIRS signals. Most hybrid models in this review combine the convolutional operator of the CNN layers with RNN, LSTM or GRU (Lu et al., 2020, Saadati et al., 2021, Wang et al., 2021). The main purpose of CNNs is to extract features, while RNN, LSTM, or GRU can be used to handle data dependencies. A combination of both make a perfect fit to extract features from fNIRS signals and, at the same time, leverage the present and past data samples to learn the nature of workload patterns. These hybrid models can have an ability to classify the mental workload of subjects in the presence of noisy data and improve model efficiency from 10 % to 15 %, as indicated in the literature. Additionally, we identified a study that utilizes GANs for the analysis. Gt et al. [5] proposed a GAN-based network to classify fNIRS signals, specifically using Convolutional-based Generative Adversarial Networks (CGAN) to generate synthetic fNIRS signals. They also proposed the revised Inception Net (rIRN) to classify fNIRS signals. The model was trained on the real and synthetic features of size 160x10. The quality of generated signals has been evaluated through Maximum Mean Discrepancy (MMD), Structural Similarity Index Measure (SSIM), and Peak Signal-to-Noise Ratio (PSNR). Experiments revealed that increasing the dataset up to two times increased the accuracy of the model, and further increases in the dataset size decreased the accuracy. They also compared the performance of rIRN with IRN and CNN with different layers, noting that each model had a similar effect on accuracy, but rIRN yielded the highest accuracy. For the distribution of the dataset, neither k-fold nor LOOC cross-validation has been applied.

Most of the studies presented so far have tested the collected data set using various classification algorithms. It is inappropriate to highlight the best algorithm by comparing the accuracy metrice or feature extracted methods. Each study has its own architecture design, input processing method, and unique feature selection technique. Identifying the best algorithm for classification is a challenging task because researchers evaluate the validity of an algorithm by utilisng data samples with more than one ML or DL methods and find the most suitable one. Nevertheless, the analysis provided in this review helps reveal future research directions of ML/D- based algorithms for cognitive load analysis.

This paper presents a comprehensive overview of research methodologies employing ML and DL approaches for the classification of cognitive load. We identified 45 experimental studies that utilized fNIRS signals to discern varying levels of cognitive load. We conducted a preliminary analysis in our systematic review to identify the cognitive tasks used in each of the sampled research. A spectrum of cognitive tasks was observed, with some studies incorporating traditional paradigms such as n-back tasks, stroop tasks, and mental arithmetic tasks. Additionally, a noteworthy aspect of the investigated literature revealed a divergence, with certain studies devising unique tasks related to activities like flying, driving, and game-based scenarios. A consistent finding across all these studies pertains to the observed correlation between increased cognitive load and heightened cortical activations in the brain. This aligns with the conceptual framework of CLT, substantiating the premise that cognitive load escalates proportionally with the demands imposed by the task at hand. The results underscore the robustness of fNIRS signals as indicators of cognitive load.

One prevalent issue encountered in the application of ML methods is the inherent challenge associated with data requirements, often necessitating larger datasets compared to traditional methods to attain comparable performance levels. has become a paradigm shift that simplifies and turns the fNIRS signal processing pipeline into an end-to-end task. This paradigm shift holds significant promise, simplifying the intricacies associated with data processing and analysis. The integration of deep learning techniques has the potential to revolutionize cognitive load classification by not only mitigating the challenges posed by fNIRS data but also by offering a more efficient approach to signal processing.

To move beyond the competition among various methodologies, and to provide a comprehensive framework for directing future endeavors in the field of automated cognitive load inference, as well as addressing the certain peculiarities associated with fNIRS data are shown in Fig. 10, it becomes imperative to elucidate a distinct approach for the construction of cognitive load inference pipelines. These considerations imply a specific set of guidelines and methodologies that should be incorporated into the design and implementation of AI-based algorithms for cognitive load inference.

Download: Download high-res image (227KB)
Download: Download full-size image

Fig. 10. Challenges associated with cognitive load analysis with fNIRS data.

The classification of various brain activities relies heavily on specific features extracted from hemodynamic signals. Currently, many studies utilize ML techniques to classify varying levels of mental workload effectively using fNIRS data. By isolating features that closely align with the characteristics of a particular class and significantly differ from those of other classes, the classification process becomes more effective in capturing distinctions in hemodynamic signals (L. Wu, Liu, Ward, Wang, & Chen, 2023). However, the substantial dimensionality of fNIRS data presents a significant challenge due to the numerous fNIRS channels, introducing the well-known issue in machine learning known as the curse of dimensionality. In the fNIRS signals domain, researchers often lack comprehensive knowledge about relevant features, leading to the inclusion of numerous candidates features to better represent the domain. Hemodynamic signals, such as HbO2, dHb, and HbT, provide a wide array of choices for feature selection due to their capacity to encompass pertinent information regarding brain activities (Z. Wang, Fang, & Zhang, 2023). Different combinations of such features provide the necessary discriminatory information for classification. Feature selection is also dependent on individual activities, the mean, peak, variance, skewness, kurtosis and slope values of HbO2, dHb, and HbT frequently have been used in fNIRS studies. In the initial stages of fNIRS studies, researchers typically compute the concentration changes of hemoglobin oxygenation throughout the task period (Murata, Sakatani, Katayama, & Fukaya, 2002). This method involves presenting time-series data illustrating cerebral oxygenation alterations for visual inspection. However, these approaches are susceptible to error, especially with increasing noise and interference levels. To address this, various statistical analysis methods have been applied to enhance the accuracy and reliability of feature extraction from fNIRS signals.

In the literature various methods have been proposed to extract cortical activities from fNIRS data, primarily utilizing changes in HbO2. Commonly employed statistical techniques in fNIRS studies include the Wilcoxon signed-rank test, Shapiro-Wilk test, t-test and ANOVA (Bak et al., 2022, Durantin et al., 2016, Keles et al., 2021, Khalil et al., 2022). These methods compare differences between conditions with respect to condition variance. To avoid assumptions about the exact shape or timing of the time course of changes in HbO2 and dHb in response to stimuli, these approaches often take average values during the task period. Features extracted from fNIRS signals typically provide a measure known as the -value, which indicates the level of significance. However, it is essential to recognize a potential issue associated with interpreting -values. A -value of 0.05, for instance, implies that there is a 5 % chance of obtaining the observed result if the null hypothesis were true. In simpler terms, if 100 statistical tests are conducted, and the null hypothesis is true for all of them, it is expected that, by chance, 5 of them will be deemed significant at the < 0.05 level. This statistical phenomenon underscores the importance of cautious interpretation of -values, as the probability of obtaining significant results by chance increases with the number of statistical tests performed.

Additionally, GLM is also a popular and adaptable analytical technique for examining fNIRS signals at the individual and group levels (Y. Zhang et al., 2022, Zhang et al., 2022). Because of its adaptability to both quantitative and qualitative independent variables, it is well-suited to capture the complex dynamics of cognitive processes. In the fNIRS studies, GLM plays an important role in analyzing the functional timeline of data, aligning with the actual hemodynamic response observed in the brain. The functional timeline of data in GLM analyses involves tracking variations in HBO2 and dhB signals over time. The method involves multiple regression analyses, where GLM is incorporated as a linear combination of regressors to predict or explain a related variable. In the case of fNIRS studies, these regressors are carefully selected to represent various experimental conditions or cognitive states, allowing for a comprehensive examination of the underlying neural processes. In addition to the conventional feature extraction methods mentioned earlier, researchers in the field of brain activity classification have explored alternative approaches, incorporating features from the frequency domain for example, Wavelet based features, Haar wavelet and Wigner-Ville distribution to reveal distinct patterns in hemodynamic signals. Frequency domain features are commonly applied in signal processing to analyze time-series data by decomposing it into different frequency components. In the fNIRS studies, the frequency domain analyis has been employed to extract features that capture temporal variations in hemodynamic signals. This approach allows for the identification of specific frequency components associated with different cognitive processes, contributing valuable information for classification tasks. Furthermore, some researchers have proposed their own unsupervised feature extraction methods, introducing novel techniques to capture unique aspects of brain activity. These methods often aim to identify patterns or features that may not be apparent through traditional approaches, enhancing the richness of information available for classification.

Feature extraction methods, as described earlier, have found extensive application in cognitive load studies. Notably, these methods play a crucial role in the analysis of hemodynamic signals, fNIRS data. While traditional statistical techniques like the t-test and ANOVA have been prevalent in extracting features, advancements in ML have introduced DL methods that often bypass the need for explicit feature extraction due to their deep neural or convolutional-based architectures. Fig. 11 presents a comparative view of approaches in feature extraction strategies within both ML and DL frameworks for fNIRS studies. This figure not only illustrates the utilization of statistical feature extraction methods but also highlights the studies that opt for raw fNIRS data. Interestingly, the rise of DL methods has not eliminated the use of feature extraction in certain studies. Despite the inherent capability of deep neural networks to automatically learn hierarchical representations, there are instances where researchers have incorporated feature extraction methods into DL frameworks. This integration aims to enhance the interpretability of the model or to extract specific information from hemodynamic signals that may not be captured effectively by the neural network alone. It is noteworthy that each feature extraction method, whether traditional or novel, has its own set of advantages and limitations. The selection of a particular method depends on the study's objectives and the characteristics of the dataset under consideration. Traditional statistical techniques like the t-test and ANOVA are known for their simplicity and ease of interpretation. They provide insights into the average values and variance of features during specific experimental conditions, aiding in the understanding of differences in brain activities. On the other hand, frequency domain methods, including Wavelet features, Haar wavelet, and Wigner-Ville distribution, offer a more comprehensive evaluation of the temporal and frequency characteristics of hemodynamic signals. These methods, applied in signal processing, allow the decomposition of time-series data into different frequency components. In fNIRS studies, the frequency domain analysis becomes particularly valuable as it enables the identification of specific frequency components associated with different cognitive processes. The coexistence of both traditional and novel feature extraction methods highlights the versatility and adaptability required in the field of brain activity classification. Researchers continue to explore and refine these techniques to address the challenges posed by the dimensionality in fNIRS data, ensuring that the extracted features are not only relevant but also contribute meaningfully to the accurate classification of mental workload and other cognitive states.

Download: Download high-res image (240KB)
Download: Download full-size image

Fig. 11. A Comparative view of feature extraction strategies in ML and DL fNIRS studies.

The evaluation of the classification performance of fNIRS data is typically conducted in an offline manner, utilizing pre-recorded datasets. Within the existing body of literature, a predominant trend emerges wherein researchers commonly employ either -fold cross-validation (-fold CV) or Leave-One-Out Cross-Validation (LOOCV) methods to measuere the effectiveness of their models. In the context of -fold CV, the dataset is partitioned into subsets or folds. The model is trained on of these folds and evaluated on the remaining one. This process is repeated k times, with each fold serving as the test set exactly once. The results are then averaged to provide a comprehensive performance metric that accounts for variations in the training and testing data. On the other hand, LOOCV involves leaving out a single data point as the test set while training the model on the remaining dataset. This process is iteratively repeated for each data point in the dataset, ensuring that each instance serves as a test set exactly once. The final performance metric is derived by averaging the results across all iterations. LOOCV is particularly useful when dealing with smaller datasets, as it maximizes the use of available data for both training and testing. Examining the distribution of studies utilizing different cross-validation methods, Fig. 12 illustrates the prevalence of specific strategies within the research community. Notably, among these methods, 10-fold cross-validation has been widely accepted and frequently employed by researchers. It is followed by 5-fold, 8-fold, LOOCV, and 20-fold CV methods, each demonstrating varying degrees of adoption within the scholarly community. Despite the popularity of specific cross-validation approaches, a noteworthy finding from the analysis is that 42 % of the studies do not explicitly mention the validation methods employed.

Download: Download high-res image (115KB)
Download: Download full-size image

Fig. 12. Distribution of studies using different CV Methods.

The fNIRS data exhibits inherent subject dependence and session dependence, characterized by substantial inter-subject and inter-session variabilities (Huang et al., 2021). Consequently, when a model is trained and tested on the same subjects or sessions, the performance results may significantly differ from those obtained when testing on new subjects or sessions that were not encountered during the training phase. To tackle the challenges posed by subject dependence and session dependence in subject’s data various techniques have been devised. These techniques include within-subject, subject-specific, subject-dependent, cross-subject, and subject-independent approaches. Within-subject methods or subject-specific involve training and testing on the same subject, focusing on individual variations. Cross-subject methods, on the other hand, involve training on one set of subjects and testing on a different set, aiming to generalize across individuals. Subject-independent methods are designed to create models that can be trained on one set of subjects and seamlessly applied to a completely new set, thus addressing the challenge of generalization.

Despite the existence of these methodological advancements, a notable gap exists in the current literature on cognitive load and fNIRS classification. There is a lack of specific implementation of within-subject, subject-specific, subject-dependent and cross-subject methods in studies within this domain. The presence of significant inter-subject variability poses a significant challenge in the classification of cognitive load using fNIRS data. In the majority of studies, ML/DL models for cognitive load are commonly trained and tested using k-fold or LOOCV methodologies. This training approach is favored for its ability to yield higher classification accuracy (Y. Zhou et al., 2021, Zhou et al., 2021). However, a notable drawback is its limited generalization ability across different subjects. Despite the prevalence of k-fold and LOOCV methods in training models for cognitive load classification, there has been a lack of comparative analyses between these cross-validation techniques and subject-specific methods within the fNIRS community. Conversely, such evaluations have been undertaken in related domains, such as EEG and other physiological signal domains. To address this challenge effectively, future studies should prioritize adopting subject-specific methods that explicitly consider the individual characteristics of each subject in the training and testing phases.

The lack of explainability in fNIRS poses a substantial hurdle in cognitive load research. While fNIRS is a valuable tool for capturing neural activity and understanding cognitive processes, it frequently struggles to offer transparent explanations for their findings and the underlying mechanisms behind them. One common approach in fNIRS analysis involves employing traditional ML and DL techniques, treating AI as a black box without delving into the interpretability of the results. The prevalent utilization of traditional ML and DL methods without sufficient explainability limits our understanding of the cognitive load phenomena captured by fNIRS. While these approaches can yield accurate predictions or classifications based on fNIRS data, they often lack the ability to provide meaningful insights into the neural processes and features driving those predictions. Researchers in the field of of neurology have used models based on CNN, LSTM, GANs, and autoencoders to analyze fNIRS data. However, a noticeable gap exists in the literature as there is a lack of studies specifically dedicated to investigating the generalizability and interpretability of DL models in the cognitive load domain using fNIRS data.

To address this gap, it is essential to leverage layer-wise model explanation techniques in the analysis of fNIRS signals. These techniques offer valuable insights into the inner workings of deep learning models and provide a deeper understanding of the specific brain regions, functional connections, and neural patterns associated with cognitive processes. Several layer-wise model explanation techniques, such as Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro, Singh, & Guestrin, 2016), Gradient-weighted Class Activation Mapping (GradCAM) (M. Han & Kim, 2019), and Layer-wise Relevance Propagation (LRP) (Bach et al., 2015), can be utilized in the analysis of fNIRS data. By applying these layer-wise model explanation techniques to fNIRS data, researchers can gain valuable insights into the underlying neural mechanisms of cognitive processes. These techniques enable the identification of specific brain regions, functional connections, and neural patterns that contribute to cognitive load, attention, memory, or other cognitive states. Additionally, these explanations can provide interpretable evidence for the predictions made by deep learning models, enhancing the understanding and trustworthiness of the results. Furthermore, combining these layer-wise model explanation techniques with traditional statistical analyses can lead to a comprehensive understanding of fNIRS. By integrating the strengths of both approaches, researchers can validate and interpret the findings in a more robust manner. This knowledge allows researchers to focus on the most informative regions or wavelengths in the brain, enabling a more targeted and interpretable investigation of cognitive load. Moreover, the development of hybrid models that combine traditional ML/DL approaches with XAI techniques holds promise for bridging the gap between accuracy and interpretability in fNIRS research. These models can retain the predictive power of ML/DL algorithms while providing transparent explanations for their outcomes.

Various classifiers have been utilized in conjunction with ML algorithms to address the task of classification or labeling and to train systems in quantifying different levels of cognitive workloads. The classification of cognitive load using fNIRS data has been explored through diverse machine learning algorithms, including SVM, k-NN, LDA, and Random Forests. SVM, known for its simplicity and high accuracy, is extensively employed in fNIRS signal analysis, demonstrating its effectiveness in mental workload classification. The k-NN classifier is notable for its shorter training time, albeit with increased computational demands during classification. LDA's simplicity and low computational requirements are acknowledged, but its linear nature poses challenges in handling non-linear fNIRS signals. Random Forests are praised for their capacity to handle high-dimensional and non-linear data, with studies reporting success in mental workload classification.

Deep learning models, with a specific focus on CNNs and LSTM networks, are also discussed. CNNs, originally designed for image inputs, are explored for their ability to transform fNIRS signals into images and classify hemodynamic changes in the brain. The advantages of CNNs, such as high accuracy, are contrasted with their limitations, including data size requirements and potential overfitting. LSTM, addressing temporal dynamics, is highlighted for outperforming other machine learning methods in certain studies.

In the past few years, newer DL architectures such as GhostNet (K. Han et al., 2020), Densenet (Y. Zhu & Newsam, 2017), and Capsule Net (Sabour, Frosst, & Hinton, 2017) have gained attention for their improved robustness, optimization, and better generalization capabilities compared to earlier models. These architectures have shown success in various computer vision tasks, but their potential in the context of cognitive load classification using fNIRS signals remains largely unexplored. Furthermore, the recent rise of transformer-based models, originally designed for natural language processing tasks, introduces a new dimension to DL. Transformers, with their attention mechanisms, have demonstrated superior generative AI capabilities compared to traditional architectures like GANs. The attention mechanisms in transformers allow them to capture complex relationships in data, making them potentially advantageous for tasks involving intricate patterns, such as those found in cognitive load studies. It is imperative to evaluate these modern DL architectures and transformer-based models specifically in the field of cognitive load classification using fNIRS technology. Their enhanced capabilities in handling complex relationships and capturing patterns may lead to improved accuracy and interpretability in understanding neural activity associated with cognitive processes. As the field of cognitive load research continues to evolve, embracing these newer DL architectures and transformer-based models can contribute to a more comprehensive understanding of the brain's response to cognitive tasks, offering novel insights into the intricacies of cognitive load classification with fNIRS data.

The main limitation of this article is that it is focused on the theme of AI and cognitive load in a relation with fNIRS only, whereas areas such as motor imagery, stress, and emotion recognition have been excluded. The main reason to exclude these areas is that either they are very wide, or they have been explained previously. Future work to improve the interpretation of AI models and clinical applicable metrics will be necessary to translate AI models in daily use.

In this review, we explore the feasibility of using fNIRS indices to quantify mental workload during various cognitively demanding tasks. The presence of open-source libraries has made it possible for the scientific community to design DL architectures with relative ease. In DL studies, the trend of using their own data set has increased. Secondly, fNIRS signals are highly affected by the age, gender, demographic and size of the data sets (Huang et al., 2021). Studies presented so far consider a limited number of participants and unequal gender distribution, as shown in Table 2. Besides that, the analysis performed on the data sets is based on the general interpretation of fNIRS signals; hence, it is difficult to compare the model performance based on various metrics used in the published studies.

Despite astonishing developments in AI, research on fNIRS is still in the early development phase. The relationships between different brain regions and across different cognitively demanding tasks still need further investigation. Few studies suggest that neural activations are higher in the left pre-frontal regions during cognitively demanding tasks, while some suggest that features from the right pre-frontal regions are best suited for DL analysis (Derosiere et al., 2014, Keles et al., 2021, Kornev et al., 2022). The list of challenges mentioned in Section 7 not only is valid in the field of neurology but also applies to other health domains. AI has become an increasingly popular topic of research in recent years, especially in relation to cognitive load. The majority of articles reviewed in this study on cognitive load focused on the emerging technology of AI, and these articles were published within the past three years. Almost all studies that compare DL with ML or with raw data instead of using handcrafted features reported a small but meaningful improvement. We observed that there is a scope of improvement in modelling and designing DL models because almost all of the studies use their own dataset to benchmark AI models. Reluctance in sharing data or model architecture limits the scope of work to small scale project.

A wide variety of both ML and DL models to analyze fNIRS signals have been proposed so far, which makes it difficult to identify the best-performing models due to a lack of comparison provided in publications. Delayed responses in fNIRS signals cause difficulty to synchronize with online analysis. Studies presented so far mostly emphasize feature selection and classification on an offline basis. The next big leap in fNIRS research could be automation using DL models. AI is likely to advance neurosciences in the near future. Research institutions should provide demographic-rich (age, gender, race) fNIRS data in a standardized format without compromising the privacy of participants. Advancement in the portable wearable fNIRS sensors will effectively reduce the errors in measurement. Availability of data will also help researchers to design optimized model architecture which can be deployed to mobile devices by using tools like TensorFlow Lite. This would enable neuroscientists to develop real-time applications by using inexpensive and portable fNIRS devices.

9. Conclusion

fNIRS is an important tool and can classify cognitive load in human performance tasks. This study has reviewed ML/DL methods used in the assessment of cognitive load by using the PRISMA protocol. In this paper, we reviewed the studies that applied DL-based classification methods on fNIRS signals collected from the participants during n-back tasks, Stroop tasks and simulated game-based tasks. The model architecture in the reviewed studies vary significantly depending on the input formulation and the task under consideration. These architectural differences can have a significant impact on the model's performance and the overall effectiveness of the AI system. This article has pointed out key strengths of ML/DL algorithms and surveyed the major achievements and limitations of state-of-the-art ML/DL approaches for fNIRS signals. By analyzing 45 articles that utilized ML/DL models to classify cognitive load based on fNIRS data, it was concluded that more than 70 % of the studies have applied CNN directly or in the form of hybrid architecture to the fNIRS signals. It was inferred that most of the researchers have adopted feature extraction techniques to leverage the full potential of ML/DL models. Some researchers also aimed to utilize convolutional layers to analyze local features from the data. Feature extraction methods ensure that the input is readily usable for model training. Few studies also indicate that features extracted from left or right prefrontal cortex of the brain can be a factor that affects the model’s accuracy. AI models can be trained using various methods, the efficiency of the model depends on the quality of preprocessing on fNIRS signals. DL algorithms are computationally expensive, they outperform ML algorithms with low pre-processing demand. We highlighted the fact that the future investigation of DL models in the domain of cognitive load not only aims at improving the accuracy of models but also inspects the aspects of practicability, such as robustness, explanation, and optimization.

We found that hybrid models generally achieve better performance compared with those of traditional models and have more potential to accurately classify different levels of mental workload. The hybrid models incorporating convolutional layers with recurrent layers are able to outperform the conventional methods. We have recommended that an in-depth investigation of hybrid models is beneficial, particularly the number of layers and arrangement of convolutional layers, fully connected layers, and recurrent layers. As cognitive studies focus merely on the objective and system paradigms, no classification technique can be declared as the best option for general use. Several challenges have been identified in the literature including model interpretability and feature engineering. We expect that AI has a potential to meet these challenges by transferring latest advances in DL technologies into massive multi-modal data of fNIRS signals.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Anderson et al., 2011
Asadi et al., 2019
Bak et al., 2022Bak, S., Yeu, M., & Jeong, J. (2022). Forecasting Unplanned Purchase Behavior under Buy-One Get-One-Free Promotions Using Functional Near-Infrared Spectroscopy. Computational intelligence and neuroscience, 2022.
Balasundaram et al., 2023
Benerradi et al., 2019
Berivanlou et al., 2016
Çakır et al., 2016
Dong and Jeong, 2018
Durtschi et al., 2021
Fujikawa et al., 2022
Howell-Munson et al., 2023
Izzetoglu et al., 2021Izzetoglu, M., Jiao, X., & Park, S. (2021). Understanding driving behavior using fNIRS and machine learning. Paper presented at the International Conference on Transportation and Development 2021.
Khan et al., 2023
Liu et al., 2020Advances in Neural Information Processing Systems, 33 (2020), pp. 13539-13550
Liu et al., 2022A Characterization of brain area activation in orienteers with different map-recognition memory ability task levels—Based on fNIRS evidenceBrain Sciences, 12 (11) (2022), p. 1561
Lu et al., 2020
Naseer et al., 2016Naseer, N., Qureshi, N. K., Noori, F. M., & Hong, K.-S. (2016). Analysis of different classification techniques for two-class functional near-infrared spectroscopy-based brain-computer interface. Computational intelligence and neuroscience, 2016.
Reddy et al., 2022
Ribeiro et al., 2016Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “ Why should i trust you?” Explaining the predictions of any classifier. Paper presented at the Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining.
Saikia et al., 2021Saikia, M. J., Kuanar, S., Borthakur, D., Vinti, M., & Tendhar, T. (2021). A machine learning approach to classify working memory load from optical neuroimaging data. Paper presented at the Optical Techniques in Neurosurgery, Neurophotonics, and Optogenetics.
Touhid et al., 2023Touhid, T. I., Anam, M., Alam, M. R., Foysal, M., & Shaiham, S. (2023). Study on Accuracy Improvement of Mental Arithmetic Task Classification Using Different Classifiers with DWT Feature Extraction Method. Paper presented at the 2023 International Conference on Electrical, Computer and Communication Engineering (ECCE).
Wang et al., 2021Wang, L., Huang, Z., Zhou, Z., McKeon, D., Blaney, G., Hughes, M. C., & Jacob, R. J. (2021). Taming fNIRS-based BCI Input for Better Calibration and Broader Use. Paper presented at the The 34th Annual ACM Symposium on User Interface Software and Technology.
Wilson et al., 2021Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5 (1) (2021), pp. 1-35
Wu et al., 2021Scalable gamma-driven multilayer network for brain workload detection through functional near-infrared spectroscopy. IEEE transactions onCybernetics. (2021)
Zaman and Islam, 2021Graph. Signal Process, 13 (2021), pp. 1-13
Zhao et al., 2022International Journal of Human-Computer Interaction (2022), pp. 1-16
Zhou et al., 2020ACM Transactions on Human-Robot Interaction (THRI), 9 (2) (2020), pp. 1-26

Application of artificial intelligence in cognitive load analysis using functional near-infrared spectroscopy: A systematic review - ScienceDirect

Abstract

1. Introduction

6.2. Deep learning trends in fNIRS analysis

6.3. Hybrid models trends in fNIRS analysis

9. Conclusion

Cited by (11)