Neural Processing Letters (2024) 56:177 https://doi.org/10.1007/s11063-024-11526-y Multi-step Transfer Learning in Natural Language Processing for the Health Domain Thokozile Manaka1 · Terence Van Zyl2 · Deepak Kar3 · Alisha Wade4 Accepted: 8 January 2024 / Published online: 20 May 2024 © The Author(s) 2024 Abstract The restricted access to data in healthcare facilities due to patient privacy and confidentiality policies has led to the application of general natural language processing (NLP) techniques advancing relatively slowly in the health domain. Additionally, because clinical data is unique to various institutions and laboratories, there are not enough standards and conventions for data annotation. In placeswithout robust death registration systems, the cause of death (COD) is determined through a verbal autopsy (VA) report. A non-clinician field agent completes a VA report using a set of standardized questions as guide to identify the symptoms of a COD. The narrative text of the VA report is used as a case study to examine the difficulties of applying NLP techniques to the healthcare domain. This paper presents a framework that leverages knowledge across multiple domains via two domain adaptation techniques: feature extraction and fine-tuning. These techniques aim to improve VA text representations for COD classification tasks in the health domain. The framework is motivated by multi-step learning, where a final learning task is realized via a sequence of intermediate learning tasks. The framework builds upon the strengths of the Bidirectional Encoder Representations from Transformers (BERT) andEmbeddings fromLanguageModels (ELMo)models pretrained on the general English and biomedical domains. These models are employed to extract features from the VA narratives. Our results demonstrate improved performance when initializing the learning of BERT embeddings with ELMo embeddings. The benefit of incorporating B Thokozile Manaka thokozilemanaka@gmail.com Terence Van Zyl tvanzyl@uj.ac.za Deepak Kar deepak.kar@wits.ac.za Alisha Wade Alisha.Wade@wits.ac.za 1 School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, Gauteng, South Africa 2 Institute for Intelligent Systems, University of Johannesburg, Johannesburg, Gauteng, South Africa 3 School of Physics, University of the Witwatersrand, Johannesburg, Gauteng, South Africa 4 MRC/Wits Rural Public Health and Health Transitions Research Unit, School of Public Health, University of the Witwatersrand, Johannesburg, Gauteng, South Africa 123 http://crossmark.crossref.org/dialog/?doi=10.1007/s11063-024-11526-y&domain=pdf 177 Page 2 of 26 T. Manaka et al. character-level information for learning word embeddings in the English domain, coupled with word-level information for learning word embeddings in the biomedical domain, is also evident. Keywords Transfer learning · Verbal autopsy · Natural language processing · Text classification · Feature extraction · Fine tuning 1 Introduction Most underdeveloped and developing countries lack robust death registration systems, and more than half of the 60 million annual deaths go unrecorded because they occur outside medical facilities [1, 2]. A verbal autopsy (VA) is a tool that can offer information about a cause of death (COD) in these places. Two parts make up a VA report: structured data and unstructured data. The structured data is made up of quantitative features like age and binary features, which are “yes” and “no” responses to disease symptoms. An open-ended narrative text outlining events leading up to death makes up the unstructured part. The adoption of natural language processing (NLP) techniques in the automation of coding of textual data has advanced NLP applications in the English domain. Still, these advance- ments have seen slower progression in the medical domain [3]. This is caused by restricted access to health information due to patient privacy and confidentiality policies. Addition- ally, clinical data lacks annotation conventions and standards, as it varies across different institutions and laboratories [4]. In transfer learning, knowledge from domains, languages and tasks where data are abun- dant can be used in domains where data are limited via adaptation techniques of feature extraction and fine-tuning [5–8]. Kim [9] showed that language modelling has been widely adopted as a source task for transfer learning and has helped advance NLP techniques. Lan- guage models possess knowledge about how language is structured and represented, and several NLP tasks share common knowledge about linguistic representation. This shared knowledge can inform each other on semantics and syntax of language [10]. This study presentsMulti-Step Transfer Learning, a framework that improves the text clas- sification task in the health domain. The model builds upon NLP transfer learning techniques of ELMo (Embeddings from Language Models) and BERT (Bidirectional Encoder Repre- sentations from Transformers). VA embeddings learned from ELMo trained in the English domain are used to initialize the learning of VA embeddings BERT trained in the biomedical domain. The resultant embeddings are used for the downstream task of COD classification. This work is structured in the following way: A review of earlier works on the automation of COD from VA reports is presented in Sect. 2. Techniques that handle a class imbalance in NLP applications are also discussed here. Section 3 gives a comprehensive description of the data and introduces the experimental design of the proposed technique. This section also outlines the Multi-Step Transfer Learning technique’s parameter settings and performance evaluation measures. Section 4 discusses the experimental findings and limitations of the study, while Sect. 5 concludes the study and outlines the planned future research directions. 2 Background Clinical natural language processing (NLP) is rapidly advancing in healthcare and medical research. It involves applying NLP techniques to clinical and biomedical texts, such as elec- 123 Multi-Step Transfer Learning... Page 3 of 26 177 tronic health records (EHRs), medical literature, and other healthcare documents to extract meaningful information to improve healthcare outcomes. Medical images, like radiology and pathology images, along with their reports, also play a crucial role in clinical diagnosis and treatment [11, 12]. However, creating medical reports, typically paragraphs detailing normal and abnormal findings, can be time-consuming and error-prone for both experienced and inexperienced radiologists [13]. Liu et al. [14] show that existing medical report-generation techniques often rely on supervised approaches, requiring paired image-report data, which can be resource-intensive in the medical field. To address this, they proposed an unsupervised Knowledge Graph Auto- Encoder model that utilizes independent sets of images and reports during training. This model establishes a shared latent space through a knowledge graph, connecting visual and textual domains. A “patient instruction” (PI) is a set of important directions given to both caregivers and patients when they leave hospital. Liu et al. [15] proposed a novel task of automatic PI generation, built a PI dataset, and presented a deep-learning approach named Re3Writer, which imitates physicians working patterns to automatically generate a PI at the point of discharge from the hospital. The field of question-answering (QA) has been transformed by recent advancements in large language models, but evaluating LLMs in the medical field is challenging due to a lack of standardized datasets [16]. Existing medical datasets [17–19] for LLM evaluation often have limitations of size that hinder thorough assessments. Many are sourced from potentially biased online forums customer service feedback surveys and lack diversity, especially in non-English languages due to resource inequality in NLP [20–22]. Overall, the insufficiency of well-curated evaluation datasets has impeded the evaluation of LLMs in the medical domain. In response to this, Liu et al. [23] introduced CMExam, a dataset derived from the Chinese National Medical Licensing Examination, serving as a benchmark for LLM performance in medical question-answering tasks, including answer prediction and reasoning. Liu et al. [24] emphasizes that the dependence of the majority of neural networks on supervised learning means their effectiveness is impacted by the accessibility and quality of labeled data. This poses a particular challenge for rare conditions such as emerging pan- demics. The Medical multi-modal large language model (Med-MLLM) was introduced as a solution for learning radiograph representations from unlabeled data. Experiments of it on COVID-19 datasets showed it’s adaptability to rare diseases, and it’s efficiency in handling both visual (X-rays, CT scans) and textual (medical reports) information [24]. With transfer learning [25], information learned from domains with large datasets can be applied to tasks, languages, or domains with smaller datasets. There are two steps to transfer learning: pretraining, where general-purpose language representations are learned, and adaptation, where the learned features are applied to a new task or domain. Clinical text representation models based on transfer learning have been developed in the health domain to boost NLP tasks of mortality prediction and hospital readmission. These models include ClinicalBERT [26], which models hospital readmission from clinical notes, MeDAL [27], a hugemedical text sample compiled for abbreviation extraction formedical domainpretraining and the Publicly Available Clinical BERT Embeddings [28]. There is also the BERT model applied on clinical notes and discharge summaries, BioBERT [29] and BioELMo [30], the ELMo model applied to biomedical literature and SciBERT [31], the BERT model which is trained on scientific literature. The datasets used to train thesemodels include records from clinics and hospitals,MIMIC- II and MIMIC-III (Medical Information Mart for Intensive Care) [32]. Other datasets are 123 177 Page 4 of 26 T. Manaka et al. from PubMed,1 a biomedical literature database that provides abstracts of published articles and PubMed Central (PMC),2 a full-text repository provides the full text of the database’s publications. The target tasks include disease prediction, diagnosis, prediction of hospital readmission, and prediction of mortality. These tasks are related to patient information from hospital care, such as vital signs, clinical features, medications, and investigations, as docu- mented by clinical care providers The differences and similarities between the source and target tasks have been identified as critical properties influencing the performance of domain adaptation approaches involv- ing feature extraction and fine-tuning. While both techniques have been found to perform similarly in the English domain, their performance changes when the training objectives and target tasks are either relatively similar or significantly different [33]. In the health domain, the same holds true, where both BioELMo and BioBERT representations have demonstrated effectiveness in biomedical tasks such as named entity recognition (NER) and natural lan- guage inference (NLI). In these tasks, BioELMo has exhibited superior performance as a feature extractor compared to BioBERT [34]. Jin et al. [34] attribute this to BioELMo’s efficacy in encoding entity types and biomedical relationship details, such as correlations between symptoms and diseases. In the general domain, ELMo has proven to be a superior feature extractor for cases involving similar source and target tasks, while BERT (fine-tuning) excels when dealing with distinct source and target tasks [33]. 2.1 Multi-source Domain Adaptation Zhao et al. [35] indicated that transferring models directly between different domains causes severe performance decline due to domain shift. Domain shift [36, 37] refers to the situation where there are differences in the joint probability distributions of observed data and labels between two domains. Domain adaptation is a paradigm that aims tomitigate the effects of domain shifts between the source and target domains. One way it does this is by aligning the source and target domains. Multi-source domain adaptation (MSDA) is a powerful extension of this concept, and it leverages knowledge from multiple sources with diverse distributions to an unlabeled target domain. Having multiple source domains available for training in real-life applications is not unexpected. Thus, in this case, utilizing only one source domain for training would seem inefficient. The typical approach is to treat all the sources as a single source, disregarding their differences. Another alternative is to train a classifier for each source and then combine these classifiers [38]. The study also demonstrated that applying risk minimization princi- ples allows for assigning weights to base models, enabling the combination of multiple base models to enhance the performance accuracy in a new domain [39]. Reimer et al. [40] showed that this task poses a challenge due to the substantial domain shift that occurs not only between the target and source domains but also among the various source domains. These variations can potentially interfere with each other during the learning process. Sun et al. [39] also showed that while using abundant training data can benefit systems, conflicting properties in data sources can lower a single model’s performance. The authors suggested that while training separate systems would be ideal, data sources often have shared characteristics despite their differences. A single system hides differences, while separate 1 https://pubmed.ncbi.nlm.nih.gov/. 2 https://www.ncbi.nlm.nih.gov/pmc/. 123 https://pubmed.ncbi.nlm.nih.gov/ https://www.ncbi.nlm.nih.gov/pmc/ Multi-Step Transfer Learning... Page 5 of 26 177 systems ignore similarities. Given that the source and target domains differ and contain domain-specific and common features, the authors demonstrated that it is possible to establish mappings from the original feature space to a latent feature space shared between the domains. Sun et al. [39] surveyed earlier MSDA methods that mainly focused on shallow models, and themodels were grouped into those that learned a latent feature space for various domains [41], and those that combined pre-learned classifiers [42]. The latest deep learning MSDA can be categorized into two groups depending on the techniques employed for alignment: latent space transformation and intermediate domain generation. Latent space transformation techniques align the target and source feature fea- tures to make them appear similar to discriminators. The primary goal of these methods is to confuse the discriminator, preventing it from accurately determining whether the fea- tures originated from different sources or were sampled from the same distribution [35]. Other latent space transformation techniques directly quantify the differences between latent spaces, representing features across domains. They accomplish this by optimizing specific discrepancy losses, such as maximum mean discrepancy and the Renyi-divergence [43]. Intermediate domain generation techniques aim to overcome the limitation of feature- level alignment, particularly in computer vision. This has been demonstrated in their ability to align only high-level information, which may not be adequate for precise predictions like pixel-wise semantic segmentation [44]. Zhao et al. [35] illustrated that this challenge can be overcome by generating an intermediate adapted domain using GANs and achieving pixel-level alignment. Existing deep learning architectures for MSDA have predominantly concentrated on scenarios involving a single-target domain. Li et al. [45] also demonstrated that existing CNN-basedmethodswere primarily designed for single-task applications. The tasks of image segmentation [46] and landmark localization [47] are significant in diagnosing and treating knee-related illnesses. Given the intricate nature of the 3D knee MRI analysis problem, encompassing both image segmentation and landmark localization tasks, which play critical roles in diagnosing and treating knee diseases, these techniques were found to be insufficient. The authors designed a Spatial DependenceMulti-Task Transformer, which incorporates spa- tial encoding into the features and introduces amulti-head attentionmechanism that combines tasks. This attention mechanism comprises two types of attention heads: inter-task attention heads, which manage spatial interdependence between tasks, and intra-task attention heads, which handle correlations within individual tasks. Wan and Jiang et al. [48] presented TransCrispr. This hybrid deep neural network com- prises four components: Embedding, CNN, Transformer, and aMultilayer Perceptron (MLP) with Fully Connected layers for the prediction of CRISPR/Cas9 single guide RNA cleav- age efficiency. MSDA has also been shown to enhance the COD classification task in VA reports [49]. In their study, Manaka et al. [49] demonstrated that better performance could be achieved by combining the English and biomedical domains for learning representations from the VA corpus. They incorporated character-level information for learning VA embed- dings in the English domain and word-level information for learning VA embeddings in the biomedical domain, resulting in improved results compared to using embeddings from these domains separately. 2.2 ELMo ELMo is a context-dependent language model that generates word embeddings by consider- ing the word’s context in both directions by using a shallow bidirectional LSTM architecture. 123 177 Page 6 of 26 T. Manaka et al. ELMo tends to learn more generic linguistic features such as syntax, semantics and some contextual information, but it might not capture fine-grained contextual details as effectively as BERT. ELMo makes use of character-level information in addition to word-level information. It uses character embeddings to represent each character in a word and employs a CNN to process the sequences of character embeddings. The CNN operates over a fixed-sized sliding window across the character embeddings. This window captures local patterns and interactions between characters. These character embeddings are combined to form word representations, which are then used as inputs for the bidirectional language model [50]. Character-level information has been shown to improve text classification models com- pared to word-level information and has been used for some time in developing word embeddings. ELMo is one such model that uses character-level information. ELMo, with its variants like BioELMo, therefore, can capture syntactic, morphological, and orthographic information at the character level, enhancing the model’s generalization on both frequent and unseen words. Additionally, they offer the benefit of representing out-of-vocabulary words and misspelled terms [51]. They can also learn long-term context dependency, critical for VA reports where each case typically comprises three to five related sentences. 2.3 BERT The core make-up of a BERT model is based on the Transformer architecture. The vanilla Transformer [52] is a sequence-to-sequence model and consists of an encoder and a decoder, each of which is a stack of L identical blocks. Each encoder block comprises a multi- head self-attention module and a position-wise feed-forward network (FFN). For building a deeper model, a residual connection [53] is employed around each module, followed by a layer normalization [54] module. Compared to the encoder blocks, decoder blocks insert cross-attentionmodules between the multi-head self-attention modules and the position-wise FFNs. Furthermore, the self-attention modules in the decoder are adapted to prevent each position from attending to subsequent positions. In contrast to ELMo, BERT’s training objective is a masked language modeling, which involves randomly replacingwords in a phrase with a particular token and using a transformer to create a prediction for the token, taking into account the unmasked words surrounding it [6]. The other pretraining task it uses is next-sentence prediction, which can be thought of as a form of sentence modeling. For BERT, the Transformer architecture strictly utilizes the workpiece tokenization method instead of ELMo, which combines character and word tokenizations. The BERT model family consists of BERT Experts, which are made up of eight models that all feature the BERT-base architecture but offer a selection of different pretraining domains to better align with the target task. The significant benefits achieved by BERT led several modern representation models to adopt the Transformer architecture as their main building element. Compared to ELMo, which uses a shallow bidirectional architecture, BERT uses a deep bidirectional architecture. Combining character-level and word-level information word embeddings has been proven to boost the performance of various text classification tasks. Character-level CNNs were employed on a dataset of news articles and a dataset of online reviews to show that the character-level model outperformed the word-level one on the classification task [55]. In a similar study, character-level and word-level information embeddings were merged with padding and a Long Short-Term Memory (LSTM) language model to produce better per- plexity scores than comparable word-based models [56]. An LSTMwas utilized in the health 123 Multi-Step Transfer Learning... Page 7 of 26 177 domain to learn character embeddings that were later coupled with pretrained word embed- dings to retrieve information about cancer, results of which successfully showed that coupling character and word-based techniques for COD is effective on the medical domain [57]. Recent studies have shown that character-level information can improve text classification models, especially in cases with numerous spelling errors and variants, such as the VA text [58]. Yan et al. [58] introduced two CNN based methods, namely embedding concatenation andmodel combination to combineword and character embeddings [58].With thesemethods, the authors demonstrated that information about characters can overall improve the COD classification task for VAs and datasets that are relatively smaller in size. They further showed that an added benefit to character-basedmodels is their smaller vocabulary size, which causes the input representations to have small variations. This trait is especially useful for very small datasets like VA narratives. A BERT-ELMO-based deep learning neural network architecture that utilised a bidirec- tional LSTM (BiLSTM) as the primary building block, together with a conditional random fields layer was used for the name entity recognition (NER) task by Affi and Latiri [59]. The authors initialized word vectors using pretrained ELMo and BERT embeddings and fed the output into a BiLSTM network. They reported improved results compared to existing state-of-the-art (SOTA) systems on Conference on Natural Language Learning 2003 shared task (CoNLL-2003) (95.56% F1-Score). Building upon a similar concept, characterBERT, a BERT version that consults the characters of words to represent them using a character CNN module, was introduced for the NER task in the health domain [51]. While the self-attention mechanism of the Transformer has proven to be effective for many language models, it does come with some challenges. The self-attention used by the Transformer is known to be complex [60], resulting in the attention module becoming a bottleneck when dealing with long sequences. The second challenge pertains to structural priors. Compared to CNNs and RNNs, which come with predefined biases for spatial or temporal patterns in data, the self-attentionmechanism used in the Transformer lacks specific structural biases and assumptions about data. Even the order information needs to be learned from the training data. Consequently, the Transformer’s design is more flexible and is better equipped to handle diverse tasks effectively. However, this flexibility comes at the cost of potential overfitting on smaller datasets [60]. Several works have set out to mitigate this challenge by improving attention. This includes techniques that utilize a prior distribution for attention. These techniques investigate supple- menting or substituting the standard attention mechanism with prior attention distributions. Combining these two attention distributions typically entails computing a weighted total of the scores linked to the prior and the generated attention and applying a softmax function [60]. Text data has been shown to favour locality strongly, and this property can be encoded as prior attention. Lin et al. [60] shows that utilising a Gaussian distribution over positions is the most straightforward approach. This means the resulting attention distribution could be multiplied by a Gaussian curve’s density and adjusted to maintain proper proportions. This adjustment is akin to introducing a bias term to the initial attention scores, where larger values suggest a greater inherent likelihood that the i-th input should prioritize attending to the j-th input. Gaussian Transformers by Guo et al. [61] and Yang et al. [62] explored this approach. BERT’s combination of bidirectional context, self-attention mechanisms, positional encodings, deep architecture, and pretraining all contribute to its ability to handle long-range dependencies in language, making it effective at capturing relationships between words or tokens that are distant from each other in a text sequence [6]. The integration of CNNs with 123 177 Page 8 of 26 T. Manaka et al. Table 1 A verbal autopsy narrative Narrative The deceased started illness on the left leg where she was scratched on the left leg on her toe. The toe became swollen and rotten. She suffered with blurred. The chest pain and headache still worse. She was taken to a special doctor. The treatment was tablets. Bandage was use on her leg. Illness became worse where she was taken to matikwana hospital. Admitted for almost a month. Treatment was water drip, tablets and bandage. She was told that she had sugar diabetes. Urine was red and her toe became swollen and rotten. She was taken to mapulaneng hospital, where her leg was amputated. Then one month after she complained about diarrhea for 3 weeks and it did not stop. She started coughing for one months. She became difficult breathing for 2 days and she died at home. She had trouble seeing for about 2 years and did not stop until death. Diagnosis: Death due to uncontrollable hyperglycaemia Transformer architectures to enhance performance is a topic that has been explored in various works, including those in the biomedical domain. CNNs effectively capture the text data’s unique characteristics, local patterns, and features. Both CNNs and Transformers have their strengths, and combining them can lead to improved performance by capturing different types of features and patterns. In Chinese text, spacing between words is not as clear as in English, making boundaries between terms less distinct. Constructing a Chinese entity involves various symbols, charac- ters, and abbreviated forms. Moreover, the structure of Chinese grammar is intricate, leading to instances where a single term can signify distinct types of entities within different contexts [63, 64]. To extract fine-grained semantic features of Chinese characters for the Chinese clinical namedentity recognition (NER) task,Wanget al. [63] combined adynamic fusion transformer layer with the Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking (RoBERTa-wwm) and 1-dimensional CNNs. Kong et al. [64] introduced an innovative approach that integrates multi-level CNN layers with an attention mechanism for the Chinese clinical NER task. Through this approach, they demonstrated the development of a data augmentation technique without relying on external information while also utilizing multi-modal character embeddings to delve into a wider range of semantic details. 2.4 Verbal Autopsy A verbal autopsy (VA) report is a research tool that enables a better understanding of COD. Unlike clinical notes such as hospital discharge summaries or biomedical text from medical literature, by the nature of its collection and compilation, narrative text from VA reports does not possess clinical or biomedical knowledge. VA reports are performed by lay interviewers and later coded by physicians for COD.Many errors are made while translating the local lan- guages and converting handwritten documents to electronic mediums. Numerous accounts frequently contain grammar and spelling mistakes, inconsistent pronouns, sentence frag- ments, improper punctuation, transcription problems, and the frequent usage of terminology in the local vernacular [58]. Table 1 shows a sample of a VA narrative. The language and dialects in VA data might not be well-represented in standard pretrained language models, and these language models might not understand the characteristics and 123 Multi-Step Transfer Learning... Page 9 of 26 177 vocabulary of the specific language used in the VA data. VA narratives also contain domain- specific terms related to cultural practices, local beliefs, and regional medical conditions that may not be covered by standard health domain pretrained models. The data is also narrated by individuals who are not medical experts. As a result, the language used may not conform to the formal medical terminology used in health domain pretrained models. To mitigate these challenges, we propose a hybrid transfer learning framework that offers cross-linguistic adaptation; The ELMo language model pretrained in the English domain serves as an intermediate bridge between the language used in theVA corpus and the biomedi- cal domain ofBERT.TheELMoembeddings capture linguistic characteristics that are present in the VA data and not covered by traditional health domain pretrained models. Our framework also offers a domain-specific initialization; By using the ELMo-initialized BERT model, we are effectively initializing the model’s parameters to understand both the linguistic characteristics of theVAcorpus and the biomedical terminology from the pretrained BERT model. This dual initialization enhances the model’s ability to handle the unique VA language. This framework also recognizes the multilingual nature of the VA corpus and incorporates knowledge from multiple domains. The ELMo-initialized BERT model is designed to learn from both English and biomedical contexts, enabling it to better capture the features of the VA corpus language. We relied on expertly annotated VA data. A pediatrician with expertise in type-1 diabetes coded the data by examining VAs to identify features indicative of diabetes or uncontrolled hyperglycemia. A colleague of the pediatrician, experienced in adult internal medicine, dia- betes, and endocrinology, reviewed cases where the reviewing physician was uncertain, and a consensus was reached. This contribution helped guide the hybrid model’s initialization, aligning the model with the biomedical knowledge present in the VA language. 2.5 Data Class Imbalance Due to the high dimensionality of the numerical vectors generated from text, current sampling techniques, such as the synthetic minority oversampling technique (SMOTE) and its variants, like SMOTE-Tomek Links, do not perform well with text data. Although BERT is capable of handling imbalanced classes without the need for additional data augmentation, it has been shown that the model does not generalize well when the training and testing datasets are different, such as news sources whose subjects change over time. To address this challenge, Madabushi et al. [65] suggest adding cost weighting to BERT. Wei et al. [66] showed that data augmentation strategies for NLP, such as synonym sub- stitution and random insertion, deletion and swapping of words from a sentence with a predetermined frequency, did not produce significant gains when using pretrainedmodels. To test this assertion,Madabushi et al. [65] used various data augmentation techniques including synonym replacement, random deletion of words in a sentence, and the random oversampling of cases from the minority class, for the sentence classification task using the BERT language model. According to the authors, except for oversampling, BERT without data augmentation approaches outperformed BERT with those techniques. In contrast to synonym insertion and random word deletion, which inject noise into the data, oversampling does not, according to the authors. In the case of natural language data, this type of noise may modify a sentence’s meaning. The cost-sensitive classification was then presented as a more reliable technique for weighing samples of imbalanced data. Cost-sensitive classification 123 177 Page 10 of 26 T. Manaka et al. Cost-sensitive learning solves the problem of class imbalance by changing the cost function of the model such that making incorrect classifications of training samples from the minority class are more costly. If xi is a single prediction and j a class, the cross entropy (CE) loss for the class is given by CE = − 1 N ∑ i ∑ jε{0,1} yi j logpi j (1) where xi is a member of a set of training examples X and is related to a label yi , which is a member of the set {0, 1}. The predicted probabilities of the classes is pi , and it is a member of [0, 1]. One can adjust the cross-entropy loss to take into account an array weights, the i th member of which gives the weight of the i th class to be Weighted CE = − 1 N ∑ i αi ∑ j∈{0,1} yi j logpi j (2) where αi , a member of a set [0, 1] is set by the inverse class frequency. Although the weighted cross entropy loss can offer some relief, the improvement is not as significant, according to Xiaoya [67]. A dice loss has been proposed as a solution to address the class imbalance due to the limitations of the cross entropy loss in cases with uneven label distributions. Dice Loss/Sorensen–Dice Coefficient The dice loss is based on a statistic that gauges the similarity between two samples or an overlap between two sets called the Sorensen–Dice coefficient (DSC) [68]. For a single example xi , the dice coefficient is given as DSC(xi ) = 2pi1yi1 pi1 + yi1 (3) The nominator and denominator of the above equation are smoothed by adding a γ factor, resulting in the following equation: DSC(xi ) = 2pi1yi1 + γ pi1 + yi1 + γ (4) Changing the denominator to the square form for faster convergence gives the dice loss DL = 1 N ∑ i [1 − 2pi1yi1 + γ p2i1 + y2i1 + γ ] (5) or DL = 1 − 2 ∑ i pi yi + γ∑ i pi 2 + ∑ i yi 2 + γ (6) The DSC gets its maximum value of 1 when two sets, A and B, perfectly overlap. If the two sets do not intersect in any way, DSC starts to fall and eventually reaches zero. As a result, the DSC’s range is 0 to 1, with bigger being better. From this, we can utilize 1-DSC as the dice loss to optimize overlapping between two sets. Although the dice loss views false positives and false negatives as equally important, using the dice loss alone has been shown to be insufficient as it cannot address the prevailing influence of easy-negative examples on the training [67]. Xiaoya et al. [67] demonstrate that while easy negative examples can be easily pushed to a probability of 0, the model fails to 123 Multi-Step Transfer Learning... Page 11 of 26 177 differentiate between positive and hard-negative examples. The authors suggested a weight- adjustment technique that assigns each training example a weight proportional to (1 − p), that changes as training continues and makes the model sensitive to hard negative cases. This makes Eq. 5 be DSC(xi ) = 2(1 − pi1)α pi1yi1 + γ (1 − pi1)α pi1 + yi1 + γ (7) where (1 − pi1)α represents the weight assigned to each case, which shifts as training goes on, pushing the weight of easy examples. Word-level information based language models, particularly those trained on specific- domain corpora like English, are prone to miss important information from VA text because of the nature of its nature [58].We assert that this is also true for clinical and biomedical dataset text dictionaries and word distributions, and that SOTA text representation algorithms trained on these datasets will not perform as well on the VA data. We present Multi-Step transfer learning to mitigate this challenge by including character-level word representations from the ELMo language model and word-level embeddings from the BERT language model. In the initial step of our framework, the languagemodeling pretraining objective primarily concerns the English language. This objective is achieved through unsupervised learning using the ELMo model trained on a combination of English Wikipedia and monolingual news crawl data. In the subsequent step, we leverage the VA embeddings acquired from the English domain to initialize the learning of VA text representations in the biomedical domain. This initialization is done byusing an additional embedding layer before the embedding layers of the BERT model trained on PubMed abstracts. The resulting embeddings, which combine knowledge from both the English and biomedical domains, are then utilized for the final task of classifying the COD due to uncontrolled hyperglycemia. This framework hypothesizes that the ELMo language model, when used to learn VA embeddings in theEnglish domain can reduce the distribution divergence between theEnglish language and the language in the VA corpora. It also assumes that biomedical and clinical knowledge fromdata transcribed bymedical professionals can be used to improve the learning of VA representations via the shared mappings of the BERT model trained in the biomedical domain. The empirical evaluation of the framework involved its implementation on three open-source text classification datasets of English, biomedical and VA domains. This paper’s contributions are the following: 1. A Multi-Step Transfer Learning approach that makes the most use of the domain adap- tation processes to improve the COD classification task of VA text. 2. A VA text representation framework with proven transferability to other medical con- ditions in cardiovascular, pulmonary, gastroenterology, neurology, orthopedics and radiology categories, which are leading CODs globally. 3. An empirical evaluation of the Multi-Step Transfer Learning model on the publicly available VA dataset collected by the Population Health Metrics Research Consortium (PHMRC). We believe that combining knowledge from the two representation learning approaches will result in VA narrative representations better suited for the target task of COD classifi- cation. This is significant because improved VA text representations will accurately convey information about uncontrolled hyperglycemia as a mortality factor, which, when identified and diagnosed in a timely manner, can prevent further complications of type-1 diabetes and death. 123 177 Page 12 of 26 T. Manaka et al. 3 Methods This section presents the experimental setup of the Multi-Step Transfer Learning framework applied to a VA dataset. There are three parts to the experiment; The first part focuses on the selection of the best hyper-parameters for the BERT model which formed the second step of the transfer learning framework. The best approach for dealing with class imbalances in text classification is studied in the second part and the third part focuses on the validation of the framework on publicly available datasets of Population HealthMetrics Research Consortium (PHMRC) VA Corpus, IMDb movie reviews and a clinical dataset of medical transcriptions. 3.1 Algorithms BERT [6], BERT Experts-PubMed [69], BERT Experts-Wikibooks [69], ELMo [50] and BioELMo [34]. 3.2 Datasets English Language Corpus For ELMo, we used the EnglishWikipedia and the monolingual news crawl data fromWMT 2008–2012.3 For BERT we used the expert version pretrained on combined Wikipedia and BooksCorpus. General Medical Corpus Avocabulary drawn from a database of 15,000 clinical research articles fromPubMedCentral (PMC)4 that cover a wide spectrum of medical areas was used as the medical domain corpus. Agincourt Verbal Autopsy Corpus The verbal autopsy (VA) dataset used in this study is from Agincourt, ethics clearance num- ber:M110138. It is a population health and demographic surveillance system operating in rural South Africa and aids research on causes and impacts of social transitions and popula- tions. The data consists of 8698 VA records collected from 1992 to 2015. The data were examined for indicators of uncontrolled hyperglycemia by a doctor with paediatric training and experience managing type-1 diabetes in high-income, low-income, andmiddle-income countries aswell as paediatric training. In 3708 cases, uncontrolled hyper- glycemia symptoms were present; and 77 cases were identified as deaths due to uncontrolled hyperglycemia. The data includes answers to both open and closed-ended questions and free text describing circumstances leading up to a death. We utilized the free text for this study. Population HealthMetrics Research Consortium (PHMRC) Verbal Autopsy Corpus We validated our framework on the VA dataset collected by the Population Health Metrics Research Consortium (PHMRC). This data was gathered to make it possible to create and test methods for measuring cause-specific mortality in areas where there is limited and inaccurate COD coding [70]. It comprises of 11,979 VA records covering three age groups; neonate, child, and adult. The VA gathered data on potential risk factors, demographics, and other relevant information. The data is compiled in Tanzania, India, the Philippines and Mexico. Of these cases, 7580 are adult cases, and only 6896 of these had the narrative text feature. IMDb Movie Reviews 3 WMT is a collection of datasets used in shared tasks of the Third Conference on Machine Translation. 4 PubMed is an online medical publication repository and contains published medical research across a very wide spectrum of clinical subjects. https://www.ncbi.nlm.nih.gov/pmc/. 123 https://www.ncbi.nlm.nih.gov/pmc/ Multi-Step Transfer Learning... Page 13 of 26 177 We also validated our framework on two classification datasets in the English and clinical domains. For the English dataset, we utilized the informalmovie reviews from InternetMovie Database (IMDb) [71] dataset provided by Keras. This dataset comprises 50,000 reviews, equally split between 25,000 negative and 25,000 positive cases. We selected this dataset due to its use as a benchmark for the Paragraph Vector [72] on sentiment analysis and information retrieval tasks. We therefore used it evaluate the Multi-Step Transfer Learning framework for the text classification task of sentiment analysis. We chose this dataset as it is larger than the VA dataset and it boasts an even class distribution. Medical Transcriptions The dataset consists of 2324 transcribed medical transcription sample reports across 21 categories ofmedical conditions including cardiovascular, pulmonary, gastroenterology, neu- rology, orthopedics and radiology [73].Although themedical transcriptions are almost similar to VA narratives and have an imbalanced category distribution, this dataset is fairly smaller than the VA dataset. The Multi-Step Transfer Learning framework was evaluated on the multi-class text classification using this dataset. 3.3 Experiments Similar to the work byManaka et al. [49], the initial step of the Multi-Step Transfer Learning framework involves an exploratory search for models across a number of domains to identify the one that best represents the VA corpus. Three sets of ELMo languagemodels were trained in three different domains, with the objective of identifying the optimal text representations for the VA language modeling task. Two BERT Experts language models were initialized with random weights for training. The set of embeddings from the ELMo model with the lowest perplexity scores was then transferred to the BERT models. 3.3.1 ELMo The tensorflow implementation of ELMo from github repository 5 was cloned to train and evaluate the ELMo embeddings in the English, medical and public health domains. Input data was prepared by randomly splitting the training data into many training files, each containing pre-tokenized and white space-separated text, one sentence per line. The three ELMo models were trained using the same hyperparameters as the original ELMo model: one on a vocabulary derived from the English Wikipedia and monolingual news crawl data fromWMT 2008–2012, and one on vocabulary derived from the VA corpus. The third ELMo language model was trained on a vocabulary from 10M PubMed abstracts. The datasets were preprocessed by removing punctuation as well as lower casing the text. When creating a vocabulary using VA data, we did a comparison of when stop words were removed and when they were not and used a vocabulary that gave a less perplexity score. Following the paper implementation [74], the language models were trained via the multi- task learning of next word prediction and natural language inference. The trained language model embeddings were used as feature extractors to initialize random word vectors of VA language corpora when using the pretrained BERT models in the second transfer learning step. All ELMo layers were combined into a single vector in order to be used in this target task. The three language models were evaluated on the perplexity score, results of which are depicted in Table 2. 5 https://github.com/allenai/bilm-tf. 123 https://github.com/allenai/bilm-tf 177 Page 14 of 26 T. Manaka et al. 3.3.2 BERT We built a basic fine-tuned model, which involved creating a preprocessing model, utilizing a pretrained model from BERT Experts, implementing an embedding layer with ELMo embeddings for initializing the BERT embeddings, incorporating a fully connected layer, and adding a dropout layer for COD classification. To facilitate comparison, we employed two BERT models from TensorFlow Hub,6 which were pretrained on medical and English domains, specifically Wikibooks and PubMed abstracts. In addition to masked language modeling, the other training object of BERT is next sen- tence prediction, which can be thought of as a form of sentence modeling and it incorporates both tasks via multitask learning. A map with three key values was created by the BERT models: pooled output which represented each input sequence as a whole, i.e. the embedding for all VA data, sequence output which represented each input token in context, i.e. the con- textual embedding for every token in the VA corpus, and encoder outputs which represented intermediate activations in the transformer blocks. For extracted BERT embeddings we used the 768-element pooled output array. Devlin et al. [6] observed that BERT models were sensitive to the choice of hyperparam- eters for smaller datasets compared to larger ones. The authors recommended the following hyperparameters for fine-tuning the model: a batch size of 16 or 32, learning rates (Adam) of: 5e−5, 3e−5 and 2e−5 when the number of training epochs ranges between 2 and 4. We con- ducted experiments using various combinations of these hyperparameters to identify those giving the optimal performance. With these combinations, we also compared the model per- formances using both the dice loss and the weighted binary cross-entropy loss functions. The corresponding results are given in Table 3. We conducted experiments using various combinations of these hyperparameters to iden- tify the configurations yielding optimal performance. Within these configurations, we also compared model performance using both the dice loss and the weighted binary cross-entropy loss functions. The corresponding results are presented in Table 3. The overall framework of the Multi-Step Transfer Learning framework is the following: 1. Pretraining: ELMo Pretraining for Spelling Variations and Language: The ELMo lan- guage model pretrained in the English domain (English Wikipedia and the monolingual news crawl data)was used in the initial step ofMulti-StepTransfer Learning. The assump- tion is that ELMo will learn the generic linguistic features of a VA text report like syntax, semantics and context. By leveraging character-level information, ELMo embeddings can potentially help in handling spelling errors, variations in language, and even rare or out-of-vocabulary words in VA texts. The domain adaptation technique from this task to the one in the next step is feature extraction, as the source task (language modeling) and the target task (language modeling) are similar. This learning paradigm is classi- fied as cross-domain learning because it entails learning embeddings in one domain and transferring them to another. 2. Intermediate Training: BERT Pretraining for Medical Information: ELMo embeddings from the first step were used to initialize the embedding layers of the BERTmodel trained in the biomedical domain (PubMed abstracts). This intermediate training step aimed to find the optimal embeddings for the final CODclassification task. This initialization helps BERT start from a more informative point enabling it to achieve better performance. 6 https://www.tensorflow.org/hub. 123 https://www.tensorflow.org/hub Multi-Step Transfer Learning... Page 15 of 26 177 BERT’s biomedical domain embeddings will help capture medical terminology, con- cepts, and domain-specific information present in VA texts. This is crucial for accurately extracting medical information and understanding the context of symptoms and COD. 3. Fine-tuning: After initializing the embedding layers, the entire BERT model was fine- tuned on the target COD classification task using the labeled VA data. The assumption is that the BERT language models will deeply understand and learn the relationships betweenwords and contexts, resulting in a highly contextualized embeddings.This setting where the source task and target task differ is called cross-task learning and the domain adaptation technique of fine-tuning was used in this step. 4. Text Classification: To ensure comparability, all models were trained and evaluated using the same split. For the task of sentence classification of COD due to uncontrolled hyper- glycemia, a fully connected layer was added above the BERT self-attention layers. For the classification task, the cost-sensitive classification [65] technique was used to enhance the weight of mislabeling a VA case by altering the fully connected layer’s cost function during training by multiplying each example’s loss by a factor. The computed class weights ratio is 0.50494305 : 51.07608696. To ensure comparability, all models were trained and evaluated using the same split and for the sentence classification task, a fully connected layer was added above the BERT self-attention layers. In addition to Multi-Step Transfer Learning, which uses ELMo embeddings to initialize the learning of BERT embeddings, we also experimented with the concatenation of ELMo and BERT embeddings. We used the resultant embeddings for the classification of COD due to uncontrolled hyperglycaemia from VA reports with a feed-forward neural network. Our comparison included evaluating these embeddings as features alongside the binary fea- tures extracted from a VA report and when both binary and VA text features were used in combination. This would ultimately provide insights into the effectiveness of VA narrative embeddings in the classification task of COD due to uncontrolled hyperglycaemia from VA reports (Fig. 1). 3.3.3 Validation To assess the framework, we conducted tests comparing BERT’s performance when employ- ing both feature extraction and fine-tuning domain adaptation methods. The testing involved evaluating the BERTmodel embeddings in isolation and when combined with ELMo embed- dings. In the case of using BERT and ELMo embeddings together, the first scenario, Multi-Step Transfer Learning utilized fine-tuning adaptation, while the second scenario involved feature extraction, where ELMo embeddings were concatenated with BERT embed- dings. While only accuracy was employed for multi-class classification, the performance of the framework for sentiment analysis, similar to binary classification was assessed using recall, precision, F1-score, the area under theROCcurve (AUC–ROC), and accuracy. Formulti-class classification, we also studied the effect of reducing the dimension of the text by extracting clinical domain entities. We used the ScispaCy7 package to detect medical entities in the medical transcriptions. 7 https://allenai.github.io/scispacy/. 123 https://allenai.github.io/scispacy/ 177 Page 16 of 26 T. Manaka et al. Fig. 1 Multi-step transfer learning framework 4 Results and Discussion The ELMomodel trained onWikibooks and BookCorpus vocabulary exhibits better perplex- ity scores on the training and testing datasets than the ELMo models pretrained on PubMed abstracts and the VA vocabulary (Table 2). This is due to the sizes of the vocabulary sets from these datasets. This also means that the distribution divergence gap of features between the VA corpus and the English domain is less than that between the VA and the health domain. We believe that ELMo was able to extract a lot of linguistic knowledge, including spelling, syntax, and grammar in the English domain. Generally, the dice loss function has better results across all the metrics than the weighted binary cross entropy loss function (Table 3). These findings are consistent with the results of a study by Xiaoya et al. [67], which showed 123 Multi-Step Transfer Learning... Page 17 of 26 177 Table 2 Evaluation of ELMo language models Technique Vocabulary Tokens Train perplexity Test perplexity ELMo English Wikipedia 5.5B 43.23 31.32 ELMo Agincourt Verbal Autopsy 982 495 71.55 50.01 BioELMo PubMed Abstracts 2.46B 47.44 33.01 ELMo PHMRC Verbal Autopsy 475 005 67.22 51.42 ELMo IMDb Reviews 11.7M 52.47 44.56 BioELMo Medical transcriptions 909 830 72.12 52.32 Table 3 BERT Fine-Tuning Hyperparameter Search Epochs L-Rate Recall Precision F1-Score AUC-ROC Dice Loss 2 2e−5 0.7843 0.7312 0.7568 0.8514 3e−5 0.8000 0.7656 0.7824 0.8433 5e−5 0.7254 0.7789 0.7512 0.8087 3 2e−5 0.8000 0.7811 0.7904 0.8293 3e−5 0.7541 0.8475 0.7981 0.8881 5e−5 0.7461 0.7255 0.7356 0.8591 4 2e−5 0.7111 0.7000 0.7056 0.8532 3e−5 0.7661 0.6286 0.6906 0.8497 5e−5 0.7100 0.6375 0.6718 0.8245 Weighted Cross-Entropy 2 2e−5 0.5822 0.5900 0.5861 0.6472 3e−5 0.5344 0.6415 0.5830 0.6166 5e−5 0.6300 0.5574 0.5914 0.6881 3 2e−5 0.6500 0.6100 0.6294 0.7105 3e−5 0.5333 0.4621 0.4952 0.5629 5e−5 0.4700 0.5248 0.4959 0.5568 4 2e−5 0.6344 0.6000 0.6167 0.6562 3e−5 0.5866 0.5223 0.5525 0.6178 5e−5 0.5167 0.5469 0.5314 0.6381 that when the imbalance between classes is extreme, the weighted binary cross entropy loss is unable to alleviate the imbalance in datasets. The authors show how the impact of easy negative examples causes this as the only thing that weighting the classes does is balance the labels such that the training and test times are equal. Madabushi et al. [65] showed that although BERT can handle imbalanced datasets without the requirement for further data augmentation, evaluation findings of the weighted cross entropy loss function demonstrate that it fails to generalize when the train and test sets differ. Our results show that this is the case with VA reports as well which consist of differing narrations. 123 177 Page 18 of 26 T. Manaka et al. Table 4 Multi-step transfer learning on sets of BERT and ELMo Embeddings of the Agincourt VA Dataset BERT ELMo Recall Precision F1-Score AUC-ROC Wiki Books None 0.6141 0.5594 0.5855 0.7001 Verbal Autopsy 0.6581 0.6147 0.6357 0.7101 Wikipedia 0.7687 0.7787 0.7734 0.8507 PubMed Abstracts 0.7581 0.6991 0.7324 0.8111 PubMed Abstracts None 0.7141 0.6011 0.6528 0.7399 Verbal Autopsy 0.8065 0.6011 0.6888 0.7981 Wikipedia 0.8171 0.8644 0.8401 0.9144 PubMed Abstracts 0.7496 0.6987 0.7233 0.8149 Table 5 Multi-step transfer learning on sets of BERT and ELMo Embeddings of the PHMRC VA Dataset BERT ELMo Recall Precision F1-Score AUC-ROC Wiki Books None 0.7044 0.6741 0.6889 0.7447 Verbal Autopsy 0.6743 0.5561 0.6095 0.7253 Wikipedia 0.7848 0.7548 0.7695 0.8465 PubMed Abstracts 0.7341 0.6791 0.7056 0.8112 PubMed Abstracts None 0.6946 0.6681 0.6811 0.7422 Verbal Autopsy 0.7215 0.7046 0.7129 0.7956 Wikipedia 0.8363 0.7941 0.8147 0.9017 PubMed Abstracts 0.7561 0.7148 0.7349 0.8764 Table 4 gives results of a comparison of sets of VA ELMo embeddings trained on different domain vocabularies. BERT pretrained on Wikibooks and BERT pretrained on PubMed abstracts perform best with VA corpus embeddings pretrained on English Wikipedia in all settings. This can be explained by the fact that PubMed abstracts and the VA corpus contain numerous mentions of words in English Wikipedia and Books Corpus and that both domains form subsets of the English domain. The benefits of adding ELMo embeddings are evident from the performance scores that are greater than for a BERT model without ELMo embeddings, except for the setting where ELMo embeddings are trained on VA corpus vocabulary, where the model performs worse than the setting without ELMo embeddings. We believe this to be due to the VA corpus vocabulary being smaller in size, containing misspelled words and grammatical errors. We argue that this limits linguistic knowledge and that the total capacity of the ELMomodel was not leveraged because the model is data-intensive. Other authors have reported higher F1-scores (95.57%) when using a combination of off-the-shelf ELMo and BERT embeddings as an initial step to a combination of a bidirec- tional LSTM and conditional random fields (BiLSTM-CRF) module on CoNLL-2003 and OntoNotes 5.0 datasets for English named entity recognition (NER) task [59]. Nonetheless, our results are comparablewith those of Boukkouri et al. [51] who reported F1-scores ranging from 70 to 89% on CharacterBERT, a variant of BERT that drops the word piece system alto- gether in favor of a character-CNN on a series of NER tasks on the medical corpus. However, with F1-scores around 0.8633–0.8907, we are convinced character information can improve the classification of cause of death (COD) from VA reports. 123 Multi-Step Transfer Learning... Page 19 of 26 177 Table 6 Multi-step transfer learning framework on IMDb reviews for BERTpretrained on English andmedical domains BERT Adaptation F1-Score AUC-ROC Accurary Wiki Books Fine tuning 0.9007 0.9576 0.8895 Feature extraction 0.8356 0.8181 0.8518 PubMed Abstracts Fine tuning 0.8382 0.9571 0.8862 Feature extraction 0.6695 0.7005 0.7062 Table 7 Multi-step transfer learning on medical transcriptions detected and undetected for clinical entities accuracy scores Model Adaptation Detected Undetected BERT Fine tuning 0.6872 0.7629 Feature extraction 0.5841 0.6577 Multi-Step (BERT) Fine tuning 0.7148 0.8244 Feature extraction 0.6824 0.7479 BERT-PubMed Fine tuning 0.7525 0.8336 Feature extraction 0.5844 0.5952 Multi-Step (BERT-PubMed) Fine tuning 0.8421 0.8946 Feature extraction 0.7669 0.7769 Danso et al. [75] showed that there is no corpus similar to VA and that the only dataset available to provide a gold standard for the evaluation of computational approaches to VA analysis, the PHMRC corpus was not suitable for linguistic research. Evaluation of our approach on the PHMRC VA dataset gave similar results, (shown in Table 5) to those of the Agincourt VA corpus in terms of performance across the different embeddings combinations. The authors show that the preprocessing steps involved in its annotation like the removal of syntax rules and linguistic information, removal of words that do not occur frequently, and only taking into account medically relevant concept-terms have resulted in the loss of important information. We found these to not have an affected our approach as some of these steps were used in preprocessing the Agincourt VA data for the ELMo models. Evaluation of our framework on the IMDb movie reviews dataset (English domain) and the medical transcriptions corpus (medical domain), where the former is larger than the VA corpus and the latter is smaller, shows that the approach is capable of handling datasets of different sizes from multiple domains (Tables 6 and 7). The Multi-Step transfer learning framework can also generalize to both balanced and unbalanced datasets as themovie reviews are equally balanced while the medical transcriptions dataset is not. Works by Beltagy et al. [31], See et al. [76], Jin et al. [34] and Boukkouri et al. [51] have shown that the general English-domain word-piece vocabularies are not suitable for specialized domain applications like clinical and biomedical domains. Evaluation of the Multi-Step transfer learning framework results are in agreement with this (Tables 6 and 7), where BERT trained in the English domain gets higher F1-scores than BERT pretrained on PubMed abstracts on the IMDb reviews. The PubMed pre-trained BERT model also outper- forms the general English domain pre-trained BERT model on the medical transcriptions. This is because even though BERT can achieve the right balance between the flexibility 123 177 Page 20 of 26 T. Manaka et al. Fig. 2 Receiver Operating Characteristic (ROC) curves and the Area Under the ROC Curve (AUC-ROC) generated by the Neural Network Classifier using Binary Features from a VA Report of characters and the utility of full words, employing predefined word piece vocabularies from the general domain is not always appropriate, especially when building specialized domain models. Further evaluation shows that generally, the fine-tuning adaptation tech- nique achieves better results than the feature extraction adaption across all settings (Tables 6 and 7). These results are in line with Peters et al. [33] in the English domain and Jin et al. [34] in the biomedical domain where both works compared the two transfer learning adaptation techniques. In comparison to our framework’s sensitivity (recall) of 0.8171 and 0.8363 on the Agin- court and PHMRC datasets for adult deaths, Jeblee et al. [77] reported a mean sensitivity of 0.7700 and in another work where they combined word2vec embeddings and key phrases they achieved a recall score of 0.7780 [78]. Theseworks however usedword frequency counts as features and both methods don’t take word order and context into account. On more recent works that incorporate context and character information, Yan et al. [58] achieved recall scores of 0.6990 while Manaka et al. [79] gave 0.6000. Considering the improvement added by character information on improving COD classification, Manaka et al. [49] added features from multiple domains and reported a score of 0.8755. These findings suggest that our pro- posed transfer learning methodology can adapt to VA datasets across various demographics as the works compared against used VA datasets collected in other developing countries, including Ghana, India and Tanzania. Figures 2, 3, and 4 illustrate the receiver operating characteristic (ROC) curves for the neural network classifier across three different sets of VA features. These features are a concatenation of VA embeddings learnedwith character-level information using ELMo in the English domain, and those learnedwithword-level information usingBERT in the biomedical domain. Both sets of embeddings were extracted through the feature extraction domain adaptation techniques. In all three scenarios, the ROC curves demonstrate an upward rise towards the upper-left corner, signifying the accurate prediction of positive andnegative cases. Notably, when comparing the individual text and binary features with the combined text and binary features setting, the latter exhibits the highest AUC-ROC score (93%). This highlights the significance of text features in the classification of COD by uncontrolled hyperglycemia. 123 Multi-Step Transfer Learning... Page 21 of 26 177 Fig. 3 Receiver Operating Characteristic (ROC) curves and the Area Under the ROC Curve (AUC-ROC) for the Neural Network classifier applied to Text Features from a VA Report Fig. 4 Receiver Operating Characteristic (ROC) curves and the Area Under the ROC Curve (AUC-ROC) for the Neural Network classifier using both Binary and Text Features from a VA Report 5 Limitations of the Study This research is restricted to the text classification task. More research can be conducted on named entity recognition (NER) and relation extraction tasks, both critical in NLP applica- tions in the health domain. Furthermore, because the classification framework significantly impacts results, additional experimentation could reveal whether or not a similar behaviour occurs for other subsets of the English domain. It would also be interesting to look into the same architecture using different character and word embedding models. Due to limited computational resources, the data were split into smaller batches, and embeddings had to be computed one batch at a time. Thismay have impacted the computation of the embeddings because the experiments were carried out on Google Colaboratory, which allocated different GPUs for each batch run. 123 177 Page 22 of 26 T. Manaka et al. 6 Conclusion We have demonstrated through experimentation that the Multi-Step transfer learning frame- work can enhance the representations of text from VA reports. Consequently, it leads to an improved COD classification due to uncontrolled hyperglycemia derived from VA reports. As part of our future work, we intend to explore additional NLP techniques that incorporate Transformer and CNN architectures. This incorporation within a similar hybrid framework will allow us to assess their performance. Furthermore, we plan to investigate the impact of VA narrative embeddings from this framework when combined with the binary features of VA reports. Exploring the potential application of this framework to other CODs mentioned in the VA reports is also of interest.Wewill additionally explore using the ChatGPT language model for COD classification from VA reports. Acknowledgements We thank the MRC/Wits-Agincourt Unit for providing us with the dataset and for assis- tance in understanding its history. We are thankful to the United Nations’ Organization of Women in Science for the Developing World (OWSD) for the support granted to carry out this study. Author Contributions The authors contributed equally to this work. We confirm that all named authors have read, reviewed, and approved the manuscript and that no other individuals who meet the requirements for authorship but are not listed have contributed to the work. We also confirm that we all approved of the order in which the authors are listed in the manuscript. Funding Open access funding provided by University of the Witwatersrand. The Organization for Women in Science for the Developing World (OWSD) funded this research. Availability of Data and Materials Due to patient privacy and confidentiality policies, the Agincourt dataset analysed during this study is not publicly available. Still, it may be obtained with Data Use Agreements with the MRC/Wits-Agincourt Unit. Researchers interested in access to the data may contact Dalby Dawn at Dawn.Dalby@wits.ac.za.The PHMRC verbal autopsy dataset used to validate this study is publicly available at https://osf.io/xuk5q/, the IMDB dataset also used in the validation of the study is publicly available at https://datasets.imdbws.com/ while the medical transcriptions dataset is publicly available at https://www. mtsamples.com/. The three data sets are available under the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication). Code Availability All code for data cleaning and analysis associated with the current submission is available upon request from the corresponding author at [email address masked for blind review]. Declarations Conflict of interest We wish to reaffirm that no known conflicts of interest related to this publication or substantial financial support might have impacted the research’s findings. Ethics Approval We further confirm that any aspect of the work covered in this manuscript that has involved either experimental animals or human patients has been conducted with the ethical approval of all relevant bodies and that such approvals are acknowledged within the manuscript (ethics clearance number: M110138). Consent to Participate Not applicable. Consent for Publication Not applicable. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory 123 https://osf.io/xuk5q/ https://datasets.imdbws.com/ https://www.mtsamples.com/ https://www.mtsamples.com/ Multi-Step Transfer Learning... Page 23 of 26 177 regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References 1. United Nations (2013) Department of economic and social affairs, population division, united nations. World Population Prospects: The 2012 revision 2. World Health Organisation (2007) Verbal autopsy standards: ascertaining and attributing cause of death, Geneva. Switzerland, World Health Organisation 3. Hirschman L, Chapman WW, D’Avolio LW, Savova GK, Uzuner O (2011) Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J Am Med Inform Assoc 18(5):450–453 4. Ohno-Machado L, Nadkarni P, Chapman W (2011) Natural language processing: an introduction. J Am Med Inform Assoc 18:544–51 5. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359 6. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 7. Kooverjee N, James S, Van Zyl T (2022) Investigating transfer learning in graph neural networks. Elec- tronics 11(8):1202 8. Bhana N, van Zyl TL (2022) Knowledge graph fusion for language model fine-tuning. In: 2022 9th international conference on soft computing and machine intelligence (ISCMI) 9. KimY (2014)Convolutional neural networks for sentence classification. In: Proceedings of the conference on empirical methods in natural language processing, pp 1746–1751 10. Ramachandran P, Liu PJ, Le QV (2016) Unsupervised pretraining for sequence to sequence learning. arXiv:1611.02683 11. Delrue, L., Gosselin, R., Ilsen, B., Landeghem,A.V., deMey, J., Duyck, P.: Difficulties in the interpretation of chest radiography. Comparative Interpretation of CT and Standard Radiography of the Chest, 27–49 (2011) 12. Goergen SK, Pool FJ, Turner TJ, Grimm JE, Appleyard MN, Crock C, Fahey MC, Fay MF, Ferris NJ, Liew SM, Perry RD, Revell A, Russell GM, Wang SC, Wriedt C (2013) Evidence-based guideline for the written radiology report: methods, recommendations and implementation challenges. J Med Imaging Radiat Oncol 57(1):1–7 13. Brady A, Laoide R, Mccarthy P, Mcdermott R (2012) Discrepancy and error in radiology: concepts, causes and consequences. Ulster Med J 81:3–9 14. Liu F, You C, Wu X, Ge S, Sun X (2021) Auto-encoding knowledge graph for unsupervised medical report generation. CoRR abs/2111.04318 15. Liu F, Yang B, You C, Wu X, Ge S, Liu Z, Sun X, Yang Y, Clifton D (2022) Retrieve, reason, and refine: generating accurate and faithful patient instructions. NeurIPS 35:18864–18877 16. Li J, Wang X, Wu X, Zhang Z, Xu X, Fu J, Tiwari P, Wan X, Wang B (2023) Huatuo-26m, a large-scale chinese medical qa dataset. CoRR abs/2305.01526 17. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J (2020) Measuring massive multitask language understanding. CoRR abs/2009.03300 18. Abacha AB, Shivade C, Demner-Fushman D (2019) Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp 370–379 19. Zhou P, Wang Z, Chong D, Guo Z, Hua Y, Su Z, Teng Z, Wu J, Yang J (2022) Mets-cov: A dataset of medical entity and targeted sentiment on covid-19 related tweets. NeurIPS 35:21916–21932 20. Nori H, King N,McKinney SM, Carignan D, Horvitz E (2023) Capabilities of gpt-4 on medical challenge problems. CoRR abs/2303.13375 21. Fang C, Ling J, Zhou J, Wang Y, Liu X, Jiang Y, Wu Y, Chen Y, Zhu Z, Ma J, Yan Z (2023) How does chatgpt4 preform on non-english national medical licensing examination? an evaluation in chinese language. medRxiv 35 22. ZengQ,Garay L, Zhou P, ChongD,HuaY,Wu J, PanY, ZhouH,Voigt R, Yang J (2022)Greenplm: Cross- lingual transfer of monolingual pre-trained language models at almost no cost. The 32nd International Joint Conference on Artificial Intelligence 23. Liu J, Zhou P, Hua Y, Chong D, Tian Z, Liu A, Wang H, You C, Guo Z, Zhu L, Li M (2023) Bench- marking large language models on cmexam - a comprehensive chinese medical exam dataset. CoRR abs/2306.03030 123 http://creativecommons.org/licenses/by/4.0/ http://arxiv.org/abs/1810.04805 http://arxiv.org/abs/1611.02683 177 Page 24 of 26 T. Manaka et al. 24. Liu F, Zhu T, Wu X, Yang B, You C, Wang C, Lu L, Liu Z, Zheng Y, Sun X, Yang Y, Clifton L, Clifton DA (2023) A medical multimodal large language model for future pandemics. npj Digit. Med 6:226 25. Baxter J (2000) A model of inductive bias learning. J Artific Intell Res 12:149–198 26. Huang Z, Zweig G, Dmoulin B (2014) Cache based recurrent neural network language model inference for first pass speech recognition. IEEE ICASSP, pp 6354–6358 27. Wen Z, Lu X, Reddy S (2020) Medal: Medical abbreviation disambiguation dataset for natural language understanding pretraining. Proceedings of the 3rd clinical natural language processing workshop, pp 130–135 28. Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott MBA (2019) Publicly available clinical bert embeddings. arXiv:1904.03323 29. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240 30. Qiao J, Bhuwan D,William C, Xinghua L (2019) Probing biomedical embeddings from language models. In: Proceedings of the 3rd workshop on evaluating vector space representations for NLP, pp 82–89 31. Beltagy I, Cohan A, Lo K (2019) Scibert: pretrained contextualized embeddings for scientific text. arXiv:1903.10676 32. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG (2016) Mimic-III, a freely accessible critical care database. Sci Data3 33. Peters M, Ruder S, Smith N (2019) To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv:1903.05987 34. Jin Q, Dhingra B, Cohen W, Lu X (2019) Probing biomedical embeddings from language models. arXiv:1904.02181 35. Zhao S, Li B, Reed C, Xu P, Keutzer K (2020) Multi-source domain adaptation in the deep learning era: a systematic survey. arXiv:2002.12169 36. Torralba A, Efros AA (2011) Unbiased look at dataset bias. In CVPR 37. Zhao S, Zhao X, Ding G, Keutzer K (2018) Emotiongan: Un-supervised domain adaptation for learning discrete probability distributions of image emotions. In ACM MM 38. III HD (2007) Frustratingly easy domain adaptation. Association for Computational Linguistic (ACL), pp 256–263 39. Sun S, Shi H, Wu Y (2015) A survey of multi-source domain adaptation. Inf Fusion 24:84–92 40. RiemerM, Cases I, Ajemian R, LiuM, Rish I, TuY, Tesauro G (2019) Learning to learn without forgetting by maximizing transfer and minimizing interference. In ICLR 41. SunQ, ChattopadhyayR, Panchanathan S, Ye J (2011) A two-stageweighting framework formulti-source domain adaptation. Adv Neural Inform Process Syst 24:505–513 42. Schweikert G, Rätsch G, Widmer C, Schölkopf B (2009) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. Adv Neural Inform Process Syst 21:1433–1440 43. Guo H, Pasunuru R, Bansal M (2020) Multi-source domain adaptation for text classification via distancenet-bandits. In AAAI 44. Zhao S, Li B, Yue X, Gu Y, Xu P, Hu R, Chai H, Keutzer K (2019) Multi-source domain adaptation for semantic segmentation. NeurIPS 45. Li X, Lv S, Li M, Jiang Y, Qin Y, Luo H, Yin S (2023) SDMT: spatial dependence multi-task transformer network for 3d kneeMRI segmentation and landmark localization. IEEE TransMed Imaging 42(8):2274– 2285. https://doi.org/10.1109/TMI.2023.3247543 46. Li X, Jiang Y, Li M, Yin S (2020) Lightweight attention convolutional neural network for retinal vessel image segmentation. IEEE Trans Ind Inf 17(3):1958–1967 47. Hu K, Wu W, Li W, Simic M, Zomaya A, Wang Z (2022) Adversarial evolving neural network for longitudinal knee osteoarthritis prediction. IEEE Trans Med Imaging 41(11):3207–3217 48. Wan Y, Jiang Z (2023) Transcrispr: transformer based hybrid model for predicting CRISPR/cas9 single guide RNA cleavage efficiency. IEEE Trans Med Imaging 20(2):1518–1528 49. Manaka T, Van Zyl TL,KarD (2022) Improving cause-of-death classification from verbal autopsy reports. arXiv:2210.17161 50. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. NAACL 51. Boukkouri HE, Ferret O, Lavergne T, Noji H, Zweigenbaum P, Tsujii J (2020) Characterbert: Reconciling elmo and bert for word-level open-vocabulary representations from characters 52. Vaswani A, Shazeer N, Parmar N, Uszkoreita J, Jones L, Gomez AN (2017) Attention is all you need. NIPS, pp 6000–6010 53. He K, Zhang X, Ren S, Jian S (2016) Deep residual learning for image recognition. AI Open 3:770–778. https://doi.org/10.1109/CVPR.2016.90 54. Ba LJ, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450 123 http://arxiv.org/abs/1904.03323 http://arxiv.org/abs/1903.10676 http://arxiv.org/abs/1903.05987 http://arxiv.org/abs/1904.02181 http://arxiv.org/abs/2002.12169 https://doi.org/10.1109/TMI.2023.3247543 http://arxiv.org/abs/2210.17161 https://doi.org/10.1109/CVPR.2016.90 http://arxiv.org/abs/1607.06450 Multi-Step Transfer Learning... Page 25 of 26 177 55. Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. Adv Neural Inf Process Syst, pp 649–657 56. Verwimp L, Pelemans J, hamme HV,Wambacq P (2017) Character-word lstm language models. Proceed- ings of the 15th conference of the European chapter of the association for computational linguistics vol 1, pp 417–427 57. Si Y, Roberts K (2018) A frame-based nlp system for cancer-related information extraction. AMIA Ann Symp Proc, pp 1524–1533 58. Yan Z, Jeblee S, Hirst G (2019) Can character embeddings improve cause-of-death classification for verbal autopsy narratives? BioNLP@ACL 59. Affi M, Latiri C (2021) Be-blc: Bert-elmo-based deep neural network architecture for English named entity recognition task. Proc Comput Sci 192 60. Lin T, Wang Y, Liu X, Qiu X (2022) A survey of transformers. AI Open 3:111–132. https://doi.org/10. 1016/j.aiopen.2022.10.001 61. Guo M, Zhang Y, Liu T (2019) Gaussian transformer: a lightweight approach for natural language infer- ence. In: Proceedings of AAAI, pp 6489–6496. https://doi.org/10.1609/aaai.v33i01.33016489. 62. Yang B, Tu Z, Wong DF, Meng F, Chao LS, Zhang T (2018) Modeling localness for self-attention networks. In: Proceedings of EMNLP. Brussels, Belgium, pp 4449–4458. https://doi.org/10.1109/CVPR. 2016.90. 63. Wang W, Li X, Ren H, Gao D, Fang A (2023) Chinese clinical named entity recognition from electronic medical records based onmultisemantic features by using robustly optimized bidirectional encoder repre- sentation from transformers pretraining approachwholewordmasking and convolutional neural networks: model development and validation. JMIR Med Inform 11(e44597) 64. Kong J, Zhang L, Jiang M, Liu T (2021) Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition. J Biomed Inform 116:103737. https://doi.org/10.1016/j.jbi. 2021.103737 65. Madabushi HT, Kochkina E, Castelle M (2020) Cost-sensitive BERT for generalisable sentence classifi- cation with imbalanced data. arXiv:2003.11563 66. Wei JW, Zou K (2019) Eda: Easy data augmentation techniques for boosting performance on text classi- fication tasks. arXiv:arXiv:1901.11196 67. Xiaoya L, Xiaofei S, YuxianM, Junjun L, FeiW, Jiwei L (2020) Dice loss for data-imbalanced NLP tasks. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 465–476 68. Sorensen TA (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Kong Dan Vidensk Selsk Biol Skr 5:1–34 69. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G.S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jozefowicz R, Jia Y, Kaiser L, Kudlur M, Levenberg J, Mané D, Schuster M, Monga R, Moore S, Murray D, Olah C, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. https://www.tensorflow.org/ 70. Flaxman AD, Harman L, Joseph J, Brown J, Murray CJ (2018) A de-identified database of 11,979 verbal autopsy open-ended responses. Gates Open Res 2:18 71. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics 72. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In Proceedings of the 31st international conference on machine learning (ICML 2014), pp 1188–1196 73. Mtsamples (2022) Transcribed medical transcription sample reports and examples. Great collection of transcription samples. https://www.mtsamples.com/ 74. PetersME, NeumannM, IyyerM, GardnerM, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv:1802.05365 75. Danso S, Johnson O, Ten Asbroek A, Soromekun S, Edmond K, Hurt C, Hurt L, Zandoh C, Tawiah C, Fenty J, Etego SA, Aygei SO, Kirkwood B (2013) A semantically annotated verbal autopsy corpus for automatic analysis of cause of death. ICAME J Int Comput Arch Modern Mediev English 37:37–69 76. See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointer-generator networks. arXiv:1704.04368 77. Jeblee S, Gomes M, Jha P, Rudzicz F, Hirst G (2019) Automatically determining cause of death from verbal autopsy narratives. BMC Med Inf Decis Mak 19(127) 78. Jeblee S, Gomes M, Hirst G (2018) Multi-task learning for interpretable cause of death classification using key phrase predictions. In Proceedings of the BioNLP 2018 Workshop vol 34, no 19, pp 12–27 123 https://doi.org/10.1016/j.aiopen.2022.10.001 https://doi.org/10.1016/j.aiopen.2022.10.001 https://doi.org/10.1609/aaai.v33i01.33016489. https://doi.org/10.1109/CVPR.2016.90. https://doi.org/10.1109/CVPR.2016.90. https://doi.org/10.1016/j.jbi.2021.103737 https://doi.org/10.1016/j.jbi.2021.103737 http://arxiv.org/abs/2003.11563 http://arxiv.org/abs/1901.11196 https://www.tensorflow.org/ https://www.mtsamples.com/ http://arxiv.org/abs/1802.05365 http://arxiv.org/abs/1704.04368 177 Page 26 of 26 T. Manaka et al. 79. Manaka T, Van Zyl TL,Wade AN, Kar D (2022) Using machine learning to fuse verbal autopsy narratives and binary features in the analysis of deaths from hyperglycaemia. arXiv:2204.12169 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 123 http://arxiv.org/abs/2204.12169 Multi-step Transfer Learning in Natural Language Processing for the Health Domain Abstract 1 Introduction 2 Background 2.1 Multi-source Domain Adaptation 2.2 ELMo 2.3 BERT 2.4 Verbal Autopsy 2.5 Data Class Imbalance 3 Methods 3.1 Algorithms 3.2 Datasets 3.3 Experiments 3.3.1 ELMo 3.3.2 BERT 3.3.3 Validation 4 Results and Discussion 5 Limitations of the Study 6 Conclusion Acknowledgements References