Received 19 February 2024, accepted 31 March 2024, date of publication 5 April 2024, date of current version 17 April 2024. Digital Object Identifier 10.1109/ACCESS.2024.3385425 A Comparative Performance Analysis of Popular Deep Learning Models and Segment Anything Model (SAM) for River Water Segmentation in Close-Range Remote Sensing Imagery ARMIN MOGHIMI 1, MARIO WELZEL 1, TURGAY CELIK 2,3,4, AND TORSTEN SCHLURMANN 1 1Ludwig-Franzius-Institute for Hydraulic, Estuarine and Coastal Engineering, Leibniz University Hannover, 30167 Hanover, Germany 2School of Electrical and Information Engineering, University of the Witwatersrand Johannesburg, Johannesburg 2050, South Africa 3Wits Institute of Data Science, University of the Witwatersrand Johannesburg, Johannesburg 2050, South Africa 4Faculty of Engineering and Science, University of Agder, 4630 Kristiansand, Norway Corresponding author: Armin Moghimi (moghimi@lufi.uni-hannover.de) This work was supported in part by the joint research project ‘‘Zukunftslabor Wasser’’ funded by the Lower-Saxon Ministry of Research and Culture under Grant FKZ: 11-76251-1873/2022 (ZN3994), and in part by the Open Access Fund of Leibniz Universität Hannover. ABSTRACT Accurate segmentation of river water in close-range Remote Sensing (RS) images is vital for efficient environmental monitoring and management. However, this task poses significant difficulties due to the dynamic nature of water, which exhibits varying colors and textures reflecting the sky and surrounding structures along the riverbanks. This study addresses these complexities by evaluating and comparing several well-known deep-learning (DL) techniques on four river scene datasets. To achieve this, we fine-tuned the recently introduced ‘‘Segment Anything Model’’ (SAM) along with popular DL segmentation models such as U-Net, DeepLabV3+, LinkNet, PSPNet, and PAN, all using ResNet50 pre-trained on ImageNet as a backbone. Experimental results highlight the diverse performances of these models in river water segmentation. Notably, fine-tuned SAM demonstrates superior performance, followed by U-Net(ResNet50), despite their higher computational costs. In contrast, PSPNet(ResNet50), while less effective, proves to be the most efficient in terms of execution time. In addition to these findings, we introduce a novel river water segmentation dataset, LuFI-RiverSnap.v1 (Dataset link), characterized by a more diverse range of scenes and accurate masks compared to existing datasets. To facilitate reproducible research in remote sensing and computer vision, we release the implementations of the fine-tuned SAMmodel (Code link). The findings from this research, coupled with the presented dataset and the accuracy achieved by fine-tuned SAM segmentation, can support tracking river changes, understanding river water level trends, and exploring river ecosystem dynamics. These can also provide valuable insights for practitioners and researchers seeking models tailored to specific image characteristics with practical means in disaster risk reduction, such as rapid assessments of inundations during floods or automatic extractions of gauge data in watersheds. INDEX TERMS Deep learning, segment anything model (SAM), river water segmentation, U-Net, DeepLabV3+, LinkNet, PSPNet, PAN, RiverSnap. I. INTRODUCTION Rivers, as vital components of Earth’s hydrological cycle, play a fundamental role in sustaining ecosystems and human societies [1]. Accurate delineation and characterization The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott . of river boundaries and features are imperative for a multitude of applications, including flood risk assessment, water resource management, riverine hydrological parameter estimation, and ecological conservation [2]. In contemporary river research studies, where high-resolution images with exceptional clarity and detail are essential, e.g. either to validate numerical models or to acquire accurate data for VOLUME 12, 2024 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 52067 https://orcid.org/0000-0002-0455-4882 https://orcid.org/0000-0002-0768-9782 https://orcid.org/0000-0001-6925-6010 https://orcid.org/0000-0002-4691-7629 https://orcid.org/0000-0002-0945-2674 https://github.com/ArminMoghimi/RiverSnap https://github.com/ArminMoghimi/RiverSnap A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 1. Examples of different segmentation models in river water segmentation. (a) The image of Norway’s Glomma river (source link) and its segmentation resulted from (b) image thresholding, (c) hybrid algorithm (k-means + active contour model), and (d) Deep Learning (DL) model. early warning systems, the segmentation of river water in close-range Remote Sensing (RS) images captured by low-cost sensors becomes particularly significant [3], [4]. Indeed, close-range RS images captured by low-cost cameras (e.g., smartphone/surveillance cameras) are proven to facili- tate the detection of subtle variations in river water properties and the surrounding terrain [3], [5], [6]. This presents, until yet, a rarely utilized and systematically investigated opportunity to extract nuanced insights into hydrological parameters or any related process (e.g., water level, water turbidity, floating debris, etc.) from close-range RS images [3], [6]. Segmenting the water body from the background in close-range images is a key step in determining riverine parameters. It forms the basis for further analysis, directly impacting the accuracy of subsequently analyzed parameters through images. In developing these methods, researchers have introduced a variety of image segmentation algorithms, classifying them into conventional image analysis and advanced Deep Learning (DL) models [7], [8]. While traditional methods (e.g., thresholding, region-based, and hybrid algorithms) are commonly used for image segmen- tation, they encounter limitations when applied to water bodies like wetlands, lakes and river scenes due to the complex nature (e.g., inhomogeneous appearance and color variations), and the reflections of surrounding structures (e.g., vegetation, rocks, and buildings) and the sky on the water surface [9], [10], [11] (see Fig. 1 (b) and (c)). This is mainly because most of these techniques depend on low- level/basic image features for object segmentation, which may not inherently capture complex spatial relationships, such as those found in river scenes in close-range images [8], [12]. In contrast, DL models, known for their automatic extraction of high-level semantic features from various data types, have recently offered more robust solutions in this specific aspect [5], [11], [12]. Furthermore, since DL models are typically trained on diverse datasets, they exhibit superior capabilities in handling sky reflections and structures on the water body, as depicted in Fig. 1 (d). While DL techniques are mainly applied to segment water in space/airborne RS images [13], [14], [15], they have high potential in accurately extracting hydrological features such as water bodies and water levels from close-range imagery captured by low-cost cameras, such as surveillance and smart- phone cameras [12]. For example, to predict floods stemming from river overflow, Lopez-Fuentes et al. [16] identified rising river water levels by applying three DL techniques (Fully convolutional network (FCN) [17], DenseNet [18], and [19]) through water segmentation in surveillance camera images. Their study showed that the DenseNet model exhibited superior performance in flood extent detection from close-range RS images. Pan et al. [20] also compared different image-based methods for water-body segmentation in surveillance camera images to create a real-time water level detection service. The study revealed that employing a convolutional neural network (CNN) at the core of the service significantly enhanced accuracy, yielding highly accurate results (about 9 mm error) compared to reference measurements. Vitry et al. [21] employedU-Net [22] to detect floodwater and introduced the Static Observer Flooding Index (SOFI) to obtain water level fluctuations. In a similar approach, Akiyama et al. [23] examined the potential of the SegNet [24] segmentation to extract river water from the background of close-range RS images. Vandaele et al. [25] employed both the UperNet [26] and DeepLabv3 [27] networks with a ResNet50 architecture [28] as a backbone for the river water segmentation process in the context of water level detection. Muhadi et al. [12] successfully utilized DeepLabv3+ [29] and SegNet networks to detect water bodies, evaluate water levels, and track fluctuations in surveillance images. Eltner et al. [11] also integrated advanced DL water segmentation models (SegNet and FCN) with photogrammetric techniques to achieve precise water stage measurements from images taken by a Raspberry Pi camera. The aforementioned studies have yielded valuable insights into identifying water bodies in image backgrounds and their application in extracting hydrological parameters, such as water level. However, most of these studies relied on a limited set of segmentation models, leaving their adaptability across diverse environmental contexts and image datasets uncertain. Furthermore, their applicability was often confined to specific scenes (e.g., datasets used in [11]), making the trained CNNs less suitable for broader applications in different environments [3]. To enhance real-time water level monitoring and address these limitations, Eltner et al. [30] introduced UPerNet [26] with the ResNeXt-50 backbone as a well-generalized CNN model, chosen from a range of DL models for water segmentation in various geographical contexts. Furthermore, Wagner et al. [3] conducted a widespread study evaluating 32 DL segmentation models by introducing high-quality RIWA.v1 [6], [31] for water detection with online/offline augmentation. While existing studies have identified several effective DL models for river water segmentation, limited datasets often 52068 VOLUME 12, 2024 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM constrained their evaluations. Amore comprehensive analysis across multiple datasets with diverse conditions is essential to better assess the capabilities and generalizability of each model. While the RIWA dataset [31] provides valuable support for this research area, there remains a critical need for additional accurately labeled datasets to further enhance model training and evaluation. Moreover, DL models such as U-Net [22], DeepLabV3+ [29], Pyramid Scene Parsing Network (PSPNet) [32], have been widely employed in river water segmentation and its applications. For instance, a recent comparative analysis by [3] revealed that U-Net with a ResNeXt50 backbone had the best performance for river water segmentation among 32 DL models. However, while this study sheds light on the capabilities of established DL architectures, it also raises questions about the exploration of alternative methodologies and newer model designs. In particular, the potential of efficient model architectures like the Pyramid Attention Network (PAN) [33] and LinkNet [34] as well as recently published DL models, such as Meta AI’s ‘‘Segment AnythingModel’’ (SAM) [35] have not yet been explored and compared. Incorporating such advanced models could also unveil new avenues for research in river water segmentation. In this study, we aim to address these challenges, making the following contributions: 1. For the first time, to the best of our knowledge, we fine- tuned the SAM segmentation model for water/river water segmentation in close-range RS images. This fine-tuning aims to enhance the segmentation of water bodies, particularly for wetlands, lakes, and rivers. The fine-tuned SAM model’s code has also been made accessible at: (Fine-tune SAM code link). 2. We extensively surveyed six state-of-the-art DL mod- els (U-Net [22], DeepLabV3+ [29], LinkNet [34], PSPNet [32], PAN [33], and the recently published SAM [35]), evaluating their strengths and limitations. This assessment aims to establish their suitability for water body segmentation from close-range RS images at rivers. 3. We collected close-range images from rivers using var- ious platforms, such as smartphones, cameras, surveil- lance cameras, and low-altitude drones, to introduce a new dataset namedLuFI-RiverSnap.v1. It comprises over 800 images with precise annotations, along with a few suitable images from the Kaggle WaterNet [36], Elbers- dorf / Wesenitz [11], RIWA.v1 [31] datasets. To support and advance the development of river water segmentation tasks, we have released the recently introduced LuFI- RiverSnapv1 dataset at (Dataset link). 4. We performed a thorough experimental analysis of segmentation methods on three considered benchmark datasets and the newly introduced LuFI-RiverSnap.v1 datasets. This comprehensive evaluation sets a research baseline, providing valuable segmentation model recom- mendations for river water analysis and suggesting future study directions. FIGURE 2. The ResNet50 backbone adopted from the PyTorch implementation and [38]. FIGURE 3. Architecture of U-Net(ResNet50), adopted from [39]. II. MATERIALS AND METHODS A. DL SEGMENTATION MODELS This section briefly overviews the unique architectural features of the six baseline DL models considered in the context of water body segmentation tasks. For all models except the SAM, we employed ResNet50 [28] (see Fig. 2), pre-trained on the ImageNet dataset [37], as the backbone architecture. 1) U-Net U-Net, introduced by Ronneberger et al. [22], is widely used for object segmentation. Its U-shaped architecture includes an encoder for downsampling and a decoder for upsampling and feature fusion [22]. Skip connections aid in integrating features from different resolutions [39]. ResNet50 serves as the encoder in this study, while a custom decoder involves upsampling and convolutional layers. (see Fig. 3). 2) DeepLabV3+ DeepLabV3+ is an advanced variant of the DeepLab series, featuring Atrous Spatial Pyramid Pooling (ASPP) and an encoder-decoder structure [29]. ResNet50 is also integrated into the encoder pathway in this study. It begins with extract- ing high-level features, and comprehensive representation through feature combination. Upsampling with transposed convolutions enhances spatial resolution, enriched by skip connections (see Fig. 4), VOLUME 12, 2024 52069 https://github.com/ArminMoghimi/RiverSnap https://github.com/ArminMoghimi/RiverSnap A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 4. Architecture of DeepLabV3+(ResNet50), adopted from [29]. FIGURE 5. Architecture of LinkNet(ResNet50), adopted from [34]. 3) LinkNet LinkNet features a U-shaped architecture, maintaining the crucial encoder-decoder structure for hierarchical feature extraction and seamless upsampling [34]. In the depicted architecture Fig. 5, ResNet50 is also utilized in the encoder to extract high-level features through a downsampling strategy. Encoder features are directly linked to corresponding outputs in the decoder via skip connections using the ‘‘sum’’ operator. 4) PSPNet PSPNet is a model structured on the encoder-decoder paradigm, incorporating a distinctive pyramid pooling mod- ule (PPM) for aggregating contextual information across multiple scales within the feature hierarchy [32]. As shown in Fig. 6, the initial step involves using ResNet50 to extract the feature maps. Following this, PPM is employed to generate representations of different sub-regions. These features are then upsampled and concatenated, enabling the model to incorporate both local and global contextual data. The final representation undergoes convolutional processing to achieve river water predictions. FIGURE 6. Architecture of PSPNet(ResNet50), adopted from [32]. FIGURE 7. Architecture of PAN(ResNet50), adopted from [33]. 5) PAN s an encoder-decoder architecture to enhance global con- textual information in semantic segmentation [33]. This is achieved by integrating the Feature Pyramid Attention (FPA) and Global Attention Upsample (GAU) modules. In the implementation, the ResNet50 was also used to extract dense features, followed by FPA and GAU for accurate pixel predictions and localization details (see Fig. 7), 6) SAM SAM stands as an innovative encoder-decoder promptable model developed by the Meta AI team for precise image segmentation [35]. Its training involved a massive SA-1B dataset, encompassing over 1 billion masks extracted from 11 million images, raising its generality to segment unseen objects and images [35], [40]. This prowess extends the model’s applicability beyond the confines of image seg- mentation, allowing it to be effectively employed in various scenarios, including but not limited to object tracking [40]. As depicted in Fig. 8 (a), the SAM comprises three key components [35]. First, an image encoder denoted by EncI uses a Vision Transformer (ViT) backbone, such as ViT-B (91M), ViT-L (308M), or ViT-H (636M parameters), to process 1024 × 1024 images I and generate image embedding features FI [35]. The flexible prompt encoder, denoted by Encp, then adeptly handles both sparse prompts Ps (e.g., points, boxes, text) and dense prompts (masks) Pl , translating them into tokens Tp and TL , respectively [35]. Finally, the outputs of the encoders pass to a lightweight mask decoder DecL [35] for label predictions. 52070 VOLUME 12, 2024 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 8. Architecture of (a) SAM, adopted from [35], and (b) fine-tuning of SAM for river water segmentation in this study. 7) FINE-TUNING OF SAM In this study, SAM was fine-tuned for river water seg- mentation, as depicted in Fig. 8 (b). In this way, We first froze the SAM encoder EncI with a lighter ViT-B (91M parameters) backbone to optimize computational efficiency and conserve resources. This configuration allowed us to use the pre-trained parameters of the SAM encoder for extracting robust image embedding features FI from river scene images. This also facilitates the use of larger batch sizes, thereby enhancing the overall efficiency of the model training pipeline. Given the absence of a prompt in our fine-tuning process, the learnable mask tokens TL were automatically obtained through element-wise addition of a learned embedding to each location of the image embedding FI [35]. Subsequently, the extracted features FI alongside the learnable mask tokens TL were directly fed into the trainable mask decoder DecL to predict the low-resolution mask L̂ as follows: L̂ = DecL(FI ,TL) (1) The low-resolution mask L̂ is then upsampled to the input size and compared with the ground truth mask using a determined loss function during training. Our fine-tuning process is conducted without any prompt encoder, enabling a fair assessment of SAM alongside other models under consideration and reducing human-machine interaction. B. RIVER WATER SEGMENTATION USING DL MODELS In this study, the river water segmentation workflow using DL models is depicted in Fig. 9. Supported by a river water dataset with distinct subsets for training, validation, and testing, this methodology forms the foundation of our approach. Let D = {(Ii,Li)}N−1i=0 denote our river dataset, where Ii ∈ Rw×h×3 is an RGB image of the river scene, and Li ∈ [0, 1]w×h is its corresponding annotated river mask for sample i, i.e., ‘‘0’’ represents the background, and ‘‘1’’ denotes the river. The datasetD includes threemutual subsets: the training set (Dtrain), the validation set (Dval), and the test set (Dtest), where D = Dtrain ∪ Dval ∪ Dtest. Initially, all subsets are subjected to a preprocessing step, denoted as P, which includes normalization (N ) and resizing (R) operations: DP = {(R(N (Ii)),R(Li))} N−1 i=0 = {(I P i ,LPi )} N−1 i=0 (2) where, N (.) ensures consistent pixel values, and R stan- dardizes the input dimensions to a 512 × 512 pixel format, resulting in IPi ∈ R512×512×3 and LPi ∈ [0, 1]512×512 as the preprocessed RGB image and the intended annotated river mask for sample i, respectively. Subsequently, preprocessed training and validation samples are fed into a chosen deep learning segmentation model represented by fM , where M can be one of the segmenta- tion models from (SAM, U-Net(ResNet50), PSPNet(ResNet50), PAN(ResNet50), LinkNet(ResNet50), DeepLabV3+(ResNet50)), with parameters M . During the training phase, the model fM is constructed and refined using the dataset DP train, guiding it to learn river water patterns and handle reflection structures. To ensure the model generalizes well to new data and avoids overfitting, the dataset DP valP is also utilized during training. Given the preprocessed input image IPi , the model computes the predicted binary river mask L̂Pi = fM (IPi ; θM ) through a forward pass in each training loop. To quantify the difference between the predicted water mask L̂P and its ground truth mask LP over each training loop, the loss function is computed using binary cross-entropy (BCE) with sigmoid loss: LBCE(L̂Pi ,LPi )=L P i log(σ (L̂i))+ (1− LPi ) log(1− σ (L̂i)) (3) where, σ is the logistic sigmoid function. To update the model parameters θM , the Ltrain is optimized using the Adam optimizer [41] over each epoch. Moreover, the validation loss Lval is computed over the DP val using the trained model fM and its updated parameters θM , to monitor the model’s performance at the end of each epoch. Algorithm 1 outlines the specific steps involved in the training phase of river water segmentation. After fine-tuning the model, evaluation is performed on the test samples. In this way, the trained model fM and its optimized parameters θM are employed to predict the test subset DP test. These predictions are further compared with the ground truth masks LPtest using an array of metrics detailed in [subsection II-D] to evaluate model performance. The specifics of the testing process have been summarized in Algorithm 2. C. DATA We employed three established benchmark datasets (i.e., Kaggle WaterNet [36], Elbersdorf/Wesenitz [11], RIWA.v1 [31]) along with our newly proposed LuFI-RiverSnap.v1 dataset to enrich the comprehensiveness of our analysis through various baseline deep learning networks. The amount of training, validation, and testing images of these datasets is shown in Table 1. VOLUME 12, 2024 52071 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 9. Workflow of river water segmentation using Deep Learning (DL) models. 1) KAGGLE WaterNet [36] This widely adopted public dataset, available in two versions, contains over 1000 annotated images (Dataset link). It integrates water images from the ADE20K dataset, encom- passing diverse water body scenes such as water streams, river segmentation, and gauges from various locations. In this study, we selected 748 accurately labeled river scene images from the second version. This dataset was divided into training (70%), validation (∼16%), and test (∼14%) sets, forming the basis of our workflow. Despite the moderate range of diversity in river water color and reflection, most considered models are expected to exhibit proficiency in the river water segmentation task on this dataset. 2) ELBERSDORF/WESENITZ [11] This dataset comprises 3,407 accurately annotated images captured using a Raspberry Pi RGB camera positioned 4 meters above the Wesenitz river at the Elbersdorf station between March 30, 2017, and April 30, 2018 (Dataset link). We randomly selected 1001 images from this dataset, allocating approximately 50% for training, ∼20% for validation, and 30% for test samples. Given that this data originates from a single location, optimal model performance is anticipated on this dataset. 3) RIWA.v1 [31] This stands as a robust and versatile resource for river scene analysis and segmentation, encompassing 1128 images captured using smartphones, Unoccupied Aerial Vehicles (UAVs), and DSLRs (Dataset link). This dataset incor- porates diverse river scenes, including images from Elbers- dorf/Wesenitz [11] and Kaggle WaterNet [36]. The dataset’s diversity, with rivers of varying colors, presents a challenge for models in segmenting water from the background. 4) LuFI-RiverSnap.v1 This dataset includes close-range river scene images obtained from various devices, such as UAVs, surveillance cameras, smartphones, and handheld cameras with sizes up to 4624× 3468 pixels (Dataset link). Several social media images, as typically volunteered geographic information (VGI) [42], have also been incorporated into the dataset to havemore diverse river landscapes from various locations and sources. The images mainly include river scenes from several cities in Germany (Hannover, Hamburg, Bremen, Berlin, and Dresden), Italy (Venice), Iran (Ahvaz), the USA, and Australia. To further enhance the dataset’s diversity and accu- racy, a small subset of images of Elbersdorf/Wesenitz [11], RIWA.v1 [31], and KaggleWaterNet [36] have been added to the data. This comprehensive dataset includes 1092 images, all accurately annotated, establishing it as a valuable resource for advancing research and development in river scene anal- ysis and segmentation. The dataset comprises challenging cases for water segmentation, such as rivers with significant reflection, shadows, various colors, and flooded areas. Fig. 10 provides an insightful overview of the LuFI-RiverSnap.v1 dataset, showcasing instances of rivers with different colors. D. EVALUATION CRITERIA To evaluate the performance of considered DLmodels in river water segmentation, their predictions were compared with the corresponding ground-truths of the test samples using several metrics extracted from the confusion matrix. These metrics include theOverall Accuracy (OA),Precision,Recall, F1-score (FS ), Intersection over Union (IoU ), and Kappa Coefficient (κ), measured by: OA = 1− TP+ TN TP+ FP+ FN + TN × 100% (4) Precision = TP TP+ FP × 100% (5) 52072 VOLUME 12, 2024 https://www.kaggle.com/datasets/gvclsu/water-segmentation-dataset https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ONOZRW https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ONOZRW https://www.kaggle.com/datasets/franzwagner/river-water-segmentation-dataset/versions/1 https://github.com/ArminMoghimi/RiverSnap A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 10. LuFI-RiverSnap dataset- Illustrative examples of rivers featuring diverse water colors. Recall = TP TP+ FN × 100% (6) FS = 2TP 2TP+ FP+ FN × 100% (7) IoU = TP TP+ FP+ FN × 100% (8) κ = P0 − Pe 1− Pe subject to  Pe= (TP+FN )(TP+FP) τ 2 + (FN+TN )(FP+TN ) τ 2 P0= TP+TN τ τ=TP+FP+FN+TN (9) In the above equations, TP (True Positive) refers to pixels that are accurately segmented as part of the river category; FP (False Positive) refers to pixels that are incorrectly classified as part of the river category when they actually belong to a different category; TN (True Negative) refers to pixels that are accurately assigned as background pixels (non-river); and FN (False Negative) refers to pixels that are incorrectly classified as non-river when they actually belong to the river category. To interpret the performance of the models quantitatively, we introduced a categorization scheme using predefined thresholds for IoU. This scheme assigns a qualitative label to the model’s performance based on observed metric values for each test sample. Accordingly, the model’s performance is labeled as ‘‘Excellent’’ when IoU > 0.90, ‘‘Good’’ if 0.85 ≤ IoU ≤ 0.90, ‘‘Fair’’ if 0.75 ≤ IoU ≤ 0.85, ‘‘Poor’’ if 0.65 ≤ IoU < 0.75, and ‘‘Unacceptable’’ otherwise. This approach provides a concise and insightful method of interpreting model performance across key metrics, facilitating a comparative analysis to identify the strengths of DL models. III. EXPERIMENTS A. EXPERIMENTAL SETUP We implemented river water segmentation models in Python, utilizing PyTorch’s core library within Google Colab. GPU- accelerated computations in Colab were performed with the following resources: an NVIDIA Tesla T4 (15GB) GPU, an Intel Xeon CPU with two cores running at 2.30 GHz, and 32 GB of RAM. We employed the segmentation_models_pytorch library1 to imple- ment all DL models, excluding SAM, which was fine-tuned based on its original code.2 All DL models were trained on the datasets without augmentation, using default parameters 1https://github.com/qubvel/segmentation_models.pytorch 2https://github.com/facebookresearch/segment-anything VOLUME 12, 2024 52073 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM Algorithm 1 DL Model Training for River Water Segmentation Data: Dtrain = {(Ii,Li)} Ntrain−1 i=0 , Dval = {(Ii,Li)} Nval−1 i=0 , Learning Rate α, Batch size B, Number of epochs Nepochs Result: model fM , parameters θM 1 Initialization: Initialize model parameters θM Initialize the best validation loss LvalBest ←∞; 2 Step 1: Data preprocessing: foreach sample (Ii,Li) ∈ Dtrain do 3 (IPi ,LPi )← Preprocess(Ii,Li) ; /* by Eq (1) */ 4 Add (IPi ,LPi ) to D P train 5 Preprocess all validation samples: foreach sample (Ii,Li) ∈ Dval do 6 (IPi ,LPi )← Preprocess(Ii,Li) ; /* by Eq (1) */ 7 Add (IPi ,LPi ) to D P val 8 Step 2: Model Training: for epoch← 1 to Nepochs do 9 foreach mini-batch (IPi:i+B−1,L P i:i+B−1) ∈ D P train do 10 L̂Pi:i+B−1← fM (IPi:i+B−1; θM ) LBCEi:i+B−1 ← BCEWithLogitsLoss(L̂Pi:i+B−1,L P i:i+B−1) ; /* by Eq (2) */ 11 θM ← AdamOptimizer(θM ,∇θM 1 B ∑i+B−1 j=i LBCEj , α) 12 Calculate Training Loss for Epoch: Ltrain← 1 Ntrain/B ∑Ntrain/B i=1 1 B ∑i+B−1 j=i LBCEj 13 Print(Ltrain); 14 Model Evaluation on Validation Set: foreach mini-batch (IPi:i+B−1,L P i:i+B−1) ∈ D P val do 15 L̂Pi:i+B−1← fM (IPi:i+B−1; θM ) LBCEi:i+B−1 ← BCEWithLogitsLoss(L̂Pi:i+B−1,L P i:i+B−1) ; /* by Eq (2) */ 16 17 Calculate Validation Loss for Epoch: Lval← 1 Nval/B ∑Nval/B i=1 1 B ∑i+B−1 j=i LBCEj 18 Print(Lval) 19 Save the model checkpoint if the validation loss improves if Lval < LvalBest then 20 LvalBest ← Lval; 21 return model fM , parameters θM for 50 epochs with a batch size of 32 and a learning rate of 1 × 10−5. The model parameters from the epoch with the lowest validation loss were saved as the best-performing ones. The computational time per epoch and the number of trainable parameters (nparam) for each model have been reported in Table 2. Algorithm 2 DL Model Testing for River Water Segmentation 1 Inputs: Dtest = {(Ii,Li)} Ntest−1 i=0 , Trained model fM , Model parameters θM ; 2 Outputs: Predictions on the test dataset L̂test, Average evaluation metrics AverageMetrics; 3 Preprocess all test samples: foreach sample (Ii,Li) ∈ Dtest do 4 (IPi ,LPi )← Preprocess(Ii,Li) ; /* by Eq (1) */ 5 Add (IPi ,LPi ) to DP test 6 Make Predictions and Evaluate Metrics: MetricsList← [] ; /* List to store individual metrics for each sample */ 7 for i← 0 to Ntest − 1 do 8 L̂Pi ← fM (IPi ; θM ) ; /* by the trained model fM */ 9 Add (L̂Pi ) to L̂test ; /* Add (L̂Pi ) to L̂test */ 10 MetricsList.append(EvaluateMetrics(L̂Pi ,LPi )) ; /* by evaluation metrics, reported in Subsection II-D */ 11 12 Calculate Average Metrics: AverageMetrics← Average(MetricsList) ; /* Calculate the average of the evaluated metrics */ 13 return Predictions on the test dataset L̂test, Average evaluation metrics AverageMetrics; TABLE 1. Amount of training, validation, and test samples of datasets used in this study. As evident from Table 2, the SAM algorithm fea- tured a parameter count roughly three times larger (> 60M) than that of other models, thereby justifying its considerable computational expense during dataset training. PSPNet(ResNet50) and PAN(ResNet50) were among the lighter and more efficient networks, boasting the lowest parameters and computational costs. DeeplabV3+(ResNet50), U-Net(ResNet50), and LinkNet(ResNet50) exhibited roughly the same training time, although DeeplabV3+(ResNet50) was lighter than both. To clarify why we fine-tuned SAM with a ViT (91M) backbone, a comparative analysis was conducted in Appendix A. 52074 VOLUME 12, 2024 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM TABLE 2. The number of parameters and average training time per epoch for each DL configuration. TABLE 3. Comparison results of DL segmentation models on the Kaggle WaterNet dataset. (Red: the best, blue: second best). It’s important to highlight that, due to limitations in computational resources, the 512 × 512 input dimensions were used for all model training processes. Its worth noting that the SAM model resamples 512× 512 images to 1024× 1024, as its ViT backbone requires this size of inputs. We evaluated the impact of this resampling process on SAM’s performance in Appendix B. We plotted training and validation accuracy graphs to address overfitting and determine optimal training epochs for each DL network, as depicted in Fig. 11. B. EXPERIMENTAL RESULTS The following subsections have presented and discussed the visualization and quantitative results of the DL models for each considered dataset. 1) RESULTS OBTAINED ON THE KAGGLE WaterNet DATASET Fig. 12 shows examples of segmentation results obtained by applying the considered DL models to the Kaggle WaterNet test images. SAM outperformed other models in river water segmentation (see Fig. 12(g)), while PSPNet(ResNet50) showed inferior performance in cases 1, 5, and 9 (see Fig. 12(c)). DeepLabV3+(ResNet50), PAN(ResNet50), and LinkNet(ResNet50) consistently produced robust segmentation results in most instances, except for case 12 (refer to column 12 of Fig. 12 (d), (e), and (f)). In these cases, these models incorrectly identified areas around the river as water. However, PAN(ResNet50) demonstrated better accuracy than DeepLabV3+(ResNet50) and LinkNet(ResNet50) in these cases. The U-Net(ResNet50) model displayed moderate segmentation performance compared to other models, where It exhibited good performance in some cases, like test images 6,7,8, and 11 (see Fig. 12 (b)), and poorer performance in others. The quantitative results reported in Table 3 closely align with the findings discussed from Fig. 12. The SAM model emerged as the top-performing model among the tested DL models as evidenced by the highest metric values. Following closely behind SAM, DeepLabV3+(ResNet50) exhibited the second-best performance, with only a 1.5% degradation in κ and IoU . LinkNet(ResNet50) also demonstrated robust results, with almost identical results to DeepLabV3+(ResNet50). On the other hand, PSPNet(ResNet50) yielded poorer results, displaying the lowest metric values among the mod- els assessed. Moreover, U-Net(ResNet50) and PAN(ResNet50) showed similar performance in river water segmentation. The box plots in Fig. 13(a) also confirm the findings. SAM consistently showed higher median values than other models, indicating more consistent performance with fewer fluctuations. SAM’s performance distribution was also more concentrated around the median of the metrics, suggesting less susceptibility to outliers and more predictable results. Conversely, PSPNet(ResNet50) had the lowest median values, indicating comparatively less stable performance. As depicted in Fig. 13(b) SAM achieved excellent performance in approximately 77% of the testing data, with a failure rate of around 6%. Conversely, DeepLabV3+, PAN, LinkNet, and U-Net achieved excellent performance in roughly 69% of the test data, with varying rates of ‘‘Fair’’ and ‘‘Good’’ performance levels. Notably, while both LinkNet and PAN achieved the same number of ‘‘Excellent’’ ratings in water segmentation, PAN had a higher failure rate. Moreover, Fig. 13(b) also confirm PSPNet’s inferior performance, where it achieved a success rate of 56%, which was 21% lower than SAM and 10% lower than the other models in river segmentation. 2) RESULTS OBTAINED ON THE ELBERSDORF/WESENITZ DATASET It is evident from Fig. 14 that all the models qualitatively produced almost identical segmentation results, effectively distinguishing water from the background in the Elbers- dorf/Wesenitz dataset. Notably, the considered state-of-the- art models exhibited slightly greater robustness than SAM. SAM’s results included limited FN and FP errors attributed to reflected shadows and floating vegetation in the shallow river water. These findings are reinforced by the comparative results of the DL models, as presented in Table 4. Table 4 shows that the metrics for all models are in the range of 0.99, signifying their robust capability in detecting river water within the Elbersdorf/Wesenitz dataset. This high level of performance is further verified by Fig. 13(b), illustrating that all methods achieved an ‘‘Excellent’’ rating over all 301 test data of this dataset. Upon closer inspection, VOLUME 12, 2024 52075 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 11. Training and validation loss on the considered datasets when (a) U-Net(ResNet50), (b) PSPNet(ResNet50), (c) DeeplabV3+(ResNet50), (d) PAN(ResNet50), (e) LinkNet(ResNet50), and (f) SAM were used as DL models for river water segmentation. it becomes apparent that SAM’s model exhibited a slightly weaker performance compared to other models, as previously noted in the visual analysis of Fig. 14. The box plots in Figure Fig. 13(a) provide clearer insights into differences in model performance. As depicted, U-Net(ResNet50) exhibited superior performance, with the results densely clustered around its median. In contrast, SAM displayed a lower median than the other models, signifying its comparatively weaker performance in river segmentation within this case. 52076 VOLUME 12, 2024 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 12. Some examples of river water segmentation results on the Kaggle WaterNet dataset. (a) Images and segmentation results generated by (b) U-Net(ResNet50), (c) PSPNet(ResNet50), (d) DeeplabV3+(ResNet50), (e) PAN(ResNet50), (f) LinkNet(ResNet50), and (g) SAM were used as DL models for river water segmentation.Green: False Positives (FP) detection, Pink: False Negatives (FN) detection, Blue: correct detection of river water. TABLE 4. Comparison results of DL segmentation models on the Elbersdorf/Wesenitz dataset. (Red: the best, blue: second best). 3) RESULTS OBTAINED ON THE RIWA.v1DATASET In contrast to the previous dataset, all considered models displayed varying performances in segmenting river water in the RIWA.v1 dataset, as depicted in Fig. 15. In several instances (cases 4, 5, 6, and 12) of the RIWA.v1 dataset, SAM demonstrated superior visual segmentation of river water compared to other models. However, common models exhibited both false positive (FP) and false negative (FN ) errors in identifying the river zone in cases 2 and 5 (refer to columns 2 and 5 of Fig. 15(a)-(f)). Interestingly, in certain instances, particularly cases 8-13, many models performed comparably well or even surpassed SAM in segmenting river water (refer to columns 8-13 of Fig. 15(a)-(g)). Table 5 provides a comprehensive assessment of the considered segmentation models on all test samples of the RIWA.v1 dataset. Upon scrutinizing the results in Table 5, it is evident that the U-Net(ResNet50) model demonstrated the best overall performance in terms of all metrics, except for precision, where SAM achieved a better result. TABLE 5. Comparison results of DL segmentation models on the RIWA.v1 dataset. (Red: the best, blue: second best). DeepLabV3+(ResNet50) ranked as the second-best model in river water segmentation, with a slight difference (approx- imately 0.5% on average) compared to U-Net(ResNet50), closely followed by LinkNet(ResNet50). Although SAM and PAN(ResNet50) produced similar segmentation results, SAM visually appeared to have superior performance. Conversely, PSPNet(ResNet50) emerged as the least effective model in the segmentation of river water in the RIWA.v1 dataset, with its IoU and κ metrics being almost 4.5% lower than those of the best-performing model. These results are strongly supported by the boxplots presented in Fig. 13(a), where U-Net(ResNet50) exhibited higher median values, shorter boxes, and fewer outliers than other models, indicating its general robustness regarding κ , IoU , and Fs. Moreover, in 64 test samples, the performance of the U-Net(ResNet50) model was classified as ‘‘Excellent,’’ confirming the findings of Table 5 and Fig. 15, as illustrated in Fig. 13 (b). In contrast, PSPNet(ResNet50) displayed the worst performance, as evidenced by lowermedian values, elongated VOLUME 12, 2024 52077 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 13. (a) Box plots depicting segmentation performance of DL models on the considered datasets, evaluated with κ , IoU , and FS metrics. Each box represents the median, with edges indicating the 25t h and 75t h percentiles, (b) Success levels of DL models on considered datasets using predefined IoU thresholds. boxes, and numerous outliers (see Fig. 13 (a)). Furthermore, it had the lowest rate of successful segmentation compared to others, with its performance classified as ‘‘Excellent’’ in less than 50% of test samples (approximately ∼ 52 samples) for 52078 VOLUME 12, 2024 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 14. Some examples of river water segmentation results on the Elbersdorf/Wesenitz dataset. (a) Images and segmentation results generated by (b) U-Net(ResNet50), (c) PSPNet(ResNet50), (d) DeeplabV3+(ResNet50), (e) PAN(ResNet50), (f) LinkNet(ResNet50), and (g) SAM were used as DL models for river water segmentation.Green: False Positives (FP) detection, Pink: False Negatives (FN) detection, Blue: correct detection of river water. the RIWA.v1 dataset (see Fig. 13 (b)). Although the SAM model visually appeared to have good performance in the cases presented in Fig. 16, its performancewas only classified as ‘‘Excellent’’ in 54 test samples, just 2 samples more than the worst model (PSPNet(ResNet50)) and 10 samples lower than the best model (U-Net(ResNet50)). However, in 25 test samples, the performance of SAMwas classified as ‘‘Good,’’ the highest number in this success rate among all considered models. 4) RESULTS OBTAINED ON THE LuFI-RiverSnap.v1 DATASET As expected, the diverse characteristics of the LuFI- RiverSnap.v1 dataset led to varying performances among the considered segmentation models, as illustrated in the showcased examples in Fig. 16. In this case, a visual comparison of the segmentation results displayed in Fig. 16 verifies the exceptional capabilities of SAM in providing river water bodies very close to the ground truth. This heightened accuracy is particularly evident in cases 4, 11, and 12, where SAM significantly outperformed other models (refer to columns 4, 11, and 12 in Fig. 16(a)-(g)). The segmentation results obtained from U-Net(ResNet50), LinkNet(ResNet50), DeepLabV3+(ResNet50), and PAN(ResNet50) were closely matched, with PAN(ResNet50) and DeepLabV3+(ResNet50) exhibiting slightly superior visual performance. The dis- tinctions among these models become more apparent in cases 1, 3, and 5, where U-Net(ResNet50), PAN(ResNet50), and LinkNet(ResNet50) demonstrated comparable segmen- tation results, respectively (see columns 1, 3, and 5 in Fig. 16(a)-(f)). Conversely, in line with previous datasets, PSPNet(ResNet50) exhibited poor performance, consistently failing to accurately detect the river water area in most instances (see Fig. 16(a) and (c)). Table 6 presents the quantitative segmentation results of the models considered for the RiverSnap dataset to enhance the analysis further. As depicted in Table 6, the SAM model demonstrated superior performance to other models in most metrics, except for OA and Recall, where PAN(ResNet50) and U-Net(ResNet50) achieved slightly better results. DeepLabV3+(ResNet50) was the second-best model for river water segmentation, closely followed by PAN(ResNet50) and U-Net(ResNet50). Notably, LinkNet(ResNet50) also delivered a commendable perfor- mance, with metric values slightly close to those of the PAN(ResNet50) and U-Net(ResNet50) methods. On the other hand, PSPNet(ResNet50) had the worst quantitative results regarding all metrics. These quantitative findings align seamlessly with our visual analyses, as highlighted in Fig. 16. VOLUME 12, 2024 52079 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 15. Some examples of river water segmentation results on the RIWA.v1. (a) Images and segmentation results generated by (b) U-Net(ResNet50), (c) PSPNet(ResNet50), (d) DeeplabV3+(ResNet50), (e) PAN(ResNet50), (f) LinkNet(ResNet50), and (g) SAM were used as DL models for river water segmentation.Green: False Positives (FP) detection, Pink: False Negatives (FN) detection, Blue: correct detection of river water. TABLE 6. Comparison results of DL segmentation models on the LuFI-RiverSnap.v1 dataset. (Red: the best, blue: second best). These findings are reinforced by the boxplots presented in Fig. 13(a) and the success levels depicted in Fig. 13(b). For instance, as shown in Fig. 13(a), SAM displayed outstanding performance with higher median values, shorter boxes, and fewer outliers than other models. This remarkable result was corroborated by SAM’s results in Fig. 13(b), where it achieved an ‘Excellent’ rating for approximately 78% of the testing data (∼ 183 test images), with a minimal failure rate of roughly 5%. Following SAM, PAN(ResNet50), and U-Net(ResNet50) showcased commendable performance, both earning an ‘Excellent’ rating for 66% of the RiverSnap test data. Notably, PAN(ResNet50) exhibited greater robust- ness compared to U-Net(ResNet50), evidenced by its failure rate being only half that of U-Net(ResNet50), along with higher median values, shorter boxes, and fewer outliers (see Fig. 13(a) and (b)). In contrast, PSPNet(ResNet50)’s performance was described as ‘Excellent’ for only 113 (less than 50%) RiverSnap test samples, as indicated in Fig. 13(b). It also achieved lower median values, larger boxes, and fewer outliers compared to other models, explaining its comparatively poor performance in this case. C. EXPERIMENTAL ANALYSIS AND DISCUSSION Using the evaluation framework from the previous subsection, we thoroughly evaluated six DL models across four datasets for river water segmentation. This assessment aimed to uncover performance trends, highlighting each model’s strengths and weaknesses across different met- rics. This section summarizes the key findings of these evaluations. In our assessment, SAM demonstrated visually and quantitatively superior water segmentation results on the Kaggle WaterNet and LuFI-RiverSnap.v1 datasets compared to other models (see Table 3 and Table 6). This exceptional performance can be attributed to its robust ViT encoder [43], significantly enhancing its capabilities. However, SAM demonstrated comparable performance but did not perform as effectively as most of the considered models in some cases of the RIWA.v1 dataset. This can be attributed to the limitations of SAM, as discussed in [44], primarily related to incorrect predictions, brokenmasks, or significant errors in challenging river scene cases. Additionally, while SAM showed compa- rable performance in segmenting Elbersdorf/Wesenitz data, it did not perform as well as other models in segmenting rivers with clear water and distinct structures. This stems from SAM’s original design, which was developed to accurately segment everything, not specific object segmentation [35]. 52080 VOLUME 12, 2024 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 16. Some examples of river water segmentation results on the LuFI-RiverSnap.v1. (a) Images and segmentation results generated by (b) U-Net(ResNet50), (c) PSPNet(ResNet50), (d) DeeplabV3+(ResNet50), (e) PAN(ResNet50), (f) LinkNet(ResNet50), and (g) SAM were used as DL models for river water segmentation.Green: False Positives (FP) detection, Pink: False Negatives (FN) detection, Blue: correct detection of river water. TABLE 7. Qualitative assessment and summary of the performance of considered DL models for river water segmentation in terms of accuracy for the Entire image (overall) and each test image (image-wise), visual quality, simplicity and speed(computing time), generality. The U-Net(ResNet50) was the best model in river water segmentation for the RIWA.v1, which aligns with the finding in [3] in which this model was the best on this dataset. Furthermore, all models demonstrated promising performance in the Elbersdorf/Wesenitz dataset. This can be attributed to the fact that this data includes images captured by a sensor with specific characteristics in a distinct river scene over time, resulting in limited variability. This characteristic makes these data well-suited for adaptation by most deep learning methods, as observed in the comparable results achieved by SegNet and FCN models presented in [11]. The PSPNet(ResNet50) yielded the poorest results in river water segmentation in all cases, primarily due to its comparatively lightweight architecture compared to the other models tested in this study. To synthesize the findings presented in the previous section and the observations outlined above, a comprehensive summary of the performance analysis of the evaluated DL models has been presented in Table 7. SAM performed exceptionally well in all aspects but ranked the lowest among the DL models in terms of simplicity and computation time. Hence, it proves to be a suitable choice for river segmentation tasks in which both accuracy and stability are of inter- est. U-Net(ResNet50) exhibited a well-balanced performance (moderate to high scores across all criteria) across various criteria. This makes it an adaptable choice for general river segmentation tasks. PAN(ResNet50), DeepLabV3+(ResNet50), and LinkNet(ResNet50) provided a reasonable balance between qualitative/quantitative accuracy and speed. This makes them well suited for segmenting river waters, where a VOLUME 12, 2024 52081 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM trade-off between these factors is acceptable. Although the PSPNet(ResNet50) model was the fastest and the easiest to use, it was inferior in other categories, making it less suitable for more demanding applications of water segmentation. The prospects of this work revolve around the application of DL segmentation models not only for river water body detection but also for a broader range of tasks, including general water body detection, wetland or lake scene assessment, urban flood extent assessment, as well as water level detection from close-range RS images. Although the comparative analysis we presented yielded promising results, it is important to acknowledge certain limitations. For instance, the decision to resize images to 512 × 512 pixels was driven by our resource constraints, enabling us to conduct DL model comparisons. However, future research can delve into the impact of image size on model performance and explore alternative approaches, especially under different resource scenarios. One such avenue could involve splitting inputs into overlapped/no overlapped patches, as discussed in [3], but our computational resources constrained us from pursuing this path. Moreover, comparison with multi-scale fusion backbones-basedmethods [45], [46], which are known to excel in segmenting large instances, can provide further insights into the performance of DL segmentation models for water segmentation. Additionally, data augmentation could be a valuable tool in tasks involving river water segmentation under diverse environmental conditions, lighting variations, or other challenges, as demonstrated in [3]. However, our focus on the intrinsic capabilities of DL models for river water segmentation, coupled with limitations in computation units, led us to forego data augmentation in this comparison. While confident that identified limitations have not hindered the study’s primary objectives, future research could benefit from additional controls. IV. CONCLUSION This study presented an exhaustive comparison of U-Net, PSPNet, DeeplabV3+, PAN, and LinkNet backboned with predefined ResNet50 and SAM in the river water segmenta- tion task. The experimental results were conducted on three benchmark river water segmentation and a newly introduced LuFI-RiverSnap.v1 dataset. These experiments provide rich information on the efficacy and adaptability of the models and valuable insights for future advancements in river water segmentation research and applications. The experimental results show that considered DL models can satisfactorily result in river water segmentation from close-range RS images with high variations in water color, illumination, sky, and structural reflection on the surface water. The SAM and U-Net(ResNet50) were more accurate on average than the other tested models in river water segmentation; however, both were slower in computing. Pan, DeepLabV3+(ResNet50), and LinkNet(ResNet50) also achieved a good tradeoff between accuracy and computing time. Moreover, PSPNet(ResNet50) was the worst model among the tested models, although FIGURE 17. Success levels of fine-tuned MobileSAM (TinyViT), SAM (ViT-B), and SAM (ViT-L) models on the LuFI-RiverSnap.v1 dataset using predefined IoU thresholds. it was the most efficient model regarding computation time. The LuFI-RiverSnap.v1 dataset provided in this study can fulfill the requirement for river water segmentation from close-range RS images, rivers with various colors, and structural reflections. This dataset can support analyses such as trend analysis in river water levels, short-term monitoring of river overflows, and flood extent detection from close-range images. While the dataset is currently undergoing refinement, future versions aim to include a more diverse range of river scene images from around the world. Additionally, the fine-tuning SAM process discussed in this study is not limited to river water segmentation but can be applied or/and integrated with other DL models for diverse object segmentation tasks. Furthermore, As we discard the prompt encoder from the SAM segmentation for a fair comparison, one can explore advanced promptable segmentation DL models to achieve better results in the task of river water segmentation. APPENDIX A To validate the effectiveness of employing a ViT (91M) backbone in SAM fine-tuning, we compared its performance with larger ViT (315M) backbone as well as MobileSAM with TinyViT (5.3 M) backbone [47] on LuFI-RiverSnap.v1 dataset. The number of parameters and average training time per epoch have been reported in Table 8, while the compara- tive performance results quantitatively and qualitatively have been presented in Table 9, Fig. 17, and Fig. 18. As can be seen from Table 9 and Fig. 18, SAM (ViT-L) outperformed SAM (ViT-B) and MobileSAM (TinyViT) in both quantitative and qualitativemeasures. Specifically, SAM (ViT-L) demonstrated superior performance in 192 test cases, showcasing its capability in accurate river water segmentation (see Fig. 17). However, SAM (ViT-L) required significantly more computational time per epoch (five times longer than SAM (ViT-B) and twenty times longer than SAM (TinyViT)), making it less practical for river water segmentation (see Table 8). 52082 VOLUME 12, 2024 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM FIGURE 18. Some examples of river water segmentation results on the LuFI-RiverSnap.v1. (a) Images and segmentation results generated by (b) MobileSAM (TinyViT), (c) SAM (ViT-B), (d) and SAM (ViT-L). FIGURE 19. (a) Example test images from the LuFI-RiverSnap.v1 dataset, predicted by (b) fine-tuned SAM on 512 × 512, and (c) fine-tuned SAM on the input original size. TABLE 8. The number of parameters and average training time per epoch for MobileSAM (TinyViT), SAM (ViT-B), and SAM (ViT-L). On the other hand, SAM (TinyViT) emerged as the most time-efficient method for river water segmentation (see Table 8). However, this efficiency came at a cost, as SAM (TinyViT) exhibited weaker performance both quantitatively and qualitatively compared to SAM (ViT-L) TABLE 9. Comparison results of fine-tuned MobileSAM (TinyViT), SAM (ViT-B), and SAM (ViT-L) models on the LuFI-RiverSnap.v1 dataset. (Red: the best, blue: second best). and SAM (ViT-B) (see Table 9 and Fig. 18)). Moreover, although SAM (TinyViT) achieved a higher success rate in segmenting river water than SAM (ViT-B), it led to failure results in the same proportion (see Fig. 17). Conversely, the quantitative and qualitative results of SAM (ViT-B) closely aligned with those of SAM (ViT-L), but were achieved in significantly less time (see Table 8, Table 9 and VOLUME 12, 2024 52083 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM Fig. 18). Therefore, a trade-off between accuracy and time became more apparent when fine-tuning SAM with ViT-B, making it the preferred choice in our comparative study. APPENDIX B In our workflow, we employed inputs with a size of 512×512, which were subsequently upsampled to 1024× 1024 to meet SAM’s requirements. SAM was also trained using inputs at their original size from the LuFI-RiverSnap.v1 dataset to improve the understanding of this issue (see Fig. 19). The results illustrated in Fig. 19 clearly showcase accurate segmentation of river water, particularly in challenging cases, when SAM was fine-tuned using the original image size rather than the 512 × 512 input images. Importantly, this improvement is not unique to the SAM model; instead, it is expected that the performance of other models tested would similarly benefit from training on the original input size. REFERENCES [1] K. E. Limburg, D. P. Swaney, and D. L. Strayer, ‘‘River ecosystems,’’ in Encyclopedia of Biodiversity. Amsterdam, The Netherlands: Elsevier, 2001, pp. 213–231. [Online]. Available: https://www.sciencedirect.com/science/article/pii/B0122268652002388 [2] Í. Güneralp, A. M. Filippi, and B. U. Hales, ‘‘River-flow boundary delineation from digital aerial photography and ancillary images using support vector machines,’’ GIScience Remote Sens., vol. 50, no. 1, pp. 1–25, Feb. 2013. [3] F. Wagner, A. Eltner, and H.-G. Maas, ‘‘River water segmentation in surveillance camera images: A comparative study of offline and online augmentation using 32 CNNs,’’ Int. J. Appl. Earth Observ. Geoinf., vol. 119, May 2023, Art. no. 103305. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1569843223001279 [4] M. Elias, C. Kehl, and D. Schneider, ‘‘Photogrammetric water level determination using smartphone technology,’’ Photogrammetric Rec., vol. 34, no. 166, pp. 198–223, Jun. 2019. [5] M. Elias and H.-G. Maas, ‘‘Measuring water levels by handheld smartphones—A contribution to exploit crowdsourcing in the spatio temporal densification of water gauging networks,’’ Int. Hydrographic Rev., vol. 27, pp. 9–22, May 2022. [6] X. Blanch, F. Wagner, R. Hedel, J. Grundmann, and A. Eltner, ‘‘Towards automatic real-time water level estimation using surveillance cameras,’’ in Proc. EGU Gen. Assem. Conf. Abstr., 2022, Paper no. EGU22-3225. [7] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, ‘‘A review of semantic segmentation using deep neural networks,’’ Int. J. Multimedia Inf. Retr., vol. 7, no. 2, pp. 87–93, 2018. [8] I. Ahmed, M. Ahmad, F. A. Khan, and M. Asif, ‘‘Comparison of deep- learning-based segmentation models: Using top view person images,’’ IEEE Access, vol. 8, pp. 136361–136373, 2020. [9] J. Yu, Y. Lin, Y. Zhu, W. Xu, D. Hou, P. Huang, and G. Zhang, ‘‘Segmen- tation of river scenes based on water surface reflection mechanism,’’ Appl. Sci., vol. 10, no. 7, p. 2471, Apr. 2020. [10] A. Rankin and L. Matthies, ‘‘Daytime water detection based on color variation,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2010, pp. 215–221. [11] A. Eltner, P. O. Bressan, W. N. Gonçalves, T. Akiyama, and J. M. Junior, ‘‘Using deep learning for automatic water stage measurements,’’ Water Resour. Res., vol. 57, no. 3, Mar. 2021, Art. no. e2020WR027608, doi: 10.1029/2020WR027608. [12] N. A. Muhadi, A. F. Abdullah, S. K. Bejo, M. R. Mahadi, and A. Mijic, ‘‘Deep learning semantic segmentation for water level estimation using surveillance camera,’’ Appl. Sci., vol. 11, no. 20, p. 9691, Oct. 2021. [13] Y. Chen, R. Fan, X. Yang, J.Wang, andA. Latif, ‘‘Extraction of urbanwater bodies from high-resolution remote-sensing imagery using deep learning,’’ Water, vol. 10, no. 5, p. 585, May 2018. [14] M. Li, P. Wu, B. Wang, H. Park, H. Yang, and Y. Wu, ‘‘A deep learning method of water body extraction from high resolution remote sensing imageswithmultisensors,’’ IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 14, pp. 3120–3132, 2021. [15] W. Feng, H. Sui, W. Huang, C. Xu, and K. An, ‘‘Water body extraction from very high-resolution remote sensing imagery using deep U-Net and a superpixel-based conditional random field model,’’ IEEE Geosci. Remote Sens. Lett., vol. 16, no. 4, pp. 618–622, Apr. 2019. [16] L. Lopez-Fuentes, C. Rossi, and H. Skinnemoen, ‘‘River segmentation for flood monitoring,’’ in Proc. IEEE Int. Conf. Big Data (Big Data), Dec. 2017, pp. 3746–3749. [17] E. Shelhamer, J. Long, and T. Darrell, ‘‘Fully convolutional networks for semantic segmentation,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, Apr. 2017. [18] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, ‘‘The one hundred layers tiramisu: Fully convolutional DenseNets for semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 1175–1183. [19] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 1125–1134. [20] J. Pan, Y. Yin, J. Xiong,W. Luo, G. Gui, and H. Sari, ‘‘Deep learning-based unmanned surveillance systems for observing water levels,’’ IEEE Access, vol. 6, pp. 73561–73571, 2018. [21] M. M. de Vitry, S. Kramer, J. D. Wegner, and J. P. Leitão, ‘‘Scalable flood level trend monitoring with surveillance cameras using a deep convolutional neural network,’’ Hydrol. Earth Syst. Sci., vol. 23, no. 11, pp. 4621–4634, Nov. 2019. [22] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks for biomedical image segmentation,’’ in Proc. 18th Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI). Springer, Oct. 2015, pp. 234–241. [23] T. S. Akiyama, J. M. Junior, W. N. Gonçalves, P. O. Bressan, A. Eltner, F. Binder, and T. Singer, ‘‘Deep learning applied to water segmenta- tion,’’ Int. Arch. Photogramm., Remote Sens. Spatial Inf. Sci., vol. 43, pp. 1189–1193, Aug. 2020. [24] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep convolutional encoder–decoder architecture for image segmentation,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017. [25] R. Vandaele, S. L. Dance, and V. Ojha, ‘‘Deep learning for automated river-level monitoring through river-camera images: An approach based on water segmentation and transfer learning,’’Hydrol. Earth Syst. Sci., vol. 25, no. 8, pp. 4435–4453, Aug. 2021. [26] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, ‘‘Unified perceptual parsing for scene understanding,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 418–434. [27] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, ‘‘Rethinking atrous convolution for semantic image segmentation,’’ 2017, arXiv:1706.05587. [28] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778. [29] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, ‘‘Encoder-decoder with atrous separable convolution for semantic image segmentation,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 801–818. [30] A. Eltner, P. Zamboni, R. Hedel, J. Grundmann, and X. Blanch, ‘‘Image-based methods for real-time water level estimation,’’ in EGU General Assembly, Vienna, Austria, Apr. 2023, Paper EGU23-6745, doi: 10.5194/egusphere-egu23-6745. [31] X. Blanch, F. Wagner, and A. Eltner. (2022). RIWA Dataset. [Online]. Available: https://www.kaggle.com/dsv/4289421 [32] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ‘‘Pyramid scene parsing network,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6230–6239. [33] H. Li, P. Xiong, J. An, and L. Wang, ‘‘Pyramid attention network for semantic segmentation,’’ 2018, arXiv:1805.10180. [34] A. Chaurasia and E. Culurciello, ‘‘LinkNet: Exploiting encoder represen- tations for efficient semantic segmentation,’’ in Proc. IEEE Vis. Commun. Image Process. (VCIP), Dec. 2017, pp. 1–4. [35] A. Kirillov, E.Mintun, N. Ravi, H.Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, ‘‘Segment anything,’’ 2023, arXiv:2304.02643. [36] Y. Liang, N. Jafari, X. Luo, Q. Chen, Y. Cao, and X. Li, ‘‘WaterNet: An adaptive matching pipeline for segmenting water with volatile appearance,’’ Comput. Vis. Media, vol. 6, no. 1, pp. 65–78, Mar. 2020. 52084 VOLUME 12, 2024 http://dx.doi.org/10.1029/2020WR027608 http://dx.doi.org/10.5194/egusphere-egu23-6745 A. Moghimi et al.: Comparative Performance Analysis of Popular Deep Learning Models and SAM [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘ImageNet large scale visual recognition challenge,’’ Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. [38] J. Laitala and L. Ruotsalainen, ‘‘Computer vision based planogram compliance evaluation,’’ Appl. Sci., vol. 13, no. 18, p. 10145, Sep. 2023. [39] P. Zhang, Y. Ban, and A. Nascetti, ‘‘Learning U-Net without forgetting for near real-time wildfire monitoring by the fusion of SAR and optical time series,’’ Remote Sens. Environ., vol. 261, Aug. 2021, Art. no. 112467. [40] F. Rajič, L. Ke, Y.-W. Tai, C.-K. Tang, M. Danelljan, and F. Yu, ‘‘Segment anything meets point tracking,’’ 2023, arXiv:2307.01197. [41] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’ 2014, arXiv:1412.6980. [42] Y. Feng and M. Sester, ‘‘Extraction of pluvial flood relevant volunteered geographic information (VGI) by deep learning from user generated texts and photos,’’ ISPRS Int. J. Geo-Inf., vol. 7, no. 2, p. 39, Jan. 2018. [43] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszko- reit, and N. Houlsby, ‘‘An image is worth 16× 16 words: Transformers for image recognition at scale,’’ 2020, arXiv:2010.11929. [44] L. Ke, M. Ye, M. Danelljan, Y. Liu, Y.-W. Tai, C.-K. Tang, and F. Yu, ‘‘Segment anything in high quality,’’ 2023, arXiv:2306.01567. [45] Y. Zhang, J. Chu, L. Leng, and J. Miao, ‘‘Mask-refined R-CNN: A network for refining object details in instance segmentation,’’ Sensors, vol. 20, no. 4, p. 1010, Feb. 2020. [46] W. Lin, J. Chu, L. Leng, J. Miao, and L. Wang, ‘‘Feature disentanglement in one-stage object detection,’’ Pattern Recognit., vol. 145, Jan. 2024, Art. no. 109878. [47] C. Zhang, D. Han, Y. Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, ‘‘Faster segment anything: Towards lightweight SAM for mobile applications,’’ 2023, arXiv:2306.14289. ARMIN MOGHIMI received the M.S. and Ph.D. degrees in civil photogrammetry and remote sens- ing engineering from the K. N. Toosi University of Technology, Tehran, Iran, in 2015 and 2022, respectively. Since 2023, he has been contributing as a Postdoctoral Research Associate with Ludwig- Franzius-Institute for Hydraulic, Estuarine and Coastal Engineering, Faculty of Civil Engineer- ing and Geodesy, Leibniz University Hannover, specializing in estuary and coastal engineering. His research interests include deep learning, change detection, computer vision, remote sensing techniques, image registration, machine learning, SAR image processing, and LiDAR data processing. MARIO WELZEL received the Ph.D. degree from Ludwig-Franzius-Institute for Hydraulic, Estuar- ine and Coastal Engineering, Leibniz University Hannover, Germany, in 2021. Since then, he has been actively engaged as a Postdoctoral Researcher, a Senior Researcher, and the Project Lead of Ludwig Franzius Institute. His research interests include marine renew- able energy, wave-current structure, wave-current structure soil interaction, flood forecasting, and flood management, including numerical and physical models. In addition to his involvement in hydraulic projects, he is currently focused on extracting hydrological parameters from smartphone and drone images using artificial intelligence (AI), to advance classic research areas, such as hydraulic engineering, flood forecast, flood management, and digitalization of the water sector. TURGAY CELIK received the Ph.D. degree from the University of Warwick, Coventry, U.K., in 2011. Currently, he is a Professor of digital trans- formation and the Director of the Wits Institute of Data Science, University of the Witwatersrand Johannesburg, South Africa. His research interests include computer vision, (explainable) artificial intelligence, (health) data science, data-driven optimal control, and remote sensing. He is an Associate Editor of BMC Medical Informatics and Decision Making, IET ELL, IEEE ACCESS, IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING (IEEE JSTARS), and SIVP (Springer). TORSTEN SCHLURMANN received the Ph.D. degree in coastal and maritime engineering from Bergische Universität Wuppertal, in 1999. He finalized his habilitation thesis to obtain the Venia Legendi in Hydraulic Engineering and Water Resources, in 2005. Following the dev- astating tsunami in Indian Ocean, in December 2004, he accepted a full-time appointment with the Institute for Environment and Human Security (EHS), Bonn, United Nations University (UNU), Tokyo, and joined the United Nations. He continued his academic career as a Postdoctoral Scientist (Oberingenieur, C2) with Bergische Universität Wuppertal. As the Head of the Section for Coastal Risks, he played a leading role in the design and implementation of a tsunami early warning system (TEWS) in Indian Ocean and was one of the responsible Lead-PI of the two BMBF projects GITEWS (FKZ: 03TSU01) and Last-Mile– Evacuation (FKZ: 03G0666A-E). He soon was appointed as a W3-Professor for hydraulic and coastal engineering with Leibniz University Hannover, in 2007, but remained ex officio formal advisor to the Director of the Institute for Environment and Human Security, Bonn, United Nations University, Tokyo, the United Nations in several overarching UN committees and working groups on the development and operation of TEWS, until 2010. He has been a Professor and the Managing Director of Ludwig- Franzius-Institute for Hydraulic, Estuarine and Coastal Engineering, since 2007. He is currently the Managing Director of the Coastal Research Centre, a joint central institution of Leibniz Universität Hannover and Technische Universität Braunschweig that operates the recently extended, world-renowned large wave-current flume (GWK+). His main research areas concentrate on the modeling of estuarine and coastal processes, strategies, and measures in coastal protection, port construction and maintenance, development of maritime technologies, and risk management. His main research interests include offshore wind and marine renewable energies, the implication of climate-driven processes and impacts from sea level rise in coastal and estuarine environments, the development of nature-based Solutions (NbS), and the quantification of ecosystem services in coastal protection. VOLUME 12, 2024 52085