Performing Observing System Experiments (OSE) that combine Neural Networks and Data Assimilation to enhance the impact of Argo floats in the Copernicus Mediterranean biogeochemical model (MedBFM)
Ocean Sciences, Modeling
Research area
Biogeochemical-Argo (BGC-Argo) floats provide key insights into vertical biogeochemical dynamics but are limited by data sparsity (data availability is reduced in the last years).
This project aims to investigate the ability of the Copernicus Mediterranean biogeochemical model (MedBFM) to reproduce key BGC features of the Mediterranean Sea by integrating Neural Networks (NN) and Data Assimilation (DA).
BGC-Argo float profiles are assimilated alongside NN-reconstructed BGC profiles.
The impact of these new DA setups is assessed through Observing System Experiments (OSE), demonstrating improved skill metrics, reduced model biases, and enhanced representation of key processes such as nitracline dynamics, primary production, and oxygen vertical distribution.
Project goals
The project focuses on improving the Data Assimilation (DA) framework used in the Mediterranean biogeochemical forecasting system of Copernicus Marine. In particular, the work aims to assess the impact of DA on the OGSTM-BFM transport model by integrating Neural Network (NN)-reconstructed biogeochemical profiles.
The research activities include building a quality-controlled BGC-Argo dataset used for data assimilation and model validation, and performing a two-year data denial experiment to quantify the contribution of BGC-Argo observations to the forecasting system. The work also analyzes uncertainties associated with the DA of NN-reconstructed profiles and statistically evaluates their impact on the representation of biogeochemical dynamics.
Finally, the project investigates the synergies between different observing systems, including satellite, in situ, and NN-reconstructed observations, in order to maximize their complementary contribution to biogeochemical forecasting.
Computational approach
The project presents significant computational challenges due to the need to build and process large datasets, run high-resolution numerical simulations, and integrate advanced data assimilation (DA) techniques with neural network (NN) reconstructions. Managing large volumes of observation and model outputs from multiple years of simulations requires efficient data handling strategies. The assimilation of BGC-Argo profiles and NN-derived observations generates extensive outputs, demanding resource for the storage.
The high computational cost of running Observing System Experiments (OSEs) with different DA setups adds another complexity. Multi-year simulations at high spatial resolution require substantial computing power, making it essential to leverage high-performance computing (HPC) resources effectively. Efficient parallelization and workflow optimization are crucial to ensure the feasibility of these experiments.
Moreover, the successful integration of NN-reconstructed profiles into the DA system introduces additional technological challenges. First, there is the initial computational effort required to manage and prepare training and test datasets. Second, ensuring compatibility and stability between machine learning outputs and variational assimilation schemes requires careful tuning and validation.
Example of enhanced dataset to assimilate for a 2-year simulation (2018-2019) in the Mediterranean Sea (16 sub-baisn): Seasonal availability of nitrate and reconstructed nitrate profiles. Light gray (autumn and spring), cyan (winter), and yellow (summer) bars represent the availability of in situ nitrate data (used in the DAfl run). Gray (autumn and spring), light blue (winter), and orange (summer) striped bars indicate the availability of reconstructed nitrate data (used in the DAnn run).
Key results
The project contributed to the development of an enhanced Data Assimilation (DA) framework aimed at improving the representation of key biogeochemical processes in the Mediterranean Sea and strengthening predictive capabilities. The results highlight that the performance of MedBFM products strongly depends on both the quality and the quantity of the observational datasets assimilated into the system. Moreover, the analysis confirms the need for a homogeneous and sufficiently dense BGC-Argo dataset for reliable model evaluation and validation. The implementation of dedicated quality control (QC) procedures allowed the creation of robust and consistent datasets. These quality-controlled datasets can therefore be effectively used to increase the information content provided by BGC-Argo observations and improve their suitability for data assimilation applications. Finally, the data denial experiment highlighted the positive impact of BGC-Argo profiles on the forecasting system.
Resource usage
The project presented significant computational challenges due to the need to build and process large datasets, perform high-resolution numerical simulations, and integrate advanced Data Assimilation (DA) techniques with Neural Network (NN)-reconstructed profiles. Managing multiple years of observation and model outputs required efficient data handling and storage strategies, as assimilating BGC-Argo and NN-derived observations generates extensive datasets. Long term simulations at high spatial resolution required substantial computing power, making effective use of TeRABIT high-performance computing (HPC) resources essential.
What's next
Since the positive impact of assimilating BGC vertical profiles has been demonstrated through comparisons of different DA setups in two-year simulations, we plan to perform a 20-year simulation to assess the cumulative effect of assimilating vertical profiles over space and time. Currently, we have a BGC-Argo time series covering 13 years (2013–2026). The 20-year run (1999–2024) will allow us to evaluate whether corrections applied from the beginning of the Argo era could influence the simulation results. Furthermore, we aim to investigate the presence of potential long-term trends, such as deoxygenation, and their implications for Mediterranean biogeochemical dynamics.
Normalized RMSE (RMSE / RMSE_max) for each biogeochemical variable (y-axis) across model setups (x-axis, version tags and simulation years in parentheses). Darker colors indicate high RMSE. Chlorophyll-a RMSE decreases over model versions, while nitrate and oxygen peak in the penultimate column, showing that changing the simulation year (same model) can reduce statistics due to lower observational density. From v11, NN-reconstructed nitrate was assimilated, previously only BGC-Argo observations.
Carolina Amadio
Istituto Nazionale di Oceanografia e di Geofisica Sperimentale
Carolina Amadio holds a PhD in biogeochemical oceanography from Ca’ Foscari University of Venice and CMCC, focusing on numerical modeling of benthic-pelagic coupling in coastal areas (Science and Management of Climate Change program). She was a postdoctoral researcher at OGS (Trieste), working on biogeochemical modeling of marine ecosystems, particularly in the Mediterranean Sea, its marginal seas, and coastal areas. Since February 2024, she has held a research position at OGS within the TeRABIT project. Her work focuses on Observing System Experiments (OSE), integrating Neural Networks and Data Assimilation to improve the impact of Argo floats in the Copernicus Mediterranean biogeochemical model (MedBFM). She contributes to European projects such as the Copernicus Marine Program and NECCTON, running multi-year simulations to assess the impact of different model configurations on Mediterranean biogeochemical dynamics.

