img-03
img-02

AI and Deep Learning

The rapid rise of deep learning is prompting a growing mobilization of the scientific community and a significant influx of research funding, both private and public. AI in general, and deep learning in particular, are seen as a potential reservoir of innovative answers to a multitude of problems, particularly in the field of sustainable development (Nishant et al. 2020). This vision considerably widens the field of possibilities, underlining the crucial importance of research in the pursuit of technological advances capable of revolutionizing not only sustainable development, but also providing innovative answers in the fields of the environment, public health and ecology. Advances in this field are directly linked, on the one hand, to the generation of very large quantities of data and the rapid development of GPU computing, and, on the other, to the continuous improvement of deep learning algorithms and the integration of more sophisticated artificial intelligence techniques. This synergy between massive data availability, increased computing power and algorithmic advances has led to significant breakthroughs in fields such as speech recognition, computer vision, machine translation and autonomous driving. In addition, it has facilitated the development of innovative applications that are transforming a wide range of sectors, from healthcare (Nat. Med. 2024), where it enables the early detection of certain diseases, to the environment, by optimizing energy management systems to reduce their carbon footprint. This increasingly open science has given rise to a number of innovative algorithmic approaches for automatically discovering complex abstractions hidden in data. Naturally, as a modeling research unit, UMMISCO has actively contributed to the search for new approaches to tackle the many methodological hurdles raised by the construction of increasingly complex models. This has resulted in the funding of several national and international projects covering a wide range of fields, from health and the environment to biodiversity and the social sciences (AIME, DeepECG4U, DeepIngegromics, Ecol+, Metaplantcode, etc.).

Scientific objectives and background

The objectives of this theme are numerous, and are set in the highly competitive context of developing a methodological and theoretical framework covering different areas and sub-areas of AI. Among the major objectives are the following:

Within the framework of this theme, UMMISCO aims to advance methodological research in AI, while tackling numerous obstacles, some of which are mentioned here. They are linked to the very nature of deep networks, learning tasks and final applications. Among the main ones we can mention :

  • Data quality and bias: Already mentioned in theme 1, this issue is addressed in part and in close connection with it (data generation) and with theme 3 (data collection). Planned actions concern the quality, standardization and size of data sets used, as well as model calibration and generalization (Bayet et al. 2022).
  • Data annotation and semi-supervised learning: among the key issues in supervised learning, the quality of annotated data occupies an important place, as it can explain the plateaus encountered by deep learning, due in particular to the natural discordance between expert annotators. The theme will focus (in connection with theme 4) on “putting the human back in the loop”, but other deep learning approaches such as semi-supervised learning or continuous learning also offer interesting prospects (Chen 2020).
  • Architectural design: finding the best architectures for a given task or type of data remains a major challenge. The best architectures are often identified experimentally, which requires extremely costly computing resources. Although methods for optimizing architectures are beginning to appear (Miikkulainen 2024), the development of a theoretical framework and a better markup of the field are still necessary, particularly in the development of more frugal AI systems.
  • AI acceptability: this lock concentrates various important aspects affecting AI, such as interpretability, safety, trust, robustness, etc. A good example is models that will have a direct impact on patient care. Are we ready today to trust a model responsible for assessing whether a patient should undergo heart surgery? How can these deep models be evaluated in clinical studies similar to those used to assess drug efficacy?

This lock also includes regulatory aspects (Al Mouatamid et al. 2023) as well as “human sciences” aspects (epistemology, cognitive sciences).

UMMISCO 3 has contributed to numerous projects in which AI has been at the heart of methodological and application developments. Some of the most noteworthy applications, which will continue under UMMISCO 4, include :

AIME (Artificial Intelligence for Marine Ecosystems): this project focuses on quantifying and modeling changes in biodiversity in various marine ecosystems. An international scientific team with extensive experience in AI, ecology and marine biology has been assembled to develop techniques for automatically generating advanced indicators of the health of these ecosystems, and to create innovative models capable of estimating changes in biodiversity. For example, classification and object detection applied to images of coral reefs are used to generate indicators of coral bleaching (Younes et al. 2024), while automatic language processing techniques applied to legal documents are used to generate indicators of legal protection of the oceans (Al Mouatamid et al, 2023).

  • DeepIntegromics (deep and integrative learning from omics data): This is a strategic project that addresses many of the AI modeling objectives seen in § 2.2.2. This project, conducted in collaboration with SU clinical teams, aims to exploit deep learning to identify patient phenotypes from clinical and omics data, and focuses on the challenge of learning from raw metagenomic data. A key innovation of the project is the adoption of a cascade of binders, a sequential approach where machine learning models feed each other with increasingly expensive data, used only when necessary to refine predictions or classifications. This method enables more accurate and interpretable analysis, reducing costs by using expensive data only when it is essential to improve prediction. This strategy has proved its worth, for example, in improving the management of patients suffering from cardiometabolic diseases.
  • DeepECG4U (translational deep learning for electrocardiogram analysis): This is one of a number of translational projects (in the sense that they are intended for routine use by doctors) aimed at developing robust, interpretable deep models capable of identifying patients at risk of arrhythmias such as torsades-de-pointes, which can lead to sudden death. This program has generated numerous national and international collaborations in the North (France, Italy, USA, Netherlands, etc.) and South (Senegal, Albania), with the aim of validating deep models in clinical studies.

eCOL+: The aim of this project is to use the power of AI to annotate the unique and very large collections of the Muséum National d'Histoire Naturelle (MNHN). This project is unique in terms of the size of the data processed (> 2PO), its diversity and that of the approaches developed. Launched in 2021 for a duration of 8 years, the project brings together a wide range of disciplines, including paleontology, botany, imaging, data analysis, modeling and more.

  • MetaPlantCode: Funded by the European Biodiversa+ program, and starting in 2024, this project aims to standardize protocols for processing plant biodiversity. UMMISCO is the leader of a workpackage working on AI approaches and in particular deep learning to help classify environmental DNA sequences.
  • I-Maroc: In this aforementioned project, UMMISCO is developing an on-board AI that uses a video stream to periodically synthesize road traffic under various parameters (number of vehicles, speed, inter-vehicular time, etc.). The aim is to develop a sustainable, low-cost counting station that will enable studies to be carried out in cities not equipped with fixed sensors.
  • NAWRAS10: This is a multi-disciplinary project in which automatic language processing techniques (Al Mouatamid et al, 2023), recent advances in large language models and their refinement (fine-tuning), are used to automatically extract information from collections of legal texts in order to build legal indicators to better understand the contribution of different national laws to ocean preservation. The final product of the project is a publicly accessible dashboard, enabling comparisons to be made between several countries (currently over 30). The project is coordinated by UMMISCO (IT department of the Semlalia Faculty of Science, Cadi Ayyad University and IRD/LEMAR).
  • ESPERANTO11: this is an H2020 project in which the UMMISCO Central Africa center is a partner. It focuses on the automatic processing of African languages, as well as speech processing for these languages. The scientific challenges addressed are mainly linked to the linguistic characteristics of African languages, which differ from those of the languages most commonly used in NLP and automatic speech processing (tones, alphabet, agglutination, etc.). In addition, these languages are poorly endowed, which necessitates, on the one hand, proposing new learning algorithms, and on the other, collecting and disseminating (labeled) datasets to make them available to the scientific community (Kenfack et al. 2023). The center is collaborating with linguists as business experts, which introduces the question of model explicability into the project; models of explicability by prototyping are being explored.

The scientific leadership of this theme will aim to encourage and facilitate exchanges between researchers from different disciplines and geographical centers. The theme will fund transdisciplinary or inter-center projects through an annual call for projects, in coordination with the other themes. The theme will participate in the dissemination of knowledge via training courses (Master's, PDI) and seminars with the unit's various partners, in particular around Deep Learning. Finally, the theme will also set up a series of seminars to enable members of the theme to exchange views on advances in the field of AI, which are proceeding at a frenetic pace. The theme also has a large number of members, including PhD students, interns, post-docs and young researchers from all centers. This increases the unit's dynamism, but also requires specific scientific animation needs. For example, since 2022, weekly seminars have been organized for researchers, doctoral students and post-doctoral fellows, with around thirty events per year. This dynamic approach to methodological issues and scientific watch will continue in the new edition of the unit. It complements the internal meetings held for each project in conjunction with our partners.