AI and Deep Learning
The rapid rise of deep learning is prompting a growing mobilization of the scientific community and a significant influx of research funding, both private and public. AI in general, and deep learning in particular, are seen as a potential reservoir of innovative answers to a multitude of problems, particularly in the field of sustainable development (Nishant et al. 2020). This vision considerably widens the field of possibilities, underlining the crucial importance of research in the pursuit of technological advances capable of revolutionizing not only sustainable development, but also providing innovative answers in the fields of the environment, public health and ecology. Advances in this field are directly linked, on the one hand, to the generation of very large quantities of data and the rapid development of GPU computing, and, on the other, to the continuous improvement of deep learning algorithms and the integration of more sophisticated artificial intelligence techniques. This synergy between massive data availability, increased computing power and algorithmic advances has led to significant breakthroughs in fields such as speech recognition, computer vision, machine translation and autonomous driving. In addition, it has facilitated the development of innovative applications that are transforming a wide range of sectors, from healthcare (Nat. Med. 2024), where it enables the early detection of certain diseases, to the environment, by optimizing energy management systems to reduce their carbon footprint. This increasingly open science has given rise to a number of innovative algorithmic approaches for automatically discovering complex abstractions hidden in data. Naturally, as a modeling research unit, UMMISCO has actively contributed to the search for new approaches to tackle the many methodological hurdles raised by the construction of increasingly complex models. This has resulted in the funding of several national and international projects covering a wide range of fields, from health and the environment to biodiversity and the social sciences (AIME, DeepECG4U, DeepIngegromics, Ecol+, Metaplantcode, etc.).
Scientific objectives and background
The objectives of this theme are numerous, and are set in the highly competitive context of developing a methodological and theoretical framework covering different areas and sub-areas of AI. Among the major objectives are the following:
- Finding representations. The use of deep learning offers new ways of automatically finding complex, abstract representations of unstructured, multimodal data. This translates into the automatic search for embeddings (dense representations of high-dimensional data, such as sentences, images or even whole genetic sequences) in a reduced-dimensional vector space. These embeddings capture the semantics and relationships between elements, enabling learning models to perform classification, recommendation or prediction tasks with increased accuracy. For example, to better characterize a patient and define his or her digital twin, it is possible to convert metagenomic data from high-throughput sequencing (Queyrell et al. 2021) or biosignals such as electrocardiograms (Prifti et al. 2021) into vectors. These vectors can then be manipulated to better represent the system under study, it being expected that a better representation contributes directly to more useful modeling.
- The development of interpretable and robust methods. The effectiveness of deep learning is closely linked to data quality, which underlines the crucial importance of data selection, cleaning and preparation. However, beyond data quality, there is also the challenge of model interpretability. To ensure the ethical and effective use of deep learning, it is essential to develop models that are not only efficient but also understandable, capable of providing clear explanations of their decisions and how they work. This not only builds user confidence, but also facilitates their continuous improvement and adaptation to new contexts or data. Finally, it has been repeatedly demonstrated that these models can quickly become biased (Goyal & Bengio 2022) and that it is important to be able to make sense of the inferences that deep models produce (Chakraborty et al. 2017). The interpretability of models is therefore a major scientific issue, to which UMMISCO aims to make an active contribution via this theme.
- Discriminative and generative AI. The acceleration in the development of large language models (LLMs) has effectively demonstrated the potential for renewal in the field of automatic natural language processing. There are many possible applications in both discriminative and generative AI, opening up new modeling perspectives. UMMISCO is very interested in these approaches and is already implementing them in various fields, such as health (analysis of DNA, ECGs, biological texts, etc.) or the construction of 3D programs and images for virtual environments (as part of the SIMPLE project). These applications all represent fascinating research problems, each posing its own challenges in terms of modeling, data analysis and interpretation. They require a thorough understanding of the theoretical underpinnings of LLM, as well as advanced technical skills for their effective implementation.
- Discriminative and frugal AI. One of the negative aspects of deep networks, at the heart of new approaches in AI, is their size, computing power and the phenomenal amount of energy they require to operate. Highly sensitive to this aspect, UMMISCO wishes to develop in this theme a research program focused on frugal AI, which can for example be embedded in small devices such as the sensors developed in theme 3.enerative. The acceleration in the development of Large Language Models (LLMs) has effectively demonstrated the potential for renewal in the field of automatic natural language processing. There are many possible applications in both discriminative and generative AI, opening up new modeling perspectives. UMMISCO is very interested in these approaches and is already implementing them in various fields, such as health (analysis of DNA, ECGs, biological texts, etc.) or the construction of 3D programs and images for virtual environments (as part of the SIMPLE project). These applications all represent fascinating research problems, each posing its own challenges in terms of modeling, data analysis and interpretation. They require a thorough understanding of the theoretical underpinnings of LLM, as well as advanced technical skills for their effective implementation.
Within the framework of this theme, UMMISCO aims to advance methodological research in AI, while tackling numerous obstacles, some of which are mentioned here. They are linked to the very nature of deep networks, learning tasks and final applications. Among the main ones we can mention :
- Data quality and bias: Already mentioned in theme 1, this issue is addressed in part and in close connection with it (data generation) and with theme 3 (data collection). Planned actions concern the quality, standardization and size of data sets used, as well as model calibration and generalization (Bayet et al. 2022).
- Data annotation and semi-supervised learning: among the key issues in supervised learning, the quality of annotated data occupies an important place, as it can explain the plateaus encountered by deep learning, due in particular to the natural discordance between expert annotators. The theme will focus (in connection with theme 4) on “putting the human back in the loop”, but other deep learning approaches such as semi-supervised learning or continuous learning also offer interesting prospects (Chen 2020).
- Architectural design: finding the best architectures for a given task or type of data remains a major challenge. The best architectures are often identified experimentally, which requires extremely costly computing resources. Although methods for optimizing architectures are beginning to appear (Miikkulainen 2024), the development of a theoretical framework and a better markup of the field are still necessary, particularly in the development of more frugal AI systems.
- AI acceptability: this lock concentrates various important aspects affecting AI, such as interpretability, safety, trust, robustness, etc. A good example is models that will have a direct impact on patient care. Are we ready today to trust a model responsible for assessing whether a patient should undergo heart surgery? How can these deep models be evaluated in clinical studies similar to those used to assess drug efficacy?
This lock also includes regulatory aspects (Al Mouatamid et al. 2023) as well as “human sciences” aspects (epistemology, cognitive sciences).
UMMISCO 3 has contributed to numerous projects in which AI has been at the heart of methodological and application developments. Some of the most noteworthy applications, which will continue under UMMISCO 4, include :
AIME (Artificial Intelligence for Marine Ecosystems): this project focuses on quantifying and modeling changes in biodiversity in various marine ecosystems. An international scientific team with extensive experience in AI, ecology and marine biology has been assembled to develop techniques for automatically generating advanced indicators of the health of these ecosystems, and to create innovative models capable of estimating changes in biodiversity. For example, classification and object detection applied to images of coral reefs are used to generate indicators of coral bleaching (Younes et al. 2024), while automatic language processing techniques applied to legal documents are used to generate indicators of legal protection of the oceans (Al Mouatamid et al, 2023).
- DeepIntegromics (deep and integrative learning from omics data): This is a strategic project that addresses many of the AI modeling objectives seen in § 2.2.2. This project, conducted in collaboration with SU clinical teams, aims to exploit deep learning to identify patient phenotypes from clinical and omics data, and focuses on the challenge of learning from raw metagenomic data. A key innovation of the project is the adoption of a cascade of binders, a sequential approach where machine learning models feed each other with increasingly expensive data, used only when necessary to refine predictions or classifications. This method enables more accurate and interpretable analysis, reducing costs by using expensive data only when it is essential to improve prediction. This strategy has proved its worth, for example, in improving the management of patients suffering from cardiometabolic diseases.
- DeepECG4U (translational deep learning for electrocardiogram analysis): This is one of a number of translational projects (in the sense that they are intended for routine use by doctors) aimed at developing robust, interpretable deep models capable of identifying patients at risk of arrhythmias such as torsades-de-pointes, which can lead to sudden death. This program has generated numerous national and international collaborations in the North (France, Italy, USA, Netherlands, etc.) and South (Senegal, Albania), with the aim of validating deep models in clinical studies.
eCOL+: The aim of this project is to use the power of AI to annotate the unique and very large collections of the Muséum National d'Histoire Naturelle (MNHN). This project is unique in terms of the size of the data processed (> 2PO), its diversity and that of the approaches developed. Launched in 2021 for a duration of 8 years, the project brings together a wide range of disciplines, including paleontology, botany, imaging, data analysis, modeling and more.
- MetaPlantCode: Funded by the European Biodiversa+ program, and starting in 2024, this project aims to standardize protocols for processing plant biodiversity. UMMISCO is the leader of a workpackage working on AI approaches and in particular deep learning to help classify environmental DNA sequences.
- I-Maroc: In this aforementioned project, UMMISCO is developing an on-board AI that uses a video stream to periodically synthesize road traffic under various parameters (number of vehicles, speed, inter-vehicular time, etc.). The aim is to develop a sustainable, low-cost counting station that will enable studies to be carried out in cities not equipped with fixed sensors.
- NAWRAS10: This is a multi-disciplinary project in which automatic language processing techniques (Al Mouatamid et al, 2023), recent advances in large language models and their refinement (fine-tuning), are used to automatically extract information from collections of legal texts in order to build legal indicators to better understand the contribution of different national laws to ocean preservation. The final product of the project is a publicly accessible dashboard, enabling comparisons to be made between several countries (currently over 30). The project is coordinated by UMMISCO (IT department of the Semlalia Faculty of Science, Cadi Ayyad University and IRD/LEMAR).
- ESPERANTO11: this is an H2020 project in which the UMMISCO Central Africa center is a partner. It focuses on the automatic processing of African languages, as well as speech processing for these languages. The scientific challenges addressed are mainly linked to the linguistic characteristics of African languages, which differ from those of the languages most commonly used in NLP and automatic speech processing (tones, alphabet, agglutination, etc.). In addition, these languages are poorly endowed, which necessitates, on the one hand, proposing new learning algorithms, and on the other, collecting and disseminating (labeled) datasets to make them available to the scientific community (Kenfack et al. 2023). The center is collaborating with linguists as business experts, which introduces the question of model explicability into the project; models of explicability by prototyping are being explored.
The scientific leadership of this theme will aim to encourage and facilitate exchanges between researchers from different disciplines and geographical centers. The theme will fund transdisciplinary or inter-center projects through an annual call for projects, in coordination with the other themes. The theme will participate in the dissemination of knowledge via training courses (Master's, PDI) and seminars with the unit's various partners, in particular around Deep Learning. Finally, the theme will also set up a series of seminars to enable members of the theme to exchange views on advances in the field of AI, which are proceeding at a frenetic pace. The theme also has a large number of members, including PhD students, interns, post-docs and young researchers from all centers. This increases the unit's dynamism, but also requires specific scientific animation needs. For example, since 2022, weekly seminars have been organized for researchers, doctoral students and post-doctoral fellows, with around thirty events per year. This dynamic approach to methodological issues and scientific watch will continue in the new edition of the unit. It complements the internal meetings held for each project in conjunction with our partners.