OpenLLM France

developing open source, transparent AI with a French twist

Building digital commons for the French language to put AI building blocks in the hands of researchers, engineers and educators working on local use cases

GitHub

About us

OpenLLM France is a research project funded by BPI France for a period of two years.

Born from the OpenLLM France community – a group of academic and industry stakeholders interested in truly open-source generative models – the consortium is composed of nine official partners.

These 9 partners are supported by 12 associate partners, including:

Our philosophy

Our goal is to create digital commons and share expertise to develop ethical and open AI applications in France that are proficient in the French language.

French Data Collection and Processing

Developing French corpora helps minimize biases from English, which is heavily overrepresented in open training datasets.

Open Redistribution of Training Data

We republish our datasets in the format used for training to in order ensure the auditability of our data and models.

Sharing of Model Weights Under Open Licenses

We share final and intermediate pretraining checkpoints to facilitate research and continual pretraining.

Open Publication of Training and Processing Code

Sharing code for training and data processing promotes interpretability and helps others get started on model training.

Our research topics

We are researchers, developers, and practitioners working at the intersection of several fields.

Multilinguality

From education to healthcare, AI in French-speaking countries requires strong French language skills, often overlooked by English-centric models. We provide resources tailored to French while advancing research on training French-language, bilingual, and multilingual models.

Clean data

Our work is guided by a commitment to data transparency and respect for intellectual property, in full compliance with European directives. While this approach may impact model performance, we believe that the long-term benefits of openly sharing training data far outweigh the trade-offs by fostering future research and development.

Multimodality

Education

An important goal of our project is to improve the use of AI in education. This involves working with educators to develop models that support both teachers and learners in real-world scenarios, but above all, collaborating with experts to raise awareness of the risks associated with AI and promote best practices.

Explore our resources

The Luciole family is our brand-new lineup of pre-trained language models. Just like Lucie 7B, the Luciole models were trained on approximately 30% French data.

Check out Luciole 1B, 8B, and 23B, as well as the training data, on Hugging Face. Our code for data processing and model training can be found on our GitHub repository.

Luciole Collection

Luciole Code

Model Sizes

1B for edge use cases, 8B Mamba hybrid for better management of long contexts, and 23B for increased performance and reasoning.

Billion tokens

Carefully selected to strike a balance between quality and diversity, while retaining our commitment to openness and transparency.

Languages

A multilingual approach, with a particular focus on French and the major European languages, ensuring cultural and linguistic representation.

LUCIE 7B

Lucie-7B, our first foundation model trained from scratch, was the first large French-focused foundation model, trained on more than 30% French data.

To learn more about the Lucie family of models and their training data, check out our spaces on Hugging Face and GitHub.

Lucie Code

Lucie Collection

Our commitments to energy-efficient generative AI

As a part of our commitment to sustainable development, we conduct an environmental life-cycle analysis of models based on the AFNOR methodology from the General Reference for Frugal AI. This assessment covers all stages of the process, from training to inference.

Want to learn more?

Contact us at contact@openllm-france.fr !