Optimising AI for sensitive industries
如果无法正常显示,请先停止浏览器的去广告插件。
1. Data-centric AI webinar #8
Optimising AI for sensitive industries
Dr. Pierre-Carl Langlais
Pleias
2. About pleias
Pleias is a Paris-based startup that’s on a mission to solve the key AI scalability
challenges for sensitive industries — data quality, lack of efficiency, compliance and
security risks.
We provide clients with vertical AI solutions at a fraction of traditional AI costs
thanks to our powerful yet frugal foundation models.
Members of the AI alliance and CurrentAI, we believe in the necessity of open,
copyright-free and factual data for AI.
That’s why we’ve released Common Corpus - the largest fully open corpus for
pre-training: 2 trillion tokens with document-level licensing, provenance and
language information.
Mozilla Confidential
3. Open data
Factual data
Semantic data
Mozilla Confidential
4. Open data
5. Training data issues…
Language models come with a large number
of data issues:
●
●
●
Largest official source is web archives
with no possible way to filter out
problematic (or poisoned?) content at
scale.
In practice, big labs seem to routinely
use shadow libraries and other sources
of pirated content. This practice is at
the center of the Meta trial.
“We don’t talk about the data”: despite
its centrality in training labs never
communicate over the datasets.
Mozilla Confidential
6. …are deployer liabilities.
In the current legislation, deployers of
models are fully liable.
●
●
●
You have no guarantees the model
won’t output copyrighted content.
You can’t be completely sure the
alignment really fit your regulations
and expected norms (the
“DeepSeek” problem”)
You don’t know whether the model
is really able to process internal
data which may be widely different
from the internet data used for
training: half of crawled archives are
less than 300 words.
Mozilla Confidential
7. Tragedy of the commons
While closed labs are protected by
obfuscation, open research efforts
have been much more precarious.
Datasets and models are routinely
removed and sometimes this even
leds to trial.
The issue is especially preeminent
in Europe due to the absence of
fair use. Text & data mining
exception only cover fair use, not
releasability. Right now, most
“open everything” LLMs rely on
HuggingFace being hosted in the
US.
Mozilla Confidential
8. Common corpus
Common corpus is the largest
collection of fully open and
releasable texts.
It includes 2 trillion texts from
500 million distinct
documents all with provenance
and licensing attribution at a
granular level.
Common Corpus has grown to
become a referenced training
dataset and will be used by 7
different LLMs (non-Pleias) in
Europe and beyond.
Mozilla Confidential
9. Common corpus
Common corpus is an aggregation of
multiple dispersed sources. While it
benefitted from large scale initiative
(“collections as data”), some of the
sourced content is frequently hard to
find.
A major continuous focus is on the
data transformation and
augmentation of common corpus for
training and indexation:
●
●
●
PDF processing
OCR correction
GDPR/toxicity filtering
Mozilla Confidential
10. Fully open and auditable LLMs…
●
●
●
●
Compliant & Auditable: only
trained on open data under
permissive license in compliance
with the European AI Act.
Extensive multilingual support
for main European languages
Efficient: outperforming much
bigger models, whilst being able
to run (fast) on consumer-grade
GPU and in CPU-only
environments.
Safe: Extremely low level of
toxicity and problematic
content.
Mozilla Confidential
11. Factual data
12. What gen AI really is about: RAG
RAG has emerged as the leading use
case of generative AI in sensitive
industries.
●
●
●
Despite flattering benchmarks
LLM hallucinates a lot. Recent
estimate showed that Grok was
incorrect 90% of the time.
Relevant data is not AI-ready. It
includes especially a large
number of unstructured
sources, specially PDFs.
To fix these shortcomings, RAG
application have become very
complex to manage with
multiple workflows.
Mozilla Confidential
13. Designing models for factuality
A very important development to put
RAG in production for sensitive
industries: better grounding. In
January, Anthropic first unveiled a
dedicated “citation mode”. The main
statements are connected not only
to a source id but to actual literal
quote in the original text.
We have been working on a similar
feature for several months and
improved on the Anthropic version
by shortening citations.
Mozilla Confidential
14. Designing models for factuality
A very important development to put
RAG in production for sensitive
industries: better grounding. In
January, Anthropic first unveiled a
dedicated “citation mode”. The main
statements are connected not only
to a source id but to actual literal
quote in the original text.
We have been working on a similar
feature for several months and
improved on the Anthropic version
by shortening citations.
Mozilla Confidential
15. Designing models for factuality
Pico is a continuous pretrain from our foundation
models Pleias-350m, specialized on 45B tokens of
prepared RAG. Despite its very small size
(GPT2-medium) even the smallest variant attain
an accuracy level superior to LLM many time its
size thanks to the following features:
●
●
●
●
Systematic source citation with attribution,
using the Wikipedia syntax.
Reasoning steps (“inference time”) to
internalize reranking and query
reformulation
Reinforcement Learning with citation
exactness as reward (GRPO).
Leveraging of model internal metrics to track
accuracy at inference time.
Our internal inference system leverage attention score to track the
impact of reasoning traces for RAG accuracy
Mozilla Confidential
16. Integrating RAG into local infrastructure
A further critical issue for the
regulated sector: interact with
sensitive data locally.
Until now capable local LLMs required
significant GPU investments due to
the intensive use of the context
window. RAG specialization makes it
possible to get usable results with
even a 350m model (1500 tokens/s on
a t4, 20 seconds on CPU).
Mozilla Confidential
17. RAG beyond models: building a factual asset
Commons Corpus aims also to
become a resource for RAG. We are
currently deploying selected and
preprocessed versions for the
following use cases:
●
●
●
●
Government norms and
regulations
Scientific articles in medicine.
Healthcare guidelines.
Banking standards.
Mozilla Confidential
18. Semantic data
19. Semantic data
Semantic web is a major component
of data infrastructure for regulated
industries. RDF/XML standards
ensure the following functions:
●
●
●
Operationalize standards of
data representation and
completeness.
Ensure interoperability.
Enforce transparency and
findability, across the web of
data.
Mozilla Confidential
20. Semantic data
Due to the limitation of embedding
indexation, there is a new wave of
interest in knowledge graphs
representation. Yet even frontier
models struggle with knowledge
graph generation:
●
●
●
Format familiarity as RDF is
likely not part of training data.
Inability to encode large and
complex standards through
prompting techniques alone.
Constant lack of data
adherence.
Mozilla Confidential
21. Training a model on Wikidata
Thanks to our partnership with Wikimedia
Foundation Enterprise, we have the opportunity to
train the first language model on Wikidata. The first
official version will be officially released in 1-2
weeks.
For now the current prototype include about 20%
of all Wikidata item with a simplified version of
RDF to save on inference — and avoid repetition
loops…
The Man Without Qualities | instance of | unfinished
novel
The Man Without Qualities | instance of | literary work
The Man Without Qualities | author | Robert Musil
The Man Without Qualities | title | {'text': 'Der Mann
ohne Eigenschaften', 'language': 'de'}
<|subject|>The Man Without Qualities
<|property|>instance of<|object|>unfinished
novel<|object|>literary work
<|property|>author<|object|>Robert Musil
<|property|>title<|object|>{'text': 'Der Mann ohne
Eigenschaften', 'language':
'de'}
Mozilla Confidential
22. Training a model on Wikidata
Training rely on a synthetic pipeline that
ensure that the model memorize not only
all Wikidata properties but also their
name and function. Our dataset has three
component:
●
●
●
The original Wikidata statements as
output (what we want to predict).
Synthetic/imaginary texts
generated from Wikidata
statements and “seed” texts as
input
Reasoning traces as some form of
intermediary draft and open
reflexion from the model itself.
Wikidata
statements
(filtered)
Seed text taken at
random from
Common Corpus
Synthetic/imaginary text
generated by LLM
(“backtranslation”)
Synthetic reasoning
trace/analysis (“backreasoning”)
Mozilla Confidential
23. Training a model on Wikidata
Tosca is an opera in three acts by
Giacomo Puccini to an Italian
libretto by Luigi Illica and Giuseppe
Giacosa. It premiered at the Teatro
Costanzi in Rome on 14 January
1900. The work, based on Victorien
Sardou's 1887 French-language
dramatic play, La Tosca, is a
melodramatic piece set in Rome in
June 1800, with the Kingdom of
Naples's control of Rome threatened
by Napoleon's invasion of Italy. It
contains depictions of torture,
murder, and suicide, as well as some
of Puccini's best-known lyrical arias.
Oh, this is clearly about an opera - that's a P136 genre
right there! And we have the full title with those Italian
words, that's definitely going to be a title property. The
composer and librettist are explicitly mentioned -
those are usually important attributes for opera.
Wait, there's this interesting bit about it being based
on another play - that's a specific relationship we
should capture. And it's set in Rome, that's a location
property for sure.
Reasoning
The language aspect is interesting - it's in Italian but
it's set in Naples... that's going to need some careful
handling with language tags. And those depictions -
torture, murder, suicide, there's a whole list of what's
shown in the opera.
Aha! The fact that it's a "melodramatic piece" gives us
the form of the work too.
Mozilla Confidential
24. Training a model on Wikidata
1. Core identity:
<|subject|>Tosca
- Map "opera" to P31 (instance of)
<|property|>instance of<|object|>opera
- Use the full title with language tag 'it' for P1476
<|property|>title<|object|>{'text': 'Tosca', 'language': 'it'}
- Map "melodramatic" to P7937 (form of creative work)
2. Creative roles:
<|property|>composer<|object|>Giacomo Puccini
- Direct mapping of composer (P86) to Giacomo
Puccini
- Direct mapping of librettist (P87) to both Luigi Illica
and Giuseppe Giacosa
3. Location and setting:
- Map "Teatro Costanzi" to P4647 (location of first
performance)
<|property|>genre<|object|>opera
Structured
data
<|property|>librettist<|object|>Luigi
Illica<|object|>Giuseppe Giacosa
<|property|>location of first performance<|object|>Teatro
Costanzi
<|property|>narrative location<|object|>Rome
Mozilla Confidential
25. A replicable approach
Banking & finance
(ISO 20022, FIBO)
Telecommunications
(GSMA)
Government
[various]
Science
(OpenAlex/Wikidata)
Patent
(ICE)
Mozilla Confidential
26. Language models are knowledge graphs?
Why is a very small model working so well
for semantic data? Maybe because
language models are already building
graph relationships across knowledge
entities in the text (the attention graph).
While checking the words that are the
most attended while the model is setting
up on “German” language, it backtracks
not only to the original text but even
more strikingly on the preparatory draft
(and the property id, P407)
Mozilla Confidential
27. Language models are knowledge graphs?
One of the underlying reason a very small
model is working so well: language
models are already building graph
relationships across knowledge entities
in the text (the attention graph).
While checking the words that are the
most attended while the model is setting
up on “German” language, it backtracks
not only to the original text but even
more strikingly on the preparatory draft
(and the property id, P407)
Mozilla Confidential
28. Conclusion