Optimising AI for sensitive industries

如果无法正常显示，请先停止浏览器的去广告插件。

1. Data-centric AI webinar #8 Optimising AI for sensitive industries Dr. Pierre-Carl Langlais Pleias

2. About pleias Pleias is a Paris-based startup that’s on a mission to solve the key AI scalability challenges for sensitive industries — data quality, lack of efficiency, compliance and security risks. We provide clients with vertical AI solutions at a fraction of traditional AI costs thanks to our powerful yet frugal foundation models. Members of the AI alliance and CurrentAI, we believe in the necessity of open, copyright-free and factual data for AI. That’s why we’ve released Common Corpus - the largest fully open corpus for pre-training: 2 trillion tokens with document-level licensing, provenance and language information. Mozilla Conﬁdential

3. Open data Factual data Semantic data Mozilla Conﬁdential

4. Open data

5. Training data issues… Language models come with a large number of data issues: ● ● ● Largest official source is web archives with no possible way to ﬁlter out problematic (or poisoned?) content at scale. In practice, big labs seem to routinely use shadow libraries and other sources of pirated content. This practice is at the center of the Meta trial. “We don’t talk about the data”: despite its centrality in training labs never communicate over the datasets. Mozilla Conﬁdential

6. …are deployer liabilities. In the current legislation, deployers of models are fully liable. ● ● ● You have no guarantees the model won’t output copyrighted content. You can’t be completely sure the alignment really ﬁt your regulations and expected norms (the “DeepSeek” problem”) You don’t know whether the model is really able to process internal data which may be widely different from the internet data used for training: half of crawled archives are less than 300 words. Mozilla Conﬁdential

7. Tragedy of the commons While closed labs are protected by obfuscation, open research efforts have been much more precarious. Datasets and models are routinely removed and sometimes this even leds to trial. The issue is especially preeminent in Europe due to the absence of fair use. Text & data mining exception only cover fair use, not releasability. Right now, most “open everything” LLMs rely on HuggingFace being hosted in the US. Mozilla Conﬁdential

8. Common corpus Common corpus is the largest collection of fully open and releasable texts. It includes 2 trillion texts from 500 million distinct documents all with provenance and licensing attribution at a granular level. Common Corpus has grown to become a referenced training dataset and will be used by 7 different LLMs (non-Pleias) in Europe and beyond. Mozilla Conﬁdential

9. Common corpus Common corpus is an aggregation of multiple dispersed sources. While it beneﬁtted from large scale initiative (“collections as data”), some of the sourced content is frequently hard to ﬁnd. A major continuous focus is on the data transformation and augmentation of common corpus for training and indexation: ● ● ● PDF processing OCR correction GDPR/toxicity ﬁltering Mozilla Conﬁdential

10. Fully open and auditable LLMs… ● ● ● ● Compliant & Auditable: only trained on open data under permissive license in compliance with the European AI Act. Extensive multilingual support for main European languages Efficient: outperforming much bigger models, whilst being able to run (fast) on consumer-grade GPU and in CPU-only environments. Safe: Extremely low level of toxicity and problematic content. Mozilla Conﬁdential

11. Factual data

12. What gen AI really is about: RAG RAG has emerged as the leading use case of generative AI in sensitive industries. ● ● ● Despite ﬂattering benchmarks LLM hallucinates a lot. Recent estimate showed that Grok was incorrect 90% of the time. Relevant data is not AI-ready. It includes especially a large number of unstructured sources, specially PDFs. To ﬁx these shortcomings, RAG application have become very complex to manage with multiple workﬂows. Mozilla Conﬁdential

13. Designing models for factuality A very important development to put RAG in production for sensitive industries: better grounding. In January, Anthropic ﬁrst unveiled a dedicated “citation mode”. The main statements are connected not only to a source id but to actual literal quote in the original text. We have been working on a similar feature for several months and improved on the Anthropic version by shortening citations. Mozilla Conﬁdential

14. Designing models for factuality A very important development to put RAG in production for sensitive industries: better grounding. In January, Anthropic ﬁrst unveiled a dedicated “citation mode”. The main statements are connected not only to a source id but to actual literal quote in the original text. We have been working on a similar feature for several months and improved on the Anthropic version by shortening citations. Mozilla Conﬁdential

15. Designing models for factuality Pico is a continuous pretrain from our foundation models Pleias-350m, specialized on 45B tokens of prepared RAG. Despite its very small size (GPT2-medium) even the smallest variant attain an accuracy level superior to LLM many time its size thanks to the following features: ● ● ● ● Systematic source citation with attribution, using the Wikipedia syntax. Reasoning steps (“inference time”) to internalize reranking and query reformulation Reinforcement Learning with citation exactness as reward (GRPO). Leveraging of model internal metrics to track accuracy at inference time. Our internal inference system leverage attention score to track the impact of reasoning traces for RAG accuracy Mozilla Conﬁdential

16. Integrating RAG into local infrastructure A further critical issue for the regulated sector: interact with sensitive data locally. Until now capable local LLMs required signiﬁcant GPU investments due to the intensive use of the context window. RAG specialization makes it possible to get usable results with even a 350m model (1500 tokens/s on a t4, 20 seconds on CPU). Mozilla Conﬁdential

17. RAG beyond models: building a factual asset Commons Corpus aims also to become a resource for RAG. We are currently deploying selected and preprocessed versions for the following use cases: ● ● ● ● Government norms and regulations Scientiﬁc articles in medicine. Healthcare guidelines. Banking standards. Mozilla Conﬁdential

18. Semantic data

19. Semantic data Semantic web is a major component of data infrastructure for regulated industries. RDF/XML standards ensure the following functions: ● ● ● Operationalize standards of data representation and completeness. Ensure interoperability. Enforce transparency and ﬁndability, across the web of data. Mozilla Conﬁdential

20. Semantic data Due to the limitation of embedding indexation, there is a new wave of interest in knowledge graphs representation. Yet even frontier models struggle with knowledge graph generation: ● ● ● Format familiarity as RDF is likely not part of training data. Inability to encode large and complex standards through prompting techniques alone. Constant lack of data adherence. Mozilla Conﬁdential

21. Training a model on Wikidata Thanks to our partnership with Wikimedia Foundation Enterprise, we have the opportunity to train the ﬁrst language model on Wikidata. The ﬁrst official version will be officially released in 1-2 weeks. For now the current prototype include about 20% of all Wikidata item with a simpliﬁed version of RDF to save on inference — and avoid repetition loops… The Man Without Qualities | instance of | unﬁnished novel The Man Without Qualities | instance of | literary work The Man Without Qualities | author | Robert Musil The Man Without Qualities | title | {'text': 'Der Mann ohne Eigenschaften', 'language': 'de'} <|subject|>The Man Without Qualities <|property|>instance of<|object|>unﬁnished novel<|object|>literary work <|property|>author<|object|>Robert Musil <|property|>title<|object|>{'text': 'Der Mann ohne Eigenschaften', 'language': 'de'} Mozilla Conﬁdential

22. Training a model on Wikidata Training rely on a synthetic pipeline that ensure that the model memorize not only all Wikidata properties but also their name and function. Our dataset has three component: ● ● ● The original Wikidata statements as output (what we want to predict). Synthetic/imaginary texts generated from Wikidata statements and “seed” texts as input Reasoning traces as some form of intermediary draft and open reﬂexion from the model itself. Wikidata statements (filtered) Seed text taken at random from Common Corpus Synthetic/imaginary text generated by LLM (“backtranslation”) Synthetic reasoning trace/analysis (“backreasoning”) Mozilla Conﬁdential

23. Training a model on Wikidata Tosca is an opera in three acts by Giacomo Puccini to an Italian libretto by Luigi Illica and Giuseppe Giacosa. It premiered at the Teatro Costanzi in Rome on 14 January 1900. The work, based on Victorien Sardou's 1887 French-language dramatic play, La Tosca, is a melodramatic piece set in Rome in June 1800, with the Kingdom of Naples's control of Rome threatened by Napoleon's invasion of Italy. It contains depictions of torture, murder, and suicide, as well as some of Puccini's best-known lyrical arias. Oh, this is clearly about an opera - that's a P136 genre right there! And we have the full title with those Italian words, that's deﬁnitely going to be a title property. The composer and librettist are explicitly mentioned - those are usually important attributes for opera. Wait, there's this interesting bit about it being based on another play - that's a speciﬁc relationship we should capture. And it's set in Rome, that's a location property for sure. Reasoning The language aspect is interesting - it's in Italian but it's set in Naples... that's going to need some careful handling with language tags. And those depictions - torture, murder, suicide, there's a whole list of what's shown in the opera. Aha! The fact that it's a "melodramatic piece" gives us the form of the work too. Mozilla Conﬁdential

24. Training a model on Wikidata 1. Core identity: <|subject|>Tosca - Map "opera" to P31 (instance of) <|property|>instance of<|object|>opera - Use the full title with language tag 'it' for P1476 <|property|>title<|object|>{'text': 'Tosca', 'language': 'it'} - Map "melodramatic" to P7937 (form of creative work) 2. Creative roles: <|property|>composer<|object|>Giacomo Puccini - Direct mapping of composer (P86) to Giacomo Puccini - Direct mapping of librettist (P87) to both Luigi Illica and Giuseppe Giacosa 3. Location and setting: - Map "Teatro Costanzi" to P4647 (location of ﬁrst performance) <|property|>genre<|object|>opera Structured data <|property|>librettist<|object|>Luigi Illica<|object|>Giuseppe Giacosa <|property|>location of ﬁrst performance<|object|>Teatro Costanzi <|property|>narrative location<|object|>Rome Mozilla Conﬁdential

25. A replicable approach Banking & ﬁnance (ISO 20022, FIBO) Telecommunications (GSMA) Government [various] Science (OpenAlex/Wikidata) Patent (ICE) Mozilla Conﬁdential

26. Language models are knowledge graphs? Why is a very small model working so well for semantic data? Maybe because language models are already building graph relationships across knowledge entities in the text (the attention graph). While checking the words that are the most attended while the model is setting up on “German” language, it backtracks not only to the original text but even more strikingly on the preparatory draft (and the property id, P407) Mozilla Conﬁdential

27. Language models are knowledge graphs? One of the underlying reason a very small model is working so well: language models are already building graph relationships across knowledge entities in the text (the attention graph). While checking the words that are the most attended while the model is setting up on “German” language, it backtracks not only to the original text but even more strikingly on the preparatory draft (and the property id, P407) Mozilla Conﬁdential

28. Conclusion