Latest Twitter Threads by @Dorialexander on Thread Reader App

Nov 10 • 10 tweets • 4 min read

Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range.

SYNTH is a radical departure from the classic pre-training recipe: what if we trained for reasoning and focused on the assimilation of knowledge and skill that matters? At its core it’s an upsampling of Wikipedia 50,000 “vital” articles huggingface.co/datasets/PleIA…

Apr 17 • 5 tweets • 2 min read

Ah, Meta released the weights of Byte Latent Transformer! Both 7b and 1b (currently under weak gating).

This was one of the most interesting papers of last year. Seemingly first effective attempt at dropping token representations for direct byte processing

https://x.com/Dorialexander/status/1867665269058842885

Feb 18 • 4 tweets • 1 min read

I'm very happy to announce a strategic partnership between @Wikimedia enterprise and @pleiasfr for open, ethical and trustworthy AI innovation. enterprise.wikimedia.com/blog/pleias-an…

The Wikimedia projects are not only a key provider verifiable information but a fundamental infrastructure for the entire web, from texts to images to semantic data. I've been part of this movement for nearly two decades as contributor and later admin of the French community.

Feb 4 • 6 tweets • 1 min read

Since I’m somewhat in a unique position, a short thread why automating Wikipedia is not at all within reach.

Key immediate issue: source relevancy. Not every piece of information is worth it. It plagued NotebookLLM already: you end up with a mix of credible sources with marketing slop not really knowing which is which. An immediate fix would be to restrict to academic sources but then you’re going to discover it: relevancy is highly context dependent.

Feb 3 • 5 tweets • 2 min read

So some background on this thing, since it will be mostly understood as a DeepSeek reply: it's a EU project that has been in the pipeline for a year from submission, to project answering and evaluation (basically like an expanded Horizon 2020).

https://twitter.com/nathanbenaich/status/1886414128878674358

It has been allocated to a consortia of various companies/research centers/public structure. I don't have the details but there's likely a big roadmap with a lot of work packages, sub-work packages, gantt diagrams, metrics to hit, bla bla bla

Jan 28 • 6 tweets • 2 min read

I feel this should be a much bigger story: DeepSeek has trained on Nvidia H800 but is running inference on the new home Chinese chips made by Huawei, the 910C.

The 910Cs are an alternative to the H100, and just been released. Chip independence is basically a national focus at this point in China: it’s extremely hard to reconstruct one of the most complex industrial chains in the world but they have high incentives for it

Jan 25 • 6 tweets • 1 min read

So DeepSeek situation summarized:
*They are not a small engineer team but one of the leading frontier lab (+100 researchers full time).
*They are not a newcomer. Started in 2023 by retraining a llama, then slowly rising to the top. All documented in their 16 (!) papers. *5M is for GPU cost of one pretraining run. Any inference on total cost is just bad reading of their article. This is a credible figure: @databricks trained a 36B active parameters MoE for 3K GPUs/$10M last year.

Dec 13, 2024 • 14 tweets • 4 min read

So do patches scale better than tokens? Are tokenizers dead? I rarely do a paper thread but this meta paper is intriguing enough.

First off, even though they don’t reference it, the approach reminds me a lot of image transformers/SigLIP (maybe more so than mambabytes). What we do is trained an encoder and a decoder to manage the transformations of texts in "patches" — so yeah encoder-decoder are back baby.

Jul 19, 2024 • 8 tweets • 4 min read

Breaking: since it is release season, announcing our first suite of specialized language models for document processing tasks (OCR correction, text segmentation, bibliographic extraction) and a new major multimodal dataset we used to train them, Finance Commons.

LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production. huggingface.co/blog/Pclanglai…

Feb 12, 2024 • 6 tweets • 3 min read

Announcing the release of marginalia, a small python application to perform corpus analysis and retrieve structured annotations with open LLMs like Mistral Open-Hermes-2.5. github.com/Pleias/margina…

marginalia works especially well for bibliographies. Tthe google colab demo transforms a very old list (Benjamin Franklin favorite's books from 1744) into well-structured data. colab.research.google.com/drive/1xKjK2mD…

Dec 31, 2023 • 10 tweets • 4 min read

Happy new year ! And happy public domain day with a major new entry: the original design of Mickey Mouse!

For the occasion I’m releasing Mickey-1928 a model on @HuggingFace that can generate pictures of Mickey, Minnie and Pete from 1928. huggingface.co/Pclanglais/Mic…

Mickey-1928 is fine tuned on 96 stills from the three Mickey cartoons that are now in the public domain: Plane Crazy, Steamboat Willie and Gallopin’ Gaucho. I have released the training data which is obviously in the public domain. huggingface.co/datasets/Pclan…

Mar 23, 2023 • 4 tweets • 1 min read

Le tournant anti-open source de l’IA s’accentue : Facebook fait fermer tous les projets et applications dérivés de son modèle Llama (qui avait été timidement ouvert il y a deux semaines et s’imposait comme une alternative ouverte à GPT-3).

https://twitter.com/theshawwn/status/1638925249709240322

J’ai assisté hier à une présentation des projets IA de Google et c’est aussi ce qui se dessine : les ingénieurs et les chercheurs restent plutôt pro-open source mais ça coince de plus en plus au niveau de la direction (dans l’idée de garder un avantage concurrentiel sur OpenAI)

Feb 7, 2023 • 28 tweets • 6 min read

ChatGPT, comment ça marche ? Un essai de décorticage technique du bot d'OpenAI et, au-delà, de la grande révolution de la génération de texte par intelligence artificielle scoms.hypotheses.org/1059 Déjà une clarification essentielle : il y a au moins deux modèles différents dans chatGPT, un grand modèle de langue (c'est GPT-3 ou 3.5) et un modèle conversationnel.

Aug 16, 2022 • 42 tweets • 10 min read

So it's time for a (rather long) thread on AI generative art, copyright, and intellectual property. As a disclaimer, I am not a lawyer, but I do have a reasonable expertise in the interaction between culture and copyright (for instance: policyreview.info/articles/analy… or link.springer.com/referenceworke…)

Aug 15, 2022 • 6 tweets • 1 min read

Tout le débat sur art et IA me fait incidemment réaliser que le moment décisif où l'on passe de l'art visuel du 20e siècle à celui du 21e c'est quand le jeu vidéo remplace la presse comme cadre de référence (donc grosso modo vers 1995-2005 : c'est commode) Au fond il y a une relative continuité entre les deux formes privilégiées du 20e, la bd style strip/ligne claire d'un côté, l'art abstrait de l'autre : un éloignement du naturalisme et une relative simplification qui s'exporte très bien dans les pages du journal ou du magazine.

Aug 15, 2022 • 6 tweets • 2 min read

Premier essai de vidéo générée avec #stablediffusion : la transformation d'une rue de Paris au 20e siècle.

(Essai ouvertement inspiré de l'évolution de ce salon américain entre 1940 et 2040)

https://twitter.com/Ted_Underwood/status/1557574592071352320

Aug 14, 2022 • 7 tweets • 4 min read

Une petite démonstration complémentaire sur la génération visuelle avec #stablediffusion : on peut aussi utiliser l'outil pour faire de l'édition d'image. Les images composées comprennent en réalité deux éléments : le texte de génération et le "seed" ou graine, soit en réalité le point départ utilisé par le générateur d'image. En changeant le texte, mais en gardant le même "seed", on peut éditer une image (à peu près) stable.

Aug 14, 2022 • 6 tweets • 2 min read

Sans surprise le débat sur IA, création artistique et droit d'auteur est en train d'exploser. Et je pense qu'il y un biais : on ne voit passer sur les réseaux que les images les plus réussies sans soupçonner tout le travail derrière.

https://twitter.com/arvalis/status/1558623545374023680

Je vois souvent des comparaisons avec la naissance de la photographie et c'est très juste : une bonne image générée, ça se cadre, il faut travailler la couleur, la composition, le contraste, comme dans un studios de photo, sauf que le texte de prompt tient lieu de mise en scène

Sep 6, 2019 • 12 tweets • 5 min read

Wikipedia et les autres sites de la Wikimedia Foundation sont en ce moment la cible d'une attaque informatique de grande ampleur.

https://twitter.com/strynwm/status/1170043499045150720

L'attaque est visiblement circonscrite à l'Europe (aux serveurs de wiki localisés à Amsterdam).

https://twitter.com/wikimediaitalia/status/1170042749166542849?s=21

Aug 9, 2018 • 17 tweets • 4 min read

Allez un petit thread sur un aspect problématique (parmi d’autres) de l’étude de @DisinfoEU spark.adobe.com/page/Sa85zpU5C… : l’identification automatisée de l’orientation politique de comptes Twitter. Dans l’étude cette information est reconstituée à partir d’une subdivision du réseau en ensembles cohérents ou clusters : le cluster 1 correspond à LR, le cluster 2 au RN, etc.

Mar 27, 2018 • 8 tweets • 2 min read

The leading French scientific institution @CNRS is cancelling its subscription of Springer journals: all access will be closed in a few days

External Tweet loading...
If nothing shows, it may have been deleted
by @Flouistory view original on Twitter

Elsevier should be next on the list. The 173 millions $ national licence is due to be renewed in 2019. The current agreeement was disclosed in 2014 by @MaliciaRogue and I in the French media nouvelobs.com/rue89/rue89-no…

Share this page!

Enter URL or ID to Unroll