How to get URL link on X (Twitter) App
SYNTH is a radical departure from the classic pre-training recipe: what if we trained for reasoning and focused on the assimilation of knowledge and skill that matters? At its core it’s an upsampling of Wikipedia 50,000 “vital” articles huggingface.co/datasets/PleIA…
This was one of the most interesting papers of last year. Seemingly first effective attempt at dropping token representations for direct byte processing https://x.com/Dorialexander/status/1867665269058842885
The Wikimedia projects are not only a key provider verifiable information but a fundamental infrastructure for the entire web, from texts to images to semantic data. I've been part of this movement for nearly two decades as contributor and later admin of the French community.
https://twitter.com/nathanbenaich/status/1886414128878674358It has been allocated to a consortia of various companies/research centers/public structure. I don't have the details but there's likely a big roadmap with a lot of work packages, sub-work packages, gantt diagrams, metrics to hit, bla bla bla
The 910Cs are an alternative to the H100, and just been released. Chip independence is basically a national focus at this point in China: it’s extremely hard to reconstruct one of the most complex industrial chains in the world but they have high incentives for it
First off, even though they don’t reference it, the approach reminds me a lot of image transformers/SigLIP (maybe more so than mambabytes). What we do is trained an encoder and a decoder to manage the transformations of texts in "patches" — so yeah encoder-decoder are back baby.

LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production. huggingface.co/blog/Pclanglai…
marginalia works especially well for bibliographies. Tthe google colab demo transforms a very old list (Benjamin Franklin favorite's books from 1744) into well-structured data. colab.research.google.com/drive/1xKjK2mD…



Mickey-1928 is fine tuned on 96 stills from the three Mickey cartoons that are now in the public domain: Plane Crazy, Steamboat Willie and Gallopin’ Gaucho. I have released the training data which is obviously in the public domain. huggingface.co/datasets/Pclan…
https://twitter.com/theshawwn/status/1638925249709240322J’ai assisté hier à une présentation des projets IA de Google et c’est aussi ce qui se dessine : les ingénieurs et les chercheurs restent plutôt pro-open source mais ça coince de plus en plus au niveau de la direction (dans l’idée de garder un avantage concurrentiel sur OpenAI)
https://twitter.com/Ted_Underwood/status/1557574592071352320
https://twitter.com/arvalis/status/1558623545374023680Je vois souvent des comparaisons avec la naissance de la photographie et c'est très juste : une bonne image générée, ça se cadre, il faut travailler la couleur, la composition, le contraste, comme dans un studios de photo, sauf que le texte de prompt tient lieu de mise en scène
https://twitter.com/strynwm/status/1170043499045150720L'attaque est visiblement circonscrite à l'Europe (aux serveurs de wiki localisés à Amsterdam).
https://twitter.com/wikimediaitalia/status/1170042749166542849?s=21
Elsevier should be next on the list. The 173 millions $ national licence is due to be renewed in 2019. The current agreeement was disclosed in 2014 by @MaliciaRogue and I in the French media nouvelobs.com/rue89/rue89-no…
External Tweet loading...
If nothing shows, it may have been deleted
by @Flouistory view original on Twitter