people who are baffled by DeepSeek have been and still are sleeping on Qwen, InternLM, ByteDance and Tencent
here's couple of fan-favorite models from them 🪭
Qwen2VL: best open-source vision language model
here's demo for 7B, use 72B for longer context length and better performance huggingface.co/spaces/Ganymed…
Jan 9 • 4 tweets • 2 min read
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗
The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) for images and videos ⏯️
take a look 🧶
The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
NVIDIA solved physics and open-sourced it? Can we just build our own autonomous robots now? 🤯
They released Cosmos: new family of open world foundation models (WFMs) 🌌
Unwrapping the release and why it's so revolutionary 🧶
They have released 7 models:
- four autoregressive video2world models
- four diffusion-based video2world and text2world models
- prompt upsampler
- content safety model
however cool your LLM is, without being agentic it can only go so far
enter smolagents: a new agent library by @huggingface to make the LLM write code, do analysis and automate boring stuff!
smolagents is a barebones library to unlock both native and traditional tool calling for language models
LLMs can already write code and do reasoning, so why bother yourself with writing the tool?
CodeAgent class is here for it! see it in action below
Oct 25, 2024 • 4 tweets • 2 min read
Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥👏
OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing. 👏
Model:
Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.
no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.huggingface.co/microsoft/Omni…
Oct 10, 2024 • 6 tweets • 3 min read
this is the BEST vision language model I have ever tried!
Aria is a new model by @rhymes_ai_: a 25.3B multimodal model that can take image/video inputs 🤩
They release the model with Apache-2.0 license and fine-tuning scripts as well 👏
I tested it extensively, keep reading to learn more 🧶
The model is open-sourced here: huggingface .co/rhymes-ai/Aria
The authors have released fine-tuning examples on RefCOCO, NextQA and NLVR and inference examples: github .com/rhymes-ai/Aria
Try the demo here: rhymes .ai
It's super nice that you can get started with this model using @huggingface transformers 🤗
Oct 7, 2024 • 6 tweets • 2 min read
I'm bullish on this foundation OCR model called GOT 📝 @eccvconf
This model can transcribe anything and it's Apache-2.0!
Keep reading to learn more 🧶
This model can take in screenshots of tables/LaTeX and output formatted text, music sheets, charts, literally anything to meaningful format!
I was giving consultancy about multimodal RAG to some external people, wanted to post it here
stop using OCR + LLMs. if you want to retrieve, use ColPali, if you want RAG from docs, use vision language models
my tweet might come off a bit sensational, but I'm mostly talking about documents that aren't just only plain text. leaving these here, pdf parser will parse incorrectly and slowly
it's more convoluted to build a demo to SAM with text to box prompting
one needs to find first frame where the object of interest appears, then give it to SAM2 as a mask and let it propagate github.com/facebookresear…
one technical problem I had was for each text prompt one needs to tune threshold for text -> box prediction for a given frame to make sure right mask is given
ofc it's not a problem, but also there's gradio not having a component to pick frames like in Meta's demo
Jul 10, 2024 • 7 tweets • 3 min read
Forget any document retrievers, use ColPali 💥💥
Document retrieval is done through OCR + layout detection, but it's overkill and doesn't work well! 🤓
ColPali uses a vision language model, which is better in doc understanding 📑
keep reading ⤋
ColPali: (mit license!)
Blog post:
DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝
time to dive in and learn more 🧶
This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself
Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM
Jan 11, 2024 • 5 tweets • 3 min read
SigLIP just got merged to 🤗transformers and it's super easy to use!
To celebrate this, I have created a repository on various SigLIP based projects 🥳
But what is it and how does it work?
SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎
Highlights✨
🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
😍 More performant than CLIP on zero-shot
🗣️ Authors trained a multilingual model too!
⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k (see saturation on perf below)
Sep 29, 2023 • 11 tweets • 5 min read
There are many known “foundation models” for chat models, but what about computer vision? 🧐
In this thread, we’ll talk about few of them 👇
🖼️ Segment Anything Model
🦉 OWLViT
💬 BLIP-2
🐕 IDEFICS
🧩 CLIP
🦖 Grounding DINO
Let’s go! ✨
How does this work for vision? 👀
Foundation models for vision are powerful models that can be out of box or fine-tuned (on your use case, if needed) to solve many problems!
All the models in this thread are open-source, can be found on Hugging Face Hub and can be used with 🤗 transformers with only few lines of code ♥️
Jun 1, 2023 • 4 tweets • 2 min read
AWS 🤝 Hugging Face for LLM Deployment 🤩
Hugging Face & AWS now offer LLM Containers to deploy any open-source LLM of your choice 🫡
See how @_philschmid deploys OpenAssistant pythia 12B model in this blog post 👉 huggingface.co/blog/sagemaker…
details in 🧶 @_philschmid There's a new open-source LLM every day, but how to deploy them is a question 🧐
📖 Having your LLM under your control enables your company to work in private, govern data and reduce costs,
📖 Hugging Face has put tons of know-hows on deploying these models into LLM container!
Mar 31, 2023 • 5 tweets • 2 min read
We had many great submissions for our Keras DreamBooth sprint, wanted to share a few of them in this 🧶
Some of the folks have fine-tuned on video games, hogwarts legacy and return to monkey island (which I really like 😍 )
Nov 3, 2022 • 4 tweets • 2 min read
PyCon SE started with the keynote of @julsimon 🤗
nothing says remote first more than meeting your colleague for the first time in a conference 😏
Oct 3, 2022 • 8 tweets • 4 min read
a 🧵 on how to make the best of model repositories on 🤗Hub
this is how a model repository looks 👀
there's a model card which is good for documenting & communicating your model 📖 if it's a library with Hub integration you can get a widget that works out of the box 📦
also on bottom left you have links to Spaces that use your model ✨
Sep 20, 2022 • 6 tweets • 3 min read
New release of @huggingface transformers includes a new pipeline called Document Question Answering ❓📄
This is a pipeline you can use to extract information from PDFs! Let's take a closer look 👀
There's various use cases you can implement with various types of models 😏
📄 Extract information from invoices
📄 Answer questions from contracts
📄 Extract information from any page! (including screenshots)
Aug 11, 2022 • 5 tweets • 2 min read
At @huggingface we're looking for ways to enable reproducibility, safety and transparency in open-source machine learning 🧑🏻🔬👩🏻💻 we're working on ways to improve workflows of folks using sklearn models, and for this we've developed skops 📦💙
for 0.1 release, the library includes two main utils 👇🏼
🃏 card: an API that lets you programmatically create interactive model cards
🛠 hub_utils: tools you need to host your model on @huggingface Hub 🤗