Create an agent using LLMs (OpenAssistant, StarCoder, OpenAI ...) and start talking to transformers and diffusers
It responds to complex queries and offers a chat mode. Create images using your words, have the agent read the summary of websites out loud, read through a PDF
How does it work in practice?
It's straightforward prompt-building:
• Tell the agent what it aims to do
• Give it tools
• Show examples
• Give it a task
The agent uses chain-of-thought reasoning to identify its task and outputs Python code using the tools.
It comes with built-in tools:
• Document QA
• Speech-to-text and Text-to-speech
• Text {classification, summarization, translation, download, QA}
• Image {generation, transforms, captioning, segmentation, upscaling, QA}
• Text to video
It is EXTENSIBLE by design.
Tools are elementary: a name, a description, a function.
Designing a tool and pushing it to the Hub can be done in a few lines of code.
The toolkit of the agent serves as a base: extend it with your tools, or with other community-contributed tools:
SAM, the groundbreaking segmentation model from @Meta is now in available in 🤗 Transformers!
What does this mean?
1. One line of code to load it, one line to run it 2. Efficient batching support to generate multiple masks 3. pipeline support for easier usage
More details: 🧵
You can first read more about the model, and learn how to use it on our documentation page: huggingface.co/docs/transform…
Let's check all the features we support below!
Automatic mask generation pipeline!
With one line of code predict automatically the segmentation masks of a given image (similar as the examples above)
It's been an exciting year for 🤗Transformers. We tripled the number of weekly active users over 2022, with over 1M users most weeks now and 300k daily pip installs on average🤯
We doubled the number of architectures (89 to 167🤯) with new models in audio🔊, text📚, vision🖼️, multiple modalities or even time series📈and protein folding🧬
Here are a few highlights in the most used of those new models👇
Swin Transformer is a vision model from @MSFTResearch added back in January, which can be used as backbone for a variety of tasks such as image classification, object detection or semantic segmentation.
[THREAD] Following the public release of Spaces, here is a showcase of a few ones we like. Let’s start with this surprising Draw-to-Search demo by @osanseviero and powered by CLIP. huggingface.co/spaces/osansev…
What a time to be alive! You can finally decode your doctor's prescription with this cool OCR demo by @NielsRogge - using the Microsoft TrOCR encoder-decoder model. huggingface.co/spaces/nielsr/…
Part 1 of the course focused on text classification, part 2 will focus on all other common NLP tasks. @mervenoyann has made videos to introduce you to each of them!
Let's start with Token Classification (giving a label to some/each word in a sentence):
Then there is question answering: finding the answer to a question in some context.
Next is Causal Language Modeling: guessing the next word in a sentence. This is how GPT-2 and its descendants were pretrained.