Jina AI Profile picture
Apr 8 5 tweets 3 min read Read on X
Introducing jina-reranker-m0: our new multilingual multimodal reranker model for ranking visual documents across multiple languages: it accepts a query alongside a collection of visually rich document images, including pages with text, figures, tables, infographics, and various layouts across multiple domains and over 29 languages.
Unlike jina-reranker-v2-base-multilingual, jina-reranker-m0 moves from the classic cross-encoder architecture to a decoder-only vision language model. It leverages the pretrained Qwen2-VL's vision encoder and projector, finetuned its LLM part with LoRA, and post-trained a MLP to generate ranking logits that measure query-document relevance. This gives a discriminative model optimized for ranking tasks.Image
This architecture also effectively solves the modality gap problem that plagued earlier models like jina-clip-v1, jina-clip-v2. Previously, images would cluster near other images while text would cluster near other text in the representation space, creating a disconnect. This meant that when your candidate documents contained both images and text, retrieving images using text queries was problematic. With jina-reranker-m0, you can now rank images and documents together without worrying about this gap, creating a truly unified multimodal search experience.Image
jina-reranker-m0 is not only SOTA on ViDoRe, MBEIR, and Winoground visual retrieval benchmarks; but also on text-only benchmarks such as BEIR, MIRACL, MLDR, MKQA and CodeIR. Yes, jina-reranker-m0 is greatly optimized for code search. Image
Image
Check out the post and learn more about the benchmarks, huggingface model page, API. jina-reranker-m0 is our first attempt to unify textual and visual modalities in a decoder-only model. It opens up many new possibilities that weren't achievable with encoder-only rerankers. Try m0 and let us know what you think.jina.ai/news/jina-rera…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jina AI

Jina AI Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @JinaAI_

Jan 15
Dear Readers, you'll ❤️ this: Introducing ReaderLM-v2, a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality. Thanks to the new training paradigm and higher-quality training data, ReaderLM-v2 is a significant leap forward from its predecessor, particularly in handling long-context and markdown syntax. While the first generation approached HTML-to-markdown conversion as a "selective-copy" task, v2 treats it as a true translation process. This enables the model to masterfully leverage markdown syntax, excelling at generating complex elements like code fences, nested lists, tables and LaTex equations. You can use ReaderLM-v2 today via Reader API, HuggingFace, AWS Sagemaker, etc.
Here's an example of HTML-to-markdown results of HackerNews front page across ReaderLM v2, v1, Claude 3.5 Sonnet, Gemini 2.0 Flash; it shows ReaderLM v2's unique vibe and performance. ReaderLM v2 excels at preserving comprehensive information from raw HTML, including original HackerNews links, while smartly structuring the content using markdown syntax. The model uses nested lists to organize local elements (points, timestamps, and comments) while maintaining consistent global formatting through proper heading hierarchy (h1 and h2 tags).
A major issue in our ReaderLM v1 was degeneration, particularly in the form of repetition and looping after generating long sequences. ReaderLM-v2 greatly alleviates this issue by adding contrastive loss during training—its performance remains consistent regardless of context length or the amount of tokens already generated. We tested ReaderLM v2 by converting our legal page to markdown—a page approximately 20x longer than the HackerNews front page, including a big table near the end of the page. Despite this great challenge, ReaderLM v2 successfully generated the complete table in markdown while maintaining consistent document structure throughout, preserving both heading hierarchy and list formatting even after the table. This level of performance was unattainable with the previous generation reader-lm-1.5b, which would degenerate after generating long sequences.
Read 6 tweets
Sep 11, 2024
Announcing reader-lm-0.5b and reader-lm-1.5b, two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to 256K tokens. Despite their compact size, these models achieve state-of-the-art performance on this HTML2Markdown task, outperforming larger LLM counterparts while being only 1/50th of their size.jina.ai/news/reader-lm…
Using LLMs for data cleaning might seem excessive due to their low cost-efficiency and slowness. But what if we're considering a SLM — one with <1B parameters that can run efficiently on the edge? Unfortunately, according to the scaling law, fewer parameters generally lead to reduced reasoning and summarizing capabilities. So an SLM might even struggle to generate any meaningful content if its parameter size is too small. 😢But let’s take a closer look at this HTML-to-Markdown task:

- First, the task we’re considering isn’t as creative or complex as typical LLM tasks. In the case of converting HTML to markdown, the model primarily needs to selectively copy from the input to the output (i.e., skipping over HTML markup, sidebars, headers, footers), with minimal effort spent on generating new content (mostly inserting markdown syntax). This contrasts sharply with the broader tasks LLMs handle, such as generating poems or writing code, where the output involves much more creativity and is not a direct copy-paste from the input. This observation suggests that an SLM might work, as the task seems simpler than more general text generation.
- Second, we need to prioritize the long-context support. Modern HTML often contains much more noise than simple
markup. Inline CSS and scripts can easily balloon the code to hundreds of thousands of tokens. For an SLM to be practical in this scenario, the context length must be sufficiently large. Token limits like 8K or 16K may not be useful at all.

What we need seems to be a shallow-but-wide SLM. "Shallow" in the sense that the task is primarily "copy-paste" and therefore needs fewer transformer blocks, and "wide" in the sense that it requires long context support to be practical so the attention mechanism needs to be carefully designed. Previous research has shown that context length and reasoning ability are closely intertwined. For an SLM, it’s extremely challenging to optimize both dimensions while keeping the parameter size small.Image
To quantitatively evaluate the performance of Reader-LM, we compared it to LLMs, including: GPT-4o, Gemini-1.5-Flash, Gemini-1.5-Pro, LLaMA-3.1-70B, Qwen2-7B-Instruct. The models were assessed using ROUGE-L, Token Error Rate and Word Error Rate.

Reader-LM-1.5B consistently performs well across all dimensions, particularly excelling in structure preservation and markdown syntax usage. While it doesn't always outperform Jina Reader API, its performance is competitive with larger models like Gemini 1.5 Pro, making it a highly efficient alternative to larger LLMs. Reader-LM-0.5B, though smaller, still offers solid performance, particularly in structure preservation.Image
Image
Image
Read 5 tweets
May 14, 2024
Grounding is absolutely essential for GenAI applications. Today, we just added new search grounding to the Reader. Now you can simply write a query as 𝗵𝘁𝘁𝗽𝘀://𝘀.𝗷𝗶𝗻𝗮.𝗮𝗶/𝗪𝗵𝗲𝗻+𝘄𝗶𝗹𝗹+𝘁𝗵𝗲+𝗻𝗲𝘅𝘁+𝗦𝗽𝗮𝗰𝗲𝗫+𝗹𝗮𝘂𝗻𝗰𝗵+𝗯𝗲 and it will return you the top-5 search results from the web, each with LLM-friendly text and a URL pointed to the source. With this, devs can easily incorporate latest world knowledge into their LLMs, which is one step closer to improving the factuality of LLMs, making responses more trustworthy and helpful. 🧵Image
Not familiar with search grounding? Allow me to explain a bit. We all know LLMs can make things up and harm user trust. LLMs may say things that are not factual (aka hallucinate), especially regarding topics they didn't learn about during training. This could be either new information created since training or niche knowledge that has been "marginalized" during training.

Here is an example of niche knowledge being "marginalized" during training can be seen when we asked GPT-3.5-turbo "When was Jina AI founded?" and received an incorrect answer (yeah, we aint that famous🤷). However, when using Reader for search grounding, the same LLM was able to provide the correct answer. In fact, it was precise to the exact date - Feb. 1st 2020 (Now you know 📷)Image
Here is another example of new information created since training. We asked GPT-3.5-turbo "When will the next SpaceX launch be?" (today is May 14th 2024) and the model responded with old information back in 2021.

As a summary, when it comes to questions like "What's the weather today?" or "Who won the Oscar for Best Actress this year?" the model will either respond with "I don't know" or give you outdated information. That's where search grounding can be useful.Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

:(