Introducing jina-reranker-m0: our new multilingual multimodal reranker model for ranking visual documents across multiple languages: it accepts a query alongside a collection of visually rich document images, including pages with text, figures, tables, infographics, and various layouts across multiple domains and over 29 languages.
Unlike jina-reranker-v2-base-multilingual, jina-reranker-m0 moves from the classic cross-encoder architecture to a decoder-only vision language model. It leverages the pretrained Qwen2-VL's vision encoder and projector, finetuned its LLM part with LoRA, and post-trained a MLP to generate ranking logits that measure query-document relevance. This gives a discriminative model optimized for ranking tasks.
Jan 15 • 6 tweets • 3 min read
Dear Readers, you'll ❤️ this: Introducing ReaderLM-v2, a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality. Thanks to the new training paradigm and higher-quality training data, ReaderLM-v2 is a significant leap forward from its predecessor, particularly in handling long-context and markdown syntax. While the first generation approached HTML-to-markdown conversion as a "selective-copy" task, v2 treats it as a true translation process. This enables the model to masterfully leverage markdown syntax, excelling at generating complex elements like code fences, nested lists, tables and LaTex equations. You can use ReaderLM-v2 today via Reader API, HuggingFace, AWS Sagemaker, etc.
Here's an example of HTML-to-markdown results of HackerNews front page across ReaderLM v2, v1, Claude 3.5 Sonnet, Gemini 2.0 Flash; it shows ReaderLM v2's unique vibe and performance. ReaderLM v2 excels at preserving comprehensive information from raw HTML, including original HackerNews links, while smartly structuring the content using markdown syntax. The model uses nested lists to organize local elements (points, timestamps, and comments) while maintaining consistent global formatting through proper heading hierarchy (h1 and h2 tags).
Sep 11, 2024 • 5 tweets • 4 min read
Announcing reader-lm-0.5b and reader-lm-1.5b, two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to 256K tokens. Despite their compact size, these models achieve state-of-the-art performance on this HTML2Markdown task, outperforming larger LLM counterparts while being only 1/50th of their size.jina.ai/news/reader-lm…
Using LLMs for data cleaning might seem excessive due to their low cost-efficiency and slowness. But what if we're considering a SLM — one with <1B parameters that can run efficiently on the edge? Unfortunately, according to the scaling law, fewer parameters generally lead to reduced reasoning and summarizing capabilities. So an SLM might even struggle to generate any meaningful content if its parameter size is too small. 😢But let’s take a closer look at this HTML-to-Markdown task:
- First, the task we’re considering isn’t as creative or complex as typical LLM tasks. In the case of converting HTML to markdown, the model primarily needs to selectively copy from the input to the output (i.e., skipping over HTML markup, sidebars, headers, footers), with minimal effort spent on generating new content (mostly inserting markdown syntax). This contrasts sharply with the broader tasks LLMs handle, such as generating poems or writing code, where the output involves much more creativity and is not a direct copy-paste from the input. This observation suggests that an SLM might work, as the task seems simpler than more general text generation.
- Second, we need to prioritize the long-context support. Modern HTML often contains much more noise than simple
markup. Inline CSS and scripts can easily balloon the code to hundreds of thousands of tokens. For an SLM to be practical in this scenario, the context length must be sufficiently large. Token limits like 8K or 16K may not be useful at all.
What we need seems to be a shallow-but-wide SLM. "Shallow" in the sense that the task is primarily "copy-paste" and therefore needs fewer transformer blocks, and "wide" in the sense that it requires long context support to be practical so the attention mechanism needs to be carefully designed. Previous research has shown that context length and reasoning ability are closely intertwined. For an SLM, it’s extremely challenging to optimize both dimensions while keeping the parameter size small.
May 14, 2024 • 8 tweets • 6 min read
Grounding is absolutely essential for GenAI applications. Today, we just added new search grounding to the Reader. Now you can simply write a query as 𝗵𝘁𝘁𝗽𝘀://𝘀.𝗷𝗶𝗻𝗮.𝗮𝗶/𝗪𝗵𝗲𝗻+𝘄𝗶𝗹𝗹+𝘁𝗵𝗲+𝗻𝗲𝘅𝘁+𝗦𝗽𝗮𝗰𝗲𝗫+𝗹𝗮𝘂𝗻𝗰𝗵+𝗯𝗲 and it will return you the top-5 search results from the web, each with LLM-friendly text and a URL pointed to the source. With this, devs can easily incorporate latest world knowledge into their LLMs, which is one step closer to improving the factuality of LLMs, making responses more trustworthy and helpful. 🧵
Not familiar with search grounding? Allow me to explain a bit. We all know LLMs can make things up and harm user trust. LLMs may say things that are not factual (aka hallucinate), especially regarding topics they didn't learn about during training. This could be either new information created since training or niche knowledge that has been "marginalized" during training.
Here is an example of niche knowledge being "marginalized" during training can be seen when we asked GPT-3.5-turbo "When was Jina AI founded?" and received an incorrect answer (yeah, we aint that famous🤷). However, when using Reader for search grounding, the same LLM was able to provide the correct answer. In fact, it was precise to the exact date - Feb. 1st 2020 (Now you know 📷)