More from @JinaAI_

Jina AI

@JinaAI_

Jan 15

Dear Readers, you'll ❤️ this: Introducing ReaderLM-v2, a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality. Thanks to the new training paradigm and higher-quality training data, ReaderLM-v2 is a significant leap forward from its predecessor, particularly in handling long-context and markdown syntax. While the first generation approached HTML-to-markdown conversion as a "selective-copy" task, v2 treats it as a true translation process. This enables the model to masterfully leverage markdown syntax, excelling at generating complex elements like code fences, nested lists, tables and LaTex equations. You can use ReaderLM-v2 today via Reader API, HuggingFace, AWS Sagemaker, etc.

Here's an example of HTML-to-markdown results of HackerNews front page across ReaderLM v2, v1, Claude 3.5 Sonnet, Gemini 2.0 Flash; it shows ReaderLM v2's unique vibe and performance. ReaderLM v2 excels at preserving comprehensive information from raw HTML, including original HackerNews links, while smartly structuring the content using markdown syntax. The model uses nested lists to organize local elements (points, timestamps, and comments) while maintaining consistent global formatting through proper heading hierarchy (h1 and h2 tags).

A major issue in our ReaderLM v1 was degeneration, particularly in the form of repetition and looping after generating long sequences. ReaderLM-v2 greatly alleviates this issue by adding contrastive loss during training—its performance remains consistent regardless of context length or the amount of tokens already generated. We tested ReaderLM v2 by converting our legal page to markdown—a page approximately 20x longer than the HackerNews front page, including a big table near the end of the page. Despite this great challenge, ReaderLM v2 successfully generated the complete table in markdown while maintaining consistent document structure throughout, preserving both heading hierarchy and list formatting even after the table. This level of performance was unattainable with the previous generation reader-lm-1.5b, which would degenerate after generating long sequences.

Read 6 tweets

Jina AI

@JinaAI_

Sep 11, 2024

Announcing reader-lm-0.5b and reader-lm-1.5b, two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to 256K tokens. Despite their compact size, these models achieve state-of-the-art performance on this HTML2Markdown task, outperforming larger LLM counterparts while being only 1/50th of their size.jina.ai/news/reader-lm…

Using LLMs for data cleaning might seem excessive due to their low cost-efficiency and slowness. But what if we're considering a SLM — one with <1B parameters that can run efficiently on the edge? Unfortunately, according to the scaling law, fewer parameters generally lead to reduced reasoning and summarizing capabilities. So an SLM might even struggle to generate any meaningful content if its parameter size is too small. 😢But let’s take a closer look at this HTML-to-Markdown task:

- First, the task we’re considering isn’t as creative or complex as typical LLM tasks. In the case of converting HTML to markdown, the model primarily needs to selectively copy from the input to the output (i.e., skipping over HTML markup, sidebars, headers, footers), with minimal effort spent on generating new content (mostly inserting markdown syntax). This contrasts sharply with the broader tasks LLMs handle, such as generating poems or writing code, where the output involves much more creativity and is not a direct copy-paste from the input. This observation suggests that an SLM might work, as the task seems simpler than more general text generation.
- Second, we need to prioritize the long-context support. Modern HTML often contains much more noise than simple

markup. Inline CSS and scripts can easily balloon the code to hundreds of thousands of tokens. For an SLM to be practical in this scenario, the context length must be sufficiently large. Token limits like 8K or 16K may not be useful at all.

What we need seems to be a shallow-but-wide SLM. "Shallow" in the sense that the task is primarily "copy-paste" and therefore needs fewer transformer blocks, and "wide" in the sense that it requires long context support to be practical so the attention mechanism needs to be carefully designed. Previous research has shown that context length and reasoning ability are closely intertwined. For an SLM, it’s extremely challenging to optimize both dimensions while keeping the parameter size small.

To quantitatively evaluate the performance of Reader-LM, we compared it to LLMs, including: GPT-4o, Gemini-1.5-Flash, Gemini-1.5-Pro, LLaMA-3.1-70B, Qwen2-7B-Instruct. The models were assessed using ROUGE-L, Token Error Rate and Word Error Rate.

Reader-LM-1.5B consistently performs well across all dimensions, particularly excelling in structure preservation and markdown syntax usage. While it doesn't always outperform Jina Reader API, its performance is competitive with larger models like Gemini 1.5 Pro, making it a highly efficient alternative to larger LLMs. Reader-LM-0.5B, though smaller, still offers solid performance, particularly in structure preservation.

Read 5 tweets

Jina AI

@JinaAI_

May 14, 2024

Grounding is absolutely essential for GenAI applications. Today, we just added new search grounding to the Reader. Now you can simply write a query as 𝗵𝘁𝘁𝗽𝘀://𝘀.𝗷𝗶𝗻𝗮.𝗮𝗶/𝗪𝗵𝗲𝗻+𝘄𝗶𝗹𝗹+𝘁𝗵𝗲+𝗻𝗲𝘅𝘁+𝗦𝗽𝗮𝗰𝗲𝗫+𝗹𝗮𝘂𝗻𝗰𝗵+𝗯𝗲 and it will return you the top-5 search results from the web, each with LLM-friendly text and a URL pointed to the source. With this, devs can easily incorporate latest world knowledge into their LLMs, which is one step closer to improving the factuality of LLMs, making responses more trustworthy and helpful. 🧵

Not familiar with search grounding? Allow me to explain a bit. We all know LLMs can make things up and harm user trust. LLMs may say things that are not factual (aka hallucinate), especially regarding topics they didn't learn about during training. This could be either new information created since training or niche knowledge that has been "marginalized" during training.

Here is an example of niche knowledge being "marginalized" during training can be seen when we asked GPT-3.5-turbo "When was Jina AI founded?" and received an incorrect answer (yeah, we aint that famous🤷). However, when using Reader for search grounding, the same LLM was able to provide the correct answer. In fact, it was precise to the exact date - Feb. 1st 2020 (Now you know 📷)

Here is another example of new information created since training. We asked GPT-3.5-turbo "When will the next SpaceX launch be?" (today is May 14th 2024) and the model responded with old information back in 2021.

As a summary, when it comes to questions like "What's the weather today?" or "Who won the Oscar for Best Actress this year?" the model will either respond with "I don't know" or give you outdated information. That's where search grounding can be useful.

Read 8 tweets

Share this page!

Enter URL or ID to Unroll

Jina AI

Try unrolling a thread yourself!

More from @JinaAI_

Jina AI

Jina AI

Jina AI

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!