Post

More from @_akhaliq

AK

@_akhaliq

Jan 20

DeepSeek-R1 Coder

its like cursor in the browser

now available in ai-gradio

pip install --upgrade "ai-gradio[deepseek]"

import gradio as gr
import ai_gradio

gr.load(
name='deepseek:deepseek-reasoner',
src=ai_gradio.registry,
coder=true
).launch()

github: github.com/AK391/ai-gradio

quick start colab: colab.research.google.com/drive/1Ilz8LRQ…

Read 4 tweets

AK

@_akhaliq

Jul 23, 2024

Apple presents SlowFast-LLaVA

A Strong Training-Free Baseline for Video Large Language Models

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal

context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features

at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design

Read 7 tweets

AK

@_akhaliq

Jul 22, 2024

EVLM

An Efficient Vision-Language Model for Visual Understanding

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the

language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features

makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method

Read 7 tweets

AK

@_akhaliq

Apr 24, 2024

Transformers Can Represent n-gram Language Models

Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the

architecture in terms of language acceptance. We contend that this is an ill-suited problem in the study of language models (LMs), which are definitionally probability distributions over strings. In this paper, we focus on the relationship between transformer LMs and n-gram

LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any n-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This

Read 6 tweets

AK

@_akhaliq

Apr 23, 2024

Open AI presents The Instruction Hierarchy

Training LLMs to Prioritize Privileged Instructions

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts.

In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an

instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-

Read 6 tweets

AK

@_akhaliq

Apr 23, 2024

JP Morgan presents FlowMind

Automatic Workflow Generation with LLMs

The rapidly evolving field of Robotic Process Automation (RPA) has made significant strides in automating repetitive processes, yet its effectiveness diminishes in scenarios requiring spontaneous or

unpredictable tasks demanded by users. This paper introduces a novel approach, FlowMind, leveraging the capabilities of Large Language Models (LLMs) such as Generative Pretrained Transformer (GPT), to address this limitation and create an automatic workflow generation

system. In FlowMind, we propose a generic prompt recipe for a lecture that helps ground LLM reasoning with reliable Application Programming Interfaces (APIs). With this, FlowMind not only mitigates the common issue of hallucinations in LLMs, but also eliminates direct

Read 8 tweets

Share this page!

Enter URL or ID to Unroll

AK

Try unrolling a thread yourself!

More from @_akhaliq

AK

AK

AK

AK

AK

AK

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!