Introducing, Act-One. A new way to generate expressive character performances inside Gen-3 Alpha using a single driving video and character image. No motion capture or rigging required.
Learn more about Act-One below.
(1/7)
Act-One allows you to faithfully capture the essence of an actor's performance and transpose it to your generation. Where traditional pipelines for facial animation involve complex, multi-step workflows, Act-One works with a single driving video that can be shot on something as simple as a cell phone.
(2/7)
Without the need for motion-capture or character rigging, Act-One is able to translate the performance from a single input video across countless different character designs and in many different styles.
(3/7)
One of the models strengths is producing cinematic and realistic outputs across a robust number of camera angles and focal lengths. Allowing you generate emotional performances with previously impossible character depth opening new avenues for creative expression.
(4/7)
A single video of an actor is used to animate a generated character.
(5/7)
With Act-One, eye-lines, micro expressions, pacing and delivery are all faithfully represented in the final generated output.
(6/7)
Access to Act-One will begin gradually rolling out to users today and will soon be available to everyone.
Today we're sharing our first research work exploring diffusion for language models: Autoregressive-to-Diffusion Vision Language Models
We develop a state-of-the-art diffusion vision language model, Autoregressive-to-Diffusion (A2D), by adapting an existing autoregressive vision language model for parallel diffusion decoding. Our approach makes it easy to unlock the speed-quality trade-off of diffusion language models without training from scratch, by leveraging existing pre-trained autoregressive models.
Standard Vision-language models (VLMs) reason about images and videos through language, powering a wide variety of applications from image captioning to visual question answering.
Autoregressive VLMs generate tokens sequentially, which prevents parallelization and limits inference throughput. Diffusion decoders are emerging as a promising alternative to autoregressive decoders in VLMs by enabling parallel token generation for faster inference.
We trained a state-of-the-art diffusion VLM, A2D-VL 7B for parallel generation by finetuning an existing autoregressive VLM on the diffusion language modeling task, using the masked diffusion framework which "noises" tokens by masking them and "de-noises" tokens by predicting the original tokens.
We develop novel adaptation techniques that gradually increase the task difficulty during finetuning to smoothly transition from sequential to parallel decoding while still preserving the base model's capabilities, by annealing both the block size and the noise level.
Runway Aleph is a new way to edit, transform and generate video. Its ability to perform a wide range of generalized tasks means it can reimagine ordinary footage in endless new ways. Allowing you to turn images and videos you already have into anything you want.
See below for a quick breakdown on how Aleph can effortlessly remove the subject from these scenes, just by asking it to.
To remove the subject, just ask Aleph to “remove the man”.
Aleph can retain complex scenes and fine details without the need for tedious masking.
Today we are releasing Frames. Our most advanced base model for image generation, offering unprecedented stylistic control and visual fidelity. Learn more below.
(1/10)
With Frames, you can begin to define worlds that represent your own artistic points of view. Styles, compositions, subject matter and more. Anything you can imagine, you can begin to bring to life with Frames.
Today we’re sharing an early video keyframing prototype that treats creative exploration like a search process of all latent artistic possibilities. One which allows you to simultaneously navigate this vast space with both precise control as well as serendipitous nonlinear discovery.
(1/8)
Graph Structure: A Window in Latent Space
The Graph structure is the foundation of the prototype. Images are represented as nodes, serving as waypoints in the model's latent space. These nodes can be connected to other nodes to create an edge; a video that transitions from the first frame to the last frame across latent space and time.
(2/8)
Balancing Control and Serendipity
Precise controls help limit the vast space of possibilities, but at the same time, variation and unpredictability can result in "happy accidents"–possibilities that we would not have considered given precise control. To balance this tradeoff, we provide two affordances for users to manipulate images in a "relational" manner that allows unpredictability in consistent dimensions.
Introducing Frames: An image generation model offering unprecedented stylistic control.
Frames is our newest foundation model for image generation, marking a big step forward in stylistic control and visual fidelity. With Frames, you can begin to architect worlds that represent very specific points of view and aesthetic characteristics.
See below for examples.
World 1089: Mise-en-scène
(1/11)
Frames allows you to design with precision the look, feel and atmosphere of the world you want to create.
This new feature allows you to transform videos into new aspect ratios by generating new areas around your input video. Expand Video has begun gradually rolling out and will soon be available to everyone.
See below for more examples and results.
(1/6)
Use Expand Video to help shape your story. Seamlessly extend your frame beyond its original boundaries while maintaining visual consistency to create stories with new compositions.
(2/6)
Expand into the unexpected with text prompts or guiding images.