Discover and read the best of Twitter Threads about #CVPR2023

Most recents (5)

Do we need RGB to train neural networks? We skip decoding JPEG to RGB, directly feed the encoded JPEG to ViT, and speed up train/eval by up to 39.2%/17.9% without accuracy loss!

Check out our poster on Thu-PM-165 in #CVPR2023! (work w/ @jcjohnss)

bit.ly/3qRwToV
JPEG slices images into patches. ViT works on patches. This makes it a perfect match for training from JPEG. Image
Data augmentation is vital for training a good-performing model. We directly augment JPEG to speed up training, instead of converting to RGB, augment, and converting it back.
Read 4 tweets
Excited to share our #CVPR2023 on synthesizing new views along a camera trajectory from a **single image**!

How?
💡 The good old epipolar constraints in a pose-guided diffusion model!

Paper: arxiv.org/abs/2303.17598
Project: poseguided-diffusion.github.io
How does it work?

We train a diffusion model conditioned on 1) relative camera pose and 2) source image via a cross attention layer.

BUT, we don’t need to attend every location!

Fantastic correspondences and where to find them?

Epipolar geometry! ImageImage
How well does it work?

Surprisingly, this simple method achieves both higher per-frame quality AND better temporal consistency! Image
Read 4 tweets
Introducing Objaverse, a massive open dataset of text-paired 3D objects!

Nearly 1 million annotated 3D objects to pave the way to build incredible large-scale 3D generative models: 🧵👇

🤗 Hugging Face: huggingface.co/datasets/allen…
📝ArXiv: arxiv.org/abs/2212.08051

#CVPR2023
In the past, ShapeNet has enabled remarkable progress and benchmarking across 3D computer vision!

But, it lacks visual diversity, realism…
and scale! 🪨

Objaverse is more than an order of magnitude larger and has ~400x more categories! 📈
Read 14 tweets
Introducing Vid2Seq, a new visual language model for dense video captioning. To appear at #CVPR2023.

Work done @Google w/ @NagraniArsha P.H. Seo @antoine77340 @jponttuset I. Laptev J. Sivic @CordeliaSchmid.

Page: antoyang.github.io/vid2seq.html
Paper: arxiv.org/abs/2302.14115

🧵/5
Most video captioning systems can only describe a single event in short videos. But natural videos may contain numerous events. So we focus on the dense video captioning task, which requires temporally localizing and captioning all events in untrimmed minutes-long videos 🎞️.

2/5
Avoiding any task-specific design, the Vid2Seq model predicts all event captions and boundaries by simply generating a single sequence of tokens, given visual and speech inputs. Special time tokens interleave the text sentences to temporally ground them in the video ⌛️.

3/5 Vid2Seq model
Read 5 tweets
🎉CutLER is accepted to #CVPR2023!

tl;dr: CutLER is an unsupervised object detector surpassing prev SOTA by 2.7x on 11 datasets across various domains, e.g. natural images, painting, sketch.

Codes/demos are released!
Developed during my internship at FAIR-Meta AI @MetaAI

1/n
Demos results on 11 benchmarks spanning a variety of domains, including video frames, paintings, clip arts, complex scenes, etc.

3/n
Read 10 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!