Discover and read the best of Twitter Threads about #CVPR2022

Most recents (10)

Attending an academic conference soon?
Going by reports and polls on Twitter, you probably have a 10-20% chance of getting #COVID19 there.
Here's a small thread with some evidence and some suggestions. 1/
2/ ACM #SIGGRAPH, a major conference and trade fair with up to 20,000 attendees, will take place in Vancouver next month, right in time for the peak of a COVID wave.
That means that probably 2000 people will return home infected with COVID.
3/ Of these 2000 people with COVID, a few hundred people will be knocked out for a few days or weeks.
About 1% might suffer from *debilitating* #LongCOVID effects - i.e. 20 attendees will suffer from a major illness for weeks or months or more.
nature.com/articles/s4146…
Read 36 tweets
🔥 GroupViT by @nvidia is now available in @huggingface Transformers.

The model is capable of zero-shot semantic segmentation, requiring no pixel-level labels.🤯

For training, only 30M noisy (image, text) pairs were used.

Notebook: tinyurl.com/mrxn9vbx (1/3)
The model can be seen as an extension of @OpenAI's CLIP to semantic segmentation, with a clever grouping in the image encoder.😎

It clearly shows the potential of how language can improve computer vision models!

Docs: huggingface.co/docs/transform…

Models: huggingface.co/models?other=g…
🙏 Shout-out to @Jerry_XU_Jiarui, first author of this paper who contributed the model to the library.

He also created an awesome Space for it (part of #CVPR2022's demo track): huggingface.co/spaces/CVPR/Gr…

(3/3)
Read 3 tweets
Applying deep learning to pathology is quite challenging due to the sheer size of the slide images (gigapixels!).

A common approach is to divide images into smaller patches, for which deep learning features can be extracted & aggregated to provide a slide-level diagnosis (1/9)
Unfortunately, dividing into small patches limits the context to cellular features, missing out on the various levels of relevant features, like larger-scale tissue organization. (2/9)
Additionally, it is difficult to improve long-range dependencies with Transformers due to the high number of patches which makes attention calculation computationally difficult. (3/9)
Read 9 tweets
In our #CVPR2022 Oral, we introduce the atemporal probe (ATP) to analyze *atemporal* (single frame) bias in video-language, with surprising results! (see 🧵)

Led by Shyamal Buch with @CristbalEyzagu2, @adnothing, @jiajunwu_cs, @drfeifei

atp-video-language.stanford.edu

1/10
The promise of videos is the potential to go *beyond* image-centric understanding (people, objects, scenes, etc.) towards event temporality, causality, and dynamics. Ideally, we want video-language benchmarks and models to realize this promise.

2/10
Our paper focuses on a fundamental question in video research: to what extent can "image-centric" understanding address "video" understanding?

Consider the example below: can we answer the question with only a single frame?

3/10
Read 10 tweets
When representing a neural field, instead of having to learn and store one large field for the entire spatial area of interest, hybrid approaches exist which combine features organized in a data structure (e.g. grid) with a small neural net to produce the final result. #CVPR2022 Image
There are lots of choices for the data structure that organizes the feature vectors: grids, point clouds, meshes… ImageImageImageImage
And even more choices! ImageImageImageImage
Read 4 tweets
Happy to finally share our paper about differentiable Top-K Learning by Sorting that didn’t make it to #CVPR2022, but was accepted for #ICML2022! We show that you can improve classification by actually considering top-1 + runner-ups… 1/6🧵

#ComputerVision #AI #MachineLearning
Paper: arxiv.org/abs/2206.07290

Great work by @FHKPetersen in collaboration with Christian Borgelt, @OliverDeussen . 2/6🧵

@MITIBMLab @goetheuni @UniKonstanz
Idea: Top-k class accuracy is used in many ML tasks, but training is usually limited to top-1 accuracy (or another k). We propose a differentiable top-k classification loss that allows training by considering any combination of top-k predictions, e.g. top-2 top-5, 3/6🧵
Read 7 tweets
Check out our #CVPR2022 paper! We improve multimodal zero-shot text-to-video retrieval on Youcook2/MSR-VTT by leveraging fusion transformer and combinatorial loss. 1/🧵

#ComputerVision #AI #MachineLearning

@MITIBMLab @goetheuni @MIT_CSAIL @IBMResearch Image
If you want to go directly to the paper/code, please check out:
paper: arxiv.org/abs/2112.04446
Github link: github.com/ninatu/everyth…

Great work by @ninashv__ , @Brian271828, @arouditchenko Samuel Thomas, Brian Kingsbury, @RogerioFeris , David Harwath, and James Glass.
We propose a multimodal modality agnostic fusion transformer that learns to exchange information between multiple modalities, e.g. video, audio, text, and builds an embedding that aggregates multi-modal information. Image
Read 5 tweets
Are you working on federated learning over heterogeneous data? Use Vision Transformers as a backbone!
In our upcoming #CVPR2022 paper, we perform extensive experiments demonstrating the effectiveness of ViTs for FL:

paper: arxiv.org/abs/2106.06047
code: github.com/Liangqiong/ViT…
@vickyqu0 @yuyinzhou_cs @mldcmu @StanfordDBDS @StanfordAILab We find that ViTs are more robust to distribution shift, reduce catastrophic forgetting over devices, accelerate convergence, and reach better models.

Using ViTs, we are able to scale FL up to the edge-case of heterogeneity - 6000 & 45000 clients with only 1 sample per client!
@vickyqu0 @yuyinzhou_cs @mldcmu @StanfordDBDS @StanfordAILab By virtue of their robustness and generalization properties, ViTs also converge faster with fewer communicated parameters, which makes them appealing for efficient FL.

ViTs can be used with optimization FL methods (FedProx, FedAvg-Share) to further improve speed & performance.
Read 4 tweets
Very glad I can finally talk about our newly-minted #CVPR2022 paper. We extended mip-NeRF to handle unbounded "360" scenes, and it got us ~photorealistic renderings and beautiful depth maps. Explainer video: and paper: arxiv.org/abs/2111.12077
We "contract" Euclidean space into a bounded domain, which gets hard because we need to warp the mip-NeRF Gaussians that model 3D volumes of space. The trick for making this work is linearizing the contraction (thanks JAX!) and using the same math as an extended Kalman filter.
Big scenes need big models! We use a tiny MLP (queried repeatedly) to model coarse scales, and a huge MLP (queried sparingly) to model the finest scale, and distill the huge "NeRF MLP" into that tiny "proposal MLP". The trick: histograms from both MLPs *must* bound each other.
Read 5 tweets
I'm glad that I finally can tell you about our paper MTTR (arxiv.org/abs/2111.14821) which got accepted to #CVPR2022, lead by Adam Botach and supervised by @ChaimBaskin (1/n)
In this work we tackle a complex multi-modal problem of referring video segmentation -- segmenting an object in a video given its textual description. (2/n)
We propose a very simple (even if it may not look so) end-to-end trainable pipeline, consisting of single multimodal Transformer model. It is free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. (3/n) A detailed overview of MTTR...
Read 22 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!