Do we need RGB to train neural networks? We skip decoding JPEG to RGB, directly feed the encoded JPEG to ViT, and speed up train/eval by up to 39.2%/17.9% without accuracy loss!
JPEG slices images into patches. ViT works on patches. This makes it a perfect match for training from JPEG.
Data augmentation is vital for training a good-performing model. We directly augment JPEG to speed up training, instead of converting to RGB, augment, and converting it back.
Our ViT-Ti shows up to 39.2%/17.9% faster train/eval without accuracy loss compared to RGB. Also, our data augmentation pipeline is up to 93.2% faster than previous works. For more details, please check out our website!
• • •
Missing some Tweet in this thread? You can try to
force a refresh