(1/8) *new paper* “LLMs can self-improve”
w/ *self-generated CoTs* (“logical dark knowledge”), no GT labels:
- SoTA (74.4%->82.1% GSM8K, 90.0%->94.4% OpenBookQA, 63.4%->67.9% ANLI-A3) by fine-tuning
- SoTA “zero-shot” (GSM8K 70.1% -> 74.2%) by prompting arxiv.org/abs/2210.11610
(2/8) inspiration #1: I like analogies. When @kojima_tks@yusuke_iwasawa_ shared initial “step by step” results, my reaction was its (1) unreal engine trick of NLP, (2) temp trick in distillation arxiv.org/abs/1503.02531@geoffreyhinton, so we called it “logical dark knowledge”😃
(3/8) inspiration #2: CoT+self-consistency arxiv.org/abs/2203.11171 was used everywhere. Most impressive to me was its calibration. Voting distribution is *very* calibrated: monotonic & even sometimes under-confident! e.g. when it predicts with 70%+ confidence, it’s correct 99%!
- 8 months or so to build humanoids from scratch: Two iterations. Far from Boston Dynamics in locomotion, and far from human bi-dexterous manipulation, but given 8-month window, the results were amazing. Nicely leveraged as much of self-driving pipeline + Dojo compute. 2/
- "generalist" conditional occupancy network: a single "big" network which outputs both voxels and semantics from images. Trained on LARGE dataset from auto labeling. Given where conditional/generative NeRF/OccNets are in academia (arxiv.org/abs/2209.10684), blown away by scale 3/
Can pre-trained language models be used for offline RL? We look to answer this question in our new work and demonstrate SoTA-level performance on various offline RL benchmarks when adapting pre-trained LMs for RL 🤯
We look at adapting pre-trained language models (e.g. GPT2) and image models (e.g. ImageGPT) for Decision Transformer in offline RL and show consistent improvement in performance over all strong baselines, e.g.. DT, TD3+BC, CQL: 2/
Interestingly, we find that vision init does not converge, whereas even a small pre-trained language model ChibiT (where チビ means small or mini in Japanese 😆) on Wiki has improvements over DT and comparable to GPT2. Perhaps some similarities in RL trajectories & language 🤔 3/
If overwhelmed by # of papers in *offline* RL, check out our @NeurIPSConf Spotlight with Scott Fujimoto: we show how few lines change to TD3 (TD3+BC) can be competitive with SoTA algorithms, halving training time. Inspired by #minimalism#zen#konmariarxiv.org/abs/2106.06860
We propose "BC as a regularizer", which adds negligible compute cost to original TD3 objective, but makes it quite performative on offline RL.
For the table, we followed similar "algorithm" "implementation" separations suggested in our other NeurIPS paper
Toy MuJoCo + Box2d envs in OpenAI Gym are moving to #brax! 100x GPU/TPU speedup + purely pythonic + jax/pytorch-enabled ready to be unleashed! An exciting news for #brax#braxlines#jax teams. Also check out #composer, where I am adding more demos github.com/openai/gym/iss…
#brax still cannot (and probably won't ever) match the full specs with mujoco/pybullet. But esp with open-sourcing plans of mujoco, excited to see where could be synergies.
Good to see a lot of large-scale, algorithmic deep RL researchers are aligned: "I personally believe that hardware accelerator support is more important, hence choosing Brax."