just pushed my first multi-turn RL environment to @PrimeIntellect
the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).
tts only tool: agentic RAG search over the story.
this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.
i really like this dataset; it’s sort of harder-to-read short stories, and the questions really require (imo) a good and subtle understanding of the paper.
so I liked the idea of building an agentic RAG system over this dataset - each story gets chunked up and embedded using OpenAI’s embeddings - then the agent gets to choose the query to embed and search.
the chunks are very small, so it’s a pretty difficult task. But I think learning here would require really reasoning about the question and the structure of this kind of writing.
thanks to @PrimeIntellect for building this and @willccbb for the invite! I think this is such an incredible initiative that can go in so many exciting directions. Looking forward to publishing a lot more RL environments and building agi together :) !
introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q
Again I really like this idea - for most practical agentic work I have done, you almost always just want to use a big API model - it works the best, and is quickest to get a good prototype
the predicted context embedding is fed into the frozen network, which (with sampling) generates reasoning chains as normal, which then get scored, and the gradient is computed in the normal way
im particularly excited about this project - the other ones felt fun, but exploratory - this feels like pulling everything together into a single framework, and to produce an end model that could be really useful.
I built the dataset by taking Cityscapes + segmentation masks, gridding each image, and labeling any square as positive if >10% of its pixels were cars or motorcycles.