Motivation of this is how we can actually find and fix LLM-generated buggy code completions
Important problem. I'm not yet following the approach. Seems to involve some kind of mutation testing. I will need to read this. They are talking about why unit tests aren't good oracles, which I buy. But what is their oracle? Still lost on that
It's cool that they're actually catching bugs in LLM-generated code, though. That's important. I think I am still unclear about the repair approach and whether I'd trust it though---especially since they seem to use metrics like BLEU I don't trust
They did some human evaluation of found bugs, which is definitely good
It'd be good to, at the very least, set a higher standard of testing for these systems, whether or not I buy the repair part. I think I am too scared that automatic repair could introduce bugs in the LLM-generated code that are not caught by the testing process
Anyways, it's good people are thinking about this problem seriously
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Lots of counting examples in the comments where tokenization isn't relevant but it still messes up. The thing is, counting is easy to teach transformer models, so I don't think the problem for those is counting in itself.
One thing I find really fun about going to different dojos when I travel is that every place seems to have a slightly different emphasis and style. So everywhere you go you learn something new
Here it seems they spend more time on turnovers and breaking grip, so I learned a lot about that. But they spend less time on throw technique and combos, and the white belts don't do randori at all
So they seem confused when I do not know much about turnovers and breaking grip but feel very comfortable doing randori. It's nice though because I get some extra tricks to bring back home (if I can remember them)
Just thank god people actually care about this problem
Still scares me a bit to use program repair tools on LLM-generated buggy code, feels like too many layers of possible failure that could mislead the user into false confidence
This paper includes an interesting sample of LLM-generated bugs on which program repair tools currently fail, though
It's weird meeting people I've only met on Twitter and realizing how much better everything is when you actually talk to each other
Like this medium just sucks for finding common ground. It's good for meeting people and feeling entertained and getting the word out and so on, but for finding common ground it really seems to do the opposite, just make everyone sound like a caricature
I guess it is just something I need to keep in mind while I'm on here. Not sure how practical it is but time and time again I'm reminded how much better direct conversation is for certain purposes