Thread by @Pier4r on Thread Reader App

So there are increasingly argument for "no human review code" provided that the agent(s) build code together with a suite of tests.

I see multiple problems with this approach
1/

- Assuming that the agent(s) can be relatively reliable (say 99.9%), there will be the case of serious failure where then people needs to jump in and good luck finding the problem if there is zero knowledge of the codebase.

2/

- With the given failure it is also hard to say to the agent "fix it", because the agent may need additional detail (otherwise it would have already done so) and those details can come only if one is involved in the codebase

3/

- Then there is the problem of maintainability. What if the agent is doing a good job but it is creating a spaghetti codebase that soon the agent itself could not develop anymore due to increased complexity?

4/

We have "taste" for complexity but we cannot reliably write test for it. So how do you catch that if you aren't involved in the codebase?

5/

- then there is the problem (already partially addressed) that tests catch only problem that fit the test. What about the rest? In an analogy: what if the agent builds a beautiful aircraft that works great but disassemble after 500 hours of flight?

6/

all the most normal and logical tests can pass, but there is unlikely a test for "500 hours of flight". Especially if the agent writes its tests (since it learned from similar human tests and combination of those)

7/

and then there is the problem of deskilling. When we do not use our brain because there is a tool for it (GPS, calculator, what not) we tend to lose that skill, if we do not use it in another ways.

Without the skill, how do we notice problems?

8/

How do we notice that we gave wrong inputs or requirements and garbage comes out? A calculator is almost always correct but if one mistype, the wrong answer can come out. How do you notice it if you do not have a sense for numbers?
9/

If you do not know orientation and stuff, you can type Denver instead of Detroit as destination in a navigator with GPS. How do you notice that that is wrong if you never review anything?

10/

And finally I think that accountability is also important. If the tool is "self built", do you want to take responsibility for its failures? Will be there an industry of "professional scapegoats"?

11/

This at least until the agents aren't good enough that their reliability (and skill) is incredible. Like Stockfish level (many other levels of superhuman in chess) but in any field.

In that case likely we will use agents like stockfish.

12/

We will use Stockfish as oracle for possible answers (or confirmations/refutations) while still wanting to understand why the answer is in that way. Hence we still want to involve ourselves in the solution, even if the solution is given to us.

13/13

@threadreaderapp unroll (test)

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll