Giving coding agents an ability to self-check their own work is one of the most useful techniques I came across. I develop a lot of numerical software in my free time and most models are very helpful here but no model has earned my complete trust. They simply don’t consult with me presence of uncertainty and rely too much on the parametric knowledge instead of consulting external sources. The usual runaround happens when the model states that it knows the answer to some problem, codes the algorithm, and the algorithm looks plausibly correct but doesn’t really work. A lot of debugging follows and the result is a lot of wasted time and frustration.

Because of this, I resorted to following a methodology of supplying the agent with a set of input-input pairs, or test vectors developed in other software. The methodology is simple but offers big leverage. The idea is to use numpy, Mathematica, etc. to compute analytical answers to problems with known convenient values that a valid code must reproduce. Supplying those values to the agent helps it develop better tests. In connection with a proper skill or a plugin like superpowers greatly enhances the effectiveness of the agents. An extension of this idea is to ask the model to use external tools to construct its own tests and stating it explicitly. The idea is that the ground truth answers do not originate from the model, who in this case only acts as a problem constructor and a tool orchestrator during test construction phase. Once the answers are baked in, the agent will keep working on the code until the tests pass.