Does non-deterministic nature of property-based testing hurt build repeatability?

I am learning FP and got introduced to the concept of property-based testing and for someone from OOP world PBT looks both useful and dangerous. It does check a lot of options, but what if there is one (or some) options that fail, but they didn't fail during your first let's say Jenkins build. Then next time you run the build the test may or may not fail, doesn't it kill the entire idea of repeatable builds?

I see that some people explored options to make the tests deterministic , but then if such test doesn't catch an error it will never catch it.

So what's better approach here? Do we sacrifice build repeatability to eventually uncover a bug or do we take the risk of never uncovering it, but get our repeatability back?

(I hope that I properly understood the concept of PBT, but if I didn't I would appreciate if somebody could point out my misconceptions)

Doing a lot of property-based testing I don't see indeterminism as a big problem. I basically experience three types of it:

  1. A property is really indeterministic b/c some external factor - eg timeout, delay, db config - makes it so. Those flaky tests also show up in example-based testing and should be eliminated by making the external factor deterministic.

  2. A property fails rarely because the triggering condition is only sometimes met by pseudo random data generation. Most PBT libraries have ways to reproduce those failing runs, eg by re-using the random seed of the failing test run or even remembering the exact constellation in a database of some sort. Those failures reveal problems and are one of the reasons why we're doing random test cases generation in the first place.

  3. Coverage assertions („this condition will be hit in at least 5 percent of all cases“) may fail from time to time even though they are generally true. This can be mitigated by raising the number of tries. Some libs, eg quickcheck, do their own calculation of how many tries are needed to prove/disprove coverage assumptions and thereby mostly eliminate those false positives.

The important thing is to always follow up on flaky failures and find the bug, the indeterministic external factor or the wrong assumption in the property's invariant. When you do that, sporadic failures will occur less and less often. My personal experience is mostly with jqwik but other people have been telling me similar stories.

You can have both non-determinism and reproducible builds by generating the randomness outside the build process. You could generate it during development or during external testing.

One example would be to seed your property based tests, and to automatically modify this seed on commit. You're still making a tradeoff. A developer could be alerted of a bug unrelated to what they're working on, and you lose some test capacity since the tests might change less often.

You can tip the tradeoff further in the deterministic direction by making the seed change less often. You could for example have one seed for each program component or file, and only change it when a related file is committed.

A different approach would be to not change the seed during development at all. You would instead have automatic QA doing periodic or continuous testing with random seeds and use them to generate bug reports/issues that can be dealt with when convenient.

johanneslink's analysis of non-determinism is spot on.

There's one thing I would like to add: non-determinism is not only a rare and small cost, it's also beneficial. If the first run of your test suite is successful, insisting on determinism means insisting that future runs (of the same suite against the same system) will find zero bugs.

Usually most test suites contain many independent tests of many independent system parts, and commits rarely change large parts of the system. So even across commits, most tests test exactly the same thing before and after, where once again determinism guarantees that you will find zero bugs.

Allowing for randomness means every run has at least a chance of discovering a bug.

That of course raises the question of regression tests. I think the standard argument is something like this: to maximize value per effort you should focus your testing on the most bug-prone parts of the code. Having observed a bug in the past provides evidence about which part of the code is buggy (and which kind of bug it's likely to have). You should use that evidence to guide your testing effort. (Often with a laser-like focus on one concrete bug.)

I think this is a very reasonable argument. I also think there's more than one way of making good use of the evidence provided by bugs.

For example, you might write a generator which produces data of the same kind and shape as the data which triggered the bug the first time, and/or which is tailor made to trigger the bug.

And/or, you might want to write tests verifying specifically those properties that were violated by the buggy behavior.

If you want to judge how good these tests are, I recommend running them a couple of times (on normally sized input batches). If they trigger the bug every time, it's likely to do so in the future also.

Here's a (hopefully thought-)provoking question: is it worse to release software which has a bug it has had before, or release software with new bugs? In other words: is catching past bugs more important than catching new ones—or do do it primarily because it's easier?

If you think we do it in part because it's easier, then I don't think it matters that re-catching the bug is probabilistic: what you should really care about is something like the average bug-catching abilities of property testing—its benefits elsewhere should outweigh the fairly small chance that an old bug squeaks through, even though it got caught in (say) 5 consecutive runs of the tests when you evaluated your regression tests.

Now, if you can't reliably generate random inputs that trigger the bug even though you understand the bug just fine, or the generator which does it is large and complicated and thus costly to maintain, hand-picking a regression example seems like a perfectly reasonable choice.

