I can't speak to how external people validate that stim is working, but I can talk about how I test it internally. Basically your mental model of the situation should be that if I screw up, things go bad. Every once in awhile I do dream some nightmare about finding a bug on the level of RANDU, invalidating years of research. So I try my best. But I am a single point of failure.
For data, you can look through the bug fixes noted in stim's release notes. The worst bugs so far have been:
- (1.2.1) Fixed the effects of X_ERROR and Z_ERROR being swapped in TableauSimulator (note: this is not the path used when compiling a sampler which is how it was missed for so long)
- (1.4.0) Fixed error analysis incorrectly handling MR gates operating on the same qubit multiple times.
- (1.5.0) Fixed a bug in the frame simulator where MY (Y basis measurement) acted like MRY (Y basis demolition measurement)
- (1.7.0) Fixed error analysis evaluating overlapping MPP targets in the wrong order, creating bad detector error models
- (1.9.0) Fix loop folding during error analysis incorrectly folding loops with observables including measurements from only the last few iterations
- (1.9.0) Worked around the pseudo random number generator state being duplicated when using multiprocessing with start method "fork". Samplers now seed from external entropy when constructed, instead of using entropy acquired at startup.
- (1.10.0) Fixed files being opened in text mode instead of binary mode, resulting in bad parsing/serialization on Windows due to \n bytes being turned into \r\n and vice versa.
- (1.12.1) Fixed HERALDED_PAULI_CHANNEL_1 targeting fixed indices instead of the given qubits
- (1.14.0) Fixed HERALDED_PAULI_CHANNEL_1 permuting X/Y/Z/ error channels
- (1.14.0) Fixed various internal methods not correctly propagating Pauli terms through MXX, MYY, and MZZ instructions that operated on the same qubit more than once
There are two clear patterns here. Newer gates are more likely to have mistakes (often the fixes apply to a gate added in the previous version), and also non-unitary gates are more likely to have mistakes.
The reason unitary gates are less likely to have mistakes is they're so easy to cross-check. There are a lot of simulators in stim (a tableau simulator, a graph simulator, a flip/frame simulator, a state vector simulator; not all exposed to python) and there's a lot of data about gates in stim (every gate specifies a hardcoded unitary matrix, tableau, and decomposition). Stim has unit tests checking each simulator's implementation, and each gate's data, against each other.
A graph simulator is very different from a tableau simulator, you're likely to make different mistakes when writing them, so verifying that they agree on the behavior of circuits is a way of gaining confidence they are both correct. Similarly a tableau is a very different style of specification of the function of a gate compared to a unitary matrix, and so if you enter them independently and then verify that code deriving a tableau from a unitary matrix produces the tableau you hardcoded from the matrix you hardcoded, that adds confidence they are correct.
If you go into stim's code and search for (const auto &gate : GATE_DATA.items) { (or variations like for (const auto &g : GATE_DATA.items) {) you will find most of these cross-checking tests. There are also tests in stimcirq that compare stim gates to their cirq equivalents. For example, test_frame_simulator_sampling_noisy_gates_agrees_with_cirq_data does statistical tests confirming that cirq and stim's noisy gates apply the same distribution of Pauli errors. There are also a few tests that validate vs qiskit, as part of the qasm export functionality.