💾 Archived View for thrig.me › blog › 2022 › 12 › 23 › test-suite.gmi captured on 2024-07-09 at 00:56:23. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-11-14)
-=-=-=-=-=-=-
Suppose we wish to verify that a calculator application returns correct numbers. This will be a very simple calculator, as the verification is the important part, and a real test suite for a real calculator would be too long and too boring. So, our calculator specification:
add the two numbers given as the first two arguments to the program and print that result
This is problematic, like what happens if the numbers are larger than 9007199254740992 or outgrow 64-bit integers or whatever the ceiling du jour happens to be? Or when the numbers are larger than fit into ARG_MAX and the program can no longer be run? Ignoring such pesky details, we might implement our adder as follows.
#!/usr/bin/env tclsh8.6 # adder - adds the first two numbers and prints the result namespace path {::tcl::mathop} puts [+ [lindex $argv 0] [lindex $argv 1]]
And of course we test this code, manually:
$ ./adder 2 2 4
And since it works we ship it and we are done!
This is how perhaps too much software is developed. The test no good, as it was done manually--if at all--and perhaps the next CluelessIntern or SeniorArchitect (both are dangerous) will forget to test the program after making their brilliant changes.
Or maybe students will submit an adder program, and the teacher wants to test that the submissions are correct, via an automated test suite--stack 'em deep and teach 'em cheap!
A test suite may not be without unexpected risks:
https://www.foo.be/docs/tpj/issues/vol1_2/tpj0102-0009.html
This particular contest was inspired by Profs. Mitch Resnick and Mike Eisenberg, who in 1987 conducted a three-way Prisoner's Dilemma contest at MIT. Your august editor entered that contest (in which the strategies were Scheme functions instead of Perl subroutines), noticed that the contest rules (there were four) didn't prohibit mutators, and wrote a function that changed its opponents' histories, yielding the highest possible score. I was disqualified, and a fifth rule excluding such strategies was added the next year.
Whoops.
A test suite has to start somewhere; what about using the manual test we used, above? 2 2 + is 4.
Luckily I just so happen to have some code that helps automate program testing, which will spare us the boilerplate of running programs and collecting their output, at the cost of introducing yet more code.
#!/usr/bin/env perl # test-adder - test that `adder` is correct use 5.24.0; use warnings; use Test2::V0; use Test::UnixCmdWrap; my $adder = Test::UnixCmdWrap->new( cmd => './adder' ); $adder->run( args => '2 2', stdout => qr/^4$/ ); done_testing;
And we can now prove (by way of App::Prove) that the adder is correct.
$ prove test-adder test-adder .. ok All tests successful. Files=1, Tests=3, 1 wallclock secs ( 0.01 usr 0.10 sys + 0.22 cusr 0.58 csys = 0.91 CPU) Result: PASS
And that a buggy adder is buggy (and why it is buggy):
$ prove test-adder test-adder .. 1/? # Failed test 'STATUS adder 2 2' # at test-adder line 9. # Got: code=42 signal=0 iscore=0 # Expected: code=0 signal=0 iscore=0 # Failed test 'STDOUT adder 2 2' # at test-adder line 9. # STDOUT # Failed test 'STDERR adder 2 2' # at test-adder line 9. # STDERR narf
This was produced by
#!/bin/sh # adder - adds the first two numbers and prints the result echo >&2 narf exit 42
where the programmer has helpfully included buggy documentation in addition to bad code. When it rains...
There are some number of programs that implement adder, and a larger number that do not. Programs that pass the test suite fall into both of these sets. Maybe. For example, the following program passes the test suite, but does not implement adder:
echo 4
There is no documentation, and the implicit use of /bin/sh may surprise some, but this is a correct program according to the test suite.
$ cat adder echo 4 $ prove test-adder test-adder .. ok All tests successful. Files=1, Tests=3, 0 wallclock secs ( 0.02 usr 0.07 sys + 0.16 cusr 0.48 csys = 0.73 CPU) Result: PASS
As a test suite becomes larger--perhaps it also needs to check that there is a -h help option, etc--it will moreso pin down whether an implementation correctly implements an interface. A protocol could also be tested with such a suite: does your software DHCP properly? Lacking such a test suite, there are workarounds in ISC DHCP server code on account of a certain Redmond-based company shipping a buggy DHCP client.
This test suite should be made more robust, especially if there are malicious actors who will game the system. Humans, in other words. Auditing for correct documentation might be tricky; perhaps one could flag scripts that contain no documentation for manual review? And what if the documentation is wrong, or misleading? Another problem is that the test suite may need a reference implementation, hopefully one that is not too buggy. The tests might be static, but sufficiently large soas to make implementing the correct code easier than simply regurgitation known test outputs for given inputs. Maybe one implementation could be used to test another? Or the requirements could be simplified so that invalid inputs are easier to detect, and so that more of the test space can be audited? The CPU usage and runtime of the tests may become a concern; is there a way to perform quick versus deep tests?
#!/bin/sh # adder - a dangerous snake sleep 2147483647 exec dc -e "$1 $2 + p"
Is this a valid adder implementation?
https://metacpan.org/pod/Test::UnixCmdWrap
https://thrig.me/src/Test2-Tools-Command.git
tags #expect #testing #perl #security