💾 Archived View for capsule.adrianhesketh.com › 2022 › 06 › 14 › linear-to-binary-search captured on 2023-01-29 at 02:29:54. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-07-16)
-=-=-=-=-=-=-
This week, I've been playing around with a parsing library I wrote, working out how how I can rewrite it to make (hopefully sensible) use of Go generics.
It has an Input which maintains an Index, and you can `Peek` characters, but also `Take` characters which moves the `Index` - i.e. moves a pointer through the string. You use it by building up bigger parsers from parsers. Each of them rolls back if it can't parse the whole thing.
type UUID string var hexParser = parse.RuneIn("0123456789abcdefABCDEF") var uuidStringParser = parse.StringFrom( parse.StringFrom(parse.Times(8, hexParser)), parse.Rune('-'), parse.StringFrom(parse.Times(4, hexParser)), parse.Rune('-'), parse.StringFrom(parse.Times(4, hexParser)), parse.Rune('-'), parse.StringFrom(parse.Times(4, hexParser)), parse.Rune('-'), parse.StringFrom(parse.Times(12, hexParser)), ) var uuidParser = parse.Convert(uuidStringParser, func(s string) (UUID, error) { return UUID(s), nil }) func main() { uuid := "123e4567-e89b-12d3-a456-426655440000" input := parse.NewInput(uuid) match, ok, err := uuidParser.Parse(input) fmt.Println("match:", match) fmt.Println("ok:", ok) fmt.Println("err:", err) fmt.Println("pos:", input.Position()) }
I wanted to add a Position feature, so that you can get a more human, line index and column position, not just the byte or character index within the file. So I created an array of the character positions of each line ending in the input string and stored it within the `InputString` struct as `newLines`.
To get the current line and col from a given index, you can read (forwards or backwards) through the array until you get to the right line. That's what I did first.
func (in *InputString) Position() (line, column int) { var previousLineEnd int for lineIndex, lineEnd := range in.newLines { if in.charIndex > previousLineEnd && in.charIndex < lineEnd { return lineIndex + 1, in.charIndex - previousLineEnd } previousLineEnd = lineEnd } return -1, -1 }
Although it was quick to write, I was unhappy that it was a linear search. The speed taken is proportional to your position in the file - i.e. it will get slower for later lines because it has to read through the preceding lines.
A binary search would be much more efficient for large files. Instead of searching each item in the array in order, it will try the item that's in the middle of all the values, immediately discounting half of the range. The search algorithm next tries a midpoint within the remaining values, repeating this process until it finds the right answer.
I thought about writing a binary search myself, but then I found that there's one in the standard library called `sort.Search` - I never thought to look in sort to find that.
func (in *InputString) Position() (line, column int) { lineIndex := sort.Search(len(in.newLines), func(lineIndex int) bool { return in.charIndex <= in.newLines[lineIndex] }) var previousLineEnd int if lineIndex > 0 { previousLineEnd = in.newLines[lineIndex-1] + 1 } return lineIndex, in.charIndex - previousLineEnd }
I wrote a quick benchmark.
package bench import ( "math/rand" "sort" "testing" ) func linearSearch(data []int, value int) (index int) { for i, v := range data { if v > value { return i } } return 0 } func binarySearch(data []int, value int) (index int) { return sort.Search(len(data), func(index int) bool { return data[index] < value }) } func TestLinearAndBinary(t *testing.T) { // Generate some lines. lineIndices := make([]int, 1000) var index int for i := 1; i < len(lineIndices); i++ { index += rand.Intn(200) + 1 lineIndices[i] = index } for i := 0; i < index; i++ { l := linearSearch(lineIndices, index) b := binarySearch(lineIndices, index) if l != b { t.Fatalf("expected l and b to match, got l=%v, b=%v for input %d", l, b, i) } } } func BenchmarkLinearAndBinary(b *testing.B) { // Generate some lines. lineIndices := make([]int, 10000) var index int for i := 1; i < len(lineIndices); i++ { index += rand.Intn(200) + 1 lineIndices[i] = index } data := lineIndices[:10] b.Run("linear: 10", func(b *testing.B) { bench(b, data, linearSearch) }) b.Run("binary: 10", func(b *testing.B) { bench(b, data, binarySearch) }) data = lineIndices[:100] b.Run("linear: 100", func(b *testing.B) { bench(b, data, linearSearch) }) b.Run("binary: 100", func(b *testing.B) { bench(b, data, binarySearch) }) data = lineIndices[:1000] b.Run("linear: 1000", func(b *testing.B) { bench(b, data, linearSearch) }) b.Run("binary: 1000", func(b *testing.B) { bench(b, data, binarySearch) }) data = lineIndices[:10000] b.Run("linear: 10000", func(b *testing.B) { bench(b, data, linearSearch) }) b.Run("binary: 10000", func(b *testing.B) { bench(b, data, binarySearch) }) } func bench(b *testing.B, data []int, f func([]int, int) int) { max := data[len(data)-1] for i := 0; i < b.N; i++ { for index := 0; index < max; index++ { f(data, index) } } }
For very small input sets, it costs more to do the setup and call the function than it saves, but somewhere between 50 and 100 values, it starts paying back.
Given my use case, in most cases, there will be more than 100 lines, so it makes sense to keep it.
BenchmarkLinearAndBinary/linear:_10-10 305419 3745 ns/op BenchmarkLinearAndBinary/binary:_10-10 152731 7827 ns/op BenchmarkLinearAndBinary/linear:_50-10 23444 51182 ns/op BenchmarkLinearAndBinary/binary:_50-10 18067 66441 ns/op BenchmarkLinearAndBinary/linear:_100-10 6616 182412 ns/op BenchmarkLinearAndBinary/binary:_100-10 7824 152947 ns/op BenchmarkLinearAndBinary/linear:_150-10 2707 444944 ns/op BenchmarkLinearAndBinary/binary:_150-10 4393 271768 ns/op BenchmarkLinearAndBinary/linear:_1000-10 68 17096097 ns/op BenchmarkLinearAndBinary/binary:_1000-10 500 2400367 ns/op BenchmarkLinearAndBinary/linear:_10000-10 1 1559057416 ns/op BenchmarkLinearAndBinary/binary:_10000-10 32 35949938 ns/op
Process for creating a React page