💾 Archived View for freeside.wntrmute.net › log › 2021 › 20210602.gmi captured on 2023-09-28 at 15:47:55. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Classic CS problems in Java, chapter 2 writeup

2021-06-02 14:25 PDT

Chapter 2

This chapter covers basic search technqiues. Search is a really broad term: "find an answer." I guess it's kind of like computation in that regard.

Searching lists

The most basic search problem is finding an element in a sequence of items. The basic approach is called a linear search: look at each item in the sequence and see if it's what we're looking for. This requires that the items in the sequence have an equality test defined. To compare the different items, we'll use a list of the first couple Fibonacci values:



search_list = [0, 1, 2, 3, 5, 8, 13, 21]

def linear_search(lst, key):
    for i in range(len(lst)):
        if lst[i] == key:
            return True
    return False

It's pretty straightforward:



linear_search(search_list, 3)
lst[0] != 3
lst[1] != 3
lst[2] != 3
lst[3] == 3

It takes four steps to find an early value. This isn't great but if the sequence isn't sorted, there's not much we can do. This has a linear worst case, e.g. O(n). If we look for a value that's not in the sequence, we have to look at each element in the list to determine that it's not present.

However, if the sequence is sorted, we can use a binary search. This requires that the keys are comparable, that is we have to have some way of checking whether an item is less than another. It works by defining a search window and checking the middle item in that window. If it's the target, great. If not, we slide the window based on whether our key is greater or less than the middle.



def binary_search(lst, key):
    low = 0
    hi = len(lst)
    mid = (low + hi) / 2
    while lst[mid] != key:
        if low == hi:
            return False
        if lst[mid] < key:
            low = mid+1
        else:
            hi = mid-1
        mid = (low + hi) / 2

binary_search(lst, key=3)

  lst = 0 1 2 3 5 8 13 21
index = 0 1 2 3 4 5  6  7

low = 0
high = 8
middle = (low + high) / 2 = 4

lst[4] = 5

Since the value of lst[4] is greater than our key, we need to search the lower part of the list.



low = 0
high = 3
middle = (low + high) / 2 = 2

lst[2] = 2

Because the value of lst[2] is less than our key, we have to increase the lower bound.



low = 3
high = 3
middle = (low + high) / 2 = 3

lst[3] = 3

We found the search key in 3 steps. Now, if we wanted to look for 13, it would take 7 steps for the linear search. How much faster is binary search?



   low   high   mid   lst[mid]
1   0     8      4      5
2   5     8      7      21
3   5     7      6      13

Three steps to find an element at the end of this list.

If it's a one-off search on an ad-hoc list, the linear search is probably almost always going to be faster than sorting and binary searching. However, once we search more than a few times, the cost of sorting once (and maybe maintaining a sorted data structure) is amortized by the savings from the binary search speedup.

I want to test this in practice, so I wrote some code to generate a sorted list of a million integers and ~4,000 randomly-chosen set of search keys and included the start and end values. According to JUnit, binary search was almost 50% faster. However, in this first test, all of the keys were found in the search list. I generated another test with a million even integers, then generated a random list of search keys. This means there is roughly a 50% chance that a key will not be in the list. The times were similar, roughly 40 ms for the linear search cases and 20ms for the binary search.

That's the tradeoff: if you are going to search the list several times, it's probably worth it to sort the list and use a binary search. Otherwise, just use a linear search.

Pathfinding

Pathfinding can be thought of like finding your way through a maze. There are three approaches covered in the book, and they're really the three main staples.

In general, pathfinding involves maintaining three data structures: the current node, a frontier list containing nodes to be explored, and a set of explored nodes. There needs to be a way to test whether the current node is the goal, and a way to generate next steps. For the maze, we might see if the current node is the exit, and to generate next steps, we want a list of all the places we can from here ("successor" states). I remember the AI class I took modeled this as a tuple, but in this case we basically define class that has the properties



boolean goalTest(node)
node[] successors()

Depth-first search (DFS) will attempt to follow a given path as far as it can go until it reaches a dead end (or the goal), then backtrack to the previous decision point. It's often faster than its sibling algorithm, breadth-first search, but finds longer paths. It uses a stack for the frontier. On the test maze, here's the path it took:



S*******X 

	**

X  X  X  *
    X*****
X*****  X 

	*
	X X  X
	*X X
	********

      XX G

(S is the starting point, X are blocked off cells, G is the goal, and * is the path).

The path is long at 37 nodes, and the algorithm examined 53 states before finding the goal.

Breadth-first search (BFS) looks one node out from the current node, using a queue instead of a stack. It finds the shortest path but usually takes longer to do so. It's important to note that any time DFS will find a solution, BFS will also find a solution. On the example maze, it found a path of 19 nodes but had to examine 89 states to get there.



S       X 

	*

X* X  X   

	X

X*      X 

	
	X X  X
	X X
	*******

      XX*G

A* is another search algorithm that finds the shortest path and tends to be faster than a breadth-first search. It chooses nodes based on a combination of a cost and a heuristic function, trying to get the lowest possible combination. The cost generally asks how expensive was it to get to the current node; for a maze, it's the number of steps taken to get there. For a car, it might be the number of miles driven so far. The heuristic is a measure of how close the goal is. For a grid search like the maze, the Manhattan distance function is a good heuristic. In order for A* to find an optimal solution, the heuristic has to be *admissible*, that is, it can never overestimate the cost. In the maze solver, lower step counts and nodes that bring us closer to the goal are chosen. The path chosen by A* through the demo maze is also 19 nodes long, but it only examined 58 states to get there.



S       X 

	*****

X  X *X   
    X**   
X     * X 

	**

  X X  X* 
  X X   * 

	

      XX*G

To get a feel for the performance in general, I wrote a function to collect some stats on a series of mazes. I kept generating random mazes and threw out any where DFS couldn't find a solution. Once I'd collected 100 solved mazes, I took the average path length and number of states examined for each algorithm.



       Path      nodes
      length    examined
DFS     31         51
BFS     19         78
 A*     19         49

DFS is the slowest, but is on par with A* for the number of nodes examined.

If we have a good cost function and an admissible heuristic function, A* is almost always the best choice. In fact, it's really only going to be beaten by algorithms that can do some kind of precalculating over the search space. If we don't have these two functions, then we have to choose between faster solutions or shorter solutions. DFS and BFS are also used as the basis for more complicated search algorithms like uniform-cost and backtracking search.

The book used the missionaries and cannibals problem to demonstrate applying pathfinding to problems. It feels more suited to a constraint solving system, but that might be a wrong intuition.

The missionaries and cannibals problem

Java notes

The Collections interface has a lot of useful tools for collections of data, including linear and binary search.
The Stack class uses the standard push and pop methods.
The Queue interface replaces push and pop with offer and poll, not enqueue and dequeue like you sometimes see elsewhere.
Queue is an abstract interface; LinkedList is a common implementation class.
PriorityQueue requires Comparable elements.
The Set interface is abstract; HashSet is a common implementation class.