💾 Archived View for yujiri.xyz › software › guide › data-structures.gmi captured on 2023-06-16 at 16:56:15. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Learn programming for good instead of profit

Data structures

A data structure is a way of storing a complex piece of data. For something simple like a number, you just store it somewhere, usually on the stack. But what if you have something more complex, like a list of things that needs to be able to change size, or a mapping between two sets of data, like a list of words and the definition of each one?

Prerequisite: null

One thing first. So you've heard about pointers: integer values that are used to identify a memory address instead of keeping track of a number. A null pointer is a pointer whose value is 0. For the purposes of userland programs (that is, not kernels), there's never meaningful data stored at address 0, so null pointers are used to represent the lack of a value. Some of these data structures use null pointers to do things like mark the end of a list.

Array

The array is the simplest data structure. It's for storing a list of things in a specific order, and its strategy is "just put them all in a row in memory", so if the first item is at address N, the second item is at address N + 1*S, and the third item is at N + 2*S (where S is the size of one of the items).

What if the items don't all have the same length? Then you have to store them behind pointers: instead of an array of items, you'd have an array of pointers to items, with the items themselves probably stored in random locations on the heap. Pointers have a fixed length (generally 8 bytes on modern computers).

In low-level languages like C, Zig, and Rust, the term "array" is reserved for stack-allocated, fixed-size arrays. They call heap-allocated, growable arrays something else: ArrayList in Zig, Vec in Rust. Higher-level languages tend to use the term "array" more loosely.

Linked list

The linked list is another data structure for storing a list of things in a specific order, but it works very differently from an array. Instead of all the items being in a row, they can be wherever the heck, and next to each item is a pointer to the next one.

So, traversing a linked list works like this:

Start at the first item
Look at the pointer stored next to it
Follow that pointer to get to the second item
Look at the pointer stored next to it
... until you find that one of the pointers is null, which means you're at the end of the list.

It probably seems like a linked list is just more convoluted than an array, and also takes up more memory (since each item also has to have a pointer stored with it). It has other disadvantages on top of that: to get to the last item, you have to traverse the entire thing following a bunch of pointers. With an array, you can get to the last item by just going to address N + CS, where N is the address of the first one, S is the size of the individual items, and C is the number of items in the array. So why would anyone use a linked list?

Well, a linked list can be much faster than an array at certain operations. For example, inserting or removing an item at an arbitrary position. Imagine you want to insert a new item near the beginning of a long array: you first have to move *every subsequent item* in the array to make room. Likewise if you delete an item near the beginning, you have to move every subsequent item backward to collapse the empty space.

To add an item to a linked list, at any position, you only have to put a pointer to the next item beside it, and update the previous item's next-item pointer to point to the new item. To remove an item, at any position, all you have to do is update the previous item's next-item pointer to point to the next one, skipping over the deleted item.

Your first impression was right though: usually, the cons outweigh the pros, and so linked lists are rarely used in modern programming.

Hash map

This is one of the more complex ones, but also the most important (after the humble array).

A hash map is a different kind of data structure than the ones above. Instead of just a list of things, it stores a list of "keys", and a corresponding "value" for each one. You can think of it like a dictionary, where the keys are words and their values are their definitions. (Some languages, like Python, actually call hash maps dictionaries because of this analogy.)

A hash map works by using a "hash function" that computes a "hash" for each key. This hash is used to determine where the value for that key is stored. Therefore, to find the value for a given key, you just compute the hash of your key, then go to the resulting location and read the value. Note three important properties of this design:

The time it takes to look up a value in a hash map doesn't depend on how many items are in the hash map. With an array or linked list, it does: you have to check every item in series to see if it's the one you're looking for.

Automatic deduplication: if you insert a value for a key that's already in there, it'll overwrite the old value instead of creating a duplicate entry.

Hash maps generally don't preserve order. If you need the items to be in a specific order, you probably don't want to use a hash map.

Others

There are tons of other special-purpose data structures like trees, ring buffers, and bloom filters. But almost all the time, you'll just use arrays and the occasional hash map. I don't encourage you to bother learning others until you run into them.

contact

subscribe via RSS