________________________________________________________________________________
The best thing I learned in algorithms class is:
For all realizable n, ln(n) is less than 40.
So when amortizing over a large window (say, a block size of 512), it can be the case that, _for any realizable n_, the logarithmic amortized costs are completely dominated by the constant non-amortized costs (adding an element to a block in this case)... making the operation effectively constant-time.
Practically speaking, it is not a good advice to just treat log(n) as if it is constant. While it's true that log(n) is always small in practice, the conclusion should not be to ignore it, but rather to notice the constant factor. And in practice, _usually_ data structures with O(log n) complexity also have a bigger hidden constant. For example, std::unordered_map is much faster in practice than std::map. Of course, this is not strictly correct, it's just a heuristics. Quicksort with its O(log n) [Edit: O(n log n)] complexity is a counter-example to this.
To be clear, that is not the advice I'm giving -- but rather, when your performance looks like _p*log n + q_, if _q_ is much greater than _p/40_ -- that is, the constant term dwarfs the logarithmic term -- then it is safe to consider it constant.
> p*log n + q, if q is much greater than p/40 -- that is, the constant term dwarfs the logarithmic term
I think you meant to say "if q is much greater than p TIMES 40".
Ah good catch, yes you are correct.
That seems like a pretty good argument _for_ treating it as constant though and just shifting your focus to how large the constants actually are.
In a virtual memory system, random access to an array of size n takes O(log n) time, and the constant factors in that O(log n) are also nontrivial. Algorithms that do O(log n) computation with O(log n) independent elements tend to take O(log^2 n) time, while those that do O(log n) computation with O(log n) contiguous elements or O(log n) iterations with O(1) elements still take O(log n) time. If the constant factors are small enough, it can be hard to distinguish the latter two from algorithms doing O(1) computation with O(1) elements.
In practice, for any memory system with caches and limited by the speed of light, random (unpredictable) access to an array of size n takes much closer to O(sqrt(n)), not O(log(n)). There's an excellent article discussing this that you can search for, and it holds both in emperical tests on modern hardware and in the theoretical physical limit.
That depends on your perspective. If multiple levels of memory hierarchy are relevant (such as when scaling from 1 MiB to 1 GiB), you will see something resembling O(sqrt(n)). If you remain within the same level (e.g. from 1 GiB to 1 TiB), the scaling resembles O(log n) more closely. Or, in other words, it depends on whether you assume that cache size grows with n or is independent of it.
And of course, nothing is important until you’ve profiled your code and measured that it is.
Random fact of the day: The currently best known upper bound for the complexity of the Union-Find algorithm is the reverse Ackermann function, which can be treated as a constant (4) for all remotely practical (and most of the impractical) values of n.
Tarjan's proof of inverse* Ackermann bound is hard to understand; it's much easier to understand the proof of the log*(n) bound, which is also an extremely slow growing function:
https://people.eecs.berkeley.edu/~vazirani/algorithms/chap5....
where log* is the iterated log function; i.e. the number of times you have to apply log repeatedly until you reach a value of 1 or smaller. For reference, log_2*(10^10000) = 5.
https://en.wikipedia.org/wiki/Iterated_logarithm
I wonder if it is true that everything is O(1) if there is an upper bound? Even for an algorithm with O(n^n) complexity, it is still O(1) if n is bounded, just with a extremely large constant.
n has to be bounded much much smaller, and the constant overhead of the algorithm in question much much larger, for any of those approximations to be valid. The approximation works for log n only because it's not uncommon to have constant overheads which dwarf log n for all plausible n (and only works in such cases!) This is very unlikely to be true even for a linear algorithm.
While I very much appreciate and respect author's detailed analysis, I am still not 100% convinced without a corresponding benchmark result. If it's really O(log n) vs O(1), it should be very easy to verify in some micro-benchmark.
Instead of that, the author mentioned that:
The tests that I did run, initially the libstdc++ and the libc++ versions have roughly the same performance, but the performance diverges after a number of insertions somewhere between 6,400 and 12,800. After that the runtime of the libc++ version is roughly 2-3 times that of the libstdc++ version. (...) the outcomes of which depend on the compiler version anyway.
This does not seem right.
Here are the results of the benchmark that I did:
It didn't look quite right to me so I didn't post it.
Here's another benchmark that I did where you can clearly see that something has gone wrong:
CppReference defines the time complexity is constant:
https://en.cppreference.com/w/cpp/container/deque/push_front
So does this mean:
- We're talking about different complexities (Amortized vs something else)
- The libc++ implementation isn't standards conformant
- The analysis in the c++ standard is incorrect.
- Something else
Neither is correct. Generally when complexity is specified in the standard, it's with respect to a specific operation. In this case the standard specifies that it's the number of calls to the constructor of T that must be constant, which means that push_back and push_front can not copy or move its elements.
There is never a way to guarantee that the physical amount of time is bounded by O(1) in the worst case. You could always have pathological scenarios or data structures written so that performing a move or copy of a single T could require O(n^n) time for all anyone knows.
Memory allocation in particular is very hard to analyze.
It's not conformant.
To make bullet points on HN, put a blank line between the paragraphs.
Anyone who programs low-latency software knows std::deque is an absolutely horrible-performance deque implementation due to how much it allocates and the pointer chasing involved. In most cases it's better to use a vector-backed deque (ring buffer) for anything performance-sensitive.
That is not true.
Anyone who programs low-latency C++ knows that the libstdc++ implementation is great (which is what 99.9% of people use) while others tend to be less stellar.
It's just a segmented vector. The libstdc++ implementation always allocates one segment even if empty, and while I've seen low-latency guidelines arguing for empty things not allocating, my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.
A ring buffer is a completely different data structure, it only works for fixed-capacity queues.
>which is what 99.9% of people use
Maybe we have different ideas about what constitutes "low-latency" but in HFT std::deque is rarely used. Much like std::unordered_map, which allocates every insert potentially costing up to a microsecond for each insert.
>It's just a segmented vector.
https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER...
we can look at the source. This "segmented vector" is `_Tp** _M_map`, essentially a vector of vectors. This means potentially twice the pointer-chasing, as we have to do two lookups. It also means allocation/deallocation each time a new segment is added/destroyed, which leads to more allocation than when using a vector (first need to allocate the segment, then potentially allocate entries within that).
>A ring buffer is a completely different data structure, it only works for fixed-capacity queues.
Where possible it's better to use a fixed capacity queue, especially one where the capacity is a power of 2 so wrap-around calculation can be done more efficiently. But the same kind of thing can be done for a dynamic-capacity queue, by resizing the whole backing array when capacity is reached.
I've worked in many HFT shops and that's not my experience.
A lot of those firms have poor C++ code though, but std::deque was typically used in the better ones.
It's not one allocation per element, as said previously it's a segmented vector. It works similarly to virtual memory: fixed-size segments are allocated for contiguous storage, while another parent array points to the segments. So appending/prepending either needs to append/prepend to the last/first segment or to allocate a new segment if full, which is what makes those operations more efficient than on a vector. Yes there is indirection, that's what segmented means. You also get stability of the memory location which is important for many data structures.
The article even tells you how big the segments are in the two implementations being looked at.
Going back to HFT, a similar data structure that's definitely heavily used there is a pool, which has a configurable segment size (potentially variable) and doesn't have order, enabling faster deallocation with a freelist.
I thought a resizable ring buffer was a valid implementation, but maybe I'm thinking of Rust's VecDeque and there's something about the semantics of std::deque that doesn't allow that.
Ohhhh, I think I found the issue with a ring buffer implementation. Would edit my post if I was in the edit window.
resize() doesn't invalidate existing references in std::deque.
I don't think insert at front/back is allowed an amortized bound, it has to be flat constant. But then I don't see how libc++'s is conforming if this article is true, so I'm not entirely sure what's going on there.
How could you make any operation O(1) with resizing?
It is possible to deamortize the O(n) resizing step, meaning you spend O(1) extra time over the O(n) operations leading to the resize, so that when the resize needs to happen, you don't need to spend O(n) time. This assumes that allocation/deallocation takes O(1) time regardless of the buffer size.
Deamortized resizing can work as follows: When inserting into a vector that is half full, allocate a new buffer twice the size and copy two elements into the buffer. For the subsequent insertions, copy two elements every time you insert one element. This guarantees that you are done copying all elements into the new buffer once the old buffer is full, and you only performed a constant number of instructions with each operation.
It's possible to deamortize many quite advanced data structures in theory - see e.g. the lecture notes on "Maintaining Order in a List" on this course webpage:
... but as with everything in theoretical computer science you are not guaranteed that the implementation works better in practice than the "naive" approach. Deamortization techniques do have a lot of practical relevance in hard realtime applications, but I don't know how much of "real" hard realtime stuff uses academic techniques and how much of it is mostly engineering.
Amortized O(1)
Amortized O(1) does not satisfy the specified requirements.
Oh, ok, it used to be amortized constant O(1), but changed at some point apparently.
https://web.archive.org/web/20150310134852/https://en.cppref...
That's an error that this website did. It's not an authoritative source.
It's always been non-amortized since 1998.
HFT needn’t mean low latency is required but of course the two tend to be related.
If trivial allocations might cost you a microsecond then you can’t allocate at all on a latency-sensitive critical path and should probably get a better allocator. Also, be wary of power-of-two sized things because if they are aligned they are likely to have poor cache performance (too many things compete for the same set of cache lines).
Most STL data structures take an allocator template argument. If you fight the compiler enough to make your own std::allocator with preallocation then the STL is actually pretty nice.
I think he might be talking about a segmented ring buffer, which is similar to a deque but each segment is a ring buffer. It gives much better insertion complexity.
Anyway I have to echo his point that I've never found a situation where deque performs better than vector, even when you really might expect it to. I guess the extra allocations are too slow. Maybe it would be more useful with arenas or something.
> my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.
That works until probabilistically there is a decent chance the capacity will always be zero.
How do you feel about the absl containers? Are they also slow for low-latency?
They are OK. But "slow for low latency" is not really defined; for many low-latency cases, only static or at-startup allocation is permitted, thus not deque at all. If you might need to build for production with MSVC, definitely do not use the std::deque that comes with it.
std::deque is odd in that there's no std size of the blocks used. We have libstdc++ that is about 512bytes, libc++ at 4096bytes, and MS STL at around 64bytes. This leaves a bunch of room for optimizing for certain cases. Smaller is more insert friendly, larger more op[] reading friendly. But push_front after a pop_front should be O(1) on all of them. Otherwise it is moving/copying data to make room. Another nice/similar data structure is something like Boosts devector that adds a pointer to vector for the front capacity.
Boosts deque let's you decide the the block size.
> MS STL at around 64bytes.
Citation? Are you sure it isn't 64 size-independent elements?
It's worse actually. The MSVC STL uses 16 byte (or however large your element is, if greater) chunks. Source:
https://github.com/microsoft/STL/blob/main/stl/inc/deque#L56...
They have made clear that this won't be changed, for ABI stability reasons.
That makes std::dequeue basically unusable in a portable context. In virtually any situation, "allocate each element separately" and "allocate elements in 4k chunks" are on opposite ends of the performance spectrum.
That is quite unfortunate, and I freely admit that I'm unfamiliar with MSVC's current quirks.
> They have made clear that this won't be changed, for ABI stability reasons.
Also unfortunate. Once upon a time, users had to be extra careful to control which version of the runtime DLL their program transitively linked to since MS would issue a new ABI-breaking version with every major compiler release. Sounds like they got ABI stability religion just in time to lock in some unfortunate decisions.
Thanks, I remembered incorrectly.
So iiuc:
The libc++ implementation is bad. The libstdc++ implementation is fine. The issue with the former is that it doesn't have enough spare capacity so it has to shift elements around too often.
Actually I think the push_front is even worse than stated: O(n). Consider an array with capacity N+1, contains N elements. Now if you alternate push_front and pop_back then every push_front will cause memmove of N elements.
Oh and to make like a 4th addition to this comment: It's kind of funny that the code is so unreadable that the author found it more helpful to look at the generated assembler code.
Reading those C or C++ standard libraries is like a joke. Almost nothing is done in a direct or clear way and internal methods have cryptic names hidden in macros.
Maybe for a good reason I dunno. But it would be nice if the code was clearer so you could make sense of it when gdb or callgrind jumps into an inlined chunk ...
> internal methods have cryptic names
They choose names like _Capitalized and __lowercase because those identifiers are reserved for the implementation. Its a consequence of the preprocessor's lack of hygiene.
So where you might see a convention of naming members like m_thingish_foo, in the implementation's library headers they would be named _M_thingish_foo or __m_thingish_foo.
Ye true. Maybe I am just whining because the inlined chunks in e.g. callgrind or gdb looks so random. I should use "-Og" more often ...
GDB has a pluggable pretty-printing system.
http://sourceware.org/gdb/wiki/STLSupport
That doesn't necessarily help if you are backtracing from your own lambda up through some <algorithm>. But it does help many other use cases with the standard library.
Very nice post!
Small comment: Ideally, big-O notation is for upper bounds. If you are doing lower bounds, you should ideally use big-Omega notation. But Omegas are harder to format in plain text, so it may be better to abuse notation and use big-O...
Interesting!
It reminds me of the hash maps that are said to be amortized O(1) but can be O(n) for some sequences of operations in various languages like Python and JS:
https://blog.toit.io/hash-maps-that-dont-hate-you-1a96150b49...
The most important thing to know about std::deque is that the implementation in MSVC is really just _awfully bad_. So, if your code might need to be built there, you are should consider using an implementation from somewhere other than std::.