The time complexity of the libc++ deque push_front implementation is O(log n)

Author: x1f604

Score: 78

Comments: 54

Date: 2021-11-28 10:53:19

Web Link

________________________________________________________________________________

colanderman wrote at 2021-11-28 15:20:21:

The best thing I learned in algorithms class is:

For all realizable n, ln(n) is less than 40.

So when amortizing over a large window (say, a block size of 512), it can be the case that, _for any realizable n_, the logarithmic amortized costs are completely dominated by the constant non-amortized costs (adding an element to a block in this case)... making the operation effectively constant-time.

tomerv wrote at 2021-11-28 17:49:50:

Practically speaking, it is not a good advice to just treat log(n) as if it is constant. While it's true that log(n) is always small in practice, the conclusion should not be to ignore it, but rather to notice the constant factor. And in practice, _usually_ data structures with O(log n) complexity also have a bigger hidden constant. For example, std::unordered_map is much faster in practice than std::map. Of course, this is not strictly correct, it's just a heuristics. Quicksort with its O(log n) [Edit: O(n log n)] complexity is a counter-example to this.

colanderman wrote at 2021-11-28 18:28:38:

To be clear, that is not the advice I'm giving -- but rather, when your performance looks like _p*log n + q_, if _q_ is much greater than _p/40_ -- that is, the constant term dwarfs the logarithmic term -- then it is safe to consider it constant.

xdavidliu wrote at 2021-11-28 19:56:55:

> p*log n + q, if q is much greater than p/40 -- that is, the constant term dwarfs the logarithmic term

I think you meant to say "if q is much greater than p TIMES 40".

colanderman wrote at 2021-11-28 22:55:04:

Ah good catch, yes you are correct.

kadoban wrote at 2021-11-28 17:58:51:

That seems like a pretty good argument _for_ treating it as constant though and just shifting your focus to how large the constants actually are.

jltsiren wrote at 2021-11-28 23:48:22:

In a virtual memory system, random access to an array of size n takes O(log n) time, and the constant factors in that O(log n) are also nontrivial. Algorithms that do O(log n) computation with O(log n) independent elements tend to take O(log^2 n) time, while those that do O(log n) computation with O(log n) contiguous elements or O(log n) iterations with O(1) elements still take O(log n) time. If the constant factors are small enough, it can be hard to distinguish the latter two from algorithms doing O(1) computation with O(1) elements.

HALtheWise wrote at 2021-11-29 02:59:23:

In practice, for any memory system with caches and limited by the speed of light, random (unpredictable) access to an array of size n takes much closer to O(sqrt(n)), not O(log(n)). There's an excellent article discussing this that you can search for, and it holds both in emperical tests on modern hardware and in the theoretical physical limit.

jltsiren wrote at 2021-11-29 03:59:50:

That depends on your perspective. If multiple levels of memory hierarchy are relevant (such as when scaling from 1 MiB to 1 GiB), you will see something resembling O(sqrt(n)). If you remain within the same level (e.g. from 1 GiB to 1 TiB), the scaling resembles O(log n) more closely. Or, in other words, it depends on whether you assume that cache size grows with n or is independent of it.

kwertyoowiyop wrote at 2021-11-28 18:14:56:

And of course, nothing is important until you’ve profiled your code and measured that it is.

gpderetta wrote at 2021-11-28 20:20:37:

Random fact of the day: The currently best known upper bound for the complexity of the Union-Find algorithm is the reverse Ackermann function, which can be treated as a constant (4) for all remotely practical (and most of the impractical) values of n.

tylerhou wrote at 2021-11-28 22:43:23:

Tarjan's proof of inverse* Ackermann bound is hard to understand; it's much easier to understand the proof of the log*(n) bound, which is also an extremely slow growing function:

https://people.eecs.berkeley.edu/~vazirani/algorithms/chap5....

where log* is the iterated log function; i.e. the number of times you have to apply log repeatedly until you reach a value of 1 or smaller. For reference, log_2*(10^10000) = 5.

https://en.wikipedia.org/wiki/Iterated_logarithm

maple3142 wrote at 2021-11-29 00:55:34:

I wonder if it is true that everything is O(1) if there is an upper bound? Even for an algorithm with O(n^n) complexity, it is still O(1) if n is bounded, just with a extremely large constant.

colanderman wrote at 2021-11-29 14:47:05:

n has to be bounded much much smaller, and the constant overhead of the algorithm in question much much larger, for any of those approximations to be valid. The approximation works for log n only because it's not uncommon to have constant overheads which dwarf log n for all plausible n (and only works in such cases!) This is very unlikely to be true even for a linear algorithm.

blahgeek wrote at 2021-11-28 14:30:44:

While I very much appreciate and respect author's detailed analysis, I am still not 100% convinced without a corresponding benchmark result. If it's really O(log n) vs O(1), it should be very easy to verify in some micro-benchmark.

Instead of that, the author mentioned that:

The tests that I did run, initially the libstdc++ and the libc++ versions have roughly the same performance, but the performance diverges after a number of insertions somewhere between 6,400 and 12,800. After that the runtime of the libc++ version is roughly 2-3 times that of the libstdc++ version. (...) the outcomes of which depend on the compiler version anyway.

This does not seem right.

x1f604 wrote at 2021-11-28 14:41:34:

Here are the results of the benchmark that I did:

https://imgur.com/a/2lCktTO

It didn't look quite right to me so I didn't post it.

Here's another benchmark that I did where you can clearly see that something has gone wrong:

https://imgur.com/a/s2rA8qE

krona wrote at 2021-11-28 13:18:27:

CppReference defines the time complexity is constant:

https://en.cppreference.com/w/cpp/container/deque/push_front

So does this mean:

- We're talking about different complexities (Amortized vs something else)

- The libc++ implementation isn't standards conformant

- The analysis in the c++ standard is incorrect.

- Something else

Kranar wrote at 2021-11-28 19:53:43:

Neither is correct. Generally when complexity is specified in the standard, it's with respect to a specific operation. In this case the standard specifies that it's the number of calls to the constructor of T that must be constant, which means that push_back and push_front can not copy or move its elements.

There is never a way to guarantee that the physical amount of time is bounded by O(1) in the worst case. You could always have pathological scenarios or data structures written so that performing a move or copy of a single T could require O(n^n) time for all anyone knows.

gpderetta wrote at 2021-11-28 20:51:40:

Memory allocation in particular is very hard to analyze.

im3w1l wrote at 2021-11-28 13:21:21:

It's not conformant.

mkl wrote at 2021-11-28 13:49:38:

To make bullet points on HN, put a blank line between the paragraphs.

logicchains wrote at 2021-11-28 13:02:34:

Anyone who programs low-latency software knows std::deque is an absolutely horrible-performance deque implementation due to how much it allocates and the pointer chasing involved. In most cases it's better to use a vector-backed deque (ring buffer) for anything performance-sensitive.

mgaunard wrote at 2021-11-28 13:35:53:

That is not true.

Anyone who programs low-latency C++ knows that the libstdc++ implementation is great (which is what 99.9% of people use) while others tend to be less stellar.

It's just a segmented vector. The libstdc++ implementation always allocates one segment even if empty, and while I've seen low-latency guidelines arguing for empty things not allocating, my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.

A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

logicchains wrote at 2021-11-28 13:58:54:

>which is what 99.9% of people use

Maybe we have different ideas about what constitutes "low-latency" but in HFT std::deque is rarely used. Much like std::unordered_map, which allocates every insert potentially costing up to a microsecond for each insert.

>It's just a segmented vector.

https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER...

we can look at the source. This "segmented vector" is `_Tp** _M_map`, essentially a vector of vectors. This means potentially twice the pointer-chasing, as we have to do two lookups. It also means allocation/deallocation each time a new segment is added/destroyed, which leads to more allocation than when using a vector (first need to allocate the segment, then potentially allocate entries within that).

>A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

Where possible it's better to use a fixed capacity queue, especially one where the capacity is a power of 2 so wrap-around calculation can be done more efficiently. But the same kind of thing can be done for a dynamic-capacity queue, by resizing the whole backing array when capacity is reached.

mgaunard wrote at 2021-11-28 14:34:21:

I've worked in many HFT shops and that's not my experience.

A lot of those firms have poor C++ code though, but std::deque was typically used in the better ones.

It's not one allocation per element, as said previously it's a segmented vector. It works similarly to virtual memory: fixed-size segments are allocated for contiguous storage, while another parent array points to the segments. So appending/prepending either needs to append/prepend to the last/first segment or to allocate a new segment if full, which is what makes those operations more efficient than on a vector. Yes there is indirection, that's what segmented means. You also get stability of the memory location which is important for many data structures.

The article even tells you how big the segments are in the two implementations being looked at.

Going back to HFT, a similar data structure that's definitely heavily used there is a pool, which has a configurable segment size (potentially variable) and doesn't have order, enabling faster deallocation with a freelist.

monocasa wrote at 2021-11-28 17:34:48:

I thought a resizable ring buffer was a valid implementation, but maybe I'm thinking of Rust's VecDeque and there's something about the semantics of std::deque that doesn't allow that.

monocasa wrote at 2021-11-28 19:47:07:

Ohhhh, I think I found the issue with a ring buffer implementation. Would edit my post if I was in the edit window.

resize() doesn't invalidate existing references in std::deque.

kadoban wrote at 2021-11-28 18:02:13:

I don't think insert at front/back is allowed an amortized bound, it has to be flat constant. But then I don't see how libc++'s is conforming if this article is true, so I'm not entirely sure what's going on there.

mgaunard wrote at 2021-11-28 18:03:06:

How could you make any operation O(1) with resizing?

rav wrote at 2021-11-28 20:32:44:

It is possible to deamortize the O(n) resizing step, meaning you spend O(1) extra time over the O(n) operations leading to the resize, so that when the resize needs to happen, you don't need to spend O(n) time. This assumes that allocation/deallocation takes O(1) time regardless of the buffer size.

Deamortized resizing can work as follows: When inserting into a vector that is half full, allocate a new buffer twice the size and copy two elements into the buffer. For the subsequent insertions, copy two elements every time you insert one element. This guarantees that you are done copying all elements into the new buffer once the old buffer is full, and you only performed a constant number of instructions with each operation.

It's possible to deamortize many quite advanced data structures in theory - see e.g. the lecture notes on "Maintaining Order in a List" on this course webpage:

https://cs.au.dk/~gerth/aa11/

... but as with everything in theoretical computer science you are not guaranteed that the implementation works better in practice than the "naive" approach. Deamortization techniques do have a lot of practical relevance in hard realtime applications, but I don't know how much of "real" hard realtime stuff uses academic techniques and how much of it is mostly engineering.

monocasa wrote at 2021-11-28 19:25:10:

Amortized O(1)

mgaunard wrote at 2021-11-28 23:21:32:

Amortized O(1) does not satisfy the specified requirements.

monocasa wrote at 2021-11-28 23:52:53:

Oh, ok, it used to be amortized constant O(1), but changed at some point apparently.

https://web.archive.org/web/20150310134852/https://en.cppref...

mgaunard wrote at 2021-11-29 09:00:34:

That's an error that this website did. It's not an authoritative source.

It's always been non-amortized since 1998.

dan-robertson wrote at 2021-11-28 14:26:53:

HFT needn’t mean low latency is required but of course the two tend to be related.

If trivial allocations might cost you a microsecond then you can’t allocate at all on a latency-sensitive critical path and should probably get a better allocator. Also, be wary of power-of-two sized things because if they are aligned they are likely to have poor cache performance (too many things compete for the same set of cache lines).

CoolGuySteve wrote at 2021-11-28 20:28:11:

Most STL data structures take an allocator template argument. If you fight the compiler enough to make your own std::allocator with preallocation then the STL is actually pretty nice.

IshKebab wrote at 2021-11-28 19:13:08:

I think he might be talking about a segmented ring buffer, which is similar to a deque but each segment is a ring buffer. It gives much better insertion complexity.

Anyway I have to echo his point that I've never found a situation where deque performs better than vector, even when you really might expect it to. I guess the extra allocations are too slow. Maybe it would be more useful with arenas or something.

ComputerGuru wrote at 2021-11-28 16:16:19:

> my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.

That works until probabilistically there is a decent chance the capacity will always be zero.

323 wrote at 2021-11-28 14:25:55:

How do you feel about the absl containers? Are they also slow for low-latency?

ncmncm wrote at 2021-11-28 22:44:44:

They are OK. But "slow for low latency" is not really defined; for many low-latency cases, only static or at-startup allocation is permitted, thus not deque at all. If you might need to build for production with MSVC, definitely do not use the std::deque that comes with it.

beached_whale wrote at 2021-11-28 14:57:09:

std::deque is odd in that there's no std size of the blocks used. We have libstdc++ that is about 512bytes, libc++ at 4096bytes, and MS STL at around 64bytes. This leaves a bunch of room for optimizing for certain cases. Smaller is more insert friendly, larger more op[] reading friendly. But push_front after a pop_front should be O(1) on all of them. Otherwise it is moving/copying data to make room. Another nice/similar data structure is something like Boosts devector that adds a pointer to vector for the front capacity.

nly wrote at 2021-11-28 16:19:16:

Boosts deque let's you decide the the block size.

brandmeyer wrote at 2021-11-28 17:57:25:

> MS STL at around 64bytes.

Citation? Are you sure it isn't 64 size-independent elements?

MauranKilom wrote at 2021-11-28 18:33:12:

It's worse actually. The MSVC STL uses 16 byte (or however large your element is, if greater) chunks. Source:

https://github.com/microsoft/STL/blob/main/stl/inc/deque#L56...

They have made clear that this won't be changed, for ABI stability reasons.

That makes std::dequeue basically unusable in a portable context. In virtually any situation, "allocate each element separately" and "allocate elements in 4k chunks" are on opposite ends of the performance spectrum.

brandmeyer wrote at 2021-11-29 00:14:31:

That is quite unfortunate, and I freely admit that I'm unfamiliar with MSVC's current quirks.

> They have made clear that this won't be changed, for ABI stability reasons.

Also unfortunate. Once upon a time, users had to be extra careful to control which version of the runtime DLL their program transitively linked to since MS would issue a new ABI-breaking version with every major compiler release. Sounds like they got ABI stability religion just in time to lock in some unfortunate decisions.

beached_whale wrote at 2021-11-29 04:34:21:

Thanks, I remembered incorrectly.

im3w1l wrote at 2021-11-28 13:11:32:

So iiuc:

The libc++ implementation is bad. The libstdc++ implementation is fine. The issue with the former is that it doesn't have enough spare capacity so it has to shift elements around too often.

Actually I think the push_front is even worse than stated: O(n). Consider an array with capacity N+1, contains N elements. Now if you alternate push_front and pop_back then every push_front will cause memmove of N elements.

Oh and to make like a 4th addition to this comment: It's kind of funny that the code is so unreadable that the author found it more helpful to look at the generated assembler code.

rightbyte wrote at 2021-11-28 15:01:09:

Reading those C or C++ standard libraries is like a joke. Almost nothing is done in a direct or clear way and internal methods have cryptic names hidden in macros.

Maybe for a good reason I dunno. But it would be nice if the code was clearer so you could make sense of it when gdb or callgrind jumps into an inlined chunk ...

brandmeyer wrote at 2021-11-28 18:02:38:

> internal methods have cryptic names

They choose names like _Capitalized and __lowercase because those identifiers are reserved for the implementation. Its a consequence of the preprocessor's lack of hygiene.

So where you might see a convention of naming members like m_thingish_foo, in the implementation's library headers they would be named _M_thingish_foo or __m_thingish_foo.

rightbyte wrote at 2021-11-28 18:33:55:

Ye true. Maybe I am just whining because the inlined chunks in e.g. callgrind or gdb looks so random. I should use "-Og" more often ...

brandmeyer wrote at 2021-11-29 00:05:44:

GDB has a pluggable pretty-printing system.

http://sourceware.org/gdb/wiki/STLSupport

That doesn't necessarily help if you are backtracing from your own lambda up through some <algorithm>. But it does help many other use cases with the standard library.

williamkuszmaul wrote at 2021-11-28 20:04:27:

Very nice post!

Small comment: Ideally, big-O notation is for upper bounds. If you are doing lower bounds, you should ideally use big-Omega notation. But Omegas are harder to format in plain text, so it may be better to abuse notation and use big-O...

Labo333 wrote at 2021-11-28 15:50:08:

Interesting!

It reminds me of the hash maps that are said to be amortized O(1) but can be O(n) for some sequences of operations in various languages like Python and JS:

https://blog.toit.io/hash-maps-that-dont-hate-you-1a96150b49...

ncmncm wrote at 2021-11-28 22:37:46:

The most important thing to know about std::deque is that the implementation in MSVC is really just _awfully bad_. So, if your code might need to be built there, you are should consider using an implementation from somewhere other than std::.