💾 Archived View for dioskouroi.xyz › thread › 29427449 captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-03)
➡️ Next capture (2021-12-05)
-=-=-=-=-=-=-
________________________________________________________________________________
_>What’s interesting about debugging this is that it actually is much easier to debug than a Linux system. Why? Because there is only one program in question. You aren’t whipping out lsof to figure out what process is spewing out crap connections or which process didn’t have a proper log rotation setup and prevented you from SSHing into the instance. [...] Unikernels are so much easier to debug than normal Linux systems._
I have no strong opinion either way but just to add counterpoint from Bryan Cantrill's 2016 article[1]:
_>[...] the most profound reason that unikernels are unfit for production — and the reason that (to me, anyway) strikes unikernels through the heart when it comes to deploying anything real in production: Unikernels are entirely undebuggable. There are no processes, so of course there is no ps, no htop, no strace — but there is also no netstat, no tcpdump, no ping! And these are just the crude, decades-old tools. There is certainly nothing modern like DTrace or MDB. From a debugging perspective, to say this is primitive understates it: this isn’t paleolithic — it is precambrian. As one who has spent my career developing production systems and the tooling to debug them, I find the implicit denial of debugging production systems to be galling, and symptomatic of a deeper malaise among unikernel proponents: total lack of operational empathy. Production problems are simply hand-waved away — services are just to be restarted when they misbehave. _
And a previous 2018 HN subthread had differing opinions on debugging unikernels:
https://news.ycombinator.com/item?id=17260604
[1]
https://www.joyent.com/blog/unikernels-are-unfit-for-product...
I think most of this can be significantly reduced with mature development processes and discipline. A mature org almost never needs to SSH into a production instance in the first place—extensive debug information is available via logging and monitoring, and when a problem arises in production you try to replicate it in lower environments if necessary. This doesn’t always work and sometimes it would be faster to just SSH and call it a day, but in exchange for that convenience you get a bunch of nice properties. Moreover, you can always compile in the debugging utilities that you absolutely need (or even better—create network interfaces for them).
Having worked mostly in cowboy organizations, I just don't see how you can fix tricky network issues without access to the machines. And once you need access sometimes, you might as well use it always, because it's certainly handy.
Application issues usually aren't so hard to replication, but even there, sometimes it's easiest to figure out how to replicate the issue in a test environment by debugging on production. Once the issue is understood, then you can validate the changes in a test environment and push to production.
All that to say, if you don't have access to an equivalent of tcpdump, netstat, top, etc, it's going to be a lot harder to debug your system. And if you have a system that needs the performance of a Unikernel, you're going to experience tricky problems. If you just want a simple system, it's better IMHO to just run a minimum daemon system. sshd, getty(if you have a console), ntpd, syslogd, crond, a local caching dns server (I like unbound) a monitoring program, and your application. A servicable unikernel is going to need an equivalent to most of those and the debugging tool, so maybe not as simple as one might hope.
> Having worked mostly in cowboy organizations, I just don't see how you can fix tricky network issues without access to the machines. And once you need access sometimes, you might as well use it always, because it's certainly handy.
The big, abstract idea is that you don't worry about individual instances (be they VMs or container instances or unikernel apps or etc), but you worry about the process for stamping out those instances. If you have a process for reproducibly creating machines, then you can test that process in lower environments (stamp out some instances, validate them, and destroy them) before promoting that process to production. When you have production issues, your first line of defense should be comprehensive logging and monitoring, but failing that you try to reproduce issues in lower environments and failing that you can deploy a version of your app with monitoring tools baked in (e.g., instead of a scratch base image, you deploy a version with a full ubuntu base image). If you're running into issues so often that you feel the need to deploy your debug tools to production all the time, then you're not appropriately investing in your logging/monitoring (ideally every time you need a debug tool in production, you should go back and add equivalent instrumentation so you don't need that debug tool any more). This is the theory, anyway. Practice involves a lot more nuance and judgement.
Yeah, my test environment is never going to match the diversity of the real world. I just don't have the equipment or imagination to ad-hoc create the garbage networks the real world has.
Running a new, clean instance doesn't help most networking problems that are an interaction between (usually) kernel bugs in different systems.
How do you figure out your kernel is sending way too many packets out when it gets an ICMP MTU exceeded with the MTU set to the current size without getting lucky with tcpdump. That's a FreeBSD kernel bug/oversight (since fixed) interacting with a Linux routing bug (already fixed when I found it, nut not deployed) where large receive offload resulted in a packet too large to forward even though the on the wire packets were properly sized.
Or that time where syncookies were broken and a very short connection could get restarted but not properly synchronized and both sides send challenge acks for all packets they receive until one side times out (up to line rate if the other party has good connectivity or is localhost). That one needs luck with tcpdump too.
Or when ip fragment processing had a nead infinite loop if you had the right timing, resulting in that rx queue getting hung for hours. Dropping into the kernel debugger made it very quick to debug.
It's quite hard to diagnose MTU issues in general without tcpdump, too. No sane person is going to log all packets for a machine that's pushing 2-20Gbps, or the retention is going to be too short to be useful.
Tcpdump could maybe be handled by port mirroring on a switch, but repeating unknown event triggers partial network outage took seconds to diagnose with the tools on production and would have taken an unknowable amount of time otherwise. Upstream fixed it by accident, and few people would have experienced it because it was unlikely to occur with default settings.
I fix a lot of problems with tcpdump, so all of my problems look tcpdump shaped.
Just as an aside both the use-cases of MTU being the wrong size (GCP, which I mentioned in the article) and adding syn cookies are both direct problems that we've had to diagnose and fix in Nanos. The latter cause the tcp/ip stack we chose didn't have syn cookies to begin with.
We were able to use tcpdump in that situation as well to test and verify the problem was solved and didn't require it to be in the guest.
All of these problems I wouldn't expect an average user to fix regardless though.
> All of these problems I wouldn't expect an average user to fix regardless though.
I wouldn't expect an average user to run a unikernel.
I'd expect those running unikernels in production to be people who really need the performance, and those who really need the performance are going to be running up against weird issues all the time, and need to have at least a couple people on staff who can solve weird issues. When you have that kind of scale, one in a million events are frequent.
I also don't necessarily understand the desire to run a unikernel in a hypervisor; might it be better to run a regular kernel on bare metal, or the unikernel on bare metal (yeah, it's harder to develop and probably debug the unikernel that way, but performance!).
Maybe isolation is a bigger deal to some people than me.
Not to mention he even mentions an obvious fix - "Production problems are simply hand-waved away — services are just to be restarted when they misbehave. "
Even if my telemetry doesn't tell me all I need, if it's easy to reproduce, I can reproduce it in an environment that has all the debug stuff I want (and I can futz with it to my heart's content safely, too).
If it's hard to reproduce...then I don't know that I really want to spend time on it if a reboot will just fix it.
While there are probably some in between cases (hard to reproduce, unable to be caught by solid telemetry, we have an instance that is in a bad state that I could remove from the LB and safely futz with, and actually gain useful information from an after-the-fact debugging rather than inspection as it runs under real load), I'm not sure I can recollect one that fits those criteria across 6 or so different companies and domains. That said, that's hardly exhaustive, and I'll admit at only two of those were we using anything akin to a unikernel.
Very well said. This is basically what I picture when I think of "mature organizations". You strive to never touch things in production except to restart them or deploy a different version. When you find yourself wishing you had some debug utility, that's certainly a sign that you need better observability. Rolling out a new release with your improved instrumentation isn't a good fit for an urgent problem, but these cases become increasingly rare over time as your observability improves and moreover you can always deploy a "debug" version of your application with various debug utilities compiled in).
From what I recall, a lot of unikernel hype started around MirageOS and OCaml. When your language encourages immutability, debugging on the live system is simply not that useful. Take a snapshot of the live system and replay it on a local machine like you describe where you have considerably more sophisticated tools.
I assume unikernels are also big on reproducible builds, which means if there are issues you can't debug like above, then you can easily and quickly deploy a new version with more tracing code inserted exactly where you want. "Traditional debugging" tools can't get this precise, so I can't say I'm convinced by that objection, although I agree developing in this paradigm would require some domain knowledge/experience to set things up right.
I don't see why a unikernel may not have a dedicated debug interface, much like MCUs that lack a console but have JTAG.
It seems to be a tooling problem. But it's a large deficit, and there's a catch 22 in that the people most concerned with tooling try to use architectures with mature tooling, and don't write it for the immature ones.
There is also the problem of a lot extra code volume without visible interfaces separating them. Once it's a networking problem, is it on your code, on the driver, on some other part of the code that is interfering with ours? But that also is a matter of architecture details and tooling.
Because someone needs to go write code for them, or make linux tools work in the said unikernel.
That interface would have to exist at the host level (just like JTAG exists at the hardware level), but you won't see cloud vendors exposing a GDB stub interface to your VM instances any time soon.
That was 2016. I don't know the state of unikernel debugging today, but I can imagine a unikernel having all the observability goodies that Solaris/Illumos have, or Linux for that matter.
mdb? maybe via a console just like on Solaris/Illumos, or from the hypervisor.
gdb? maybe via a gdb server in the unikernel, or from the hypervisor.
DTrace, eBPF? maybe via a server like a gdb server, but for those, in the unikernel.
I can't imagine why those things wouldn't be possible.
I'm not sure I get what the point of unikernels is given his explanation.
GCP runs your VM as a Linux process with KVM. If you run a unikernel on that... that's roughly comparable to running your workload directly on the host - i.e. something more like App Engine. You still have Linux on top taking care of everything. You've just replaced the system call interface with a paravirtualized hardware interface. Which... is quite likely to be a bit slower if anything. You also have the overhead of two-stage paging (even for a unikernel, apps will want to use things like mmap(), which I assume means they can't run with guest paging off; and there's no paravirt API to mess with the VM memory layout to emulate this at the host level AFAIK)
Of course you could in principle run a unikernel on bare metal, but I don't see that ever taking off. You need more software to manage a real machine, plus drivers. It's a lot easier to build your own OS when you're targeting fairly simple VM platforms with paravirtualized hardware. Bare metal isn't the same, and who wants to dedicate a whole machine to one process?
Basically, this sounds to me like a less than ideal workaround to reduce VM overhead in existing cloud environments; not something that makes a whole lot of sense when you look at the big picture.
The ASLR story is also silly. You can get the same effect with traditional fork/exec request handling (think CGI) on a normal OS, and of course that's going to be faster than booting the unikernel _and_ the app for each request.
> Bare metal isn't the same, and who wants to dedicate a whole machine to one process?
1) I actually do that at work. A few bare metal machines each have to run just one large-ish process. If I had the time I'd really invest in a unikernel type setup just for the simplicity of it.
2) if you're using any cloud hosted vm to run your apps as the post describes you could benefit a lot from a unikernel setup. Haven't used aws et al much the last few years but at my last gig we hosted everything in their own vm,which would've been great to slim down hard.
The alternative is not between bare metal and paravirtualization. Hardware assisted virtualisation is a thing. The hypervisor can actually be very thin.
I think what you are describing is actually an argument in favor of unikernel. The idea is to get the best of both world. You get the isolation guaranty provided by virtualisation without the cost of a full OS running on a full OS and with the added benefits of compiler optimisation running on the whole stake at once.
Intellectually it makes a lot of sense. The question is: can it actually work without creating a lot of frictions during development and deliver these theoretical gains.
The host interface is paravirtualized. Hardware assisted virtualization only covers the CPU in most cases. All the hardware and I/O is going to be using paravirtualized devices (virtio or similar).
There is no inherent reason why that interface is going to be more secure than the syscall interface. If the goal is security, there are easier ways of hardening that without reaching for virtualization. Virtualization itself doesn't really buy you anything there; we only consider VMs more secure due to how existing implementations work, not due to anything intrinsic.
> Hardware assisted virtualization only covers the CPU in most cases. All the hardware and I/O is going to be using paravirtualized devices (virtio or similar).
Well, no. IOMMU is a thing. You can do passthrough of I/O devices is you need to.
> There is no inherent reason why that interface is going to be more secure than the syscall interface. If the goal is security, there are easier ways of hardening that without reaching for virtualization.
There is no question that running a software in a VM is safer than running it straight on the kernel. The isolation enforced is far stronger.
You could legitimately ask if it’s safer than running it inside a container. That’s a good question especially now that the orchestration surrounding containers is better.
A unikernel keeps the advantage of discarding plenty of attack vectors just by virtue of every unnecessary part of the kernel not being there however.
> we only consider VMs more secure due to how existing implementations work, not due to anything intrinsic.
It’s weird argument. We do indeed consider VMs more secure than running things directly on the kernel because the way they work intrinsically means are.
You can certainly harden systems in different way and enforce isolation though different means but VMs have the advantage of being a fairly old and well known technology with good toolings around them, especially in the entreprise world.
When I studied in Uni, it seemed like there were two different paths to reducing the number of copies and system calls in handling a network request.
1. Move the network stack to userspace. Then the code running in userland can get at the request without a copy.
2. Move the running code into the kernel. Then the code running in the kernel can get at the request without a copy.
Unikernals are a way to do 2. People just really don't want to pay for that syscall boundary.
I think the parent's point is that running a unikernel on paravirtualized hardware is actually neither: the entire VM runs in userspace as seen from the host, but you still have the data copy from the host's physical network interface to the paravirtual interface on the VM.
But you're paying for the virtualization boundary, which is usually more expensive than a syscall boundary. You're still a VM guest running under the host kernel, which is a comparable situation to a userspace process.
I don’t think App Engine shares a Linux kernel between users. That would be a surprising level of confidence in lack of Linux system call vulnerabilities that no other cloud vendor seems to have (see AWS’s Firecracker VM or Google’s own gVisor).
So at the end of the day, you’re still going to be dealing with a hypervisor. It’s also worth noting that at least AWS appears to hardware accelerate their paravirtualized interfaces so it’s not necessarily the case that IO operations are causing VM exits.
App Engine absolutely shares a Linux between users. They have a heavily audited sandbox with multiple layers of defence. Of course customers don't get to call arbitrary system calls, but it's also not a VM.
GCP indicates that App Engine standard environment at least uses gVisor[1] (i.e. KVM with emulated Linux syscall interface). The docs also indicate you can run custom binaries on most of the standard runtime languages[2] (not to mention being able to ship Go binaries), so I don’t see how they could get around allowing system calls.
App Engine must have used something other than gVisor originally since Go didn’t even exist when it launched, but I’m having trouble finding any concrete indication as to whether they shared kernels between multiple customers.
[1]:
https://cloud.google.com/blog/products/containers-kubernetes...
[2]:
https://cloud.google.com/appengine/docs/the-appengine-enviro...
They definitely shared kernels between customers (and with other internal stuff).
I don't know if they're using the KVM backend now, but if they are, that's definitely more recent. There was no virtualization involved when it all started.
Source: I'm a xoogler.
gVisor's kvm backend is still experimental AFAIK, in favor of the ptrace backend.
I don't know if they use Firejail or not but that sounds a lot like Firejail.
Interesting. Basically the argument is as follows:
Conventional operating systems are multi-user: one user shares all their files with all their apps, but effort is put into protecting one user from another. This then gets breached _all the time_ (escalation of privilege), and fails to capture the problem that attacks come from applications.
Unikernel/cloud systems are "zero-user": we have something that might be called a "process" in a traditional operating system, but instead of lots of fallible fine grained security boundaries we have only one, the hardware virtualization. The "process" runs directly below the hypervisor and interacts with "virtual hardware", predominantly with other networked services.
That raises the question - how is the hardware virtualization for these systems handled and secured? Do you get a virtual PCI bridge to a virtual network card?
> effort is put into protecting one user from another.
This is because it dates back to the times when you'd have multiple real users logged into a server on a campus. There's a bunch of *nix perms stuff that makes no sense in modern production environments.
And in my home environment, I only ever have one user and
is relevant.
> hardware virtualization
Is this similar with Qubes OS?
You can think of this like Qubes but for the server-side instead of the desktop/consumer. It's a very good analogy in my opinion.
>zero-user
Wouldn't the normal "single user" term be more appropriate? Or by Zero user you mean something else?
I think 'zero user' is right, there just isn't any concept of users or permissions.
its really moving the classic OS resource sharing interface to the hypervisor. that's where the storage and network comes from. given that, what machinery do you need to run just one application? maybe a little filesystem, a tcp stack, interfaces to manipulate threads and kernel locks. that's what gets packaged up with your application into a bootable VM
So whoever exploits that one process has complete control over everything, right ?
Well, they have control over the application, and whatever its VM has access to .. which is supposed to be limited to only the things that the application needs.
Point raised in the TFA is that one process unikernal has a much smaller attack surface than a linux instance running the process would have.
If anyone wants to learn more about Unikernels, Ian (the author of the blog post for this HN post) was on my podcast back in April.
That episode is at:
https://runninginproduction.com/podcast/79-nanovms-let-you-r...
It covers topics like what a Unikernel is, but also ways it can be practically applied to build web applications. We also talked about how he built a few apps using them. We also talked about a few of his open source projects related to Unikernels and how you can run them on popular cloud hosting providers today.
Do you, or other readers, know of other/more companies operating in the Unikernel space? I'm curious about the landscape of opportunities to work in this area. Would it be wrong to think Unikernels and high CPU/GPU multiprocessing requirements might be a good combination?
Not something we've directly worked on yet but there is a lot of interest and other groups/companies doing ML at the edge utilizing things like GPU PCI passthrough with co-processors attached. That way you could isolate say ffmpeg in one unikernel, your inference app in another, maybe a stats app that shoots rollups to the cloud, etc. I think as costs keep coming down this will get ever more interesting over time. You can for instance already run multiple unikernels on a single rpi4 with ARMv8. The one that ships with v9 will be really nice.
That sounds an awful lot like reinventing a multitasking operating system
At the end of the day most software is not composed of a single application. To further complicate matters, many applications can't even run on a single server today. Many of the companies that users here work at have applications that span hundreds or thousands of servers.
I think part of the argument here is that our software outgrew the one server/one operating system model a very long time ago.
Hi Ian:
> _I think part of the argument here is that our software outgrew the one server/one operating system model a very long time ago._
That it certainly did. I want to know what's your opinion of WASM and where do you see WASM and its nanoprocess model (capability-based security) fit into all of this [0]? I look up to it as a tech midway between the KVM and unikernels; and firmly believe WASM is poised to get big soon.
[0]
https://www.infoq.com/news/2019/11/wasm-nanoprocesses-securi...
WASM (and WASI) are interesting and I personally know a lot of people excited about it. We've also ran a few wasm runtimes such as wasmer inside Nanos.
Having said that, there are a handful of technical limitations {raw sockets, 32bit, dynamic linking, performance, security, threads, TLS} that exist today that prevent a bunch of applications being deployed there and while most of them can be dealt with technically I think just because of where WASM descends from (the browsers) there are some organizational (read: standardization) issues that will probably prevent a lot of it from being dealt with. It's been a while since I looked at it but to me none of these are hard technical issues - they are just issues that the WASM community will need to decide in which direction they want to go and how married to the browsers do they want to be.
Last job we used a former game dev to create a single executable combining an ffmpeg core, an SQLite manager, REST server, a facial recognition engine, and an notifications push server into a single high compute executable. Considering the number of OS services such a Swiss army knife application requires, would such be appropriate for a unikernel, as a single app? Reason being, when combining all those aspects, huge optimizations become available that are simply not possible when the work is spread across even multiple executables.
The author of the article is CEO of NanosVM, who makes Nanos unikernel and tooling.
It looks pretty interesting with Go, php, and nodejs support.
I tried dabbling in MirageOS in the past, and actually love OCaml as a language, but the local setup required to code, build, and deploy is too much for after working a full day.
The Ops tool and nanosvm looks much easier to use. I haven’t dug through the unikernel code to see how real it is, but it’s def. attractive efficiency after dealing w/k8 & docker stack all day.
Yeah, their tooling is very easy to install. I'm having a fiddle with it now :).
Agreed. Pragmatism & DX looks to have had a huge focus here, which alone makes it much more interesting than other projects which I found fairly rough at the edges in comparison.
Operating systems are there to help you share resources.
You're going to have numerous applications share the same NIC, PCI bus and memory. As far as performance is concerned, running those applications bare metal on a traditional kernel makes a lot more sense than creating virtual resources that each application can pretend to have exclusive ownership of.
The only way to get efficient networking out of a VM is to dedicate the hardware to that VM which defeats the point.
Certainly in 1970 that was the case. Nowadays things look different (in some ways): Hypervisors already mediate the hardware, presenting virtual NICs etc up to the guests. Partitioning hypervisors split up huge machines into smaller slices with dedicated (real) hardware per slice. And sometimes people really want to run a single application baremetal for performance reasons. And ask yourself how many CPUs you have even in your PC - it's a lot more than you think. Why not run single applications on one of the several CPUs that live in your hard disk for example?
I guess you might think that if you never did high-performance networking.
The whole setup adds a lot of latency and affects throughput too.
Unikernels don't want to replace systems where multiple competing process are running(heck, most unikernels don't even support process). The most obvious thing that you could run as unikernels are services and DBs.
Nanos has this basic assumption that it will never be deployed on bare metal. It assumes it is going into a virtualized environment such as AWS or GCP or Azure or whatever.
I'm criticizing the statements made in the article, which are that traditional kernels are obsolete and only made sense in the 70s.
Yet they still are the best thing for bare metal.
Basically, this is not about operating system kernel but about the notion of running a single process with somebody else's PAAS, which runs an actual operating system with a kernel and virtualization layer.
Doing this on bare metal makes a lot less sense because then suddenly you need to worry about managing all the hardware, initializing it, dealing with file systems, storage, memory and all the other stuff that operating systems tend to do (or at least orchestrate). Modern VMs on the other hand don't really need to do any of that and simply hand off that stuff to the underlying OS via kvm or some other virtulization API. You could do all of that in your unikernel on top of bare metal of course.
That's what embedded software manufacturers of course do. They build firmware that goes into a ROM and the embedded hardware boots the firmware. That's a classic unikernel. I guess some network router manufacturers might do something similar though most of them seem to like running linux or bsd variants and not reinvent a lot of wheels instead.
MirageOS (unikernel) plug:
and repo:
https://github.com/mirage/mirage
A bit OT: what kind of performance do you normally expect from OCaml? I looked at the Prime Sieve challenge[1] recently and with the current submissions the Standard ML solution is 3 times faster than the OCaml one. I've also spent some time optimizing this SML solution and got it several times faster still. With a bit more work I thinkg I could get it up to the speed of the fastest C solution. It's compiled with MLton. From that POV OCaml is a bit disappointing, since I've also viewed SML as a "hipster language" that hasn't had that much money put into it, while OCaml saw quite a bit more development, so I would expect the OCaml compiler to be able to produce better code. I've briefly looked at the OCaml code, but I don't know OCaml as much as SML, so can't really say what may be the issue in the OCaml case.
[1]: <
https://plummerssoftwarellc.github.io/PrimeView/report?id=rb...
>
[2]: <
https://github.com/PlummersSoftwareLLC/Primes
>
I would expect such a difference between OCaml and MLton (in either direction in fact) to be mostly the result of an apple-to-orange comparison. In this case, the OCaml code looks neither idiomatic nor optimized.
MLTon is a whole program compiler, so it's not exactly comparable to OCaml. OCaml compiles insanely fast, MLTon not so much.
I'm wondering if anyone has seen any attempts at creating unikernels that work with major scripting languages like PHP, Python or Ruby.
I know this would seem a bit like a contradiction to the ethos of unikernels, but it would give the benefits of unikernels (i.e. smaller attack surface) to a much larger audience.
NanosVM has a concept called Packages [1], ie prebuilt images that package a language runtime or application component eg Redis. It looks like they have packages for various scripting languages.
[1]
https://nanovms.gitbook.io/ops/packages
That should be possible if you use a web server like Apache as main application and then link the module that your scripting language requires. Not sure whether NanoVMs already support it though.
Hey, Unikraft has support for running the Python[0] and Ruby[1] runtimes as unikernels. We're about to release PHP too :) Watch this space!
[1]:
https://github.com/unikraft/app-python3
[1]:
https://github.com/unikraft/app-ruby
https://news.ycombinator.com/reply?id=29427884&goto=item%3Fi...
I think we will be stuck with JVM on top of k8s container on top of linux VM on top of kvm on top of linux forever
At some point someone (will it require a combo developer/MBA?) is going to make a name for themselves by finally being heard when they say all this complexity is unnecessary. The raw compute at our finger tips is being wasted, replaced by pointless complexity for resume inflation and unnecessary cloud bills.
Great entrant in the space that is actually usable with ergonomic tooling and good docs:
Promising project that's inactive but was one of the first ones I found with reasonable ergonomics and no lock-in to a specific language that I didn't use:
https://github.com/rumpkernel/rumprun
Unfortunately it looks to be unmaintained as of now, but I expect the examples still work etc (
https://github.com/rumpkernel/rumprun/issues/135
)
He's wrong about Linux, it's possible to link regular applications into Linux-based unikernels:
https://github.com/unikernelLinux/linux
Umm what is this?
“This branch is 71 commits ahead, 27784 commits behind torvalds:master” and the lack of any docs/readme isn’t very promising
Given 15k commits per Linux release (see e.g.
http://lkml.iu.edu/hypermail/linux/kernel/2105.1/00457.html
), that's actually trailing just two upstream kernel releases. That's pretty good for a small project.
It's under development, and does indeed lack much in the way of docs at the moment, but it's a way to link regular applications into Linux to make unikernels that will run in VMs or on baremetal.
If there is often no scheduler, how does multithreading work? Also, how do things like valgrind or compiler sanitizers work for unikernels?
Many high-throughput server architectures, like modern thread-per-core designs, already do all of their own thread scheduling in user space. Similarly, they tend to do their own storage I/O scheduling in user space as well, bypassing the kernel cache and sometimes the file system. Similar with networking.
For this type of software, which treats Linux as little more than a device driver, a unikernel looks pretty similar to how they already interact with a normal OS, so porting to a unikernel should be relatively straightforward. I think this is more the case that unikernels target.
Well, Nanos definitely does scheduling and it supports multiple threads. In this day/age you kind of have to. In fact we are at the tail end of removing our version of the BKL.
Very good podcast (with transcript!) about kernels / lib kernels / unikernels:
https://signalsandthreads.com/what-is-an-operating-system/
However, when it crashed, we could easily export the VM image and download it locally and run it. We could attach gdb to it and pinpoint directly where it was having an issue.
How does this work, if running the VM again? How does one ensure the same issue is recreated?
Implicit in that is having a reproduction case, even if it's just "run the thing with a workload until it crashes".
Right, that's worse than being able to connect to a VM with full O/S and investigate the problem there by analysing a core dump or something. An issue of ergonomics with unikernels which I was hoping they had a better solution for.
You can easily get core dumps and there's a whole wealth of debugging tools available such as ftrace/strace/etc.
I didn't see in the documentation where the files are supplied to the run process... if it can be fed just the minimum files required, that's close to good enough to do capability based security.
https://nanovms.gitbook.io/ops/configuration#files
|| I most commonly use
https://nanovms.gitbook.io/ops/configuration#dirs
||
https://nanovms.gitbook.io/ops/configuration#mapdirs
to not do copies/keep same layout
I believe that's one of the things the `ops` tool does, ie takes an image and bundles it with eg script or config files.
So they've found out that unikernels are faster than running a kernel in a VM. But what about running your app in a Docker container on bare metal?
Docker is specific to the Linux kernel.
Not really. You can have Windows Docker containers.
That's achieved by running a Linux kernel using virtualisation, if I understand correctly. (Wikipedia tells me WSL2 runs a full Linux kernel. [0]) Same goes for Docker on macOS.
Docker containers are built atop the Linux syscall interface, after all. The goal of Docker is to allow you to bring your own userland without bringing your own kernel, so that many containers can run atop a single shared kernel instance. That's what distinguishes it from 'conventional' virtualisation.
If you really want 'bare metal containers', that's pretty much what unikernels are going for.
[0]
https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux
Also somewhat relevant:
https://docs.docker.com/engine/security/seccomp/
No, there's docker windows images that run on an NT kernel.
https://docs.microsoft.com/en-us/virtualization/windowsconta...
Thanks I'd missed that. A quite different codebase then, but still under the Docker brand.
To correct my earlier point then: containers, pretty much by definition, run atop a shared kernel, not on bare metal.
It's the same codebase AFAIK, just ported to multiple kernels.
I think part of the issue here is how overloaded the term "bare metal" has become. In context I believe the parent is contrasting unikernel on hypervisor to container on kernel, seeing one coarse privilege boundary in both cases, reducing the obvious benefits of a unikernel design.
> It's the same codebase AFAIK, just ported to multiple kernels.
Presumably plenty of the code is specific to Windows.
Docker/Linux works by using various features of the Linux kernel to, well, _contain_, containers. As I understand it this is in contrast to BSD Jails and Solaris Zones, where the whole job is done 'officially' by the kernel.
> In context I believe the parent is contrasting unikernel on hypervisor to container on kernel, seeing one coarse privilege boundary in both cases, reducing the obvious benefits of a unikernel design.
Good point.
> Presumably plenty of the code is specific to Windows.
There's some code specific to windows but way less than you might think.
> Docker/Linux works by using various features of the Linux kernel to, well, contain, containers. As I understand it this is in contrast to BSD Jails and Solaris Zones, where the whole job is done 'officially' by the kernel.
Yes, that's techincally true, but doesn't really matter much to the discussion at hand at the end of the day. In all these cases (including if solaris zone/freebsd jail support was added to docker), docker has the same amount of work and it looks very similar. The containerization features of linux are absolutely significantly more ad hoc than NT, solaris, or freebsd, but at the end of the day a complete user space component for managing those containers has the same amount of work, and that work can look very very similar. The difference mainly comes in
* What happens for an incomplete container managing daemon. When Linux presents a new resource to be containerized, it's historically been a new interface and needs support in the container managing daemon like linux to be forked off of the root namespace or something. The systems with a cohesive concept of container have a better chance of by default breaking that resource away from the root namespace without configuration. A complete implementation in both cases requires manual config and setup.
* Linux's containerizing features can potentially be split and used in ways not originally intended. You can sort of see this in clone(2) as well. There fork and new thread have been coalesced to the same syscall, you just provide a bitset of resources to share or not. Share nothing? Pretty much a fork. Share virtual address space, file descriptor table, tgid, etc? You've just created a thread. Because of that you can do real neat things like sharing a virtual address space but not the file descriptor table if that's useful to you for some reason.
At the end of the day, it's more or less the same work for docker regardless of the mechanism used to push the container's config to the kernel.
Interesting, thanks.
Do you work on Docker internals?
Some points to consider:
1: This "big idea" will require a lot of differentiated feature work to make it be on-par with those dumb old slow Linux/Windows/Mac execution environments. You won't be able to develop or test or troubleshoot/debug apps for Unikernels the same way you do normal apps today. That _might_ turn out to be a benefit... but any such huge change carries hurdles which need to be jumped over before we can become comfortable with it.
2: On _"if the datacenter is the computer, then the cloud is its operating system"_. It's really better if you don't think about datacenters. It's fine to think about _regions_ and _availability zones_. Those abstractions actually help you design a better, more resilient, more performant application. But there might be 5 different datacenters serving a single AZ. Datacenters are messy and have all kinds of weird considerations... best to stay in the logical abstraction world, rather than physical.
3: If we're talking about the future of computing, here's a few other big ideas for you:
- Immutable SaaS. You've heard of immutable infrastructure? It's where, for example, you use Packer to create a VM image, and deploy a new EC2 instance with that VM image. If there's a bug in your app, you deploy the old VM image to an EC2 instance (or keep the old instance online and send traffic back to it, called blue/green deployment). No ssh'ing to a machine and 'git clone'ing and running puppet and restarting Nginx or any of that nonsense. Just spin up a VM with a pre-baked pre-tested validated set of software and configuration. It's the best, most reliable way to manage deployments, period. But that's just "the infrastructure". What about SaaS that you don't control like a virtual machine? We need SaaS to be immutable now, too. Literally the only reason Terraform exists is that 0% of the cloud's SaaS APIs are immutable. The majority of AWS APIs cannot be treated like giant blocks of versioned configuration that can be reverted at a moment's notice. Calling the APIs results in failure that has to be worked around by the client, rather than the SaaS taking care of it for you. This leads to complexity in management and lots of failures. Same goes for things like Kubernetes, which lie to you about "declarativeness", but meanwhile in the background the state is mutating like crazy and you have to pray that the change you deployed last week will work the same this week.
- Versioned Immutable Data. A database should not be one giant constantly-mutating-pile-of-state that is impossible to revert without writing a bunch of scripts ahead of time. Modern databases _need_ to start supporting not only transactions, but a dead-simple way to roll back to any change in time. This will result in a tectonic shift in how we can deploy and manage data in real time, enabling faster, more complex work.
- First class support in every language for versioned function/subroutine DAGs. Essentially, every function that calls every other function in a program needs to have a version, and an explicit declaration of what versions of what functions it is compatible with. This functionality already exists in C today, but nobody uses it because they're unaware of its utility. If we all started using it, it would obsolete entire classes of software created to deal with the problems it would solve. Package management and backwards/forwards compatibility would just natively be taken care of, with no need to "manage" what versions of what software you have installed. We could support feature flags and upgrade-in-place code that could be reverted in a microsecond just by pointing at the previous DAG of software versions that an app used. "Deployment" would become flipping what version of a function is current, and it could automatically flip back to a previous one after a certain rate of errors. And because it's a DAG, it could work over entire distributed systems.