💾 Archived View for nanako.mooo.com › gemlog › 2023-11-01-a.gmi captured on 2024-06-16 at 12:48:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-12-28)
-=-=-=-=-=-=-
This is all some stuff I've been considering for a few weeks. The post itself came about due to my own Mastodon thread:
Benben, my VGM player, does just two things: it plays VGM files, and it bulk-converts them to WAV or AU. That's it. During playback, it uses three processes to help split up the work, and to help further isolate the actual playback code from the user interface code. The three processes it uses are:
Benben is written in Crystal, which provides a "Fiber" class to help you split up your work. Fibers are very similar to operating system threads, except they are cooperatively scheduled by the Crystal runtime itself [1][2]. At the time of writing, Crystal also defaults to scheduling these on a single operating system thread [2]. However, using the compile time flag '-Dpreview_mt" allows the scheduler to use multiple worker threads, where each worker thread has its own scheduler and a queue of runnable Fibers [3]. Which worker thread gets which new Fiber is decided in a round-robin way [3]. Given this, I think it's safe to say that this uses a M:1 model when using the default, and a hybrid M:N threading model when a program is compiled with the "-Dpreview_mt" flag [4].
Benben currently enforces the use of "-Dpreview_mt" when compiling it. If you use the included Rakefile, this is taken care of for you. There is an experimental branch called "allow-build-without-preview_mt" that lets you build without "-Dpreview_mt", but the results leave a lot to be desired. For example, normal playback of a VGM works [5], but the user interface flickers heavily, and the program will deadlock if you press "q" to quit. Rendering works, but is very slow.
Still, the design of Benben is such that "-Dpreview_mt" is the only model that really makes sense. Rendering VGMs to WAV or AU files is one of those problems that are "embarrassingly parallel" [6]. Each song is self-contained, so no song depends on any other song during rendering, meaning that one song can be rendered per thread of execution. This is exactly what Benben does. When used with its default options, Benben will spawn X number of Fibers, where each Fiber renders a single song; X is equal to the number of logical cores in this case, or the total number of songs, whichever is less. However, X can also be controlled with the "--jobs" command line option, which lets a user specifically request the number of Fibers to use, though this may still be less if they have fewer total songs than jobs.
However, remember that Fibers are not threads. M Fibers get scheduled over M operating system threads. Programs compiled with Crystal default to using only four operating system threads when "-Dpreview_mt" is used, but this can be modified at runtime by changing the "CRYSTAL_WORKERS" environment variable. So if you want all your Fibers to get distributed over 12 operating system threads, you'd build your program with "-Dpreview_mt", then run it like this:
CRYSTAL_WORKERS=12 ./myprogram
This is pretty neat and reminds me of Go and its goroutines. But this is also where my problems start.
Let's take my desktop as an example. It has a Core i9-10850K processor in it, which means it's a 10 core/20 thread machine. Let's just call this "20 logical cores" for short. In my "~/.bashrc" I have "CRYSTAL_WORKERS=20" because I want Crystal programs to use up to all my cores by default.
So, 20 logical cores, and "CRYSTAL_WORKERS=20". Now lets also assume I'm rendering 61 VGM files to WAV. Benben knows I have 20 logical cores, so unless I specify the "--jobs" option when I run it, Benben will spawn 20 Fibers to render 20 VGM files to WAV in parallel. In this situation, Benben is effectively doing one Fiber per scheduler thread because I have "CRYSTAL_WORKERS" set to 20. And, since the operating system threads are preemptively scheduled, that means my Fibers are _effectively_ preemptively scheduled. This is exactly the behavior I want: N songs = N preemptive threads, a 1F:1T model.
If I reduce "CRYSTAL_WORKERS" to fewer (e.g. four, the default for Crystal if the environment variable was never set), then Benben tries to render 20 songs at once over 20 Fibers that are scheduled over _less than 20 operating system threads_. The MF:NT model absolutely works here, though is slower than when Benben uses a 1F:1T model.
But, I'm a rather technical user. I not only know what an environment variable is and how to use one, but I also know how the "CRYSTAL_WORKERS" environment variable works.
Let's assume a non-technical user with a machine identical to mine downloads a pre-compiled binary of Benben and tries to render 61 VGM files to WAV. "CRYSTAL_WORKERS" won't be set for them, so the Crystal runtime will only launch four worker threads when Benben starts. They are already getting a reduced experience. Taking this further, let's assume they are at least technical enough to open up their system monitoring program, and upon doing so, see that only four of their CPU cores are doing work. "What? I thought Benben was supposed to use all my cores to render songs? Is it broken?" Now not only do they have a subpar experience, but it looks like my software is misbehaving.
Now, let's assume they have a way of contacting me (Mastodon, email, Fossil ticket, whatever) and they ask me what's up with this behavior. I could explain to them that it's due to how the language it's written in (Crystal) works. Unfortunately this not only makes my program look kinda subpar, but makes Crystal look bad as well. I could also tell them to use fewer "--jobs", but them I'm back at a situation where Benben looks like crap software. After all, they have 20 cores and they want to fully use them, damnit! Without messing with "environment variables" or whatever the heck those are!
Overall, a workable but underwhelming experience.
Of course it just gets even worse if the program was compiled without "-Dpreview_mt".
The problem basically boils down to Crystal's MF:NT model of multitasking. While this is an excellent model for many problems (servers that need to answer requests immediately come to mind), it's not a "one-size-fits-all" solution. And unfortunately, Crystal does not have true First-Class Threads that I can spawn myself [7]. This problem manifests itself as performance issues, which are not only frustrating in nature, but could potentially make Benben look bad, and possibly also Crystal if push came to shove. Crystal _does_ seem to have a natural drive to be optimized for backend applications that run on servers, but it also seems to want to be a general purpose programming language. If it's limited to a MF:NT model (or a M:1 model, when "-Dpreview_mt" isn't used), then it simply misses out on being a good option for a whole set of programs.
While looking into this problem, I have come across a few threads [11][12] that touch on issues that are similar to my own. In them, the question of what should be the default tends to come up. Should Crystal default to four worker threads? Or should it default to spawning X worker threads where X equals the number of logical CPU cores? If it did the latter, what would that mean for RAM usage? How about contention? No two programs are the same, after all. There was also a suggested solution: a _compile_ time tuning knob that would set the default number of worker threads.
But, to all of these questions, I feel the best solution is for the language to not limit the programs written with it. Rather, a good program should instead offer the tuning knobs _itself_, not the language. This is why Benben has a "--jobs" option. I feel this is especially true for programs that run on servers, where tuning is vital. If the language imposes a limit itself, then that's one less knob for the end user (or at least a more troublesome knob).
To the suggestion of a compile-time knob for the default number of workers, this seems like an even worse idea. I have no way to predict the number of cores an end user will have. Of the computers I own myself, I have a few with four logical cores (some of which are in big.LITTLE), one with 12, one with six (in a big.LITTLE config), and one with 20. A compile-time knob just moves the goal posts, and does nothing to solve the problems I've discussed in the previous section.
There are a few stopgap solutions, though.
One possible solution is to detect the number of CPU cores very, very early at program launch, then modify the number of worker threads that get spawned by the runtime. This would require a redefinition of the "main" function in Crystal:
fun main(argc : Int32, argv : UInt8**) : Int32 LibC.setenv("CRYSTAL_WORKERS", System.cpu_count.to_s, 1) Crystal.main(argc, argv) end
The problem with this solution is that it is WAY before command line arguments are processed by Benben, so "--jobs" still gets wonky. I'm also not 100% sure if "System.cpu_count" is available at that time. But it's a start.
A second possible solution is to monkey patch the Crystal runtime's scheduler:
class Crystal::Scheduler private def self.worker_count env_workers = ENV["CRYSTAL_WORKERS"]? if env_workers && !env_workers.empty? workers = env_workers.to_i? if !workers || workers < 1 Crystal::System.print_error "FATAL: Invalid value for CRYSTAL_WORKERS: #{env_workers}\n" exit 1 end workers else System.cpu_count end end end
Unfortunately this runs into the same problems as when you monkey patch _any_ API: you need to track the upstream code to ensure your monkey patch will continue to work properly with the rest of the program. This does seem like a good possible short-term solution, however (though be sure to check out the Benchmarks section below).
The solution I would prefer long-term is for Crystal to give us a proper Thread API. It can still have its Fibers and the scheduler, but it should expose the scheduler as a public API so we can disable it or tweak it to our needs. But most importantly: **Allow us to spawn native First-Class Threads (operating system threads)**. After all, sometimes you just want a real operating system thread, not a Fiber.
One thing to note is that Benben did not start life as a Crystal program. My usual MO is to prototype stuff in Common Lisp, then examine if it would be beneficial to do actual production code in another language (read: Crystal). Benben, and it's underlying VGM library YunoSynth, were no different. They started life in Common Lisp, and I still have that code. In fact I've even expanded it a bit out of curiosity and have found it to perform almost identically to the Crystal version as far as CPU usage goes [8]. More importantly: the Common Lisp implementation I target (SBCL) gives me easy-to-use true First-Class Threads. There's also libraries [9] available to let me use Green Threads (which are very similar to a Fiber in Crystal) alongside normal First-Class Threads if I really, really want some cooperative multitasking (highly unlikely for Benben). There's also the SB-SIMD package that comes with SBCL that gives me a very, very nice interface to do SIMD stuff, potentially making the Common Lisp version faster than the Crystal version currently is.
Basically, I could always just drop Crystal and go back to writing Benben[10] in Common Lisp.
I really don't want to do this, though. Not because of difficulty - it wouldn't be at all difficult, and only mildly time consuming since all the hard parts are already solved. I don't want to do this because I don't want to hurt Crystal. It's still a new language, and I really hope to bring (positive) attention to it with Benben. Not that Benben is the _only_ program I have written in Crystal... I have many others, including the Gemini server that served you this page. But Benben and it's cousin, midi123, are probably the most publicly visible of my programs.
But, if push came to shove, the mother in me will come out and protect my own child first and foremost.
So, I've talked about how Benben works in regards to Fibers, "-Dpreview_mt", and the "CRYSTAL_WORKERS" environment variable. How about some actual numbers?
These benchmarks were done on my desktop computer, some of the specs of which are:
Benben was compiled using Crystal v1.10.1. The compiler was built on my system from source with LLVM 13.0.0. Each run of Benben was performed on a system with just a terminal and Emacs (for record keeping) running in a Sawfish X11 session to keep other programs from interfering as much as possible. There were at least 10 seconds of pause time between each run to allow the CPU to cool to a "normal" base temperature.
The numbers below are geometric means taken from three runs for each configuration. Three things were measured: the time taken, the average samples rendered per second, and the memory used. The time taken and average samples per second are reported directly by Benben and do NOT include the time it takes to startup and load the XSPF playlist The memory measurements were taken using a program called xtime and DO take the startup and XSPF load into account. The source code (Crystal) for xtime is included above the footnotes in its own section.
The configuration permutations aren't exhaustive, but I think I got the important ones. If someone wants me to add a few more on, I will.
These runs all rendered 61 songs to 44.1Khz 16-bit stereo WAV. The songs were referenced in an XSPF playlist and consisted of songs from the Sharp X68000 computer and the NeoGeo, both of which require somewhat more involved emulation cores (YM2151 and YM2610, respectively).
These benchmarks were done using two checkouts from Benben's repository:
If you wish to examine the source repository, it's available here:
The YunoSynth code was at checkout 5f351d5e3c31782ef1e56dcaf0b0c8664f79af82. It's repository is here:
Benben was built using this, with "-Dpreview_mt" added/removed as-needed:
shards build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40
Some issues to note:
The source code (Common Lisp) used to calculate the geometric means, along with the xtime code, is in its own section above the footnotes. The data file with the raw values can be accessed here (it can be directly processed with the Common Lisp code):
Finally, here are the results when using "-Dpreview_mt":
CRYSTAL_WORKERS=20, --jobs 20, WITHOUT Fiber.yield added Time: 00:00:23:88967938 Samples per Sec: 6,806,318 Memory Used: 112.62 MiB CRYSTAL_WORKERS=20, --jobs 4, WITHOUT Fiber.yield added Time: 00:00:49:561379837 Samples per Sec: 3,157,491 Memory Used: 82.06 MiB CRYSTAL_WORKERS=20, --jobs 1, WITHOUT Fiber.yield added Time: 00:01:25:153285603 Samples per Sec: 1,829,959 Memory Used: 53.44 MiB CRYSTAL_WORKERS=20, --jobs 20, Fiber.yield added Time: 00:00:21:316036046 Samples per Sec: 7,445,727 Memory Used: 152.65 MiB CRYSTAL_WORKERS=4, --jobs 20, Fiber.yield added Time: 00:00:49:71518083 Samples per Sec: 3,186,044 Memory Used: 113.33 MiB CRYSTAL_WORKERS=4, --jobs 4, Fiber.yield added Time: 00:01:03:478200751 Samples per Sec: 2,462,458 Memory Used: 73.96 MiB CRYSTAL_WORKERS=4, --jobs 1, Fiber.yield added Time: 00:01:28:427111020 Samples per Sec: 1,766,431 Memory Used: 58.58 MiB CRYSTAL_WORKERS=1, --jobs 20, Fiber.yield added Time: 00:02:31:215964094 Samples per Sec: 1,030,649 Memory Used: 13.94 GiB CRYSTAL_WORKERS=1, --jobs 1, Fiber.yield added Time: 00:02:26:233878863 Samples per Sec: 1,066,631 Memory Used: 1.20 GiB
And here are results WITHOUT the "-Dpreview_mt" flag:
--jobs 20, Fiber.yield added Time: 00:02:25:331626453 Samples per Sec: 1,072,926 Memory Used: 112.62 MiB --jobs 4, Fiber.yield added Time: 00:02:24:504344037 Samples per Sec: 1,078,986 Memory Used: 82.06 MiB --jobs 1, Fiber.yield added Time: 00:02:24:695090870 Samples per Sec: 1,077,629 Memory Used: 53.44 MiB
Benben is fastest when compiled with "-Dpreview_mt", and is allowed to use 20 worker threads and 20 jobs. This is when it behaves as if it has true First-Class Threads that are preemptively scheduled for this machine. When compiled with "-Dpreview_mt", Benben is slowest when using one worker thread with 20 jobs. I interpret this as meaning a 1F:1T model is ideal, and that a true 1:1 "no Fibers, all First-Class Threads" would potentially be even more ideal.
Reducing the worker threads causes a larger performance drop than leaving "CRYSTAL_WORKERS" set to 20 and reducing the number of "--jobs". I interpret this as meaning the MF:NT model works best when M is <= N, and MF:NT results in slower performance than a 1F:1T model.
Fiber.yield usually results in slightly faster performance when "-Dpreview_mt" is used, but also slightly higher RAM usage. However, when compiled with "-Dpreview_mt", using one worker thread, and Fiber.yield is added, memory usage is absurdly high (in the gigabytes) compared to using more worker threads.
Fiber.yield must be used when compiling without "-Dpreview_mt" because of the cooperative multitasking.
When compiled without "-Dpreview_mt", Benben runs its absolute slowest.
Memory allocation and deallocation are almost certainly playing _some_ part to these numbers. However, I don't believe it's Benben causing these extra allocations. YunoSynth and Benben are both written specifically to allocate things like audio buffers only once, and minimize the allocation of strings as much as possible. Any extra allocations are more than likely to be in the standard library.
Overall, I feel these benchmarks support these conclusions:
xtime source code:
# Based on https://github.com/kostya/benchmarks/blob/master/xtime.rb class EnergyStats PATH = "/sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj" getter? hasEnergyMetrics : Bool = false @accum : UInt64 = 0 @energy : UInt64 = 0 @maxEnergy : UInt64 = 0 def initialize @hasEnergyMetrics = File.readable?(PATH) @maxEnergy = File.read(PATH).to_u64 if @hasEnergyMetrics end @[AlwaysInline] def update return unless @hasEnergyMetrics newVal = File.read(PATH).to_u64 if @energy == 0 # first reading @accum = 0 elsif newVal > @energy @accum += newVal - @energy elsif newVal < @energy # counter has been reset @accum += @maxEnergy - @energy + newVal end @energy = newVal end def val sprintf("%0.3f", 0.000001_f64 * @accum) end end @[AlwaysInline] def mem(pid : Int64) : Int64 `ps p #{pid} -o rss`.lines[-1].to_i64 * 1024 end def usage : Nil STDERR << %|Usage: xtime <program> [args...] xtime [-e] [-s] <program> [args...] -q : Don't print command name -e : Show stderr output from process -s : Show stdout output from process Stderr and Stdout are both hidden by default. | exit 1 end usage if ARGV.empty? # Check options output = Process::Redirect::Close outerr = Process::Redirect::Close quiet = false outfile = nil outname = nil outlang = nil loop do break if ARGV.empty? case ARGV[0] when "-q" quiet = true ARGV.shift when "-e" outerr = Process::Redirect::Inherit ARGV.shift when "-s" output = Process::Redirect::Inherit ARGV.shift when "-o" ARGV.shift outname = ARGV.shift outlang = ARGV.shift outfile = ARGV.shift when "-h" then usage when "--" then break else break end end # ARGV may now be empty... if ARGV.empty? STDERR << "ERROR: No program given\n" exit 1 end # Used to track memory usage highestMem : Int64 = 0 newMem : Int64 = 0 # Start energy tracking, if possible energy = EnergyStats.new # Start the process and timer start = Time.monotonic proc = Process.new(ARGV[0], ARGV[1..], output: output, error: outerr) pid = proc.pid # This fiber checks memory usage spawn do highestMem = mem(pid) loop do sleep 5.milliseconds newMem = mem(pid) highestMem = newMem if newMem > highestMem energy.update end end # All done, report proc.wait finish = Time.monotonic finalMem = highestMem.humanize_bytes(format: :JEDEC) if energy.hasEnergyMetrics? STDERR << "#{ARGV[0]}: == #{finalMem}, #{(finish - start)} (#{energy.val} J energy) ==\n" else STDERR << "#{ARGV[0]}: == #{finalMem}, #{(finish - start)} (? J energy) ==\n" end
Code used to calculate geometric means:
(require 'computable-reals) (require 'cl-sdm) ;; For STRING-REPLACE and SPLIT-STRING ;; Nanoseconds per second (defconstant +ns+ 1000000000) ;; Each time in TIMES should be a STRING in the format: hh:mm:ss.nnnnnnnnn (defun calculate-geometric-mean-time (&rest times) (labels ((parse-time (time) ;; Converts a "hh:mm:ss.nnnnnnnnn" string into a (:HOURS hh :MINUTES mm ;; :SECONDS ss :FRACTION nnnnnnnnn) plist. (loop for part-num in (mapcar #'parse-integer (sdm:split-string (sdm:string-replace time "." ":") #\: :as-list t)) for part in '(:hours :minutes :seconds :fraction) ;; Rather overkill, but meh append (list part part-num) into ret finally (return (+ (getf ret :fraction) (* (getf ret :seconds) +ns+) (* (getf ret :minutes) 60 +ns+) (* (getf ret :hours) 60 60 +ns+)))))) (let ((gm (cr:raw-approx-r (cr:expt-r (reduce #'* (mapcar #'parse-time times)) (/ 1 (length times)))))) (multiple-value-bind (gm-total-sec gm-ns) (truncate gm +ns+) (let* ((gm-hour (truncate (truncate gm-total-sec 60) 60)) (gm-min (truncate (- gm-total-sec (* gm-hour 60 60)) 60)) (gm-sec (- gm-total-sec (* gm-hour 60 60) (* gm-min 60)))) ;; Return a new string (format nil "~2,'0d:~2,'0d:~2,'0d:~2,'0d" gm-hour gm-min gm-sec gm-ns)))))) (defun calculate-geometric-mean-samples (&rest samples) (format nil "~:d" (cr:raw-approx-r (cr:expt-r (reduce #'* samples) (/ 1 (length samples)))))) (defun calculate-geometric-mean-memory (&rest memory) (sdm:pretty-print-bytes (cr:raw-approx-r (cr:expt-r (reduce #'* memory) (/ 1 (length memory)))))) (defun process-file (filename) (format t "~a" (with-output-to-string (out) (with-open-file (in filename) (loop for line = (read-line in nil nil) while line do (unless (or (sdm:empty-string-p line) (sdm:string-starts-with line "#")) (if (sdm:string-starts-with line "name:") (format out "== ~a ==~%" line) (let* ((parts (sdm:split-string line #\, :as-list t))) (format out " GM Time: ~a~%" (apply #'calculate-geometric-mean-time (subseq parts 0 3))) (format out " GM Samples per Sec: ~a~%" (apply #'calculate-geometric-mean-samples (mapcar #'parse-integer (subseq parts 3 6)))) (format out " GM Memory Used: ~a~%" (apply #'calculate-geometric-mean-memory (mapcar #'parse-integer (subseq parts 6)))) (write-char #\Newline out)))))))))
--------- Page served by Aya https://nanako.mooo.com/fossil/aya/ Aya is under the GNU Affero GPLv3 license