💾 Archived View for nanako.mooo.com › gemlog › 2023-11-01-a.gmi captured on 2024-02-05 at 10:04:37. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-12-28)

-=-=-=-=-=-=-

Some Benchmarks and Issues with Parallel Processing in Crystal

This is all some stuff I've been considering for a few weeks. The post itself came about due to my own Mastodon thread:

Mastodon thread

Benben, my VGM player, does just two things: it plays VGM files, and it bulk-converts them to WAV or AU. That's it. During playback, it uses three processes to help split up the work, and to help further isolate the actual playback code from the user interface code. The three processes it uses are:

Benben is written in Crystal, which provides a "Fiber" class to help you split up your work. Fibers are very similar to operating system threads, except they are cooperatively scheduled by the Crystal runtime itself [1][2]. At the time of writing, Crystal also defaults to scheduling these on a single operating system thread [2]. However, using the compile time flag '-Dpreview_mt" allows the scheduler to use multiple worker threads, where each worker thread has its own scheduler and a queue of runnable Fibers [3]. Which worker thread gets which new Fiber is decided in a round-robin way [3]. Given this, I think it's safe to say that this uses a M:1 model when using the default, and a hybrid M:N threading model when a program is compiled with the "-Dpreview_mt" flag [4].

Benben currently enforces the use of "-Dpreview_mt" when compiling it. If you use the included Rakefile, this is taken care of for you. There is an experimental branch called "allow-build-without-preview_mt" that lets you build without "-Dpreview_mt", but the results leave a lot to be desired. For example, normal playback of a VGM works [5], but the user interface flickers heavily, and the program will deadlock if you press "q" to quit. Rendering works, but is very slow.

Still, the design of Benben is such that "-Dpreview_mt" is the only model that really makes sense. Rendering VGMs to WAV or AU files is one of those problems that are "embarrassingly parallel" [6]. Each song is self-contained, so no song depends on any other song during rendering, meaning that one song can be rendered per thread of execution. This is exactly what Benben does. When used with its default options, Benben will spawn X number of Fibers, where each Fiber renders a single song; X is equal to the number of logical cores in this case, or the total number of songs, whichever is less. However, X can also be controlled with the "--jobs" command line option, which lets a user specifically request the number of Fibers to use, though this may still be less if they have fewer total songs than jobs.

However, remember that Fibers are not threads. M Fibers get scheduled over M operating system threads. Programs compiled with Crystal default to using only four operating system threads when "-Dpreview_mt" is used, but this can be modified at runtime by changing the "CRYSTAL_WORKERS" environment variable. So if you want all your Fibers to get distributed over 12 operating system threads, you'd build your program with "-Dpreview_mt", then run it like this:

CRYSTAL_WORKERS=12 ./myprogram

This is pretty neat and reminds me of Go and its goroutines. But this is also where my problems start.

A Few Quick Definitions

A Hypothetical Example

Let's take my desktop as an example. It has a Core i9-10850K processor in it, which means it's a 10 core/20 thread machine. Let's just call this "20 logical cores" for short. In my "~/.bashrc" I have "CRYSTAL_WORKERS=20" because I want Crystal programs to use up to all my cores by default.

So, 20 logical cores, and "CRYSTAL_WORKERS=20". Now lets also assume I'm rendering 61 VGM files to WAV. Benben knows I have 20 logical cores, so unless I specify the "--jobs" option when I run it, Benben will spawn 20 Fibers to render 20 VGM files to WAV in parallel. In this situation, Benben is effectively doing one Fiber per scheduler thread because I have "CRYSTAL_WORKERS" set to 20. And, since the operating system threads are preemptively scheduled, that means my Fibers are _effectively_ preemptively scheduled. This is exactly the behavior I want: N songs = N preemptive threads, a 1F:1T model.

If I reduce "CRYSTAL_WORKERS" to fewer (e.g. four, the default for Crystal if the environment variable was never set), then Benben tries to render 20 songs at once over 20 Fibers that are scheduled over _less than 20 operating system threads_. The MF:NT model absolutely works here, though is slower than when Benben uses a 1F:1T model.

But, I'm a rather technical user. I not only know what an environment variable is and how to use one, but I also know how the "CRYSTAL_WORKERS" environment variable works.

Let's assume a non-technical user with a machine identical to mine downloads a pre-compiled binary of Benben and tries to render 61 VGM files to WAV. "CRYSTAL_WORKERS" won't be set for them, so the Crystal runtime will only launch four worker threads when Benben starts. They are already getting a reduced experience. Taking this further, let's assume they are at least technical enough to open up their system monitoring program, and upon doing so, see that only four of their CPU cores are doing work. "What? I thought Benben was supposed to use all my cores to render songs? Is it broken?" Now not only do they have a subpar experience, but it looks like my software is misbehaving.

Now, let's assume they have a way of contacting me (Mastodon, email, Fossil ticket, whatever) and they ask me what's up with this behavior. I could explain to them that it's due to how the language it's written in (Crystal) works. Unfortunately this not only makes my program look kinda subpar, but makes Crystal look bad as well. I could also tell them to use fewer "--jobs", but them I'm back at a situation where Benben looks like crap software. After all, they have 20 cores and they want to fully use them, damnit! Without messing with "environment variables" or whatever the heck those are!

Overall, a workable but underwhelming experience.

Of course it just gets even worse if the program was compiled without "-Dpreview_mt".

The Problem

The problem basically boils down to Crystal's MF:NT model of multitasking. While this is an excellent model for many problems (servers that need to answer requests immediately come to mind), it's not a "one-size-fits-all" solution. And unfortunately, Crystal does not have true First-Class Threads that I can spawn myself [7]. This problem manifests itself as performance issues, which are not only frustrating in nature, but could potentially make Benben look bad, and possibly also Crystal if push came to shove. Crystal _does_ seem to have a natural drive to be optimized for backend applications that run on servers, but it also seems to want to be a general purpose programming language. If it's limited to a MF:NT model (or a M:1 model, when "-Dpreview_mt" isn't used), then it simply misses out on being a good option for a whole set of programs.

While looking into this problem, I have come across a few threads [11][12] that touch on issues that are similar to my own. In them, the question of what should be the default tends to come up. Should Crystal default to four worker threads? Or should it default to spawning X worker threads where X equals the number of logical CPU cores? If it did the latter, what would that mean for RAM usage? How about contention? No two programs are the same, after all. There was also a suggested solution: a _compile_ time tuning knob that would set the default number of worker threads.

But, to all of these questions, I feel the best solution is for the language to not limit the programs written with it. Rather, a good program should instead offer the tuning knobs _itself_, not the language. This is why Benben has a "--jobs" option. I feel this is especially true for programs that run on servers, where tuning is vital. If the language imposes a limit itself, then that's one less knob for the end user (or at least a more troublesome knob).

To the suggestion of a compile-time knob for the default number of workers, this seems like an even worse idea. I have no way to predict the number of cores an end user will have. Of the computers I own myself, I have a few with four logical cores (some of which are in big.LITTLE), one with 12, one with six (in a big.LITTLE config), and one with 20. A compile-time knob just moves the goal posts, and does nothing to solve the problems I've discussed in the previous section.

There are a few stopgap solutions, though.

Stopgap Solutions

One possible solution is to detect the number of CPU cores very, very early at program launch, then modify the number of worker threads that get spawned by the runtime. This would require a redefinition of the "main" function in Crystal:

fun main(argc : Int32, argv : UInt8**) : Int32
  LibC.setenv("CRYSTAL_WORKERS", System.cpu_count.to_s, 1)
  Crystal.main(argc, argv)
end

The problem with this solution is that it is WAY before command line arguments are processed by Benben, so "--jobs" still gets wonky. I'm also not 100% sure if "System.cpu_count" is available at that time. But it's a start.

A second possible solution is to monkey patch the Crystal runtime's scheduler:

class Crystal::Scheduler
  private def self.worker_count
    env_workers = ENV["CRYSTAL_WORKERS"]?

    if env_workers && !env_workers.empty?
      workers = env_workers.to_i?
      if !workers || workers < 1
        Crystal::System.print_error "FATAL: Invalid value for CRYSTAL_WORKERS: #{env_workers}\n"
        exit 1
      end

      workers
    else
      System.cpu_count
    end
  end
end

Unfortunately this runs into the same problems as when you monkey patch _any_ API: you need to track the upstream code to ensure your monkey patch will continue to work properly with the rest of the program. This does seem like a good possible short-term solution, however (though be sure to check out the Benchmarks section below).

The Ideal Solution (In My Opinion)

The solution I would prefer long-term is for Crystal to give us a proper Thread API. It can still have its Fibers and the scheduler, but it should expose the scheduler as a public API so we can disable it or tweak it to our needs. But most importantly: **Allow us to spawn native First-Class Threads (operating system threads)**. After all, sometimes you just want a real operating system thread, not a Fiber.

The Alternatives

One thing to note is that Benben did not start life as a Crystal program. My usual MO is to prototype stuff in Common Lisp, then examine if it would be beneficial to do actual production code in another language (read: Crystal). Benben, and it's underlying VGM library YunoSynth, were no different. They started life in Common Lisp, and I still have that code. In fact I've even expanded it a bit out of curiosity and have found it to perform almost identically to the Crystal version as far as CPU usage goes [8]. More importantly: the Common Lisp implementation I target (SBCL) gives me easy-to-use true First-Class Threads. There's also libraries [9] available to let me use Green Threads (which are very similar to a Fiber in Crystal) alongside normal First-Class Threads if I really, really want some cooperative multitasking (highly unlikely for Benben). There's also the SB-SIMD package that comes with SBCL that gives me a very, very nice interface to do SIMD stuff, potentially making the Common Lisp version faster than the Crystal version currently is.

Basically, I could always just drop Crystal and go back to writing Benben[10] in Common Lisp.

I really don't want to do this, though. Not because of difficulty - it wouldn't be at all difficult, and only mildly time consuming since all the hard parts are already solved. I don't want to do this because I don't want to hurt Crystal. It's still a new language, and I really hope to bring (positive) attention to it with Benben. Not that Benben is the _only_ program I have written in Crystal... I have many others, including the Gemini server that served you this page. But Benben and it's cousin, midi123, are probably the most publicly visible of my programs.

But, if push came to shove, the mother in me will come out and protect my own child first and foremost.

Benchmarks

So, I've talked about how Benben works in regards to Fibers, "-Dpreview_mt", and the "CRYSTAL_WORKERS" environment variable. How about some actual numbers?

These benchmarks were done on my desktop computer, some of the specs of which are:

Benben was compiled using Crystal v1.10.1. The compiler was built on my system from source with LLVM 13.0.0. Each run of Benben was performed on a system with just a terminal and Emacs (for record keeping) running in a Sawfish X11 session to keep other programs from interfering as much as possible. There were at least 10 seconds of pause time between each run to allow the CPU to cool to a "normal" base temperature.

The numbers below are geometric means taken from three runs for each configuration. Three things were measured: the time taken, the average samples rendered per second, and the memory used. The time taken and average samples per second are reported directly by Benben and do NOT include the time it takes to startup and load the XSPF playlist The memory measurements were taken using a program called xtime and DO take the startup and XSPF load into account. The source code (Crystal) for xtime is included above the footnotes in its own section.

The configuration permutations aren't exhaustive, but I think I got the important ones. If someone wants me to add a few more on, I will.

These runs all rendered 61 songs to 44.1Khz 16-bit stereo WAV. The songs were referenced in an XSPF playlist and consisted of songs from the Sharp X68000 computer and the NeoGeo, both of which require somewhat more involved emulation cores (YM2151 and YM2610, respectively).

These benchmarks were done using two checkouts from Benben's repository:

If you wish to examine the source repository, it's available here:

Benben Source Code

The YunoSynth code was at checkout 5f351d5e3c31782ef1e56dcaf0b0c8664f79af82. It's repository is here:

YunoSynth Source Code

Benben was built using this, with "-Dpreview_mt" added/removed as-needed:

shards build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40

Some issues to note:

The source code (Common Lisp) used to calculate the geometric means, along with the xtime code, is in its own section above the footnotes. The data file with the raw values can be accessed here (it can be directly processed with the Common Lisp code):

Raw Result Data

Finally, here are the results when using "-Dpreview_mt":

CRYSTAL_WORKERS=20, --jobs 20, WITHOUT Fiber.yield added
  Time:            00:00:23:88967938
  Samples per Sec: 6,806,318
  Memory Used:     112.62 MiB

CRYSTAL_WORKERS=20, --jobs 4, WITHOUT Fiber.yield added
  Time:            00:00:49:561379837
  Samples per Sec: 3,157,491
  Memory Used:     82.06 MiB

CRYSTAL_WORKERS=20, --jobs 1, WITHOUT Fiber.yield added
  Time:            00:01:25:153285603
  Samples per Sec: 1,829,959
  Memory Used:     53.44 MiB
  
CRYSTAL_WORKERS=20, --jobs 20, Fiber.yield added
  Time:            00:00:21:316036046
  Samples per Sec: 7,445,727
  Memory Used:     152.65 MiB

CRYSTAL_WORKERS=4, --jobs 20, Fiber.yield added
  Time:            00:00:49:71518083
  Samples per Sec: 3,186,044
  Memory Used:     113.33 MiB

CRYSTAL_WORKERS=4, --jobs 4, Fiber.yield added
  Time:            00:01:03:478200751
  Samples per Sec: 2,462,458
  Memory Used:     73.96 MiB

CRYSTAL_WORKERS=4, --jobs 1, Fiber.yield added
  Time:            00:01:28:427111020
  Samples per Sec: 1,766,431
  Memory Used:     58.58 MiB

CRYSTAL_WORKERS=1, --jobs 20, Fiber.yield added
  Time:            00:02:31:215964094
  Samples per Sec: 1,030,649
  Memory Used:     13.94 GiB

CRYSTAL_WORKERS=1, --jobs 1, Fiber.yield added
  Time:            00:02:26:233878863
  Samples per Sec: 1,066,631
  Memory Used:     1.20 GiB

And here are results WITHOUT the "-Dpreview_mt" flag:

--jobs 20, Fiber.yield added
  Time:            00:02:25:331626453
  Samples per Sec: 1,072,926
  Memory Used:     112.62 MiB

--jobs 4, Fiber.yield added
  Time:            00:02:24:504344037
  Samples per Sec: 1,078,986
  Memory Used:     82.06 MiB

--jobs 1, Fiber.yield added
  Time:            00:02:24:695090870
  Samples per Sec: 1,077,629
  Memory Used:     53.44 MiB

Benchmark Discussion

Benben is fastest when compiled with "-Dpreview_mt", and is allowed to use 20 worker threads and 20 jobs. This is when it behaves as if it has true First-Class Threads that are preemptively scheduled for this machine. When compiled with "-Dpreview_mt", Benben is slowest when using one worker thread with 20 jobs. I interpret this as meaning a 1F:1T model is ideal, and that a true 1:1 "no Fibers, all First-Class Threads" would potentially be even more ideal.

Reducing the worker threads causes a larger performance drop than leaving "CRYSTAL_WORKERS" set to 20 and reducing the number of "--jobs". I interpret this as meaning the MF:NT model works best when M is <= N, and MF:NT results in slower performance than a 1F:1T model.

Fiber.yield usually results in slightly faster performance when "-Dpreview_mt" is used, but also slightly higher RAM usage. However, when compiled with "-Dpreview_mt", using one worker thread, and Fiber.yield is added, memory usage is absurdly high (in the gigabytes) compared to using more worker threads.

Fiber.yield must be used when compiling without "-Dpreview_mt" because of the cooperative multitasking.

When compiled without "-Dpreview_mt", Benben runs its absolute slowest.

Memory allocation and deallocation are almost certainly playing _some_ part to these numbers. However, I don't believe it's Benben causing these extra allocations. YunoSynth and Benben are both written specifically to allocate things like audio buffers only once, and minimize the allocation of strings as much as possible. Any extra allocations are more than likely to be in the standard library.

Overall, I feel these benchmarks support these conclusions:

xtime and Geometric Mean Source Code

xtime source code:

# Based on https://github.com/kostya/benchmarks/blob/master/xtime.rb

class EnergyStats
  PATH = "/sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj"

  getter? hasEnergyMetrics : Bool = false
  @accum : UInt64 = 0
  @energy : UInt64 = 0
  @maxEnergy : UInt64 = 0

  def initialize
    @hasEnergyMetrics = File.readable?(PATH)
    @maxEnergy = File.read(PATH).to_u64 if @hasEnergyMetrics
  end

  @[AlwaysInline]
  def update
    return unless @hasEnergyMetrics

    newVal = File.read(PATH).to_u64
    if @energy == 0
      # first reading
      @accum = 0
    elsif newVal > @energy
      @accum += newVal - @energy
    elsif newVal < @energy
      # counter has been reset
      @accum += @maxEnergy - @energy + newVal
    end
    @energy = newVal
  end

  def val
    sprintf("%0.3f", 0.000001_f64 * @accum)
  end
end

@[AlwaysInline]
def mem(pid : Int64) : Int64
  `ps p #{pid} -o rss`.lines[-1].to_i64 * 1024
end

def usage : Nil
  STDERR << %|Usage: xtime <program> [args...]
       xtime [-e] [-s] <program> [args...]

  -q : Don't print command name
  -e : Show stderr output from process
  -s : Show stdout output from process

Stderr and Stdout are both hidden by default.
|
  exit 1
end

usage if ARGV.empty?

# Check options
output = Process::Redirect::Close
outerr = Process::Redirect::Close
quiet = false
outfile = nil
outname = nil
outlang = nil
loop do
  break if ARGV.empty?
  case ARGV[0]
  when "-q"
    quiet = true
    ARGV.shift
  when "-e"
    outerr = Process::Redirect::Inherit
    ARGV.shift
  when "-s"
    output = Process::Redirect::Inherit
    ARGV.shift
  when "-o"
    ARGV.shift
    outname = ARGV.shift
    outlang = ARGV.shift
    outfile = ARGV.shift
  when "-h" then usage
  when "--" then break
  else break
  end
end

# ARGV may now be empty...
if ARGV.empty?
  STDERR << "ERROR: No program given\n"
  exit 1
end

# Used to track memory usage
highestMem : Int64 = 0
newMem : Int64 = 0

# Start energy tracking, if possible
energy = EnergyStats.new

# Start the process and timer
start = Time.monotonic
proc = Process.new(ARGV[0], ARGV[1..], output: output, error: outerr)
pid = proc.pid

# This fiber checks memory usage
spawn do
  highestMem = mem(pid)
  loop do
    sleep 5.milliseconds
    newMem = mem(pid)
    highestMem = newMem if newMem > highestMem
    energy.update
  end
end

# All done, report
proc.wait
finish = Time.monotonic
finalMem = highestMem.humanize_bytes(format: :JEDEC)

if energy.hasEnergyMetrics?
  STDERR << "#{ARGV[0]}: == #{finalMem}, #{(finish - start)} (#{energy.val} J energy) ==\n"
else
  STDERR << "#{ARGV[0]}: == #{finalMem}, #{(finish - start)} (? J energy) ==\n"
end

Code used to calculate geometric means:

(require 'computable-reals)
(require 'cl-sdm) ;; For STRING-REPLACE and SPLIT-STRING

;; Nanoseconds per second
(defconstant +ns+ 1000000000)

;; Each time in TIMES should be a STRING in the format: hh:mm:ss.nnnnnnnnn
(defun calculate-geometric-mean-time (&rest times)
  (labels
      ((parse-time (time)
         ;; Converts a "hh:mm:ss.nnnnnnnnn" string into a (:HOURS hh :MINUTES mm
         ;; :SECONDS ss :FRACTION nnnnnnnnn) plist.
         (loop for part-num in (mapcar #'parse-integer (sdm:split-string (sdm:string-replace time "." ":")
                                                                         #\: :as-list t))
               for part in '(:hours :minutes :seconds :fraction) ;; Rather overkill, but meh
               append (list part part-num) into ret
               finally (return (+ (getf ret :fraction)
                                  (* (getf ret :seconds) +ns+)
                                  (* (getf ret :minutes) 60 +ns+)
                                  (* (getf ret :hours) 60 60 +ns+))))))

    (let ((gm (cr:raw-approx-r (cr:expt-r (reduce #'* (mapcar #'parse-time times)) (/ 1 (length times))))))
      (multiple-value-bind (gm-total-sec gm-ns)
          (truncate gm +ns+)
        (let* ((gm-hour (truncate (truncate gm-total-sec 60) 60))
               (gm-min (truncate (- gm-total-sec (* gm-hour 60 60)) 60))
               (gm-sec (- gm-total-sec (* gm-hour 60 60) (* gm-min 60))))
          ;; Return a new string
          (format nil "~2,'0d:~2,'0d:~2,'0d:~2,'0d" gm-hour gm-min gm-sec gm-ns))))))

(defun calculate-geometric-mean-samples (&rest samples)
  (format nil "~:d" (cr:raw-approx-r (cr:expt-r (reduce #'* samples) (/ 1 (length samples))))))

(defun calculate-geometric-mean-memory (&rest memory)
  (sdm:pretty-print-bytes (cr:raw-approx-r (cr:expt-r (reduce #'* memory) (/ 1 (length memory))))))

(defun process-file (filename)
  (format t "~a"
          (with-output-to-string (out)
            (with-open-file (in filename)
              (loop for line = (read-line in nil nil) while line do
                (unless (or (sdm:empty-string-p line)
                            (sdm:string-starts-with line "#"))
                  (if (sdm:string-starts-with line "name:")
                      (format out "== ~a ==~%" line)
                      (let* ((parts (sdm:split-string line #\, :as-list t)))
                        (format out "  GM Time:            ~a~%"
                                (apply #'calculate-geometric-mean-time (subseq parts 0 3)))
                        (format out "  GM Samples per Sec: ~a~%"
                                (apply #'calculate-geometric-mean-samples
                                       (mapcar #'parse-integer (subseq parts 3 6))))
                        (format out "  GM Memory Used:     ~a~%"
                                (apply #'calculate-geometric-mean-memory
                                       (mapcar #'parse-integer (subseq parts 6))))
                        (write-char #\Newline out)))))))))

Footnotes

---------
Page served by Aya https://nanako.mooo.com/fossil/aya/
Aya is under the GNU Affero GPLv3 license