💾 Archived View for gemini.bortzmeyer.org › fosdem › event-11229.gmi captured on 2021-12-17 at 13:26:06. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

FOSDEM event "ML inference acceleration for lightweight VMMs"

Anastassios Nanos and Babis Chalios

Type devroom

Starts on day 1 (2021-02-06) at 12:15 (Brussels time, UTC+1) in room Virtualization (duration 00:45)

Matrix room #virtualization:fosdem.org

The debate on how to deploy applications, monoliths or micro services, is in

full swing. Part of this discussion relates to how the new paradigm

incorporates support for accessing accelerators, e.g. GPUs, FPGAs. That kind of

support has been made available to traditional programming models the last

couple of decades and its tooling has evolved to be stable and standardized

(eg. CUDA, OpenCL/OpenACC, Tensorflow etc.).

On the other hand, what does it mean for a highly distributed application

instance (i.e. a Serverless deployment) to access an accelerator? Should the

function invoked to classify an image, for instance, link against the whole

acceleration runtime and program the hardware device itself? It seems quite

counter-intuitive to create such bloated functions.

Things get more complicated when we consider the low-level layers of the

service architecture. To ensure user and data isolation, infrastructure

providers employ virtualization techniques. However, generic hardware

accelerators are not designed to be shared by multiple untrusted tenants.

Current solutions (device passthrough, API-remoting) impose inflexible setups,

present security trade-offs and add significant performance overheads.

To this end, we introduce vAccel, a lightweight framework to expose hardware

acceleration functionality to VM tenants. Our framework is based on a thin

runtime system, vAccelRT, which is, essentially, an acceleration API: it offers

support for a set of operators that use generic hardware acceleration

frameworks to increase performance, such as machine learning and linear algebra

operators. vAccelRT abstracts away any hardware/vendor-specific code by

employing a modular design where backends implement bindings for popular

acceleration frameworks and the frontend exposes a function prototype for each

available acceleration function. On top of that, using an optimized paravirtual

interface, vAccelRT is exposed to a VM’s user-space, where applications can

benefit from hardware acceleration via a simple function call.

In this talk we present the design and implementation of vAccel on two KVM

VMMs: QEMU and AWS Firecracker. We go through a brief design description and

focus on the key aspects of enabling hardware acceleration for machine learning

inference for ligthweight VMs both on x86_64 and aarch64 architectures. Our

current implementation supports jetson-inference & TensorRT, as well as Google

Coral TPU, while facilitating integration with NVIDIA GPUs (CUDA) and Intel

Iris GPUs (OpenCL).

Finally, we present a demo of vAccel in action, using a containerized environment

to simplify configuration & deployment

FOSDEM schedule page