💾 Archived View for anachronauts.club › ~voidstar › log › 2022-03-24-openapi-for-binfmt.gmi captured on 2023-06-14 at 14:09:32. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-04-28)
-=-=-=-=-=-=-
if you're a web dev, you already know what openapi (previously swagger) is. if not, i'll give you a basic rundown.
let's say you are building a REST API. so (in broad strokes):
openapi is a schema format for REST APIs. an openapi schema is a single json file that describes your entire API: all of the endpoints, which methods they accept, which parameters/fields are required, complex validation rules, and documentation explaining the purpose and usage of each endpoint.
there are two main ways to deploy openapi:
the schema format is standardized, so the information in it can be parsed by code -- and an entire ecosystem of tooling has sprung up around openapi schemas:
at this point you may be hopping up and down in your chair, dying to tell me about a project that does nearly what i'm looking for. please hold that thought for now -- there are a lot of almost-but-not-quite "openapi for binary formats" projects out there, but upon closer inspection it is evident that their differences are fundamental. the changes that would be needed to make those projects suit this goal would undermine the core purpose those projects serve today.
let's consider the nature of binary formats. binary formats are usually sequences and lists of potentially-nested structures and unions. attributes (like type or size) of one field are often derived from fields that came before. in many cases, a binary format is a common structured header followed by a payload whose size and structure depends on details from the header. de-serializing must be done piecemeal when there are fields whose attributes depend on earlier fields.
let's consider the needs of an "openapi for binary formats". we need a schema file that is machine readable and human editable. the schema must define (keeping in mind that some aspects may be derived from the values of fields before them):
from that schema, we want to be able to generate (or be able to write a generator for):
ASN.1 is a declarative syntax for describing data structures, with an ecosystem of tools that can generate serializers and deserializers from ASN.1 files. it dates back to the eighties and is still used today.
protobuf (and its cousin, cap'n'proto) is an ipc serialization format that allows defining each field and type in a schema file, and then generating clients and servers in a myriad of languages.
both of these tools are only interested in allowing the author to model the logical representation of the data, not the physical. they are useful if you have data and want to automatically convert that into a file format or protocol, but not useful if you want to describe an existing file format or protocol.
(ASN.1 does allow for different profiles which can tweak how much metadata is embedded in the physical representation, but no mechanism for full control.)
these tools, and others in the same category, provide guarantees about their serialized format (such as backwards-compatibility), guarantees which can only be met by taking full control of the physical representation -- so serialization container formats are fundamentally unsuited as an "openapi for binary formats".
wuffs and spicy both provide an environment to write binary serializiers/de-serializers in a turing-complete programming language that was specifically designed to be transpiled into a myriad of other languages. the transpiled code can then be packaged as a library or imported directly into a codebase. these transpile-first languages provide strong memory safety guarantees, making this an attractive option for systems code.
with wuffs or spicy, you are not writing a schema of your format, you are writing a parser. you can transpile this to other languages, but there isn't much you can do to generate the ancillary tooling like documentation and test tools. a schema would have a specific spot and syntax for validation rules, so any tool that wants to generate test cases would know exactly where to look; however, a hand-written parser has validation rules written however and wherever the author felt is most appropriate. unfortunately, this expressive power is also the reason that transpile-first languages are unsuitable as an "openapi for binary formats".
binary parsing frameworks are available in many programming languages; such frameworks allow the developer to define objects and add extra metadata (often through language annotations) to describes physical representation, validation rules, conversion functions, etc. -- then the framework can then use that metadata to serialize and de-serialize data at runtime.
these tools are great for declaratively writing bulletproof low-code serializers/de-serializers with rich features, but if you need similar code in a new language, you cannot re-use your existing work. language-specific parser frameworks are not, by themselves, an "openapi for binary formats"; however, if such a schema format existed, frameworks could play a role in generating the schema contents (similar to how some REST backend frameworks generate openapi schemas).
kaitai struct is a schema format for binary formats. a kaitai schema is a yaml file that describes a format as a sequence of fields, describing both physical and logical representations, with attributes that can reference other fields, with user-defined types and enums, and with a compiler that can generate de-serializers in several languages as well as documentation materials.
kaitai struct was designed for reverse engineers working with undocumented binary data. it is bundled in their binary viewer application, and reverse engineers write their schemas bit by bit, slowly annotating the data as they learn new things about the format.
kaitai does not support data serialization, and have no definite plans to add serialization. their current language constructs include some features that would make serialization ambiguous in some cases -- adding serialization would mean dropping some features from kaitai.
parsing a kaitai schema is significantly more involved than parsing an openapi schema. one contributing factor is that kaitai features an embedded expression language, allowing almost any attribute to be bound to expressions involving other fields, rather than simple constants. this makes is hard for an ecosystem to thrive because of a high technical barrier for entry.
as a potential "openapi for binary formats", kaitai is closer to the goldilocks zone than any other project. while its purpose may not be fully compatible with this goal, it certainly represents prior art in the space and a valuable design reference.
kaitai struct provides a promising DSL for binary data schemas, but its usefulness is limited by its complexity. in order to create a new tool that works with kaitai files, the most practical option is to extend the existing code generator with new functionality, as opposed to making a standalone tool (as can be done with openapi). this creates a centralized bottleneck for ecosystem growth.
this is a challenging problem to solve. that kaitai code generator does a lot of pre-compile passes on the input to evaluate expressions, compute field sizes, padding, and alignment, and resolve types. this is work that would be required by most tools that would want to work with kaitai files, but the code that does this work is written in scala and can only be invoked from JVM-based languages .(edit: actually, it looks like there's a way to run it in javascript as well?)
if the core of the kaitai compiler were instead written as a library in a native compiled language, and then that library were made available in many different programming languages (e.g., a library written in C/zig/rust and thin FFI wrappers to package the library for other languages, like protobuf), perhaps that would bootstrap a diverse tooling ecosystem.
the question of serialization remains; however, the technical challenges seem largely connected to the use of a feature in kaitai called "instances" rather than "sequences". this feature allows defining a field at a fixed offset, rather than defining it in context to the field before or after. this feature is obviously very useful to reverse engineers, who may understand only a certain chunk of a binary blob and need to annotate its fields before moving on, but it wouldn't be a desireable feature in a schema that serves as the source definition for a binary format.
there is a lot of overlap in the needs of kaitai struct and the needs of a hypothetical "openapi for binary formats", but the small amount that does not overlap is providing technical obstacles to the main feature that would be needed (serialization). furthermore, the chosen technology (scala) limits the ecosystem. can the differences in purpose be bridged, or will it require the start of a new project with subtly different goals?
kaitai struct issue tracker: serialization
invertible syntax descriptions: unifying parsing and pretty printing
-----
(my alias is woodrowbarlow)