💾 Archived View for anachronauts.club › ~voidstar › log › 2022-03-24-openapi-for-binfmt.gmi captured on 2022-06-04 at 00:09:43. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

why isn't there a swagger/openapi for binary formats?

what's openapi?

if you're a web dev, you already know what openapi (previously swagger) is. if not, i'll give you a basic rundown.

let's say you are building a REST API. so (in broad strokes):

you have a url endpoint for each object in your database
each endpoint can take different read/write operations (represented via http method)
each request can include parameters, some requests can also include a payload
each endpoint returns payloads; a certain type on success, other types in cases of failure

openapi is a schema format for REST APIs. an openapi schema is a single json file that describes your entire API: all of the endpoints, which methods they accept, which parameters/fields are required, complex validation rules, and documentation explaining the purpose and usage of each endpoint.

there are two main ways to deploy openapi:

you can hand-write an openapi schema and then generate the skeleton of your backend code from that; or,
some backend frameworks (for example, fastapi in python) will generate an openapi schema from your backend code.

the schema format is standardized, so the information in it can be parsed by code -- and an entire ecosystem of tooling has sprung up around openapi schemas:

client libraries for your API can be automatically generated in many different languages, so that REST API requests can be a simple function call with support for language features like custom exceptions and class definitions for json objects and inline documentation.
a documentation webpage for your API can be automatically generated with information about endpoints, input formats and validation, possible errors, output data, etc., and even including interactive try-it-out widgets to make requests in the browser.
test tools can look at the validation rules of your openapi schema and automatically perform boundary case testing to ensure accurate input validation.
and so, so much more.

openapi for binary formats?

at this point you may be hopping up and down in your chair, dying to tell me about a project that does nearly what i'm looking for. please hold that thought for now -- there are a lot of almost-but-not-quite "openapi for binary formats" projects out there, but upon closer inspection it is evident that their differences are fundamental. the changes that would be needed to make those projects suit this goal would undermine the core purpose those projects serve today.

let's consider the nature of binary formats. binary formats are usually sequences and lists of potentially-nested structures and unions. attributes (like type or size) of one field are often derived from fields that came before. in many cases, a binary format is a common structured header followed by a payload whose size and structure depends on details from the header. de-serializing must be done piecemeal when there are fields whose attributes depend on earlier fields.

let's consider the needs of an "openapi for binary formats". we need a schema file that is machine readable and human editable. the schema must define (keeping in mind that some aspects may be derived from the values of fields before them):

the data's physical representation (how many bits/bytes, endianness, alignment, padding, terminator sequence, etc.)
how to map physical representation to logical representation (data type, e.g. string vs number, signed vs unsigned, user-defined structures and enums, conversion logic, etc.)
validation rules for data fields in their logical representation (e.g. range validation, optional fields, etc.)
spots to add documentation for everything! fields, structs, enums, etc.

from that schema, we want to be able to generate (or be able to write a generator for):

a library/package/module/whatever in any language that can deserialize a stream of bytes into native language constructs (e.g. classes, for oop languages), and serialize such constructs (after modifying them) back into a stream of bytes. this includes field validation and usage of language features like custom exceptions and inline documentation as appropriate.
documentation, including descriptions of each field, type, enum, or value alongside generated information about physical and logical representations, size, alignment, padding, sequencing, ordering, repeating, etc.; possibly including interactive validator tools.
automated test tools that specificly probe edge cases that are determined through static analysis of the schema.
diagnostics tools, like plugins for wireshark.

ASN.1, protobuf: serialization container formats

ASN.1 is a declarative syntax for describing data structures, with an ecosystem of tools that can generate serializers and deserializers from ASN.1 files. it dates back to the eighties and is still used today.

protobuf (and its cousin, cap'n'proto) is an ipc serialization format that allows defining each field and type in a schema file, and then generating clients and servers in a myriad of languages.

both of these tools are only interested in allowing the author to model the logical representation of the data, not the physical. they are useful if you have data and want to automatically convert that into a file format or protocol, but not useful if you want to describe an existing file format or protocol.

(ASN.1 does allow for different profiles which can tweak how much metadata is embedded in the physical representation, but no mechanism for full control.)

these tools, and others in the same category, provide guarantees about their serialized format (such as backwards-compatibility), guarantees which can only be met by taking full control of the physical representation -- so serialization container formats are fundamentally unsuited as an "openapi for binary formats".

wuffs, spicy: parsing in transpile-first languages

wuffs and spicy both provide an environment to write binary serializiers/de-serializers in a turing-complete programming language that was specifically designed to be transpiled into a myriad of other languages. the transpiled code can then be packaged as a library or imported directly into a codebase. these transpile-first languages provide strong memory safety guarantees, making this an attractive option for systems code.

with wuffs or spicy, you are not writing a schema of your format, you are writing a parser. you can transpile this to other languages, but there isn't much you can do to generate the ancillary tooling like documentation and test tools. a schema would have a specific spot and syntax for validation rules, so any tool that wants to generate test cases would know exactly where to look; however, a hand-written parser has validation rules written however and wherever the author felt is most appropriate. unfortunately, this expressive power is also the reason that transpile-first languages are unsuitable as an "openapi for binary formats".

binfmt, construct: language-specific parser frameworks

binary parsing frameworks are available in many programming languages; such frameworks allow the developer to define objects and add extra metadata (often through language annotations) to describes physical representation, validation rules, conversion functions, etc. -- then the framework can then use that metadata to serialize and de-serialize data at runtime.

these tools are great for declaratively writing bulletproof low-code serializers/de-serializers with rich features, but if you need similar code in a new language, you cannot re-use your existing work. language-specific parser frameworks are not, by themselves, an "openapi for binary formats"; however, if such a schema format existed, frameworks could play a role in generating the schema contents (similar to how some REST backend frameworks generate openapi schemas).

kaitai struct: reverse engineering toolsets

kaitai struct is a schema format for binary formats. a kaitai schema is a yaml file that describes a format as a sequence of fields, describing both physical and logical representations, with attributes that can reference other fields, with user-defined types and enums, and with a compiler that can generate de-serializers in several languages as well as documentation materials.

kaitai struct was designed for reverse engineers working with undocumented binary data. it is bundled in their binary viewer application, and reverse engineers write their schemas bit by bit, slowly annotating the data as they learn new things about the format.

kaitai does not support data serialization, and have no definite plans to add serialization. their current language constructs include some features that would make serialization ambiguous in some cases -- adding serialization would mean dropping some features from kaitai.

parsing a kaitai schema is significantly more involved than parsing an openapi schema. one contributing factor is that kaitai features an embedded expression language, allowing almost any attribute to be bound to expressions involving other fields, rather than simple constants. this makes is hard for an ecosystem to thrive because of a high technical barrier for entry.

as a potential "openapi for binary formats", kaitai is closer to the goldilocks zone than any other project. while its purpose may not be fully compatible with this goal, it certainly represents prior art in the space and a valuable design reference.

conclusion

kaitai struct provides a promising DSL for binary data schemas, but its usefulness is limited by its complexity. in order to create a new tool that works with kaitai files, the most practical option is to extend the existing code generator with new functionality, as opposed to making a standalone tool (as can be done with openapi). this creates a centralized bottleneck for ecosystem growth.

this is a challenging problem to solve. that kaitai code generator does a lot of pre-compile passes on the input to evaluate expressions, compute field sizes, padding, and alignment, and resolve types. this is work that would be required by most tools that would want to work with kaitai files, but the code that does this work is written in scala and can only be invoked from JVM-based languages .(edit: actually, it looks like there's a way to run it in javascript as well?)

if the core of the kaitai compiler were instead written as a library in a native compiled language, and then that library were made available in many different programming languages (e.g., a library written in C/zig/rust and thin FFI wrappers to package the library for other languages, like protobuf), perhaps that would bootstrap a diverse tooling ecosystem.

the question of serialization remains; however, the technical challenges seem largely connected to the use of a feature in kaitai called "instances" rather than "sequences". this feature allows defining a field at a fixed offset, rather than defining it in context to the field before or after. this feature is obviously very useful to reverse engineers, who may understand only a certain chunk of a binary blob and need to annotate its fields before moving on, but it wouldn't be a desireable feature in a schema that serves as the source definition for a binary format.

there is a lot of overlap in the needs of kaitai struct and the needs of a hypothetical "openapi for binary formats", but the small amount that does not overlap is providing technical obstacles to the main feature that would be needed (serialization). furthermore, the chosen technology (scala) limits the ecosystem. can the differences in purpose be bridged, or will it require the start of a new project with subtly different goals?