💾 Archived View for thebird.nl › blog › 2023 › zig-pointers-and-c.gmi captured on 2023-05-24 at 17:36:39. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-03-20)
-=-=-=-=-=-=-
I write bindings against C libraries by, essentially, migrating C code to Zig.
Zig can bind against C libraries, but its pointers are different from those in C.
C pointers consist of an address in space (I mean RAM). You can do arithmetic on C pointers, i.e.,
char *p = malloc(128); p = p+1;
will make p point to the next item in the 'array' of char.
Code like this is a source of great errors, so it is not directly supported by Zig.
So, what are Zig pointers?
First there is the concept of a single pointer into RAM:
// Get the address of a variable: const x: i32 = 1234; const x_ptr = &x; // Dereference a pointer: try expect(x_ptr.* == 1234); // When you get the address of a const variable, you get a const single-item pointer. try expect(@TypeOf(x_ptr) == *const i32); // If you want to mutate the value, you'd need an address of a mutable variable: var y: i32 = 5678; const y_ptr = &y; try expect(@TypeOf(y_ptr) == *i32); y_ptr.* += 1; try expect(y_ptr.* == 5679); // Taking an address of an individual element gives a // single-item pointer. This kind of pointer // does not support pointer arithmetic. var array = [_]u8{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; const ptr = &array[2]; try expect(@TypeOf(ptr) == *u8); try expect(array[2] == 3); ptr.* += 1; try expect(array[2] == 4);
all straight from the Zig documentation. So, single pointers can't *move*.
https://ziglang.org/documentation/master/#Pointers
The next type of pointer is essentially a slice:
[*]T - many-item pointer to unknown number of items. Supports index syntax: ptr[i] Supports slice syntax: ptr[start..end] Supports pointer arithmetic: ptr + x, ptr - x T must have a known size, which means that it cannot be anyopaque or any other opaque type.
on this you can do arithmetic because slices support bounds checking:
test "pointer arithmetic with many-item pointer" { const array = [_]i32{ 1, 2, 3, 4 }; var ptr: [*]const i32 = &array; try expect(ptr[0] == 1); ptr += 1; try expect(ptr[0] == 2); } test "pointer arithmetic with slices" { var array = [_]i32{ 1, 2, 3, 4 }; var length: usize = 0; var slice = array[length..array.len]; try expect(slice[0] == 1); try expect(slice.len == 4); slice.ptr += 1; // now the slice is in an bad state since len has not been updated try expect(slice[0] == 2); try expect(slice.len == 4); }
The point being that you can do arithmetic as long as it is safe.
Finally, to support C, Zig has the concept of a zero-terminated buffer.
The syntax [*:x]T describes a pointer that has a length determined by a sentinel value. This provides protection against buffer overflow and overreads.
// This is also available as `std.c.printf`. pub extern "c" fn printf(format: [*:0]const u8, ...) c_int; pub fn main() anyerror!void { _ = printf("Hello, world!\n"); // OK const msg = "Hello, world!\n"; const non_null_terminated_msg: [msg.len]u8 = msg.*; _ = printf(&non_null_terminated_msg); }
will through and error, because a string in Zig does not have a 0 terminator by default!
So there is some safety added there too. See
https://ziglang.org/documentation/master/#Sentinel-Terminated-Pointers
You can play with these concepts of a single pointer, a sliced pointer and a zero terminated pointer.
All good, but how do we interact with C libraries?
To interact with C libraries we need to convert C pointers to Zig pointers/slices and back.
First we need to look into casting. Zig has the @as() operation which can only cast when no information is lost in the process. So you can cast from u8 to u16, but not the other way. Same for u16 to i32.
To go from a larger size to a smaller size you may use @truncate() instead.
To cast from different types also check out @bitCast() and @intCast().
Reading up on the documentation you'll see in how many ways you can introduce errors into C programs that Zig helps avoid.
For pointers Zig can also cast. Here we take a zero-sentinel Zig slice and cast it to a C buffer of u8:
fn to_cstr(str: [:0] const u8) [*c]const u8 { return @ptrCast([*c]const u8,str); } fn to_cbuf(str: [] const u8) [*c]const u8 { return @ptrCast([*c]const u8,str); }
The c in '[*c]' tells the Zig compiler that the pointer is used as a native C pointer.
So, when a zero-terminated buffer has to move to a C binding, we can simply create a [:0] const u8 buffer in Zig.
The other way, when we have a C string we can cast it to a Zig slice with
fn to_slice0(c_str: [*c] const u8) [:0] const u8 { return std.mem.span(@ptrCast([*:0]const u8, c_str)); } fn to_slice(c_str: [*c] const u8) [] const u8 { return std.mem.span(c_str); }
Note that declaring '[:0] const u8' is useful if we want to pass a text buffer back to C again using to_cstr.
If you pass back the length of the buffer us the to_cbuf translation instead.
The latter approach can also be used for vectors of data, such as int or doubles.
Also, std.mem.span does not make a copy of the string: no standard library function will make a copy of something unless you pass it an allocator.
Very often C libraries use '**' to pass data around. This is pointer-to-pointer and considered somewhat obsolete in C++. The idea is that you pass a list of pointers to data around.
A 'char *' is a pointer to a list of char. Likewise 'char **' is a list of pointers to pointers of char.
In Zig you'll get '[] u8' for a slice of char of known size and '[*] u8' for a list of unknown size, which reflects 'char *'.
To get a list of list you'll do '[][] u8' for known sizes (slices) and '[*][*] u8' for unknown sizes.
To create the C declaration in Zig you may end up with '[*c][*c] u8'.
Note that in C the type declaration is reversed to Zig.
One other famous source of errors in C is the 'void' pointer.
Void does not mean 'empty' here, it means unknown.
A pointer into an unknown data structure.
In C this can be casted any old way.
To deal with 'void' Zig has anyopaque. In principle it is preferred to use typing, e.g.
void gg(void **a) { <<-- correct? printf("%s\n", a); }
extern fn gg(a: [*:0]const u8) void; pub fn main() void { gg("hello"); }
(note that void really means empty in zig).
C's void** is a pointer to an anyopaque, equivalent to [*][*]anyopaque in zig.
// assuming it just wants a pointer to a byte array const slice: []const u8 = "123456"; c_function(@ptrCast(**anyopaque, slice.ptr)); // an array of anyopaque pointers const slice = "123456"; const items: [*]*anyopaque = &[_]*anyopaque{@ptrCast(*anyopaque, slice.ptr)}; // *anyopaque to []u8 @ptrCast([*]u8, @alignCast(@alignOf(u8), c_pointer))[0..c_length]
I found a trick to copy a vector to Zig. Here we return a list of char ** into a preset
buffer in Zig. This is an efficient way to pass "vector<string>" to Zig -- the string data is still 'owned'
by the C++ code.
/// Get the C++ alts pub fn alt(self: *const Self) ArrayList([] const u8) { var list = ArrayList([] const u8).init(allocator); const altsize = var_alt_num(self.v); var buffer = allocator.alloc(*anyopaque, altsize) catch unreachable; defer allocator.free(buffer); const res = var_alt(self.v,@ptrCast([*c]* anyopaque,buffer)); var i: usize = 0; while (i < altsize) : (i += 1) { const s = res[i]; const s2 = to_slice(s); list.append(s2) catch unreachable; } return list; }
note 'to_slice' we defined earlier and 'var_alt_num' is a separate C++ call to get the vector size. On the C++ side we do:
const char **var_alt(void *variant,const char **ret) { auto v = static_cast<Variant*>(variant); int idx = 0; for (auto &a: v->alt) { ret[idx] = a.c_str(); idx++; } return ret; }
where Vector is a class and v->alt is vector<string>. For the full code see
https://github.com/vcflib/vcflib/blob/master/src/zig/vcf.zig
Based on the previous section I thought it should be easy to return a C buffer using [*c][*] structure to Zig.
But I kept getting segmentation faults. This is the real reason I started this writeup - to make a deeper dive into Zig's handling of C pointers.
Sometimes an [*c]anyopaque or [*c][*c]anyopaque is just that.
You want to pass it around and know nothing about its contents.
A typical example is a file or DB handle. Assign it once and pass it around to other functions.
Normally this is no problem, until you have a C library that returns a value as a parameter, as in
int mdb_env_create(MDB_env **env) Create an LMDB environment handle. This function allocates memory for a MDB_env structure.
declared in zig by the automated C translator (yes, Zig can convert include .h files because it has clang built-in) as
pub extern fn mdb_env_create(env: [*c]?*MDB_env) c_int;
Note the ? means that it is allowed to have a NULL value according to the zig automated C translator.
A naive
var e: [*c]?*lmdb.MDB_env = null; var i = lmdb.mdb_env_create(e);
compiles, but segfaults. Now, from the description of mdb_env_create it appears we need to pass in a functional pointer that will point to a new data structure. Can this be a NULL? And if it is a new C pointer, can Zig deal with that, even if it is supposedly opaque? Hmmm.
If we look at another implementation of calling mdb_env_create in C++ we find
i.e.
MDB_env* handle{nullptr}; lmdb::env_create(&handle);
nullptr is the slightly safer void * = NULL. Another C example from the lmdb project itself:
int main(int argc,char * argv[]) { int rc; MDB_env *env; MDB_dbi dbi; MDB_val key, data; MDB_txn *txn; MDB_cursor *cursor; char sval[32]; rc = mdb_env_create(&env); rc = mdb_env_open(env, "./testdb", 0, 0664); rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi); key.mv_size = sizeof(int); key.mv_data = sval; data.mv_size = sizeof(sval); data.mv_data = sval; sprintf(sval, "%03x %d foo bar", 32, 3141592); rc = mdb_put(txn, dbi, &key, &data, 0); rc = mdb_txn_commit(txn); if (rc) { fprintf(stderr, "mdb_txn_commit: (%d) %s\n", rc, mdb_strerror(rc)); goto leave; } rc = mdb_txn_begin(env, NULL, MDB_RDONLY, &txn); rc = mdb_cursor_open(txn, dbi, &cursor); while ((rc = mdb_cursor_get(cursor, &key, &data, MDB_NEXT)) == 0) { printf("key: %p %.*s, data: %p %.*s\n", key.mv_data, (int) key.mv_size, (char *) key.mv_data, data.mv_data, (int) data.mv_size, (char *) data.mv_data); } mdb_cursor_close(cursor); mdb_txn_abort(txn); leave: mdb_close(env, dbi); mdb_env_close(env); return 0; }
Says that the pointer has to exist on the stack. So passing in a null is not a great idea (see the assembly below).
Now the actual lmdb code is
mdb_env_create(MDB_env **env) { MDB_env *e; e = calloc(1, sizeof(MDB_env)); if (!e) return ENOMEM; e->me_maxreaders = DEFAULT_READERS; (...) *env = e; return MDB_SUCCESS; }
so we expect a valid pointer that can be set to point to 'e'. It points to one object and
can not be null. So the zig definition may be simplified to
pub extern fn mdb_env_create(env1: [*c]* anyopaque) c_int; pub fn main() anyerror!void { var buffer = allocator.alloc(*anyopaque, 10000) catch unreachable; var ptr = @ptrCast([*c]* anyopaque,&buffer.ptr); const i = mdb_env_create(ptr);
And this works after adding 'bin.linkSystemLibrary("pthread");' and setting 'export CC=clang' with the GNU Guix package zig 0.9 compiler!
Not sure what the problem is, but it appears above code should normally work. Having a think about the segfault, one thing I should check is the code generated. It just could be the pointer is optimized away because zig does not see it is being used.
Anyway, now we have a working example, we can simplify in steps down to
const lmdb = @import("lmdb.zig"); const MDB_env = lmdb.MDB_env; pub extern fn mdb_env_create(env: ** MDB_env) c_int; pub fn main() anyerror!void { var env = allocator.alloc(*MDB_env, 1) catch unreachable; var ptr = @ptrCast(**MDB_env,&env.ptr); const i = mdb_env_create(ptr);
Starting to look good! Having a type beats using anyopaque.
And, again, realise that zig's ** is not the same as C's **.
Now, is that allocator really required? Here we create a pointer buffer of size 1 on the stack:
pub extern fn mdb_env_create(env: ** MDB_env) c_int; pub fn main() anyerror!void { var env = [1]*MDB_env{ undefined }; var ptr = @ptrCast(**MDB_env,&env); const i = mdb_env_create(ptr); var path: [*c]const u8 = "test"; var flags: c_uint = 0; const mode: lmdb.mdb_mode_t = 0; var ret = lmdb.mdb_env_open(env[0],path,flags,mode); errdefer lmdb.mdb_env_close(env[0]); p("{} {}",.{i,ret}); std.log.info("All your codebase are belong to us.", .{});
And it runs.
Not sure this works correctly right now FIXME
To find more on C and zig pointers, search for void and anyopaque online.
Calling above allocators and ptrCast's at every step will lead to unreadable code. So, let's try to simplify
var env = [1]*MDB_env{ undefined }; // set up environment area on the stack var ptr = @ptrCast(**MDB_env,&env); if (lmdb.mdb_env_create(ptr) != 0) { return LMDBerror.CannotStartEnv; }
We would like to simplify that to a call
var env = lmdb_create_env();
Returning the stack buffer MDB_env won't work, so we end up with an allocator. No real harm here, performance wise. All allocators in Zig are explicit:
var env = lmdb_create_env(allocator);
Using a global allocator we this would work
pub fn lmdb_create_env() anyerror![*]*MDB_env { // Create ptr buffer for 1 pointer var env = allocator.alloc(*MDB_env, 1) catch unreachable; var ptr = @ptrCast([*]*MDB_env,&env.ptr); // set address of buffer if (mdb_env_create(ptr) != 0) { return LMDBerror.CannotStartEnv; } return ptr; } var env = try lmdb_create_env(); defer allocator.free(env); var lenv = env[0]; etc
compiles and runs, but the allocator.free does not work! This is because we 'mangled' the buffer. First we allocate a buffer *MDB_env. Next we cast it to a C pointer2pointer to `struct_MDB_env`. Next we return this last pointer. Next we try to free this pointer which is no longer the same memory! You might think
allocator.free(env[0]);
works, but it gives a slice error. What is worse is that env itself actually becomes nonsense because we get
a call to `@panic("Invalid free")`, so the free checker tells us that we did something wrong even though we received a valid C pointer and are able to extract it. Hmmm. The original env is not the later env.
Next thing I tried proves this:
const MyTuple = std.meta.Tuple(&.{ [*c]*MDB_env, []*MDB_env }); // Return enviroment buffer as a single pointer // pub fn lmdb_create_env() anyerror![*c]*MDB_env { pub fn lmdb_create_env() anyerror!MyTuple { // Create ptr buffer for 1 pointer var env = allocator.alloc(*MDB_env, 1) catch unreachable; var ptr = @ptrCast([*]*MDB_env,&env.ptr); // set address of buffer if (mdb_env_create(ptr) != 0) { return LMDBerror.CannotStartEnv; } var tuple: MyTuple = .{ ptr, env }; return tuple; } var t = try lmdb_create_env(); var env = t[0]; defer allocator.free(t[1]);
because now I certainly free the right pointer in the right scope. Still bails out with @panic("Invalid free"). When I change the line
var ptr = @ptrCast([*]*MDB_env,&env[0]); // set address of buffer
It works! Eh, that was a bit naive, no? The moral of the story is that casting can still be a challenge if you don't think it through.
Note that it is nice to return a Tuple here! Even so, we probably don't need it now we know what went wrong. env[0] and ptr[0] should really point to the same, so we can return env.
pub fn lmdb_create_env() anyerror![]*lmdb.struct_MDB_env { // Create ptr buffer for 1 pointer var env = allocator.alloc(*MDB_env, 1) catch unreachable; var ptr = @ptrCast([*]*MDB_env,&env[0]); // set address of buffer if (mdb_env_create(ptr) != 0) { return LMDBerror.CannotStartEnv; } p("\n{any} {any}\n",.{ env[0], ptr[0] }); return env; } var env = try lmdb_create_env(); defer allocator.free(env); const lenv = env[0];
Effective! To carry around state we could have created a struct, but I think this suffices for now.
To comply with Zig policy we should pass in the allocator to make it explicit we are allocation memory, so
the function definition becomes:
pub fn lmdb_create_env(alloc: std.mem.Allocator) anyerror![]*lmdb.struct_MDB_env
lmdb is a key value store. It passes data around using its own special struct that has mv_size and mv_data of a string. In the first phase we can abstract the buffer handling so we use [] const u8 for in and out:
pub fn lmdb_get(txn: *lmdb.struct_MDB_txn, dbi: c_uint, key: [] const u8) anyerror![] const u8 { var data: lmdb.MDB_val = undefined; var k = &lmdb.MDB_val{ .mv_size = key.len, .mv_data = @intToPtr(*u8, @ptrToInt(key.ptr)) }; if (lmdb.mdb_get(txn, dbi, k, &data) != 0) return LMDBerror.KeyNotFound; const v: []const u8 = @ptrCast([*]const u8, data.mv_data)[0..data.mv_size]; return v; }
usage
const v = try lmdb_get(txn,dbi,"current"); const blob = try lmdb_get(txn,dbi,v);
That works well enough. But it gets annoying to move txn and dbi as state around. Zig can carry methods in a struct and pass around self. It really is minimalistic OOP and reminds us of Python and the like and creates a form of name spacing:
const LMDB = struct { fn get(txn: *lmdb.struct_MDB_txn, dbi: c_uint, key: [] const u8) anyerror![] const u8 { var data: lmdb.MDB_val = undefined; var k = &lmdb.MDB_val{ .mv_size = key.len, .mv_data = @intToPtr(*u8, @ptrToInt(key.ptr)) }; if (lmdb.mdb_get(txn, dbi, k, &data) != 0) return LMDBerror.KeyNotFound; const v: []const u8 = @ptrCast([*]const u8, data.mv_data)[0..data.mv_size]; return v; } }; const v = try LMDB.get(txn,dbi1,"current");
We can introduce a pointer to self and
const LMDB = struct { const Self = @This(); fn get(self: *const Self, txn: *lmdb.struct_MDB_txn, dbi: c_uint, key: [] const u8) anyerror![] const u8 { _ = self; var data: lmdb.MDB_val = undefined; var k = &lmdb.MDB_val{ .mv_size = key.len, .mv_data = @intToPtr(*u8, @ptrToInt(key.ptr)) }; if (lmdb.mdb_get(txn, dbi, k, &data) != 0) return LMDBerror.KeyNotFound; const v: []const u8 = @ptrCast([*]const u8, data.mv_data)[0..data.mv_size]; return v; } }; const db = LMDB{}; const v = try db.get(txn,dbi1,"current");
Note the pointer to 'self' gets passed in automatically, just as in Python etc.
And now we can carry state with
const LMDB = struct { dbi: lmdb.MDB_dbi, txn: *lmdb.struct_MDB_txn, const Self = @This(); fn get(self: *const Self, key: [] const u8) anyerror![] const u8 { var data: lmdb.MDB_val = undefined; var k = &lmdb.MDB_val{ .mv_size = key.len, .mv_data = @intToPtr(*u8, @ptrToInt(key.ptr)) }; if (lmdb.mdb_get(self.txn, self.dbi, k, &data) != 0) return LMDBerror.KeyNotFound; const v: []const u8 = @ptrCast([*]const u8, data.mv_data)[0..data.mv_size]; return v; } }; const db = LMDB{ .dbi = dbi, .txn = txn }; const v = try db.get("current");
Now you can create an init function to create the structure
fn init(dbi: lmdb.MDB_dbi, txn: *lmdb.struct_MDB_txn) LMDB { return LMDB { .dbi = dbi, .txn = txn }; } const db = LMDB.init(dbi,txn); const v = try db.get("current");
Now we can move the boiler plate into the init function.
So, even if zig aims to replace C, it has enough power to replace some of the important parts of C++.
With our code we got a binary blob from lmdb. The blob is actually a matrix of byte values. So, once we have the rowsize we should be able to cast to a 2D matrix.
Zig has
when the size is known at compile time. At runtime we'll need to use ArrayList to create a list of rows. Note the data ought to be copied because 'blob' is owned by lmdb. As the docs have it:
The memory pointed to by the returned values is owned by the database. The caller need not dispose of the memory, and may not modify it in any way. For values returned in a read-only transaction any modification attempts will cause a SIGSEGV.
Values returned from the database are valid only until a subsequent update operation, or the end of the transaction.
So, it will be safer to immediately convert a result into a Zig maintained version. But as you can see from this:
const db = try LMDB.init(txn); const v = try db.get("current"); const blob = try db.get(v); _ = blob; const versions = try db.get("versions"); const hash = versions[0..32]; var cols = std.ArrayList(u8).init(allocator); try cols.appendSlice(hash); try cols.appendSlice(":ncols"); const ncols = try db.get_u64(cols.items); var rows = std.ArrayList(u8).init(allocator); try rows.appendSlice(hash); try rows.appendSlice(":nrows"); const nrows = try db.get_u64(rows.items); p("\nnrows,ncols = {d},{d}\n",.{nrows,ncols}); var m = try std.ArrayList([] const u8).initCapacity(allocator,ncols); var j: usize = 0; while (j < nrows) : (j += 1) { const start = j * ncols; const end = start + ncols; try m.append(blob[start..end]); } for (m.items) | r | { p("{d}",.{r}); }
we start a transaction and have short lived values. As long as lmdb has memory there should be no issue for the duration of the transaction. Making an extra copy will just slow things down and it is one reason for using Zig in the first place that we don't want C++ style hidden copying.
Anyway, the result is a 2-dimensional indexed matrix with slicing and safe boundary checks:
p("top left value is {d}, bottom right is {d}\n",.{m.items[0][0],m.items[nrows-1][ncols-1]}); top left value is 0, bottom right is 2
I may turn it into a list of vector for Zig's built-in SIMD operations.
But first we are going to multiply matrices:
Zig is great at inspection:
p("{s}",.{@typeName(@TypeOf(txn))});
displays
Zig code can be stepped through with the GNU debugger. Compile it and run
gdb --args main arg0 b mgamma.zig:25 set directories contrib/lmdb/libraries/liblmdb
Sometimes it is useful to check what the compiler generates:
zig build-obj -femit-asm=main.asm main.zig
This shows that
pub fn main() anyerror!void { var env1: ?*lmdb.MDB_env = null; var i = lmdb.mdb_env_create(&env1);
compiles to
.type mgamma.main,@function mgamma.main: push rbp mov rbp, rsp sub rsp, 48 mov qword ptr [rbp - 48], 0 <-- a NULL is passed in lea rdi, [rbp - 48] call mdb_env_create@PLT
And you wonder why your code segfaults in mdb_env_create?
It did not work, however, with the non-guix zig 0.11 compiler and brought out an 'illegal instruction' with the Guix 0.10 package.
The recent version of zig bails out with an illegal instruction, comparable to
https://github.com/NixOS/nixpkgs/issues/214356
To see how perfectionist the Zig team is and how hard they try to simplify the language: