I'm still trying to find the cause of the crash [1]. I made a small change (a switch from using lua_touserdata(), which can return NULL, to using luaL_checkudata(), which if it returns, will never return NULL; else it throws an error) which didn't help resolve the issue—the program still dumped core.
Both last week and today, it crashed in the same place in Ltls_write():
static int Ltls_write(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); size_t len; char const *data = luaL_checklstring(L,2,&len); lua_pushinteger(L,tls_write(*tls,data,len)); return 1; }
The variable tls was itself not NULL, but the value it pointed to was! There's only one place in the code where it's set to NULL—in the __gc metamethod (the bits with the Lua registry is to deal with callbacks from libtls which knows nothing about Lua):
static int Ltls___gc(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); if (*tls != NULL) { lua_pushlightuserdata(L,*tls); lua_pushnil(L); lua_settable(L,LUA_REGISTRYINDEX); tls_free(*tls); *tls = NULL; } return 0; }
This is only called when the object no longer has any references to it. I'm not aware of any place where this value gets overwritten (but I'm not discounting it outright quite yet). The crashes started after I fixed the leak, which turned out to be this function (this is the pre-fix version):
static int Ltls_close(lua_State *L) { lua_pushinteger(L,tls_close(*(struct tls **)lua_touserdata(L,1))); return 1; }
The leak happened because I left the reference to the object in the Lua registry, so the garbage collector would never reclaim the resource. The fix itself was easy:
static int Ltls_close(lua_State *L) { int rc = tls_close(*(struct tls **)luaL_checkudata(L,1,TYPE_TLS)); if ((rc != TLS_WANT_POLLIN) && (rc != TLS_WANT_POLLOUT)) Ltls___gc(L); lua_pushinteger(L,rc); return 1; }
Yes, it's slightly complicated by the fact that closing a TLS (Transport Layer Security) connection might require a few more packets of data between the two endpoints to close out the protocol, but that's taken into account by the framework I'm using. At least, I hope it is.
So my theory right now is that the __gc metamethod is being called before the connection is fully closed, but to confirm that, I somehow need to capture that scenario. It's consistent in that it seems to be the same client making the same requests causing the same crash. On the down side, I have no idea who runs the client, or where it is (other than an IP (Internet Protocol) address). And if I make the same requests with my client, it doesn't crash, so I can't reproduce it.
And there's no telling when it will happen again.
I know it's just a small detail I'm missing on my side that's causing the issue, but I just can't seem to locate the detail. To that end, I modified Ltls___gc() and Ltls_write() to attach some traceback information to see if my theory that Ltls___gc() is being called too early. Lua makes this pretty easy to do. I modified Ltls___gc():
static int Ltls___gc(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); if (*tls != NULL) { luaL_traceback(L,L,"__gc",0); lua_setuservalue(L,1); lua_pushlightuserdata(L,*tls); lua_pushnil(L); lua_settable(L,LUA_REGISTRYINDEX); tls_free(*tls); *tls = NULL; } return 0; }
I get a traceback message and assign it to the object about to be cleaned. In Lua, a userdata goes through the garbage collection twice—once to clean the resources that Lua isn't aware of (that's what this function does) and the second time (in the next garbage collection cycle) to reclaim the memory Lua uses to store the userdata. Then in Ltls_write() I have:
static int Ltls_write(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); size_t len; char const *data = luaL_checklstring(L,2,&len); if (*tls == NULL) { lua_getuservalue(L,1); if (!lua_isnil(L,-1)) syslog(LOG_NOTICE,"%s",lua_tostring(L,-1)); luaL_traceback(L,L,"write",0); syslog(LOG_NOTICE,"%s",lua_tostring(L,-1)); lua_pop(L,2); lua_pushinteger(L,-1); } else lua_pushinteger(L,tls_write(*tls,data,len)); return 1; }
This will (hopefully, knock on formica) give me two tracebacks—one when the garbage collection is called (or Ltls_close() is called) and then a write is attempted, and that will point me to the detail I'm missing.
The other theory is that I'm overwriting memory somewhere and that's a harder issue to track down.