I solved the issue, but I'm not sure what the issue was

It's a bug whose solution I can't explain.

So I have a Lua module that enables event-driven programming [1]. I also have a few modules that drive TCP (Transmission Control Protocol) and TLS (Transport Layer Security) connections [2]. To make it even easier, I have a module that presents a Lua file-like interface to network connections [3]—functions like obj:read() and obj:write().

The previous version of this interface module, org.conman.net.ios, used LPEG (Lua Parsing Expression Grammar) [4] to handle line-based IO (Input/Output) requests, as well as an extension to read headers from an Internet message [5]. Given the overhead of LPEG [6] I thought I might try using the built-in pattern matching of Lua. I reworked the code and a bench mark did show a decent and measurable improvement in speed and memory usage.

But the new code failed when transferring a sizable amount of data (about 6.7M (Megabytes)) over TLS. It took about two days to track down the problem, and I still don't have a root cause. The code works, but I don't know why it works. And that bugs me.

To further complicate matters, the code did work when I download the data from a server I wrote (using the same Lua code as the client) but it would fail when I tried downloading the data from another server (different TLS implementation, different language, etc.).

I was able to eventually isolate the issue down one function in org.conman.net.ios. Here was the original code:

local function write(ios,...)
  if ios._eof then
    return false,"stream closed",-2
  end

  local output = ""

  for i = 1 , select('#',...) do
    local data = select(i,...)
    if type(data) ~= 'string' and type(data) ~= 'number' then
      error("string or number expected, got " .. type(data))
    end

    output = output .. data
  end

  return ios:_drain(output)
end

It works, but I didn't like that it first accumulated all the output first before writing it. So when I rewrote org.conman.net.ios, I modified the function thusly:

local function write(ios,...)
  if ios._eof then
    return false,"stream closed",-2
  end
  
  for i = 1 , select('#',...) do
    local data = select(i,...)
    if type(data) ~= 'string' and type(data) ~= 'number' then
      error(string.format("bad argument #%d to 'write' (string expected, got %s)
    end
    
    data = tostring(data)
    
    local okay,err,ev = ios:_drain(data)
    if not okay then
      syslog('error',"ios:_drain() = %s",err)
      return okay,err,ev
    end
    
    ios._wbytes = ios._wbytes + #data
  end
  
  return true
end

Instead of accumulating the data into one large buffer, it outputs it piecemeal. To further confound things, this doesn't appear to have anything to do with reading, which is what I was having issues with.

The client only did one call to this function:

local okay,err = ios:write(location,"\r\n")

The request went out, I would start receiving data, but for some odd reason, the connection would just drop about 200K (Kilobytes) short of the full file (it was never a consistent amount either).

While the reading side was a different implementation, the writing side didn't have to be different, I just felt the second vesion to be a bit better, and it shouldn't make a difference, right? [There's that word! –Editor] [What word? –Sean] [“Should.” –Editor] But regardless of my feelings about how that certainly can't be at fault, I put the previous version of write() back and lo'! It worked!

I'm flummoxed!

I don't understand why the new version of write() would cause the TLS connection to eventually fail, but it did, for whatever reason. Weird.

[1] https://github.com/spc476/lua-conmanorg/blob/master/lua/nfl.lua

[2] https://github.com/spc476/lua-conmanorg/tree/master/lua/nfl

[3] https://github.com/spc476/lua-conmanorg/blob/master/lua/net/ios.lua

[4] http://www.inf.puc-rio.br/~roberto/lpeg/

[5] https://www.ietf.org/rfc/rfc5322.txt

[6] /boston/2020/12/19.2

Gemini Mention this post

Contact the author