Leaky abstractions at hot spots

Wlofie mention clive [1], a C-based Live Journal client, was vastly slower than jlj, a Perl based Live Journal client. Having encountered some weirdness in the past like this [2], I thought it might be fun to profile the code and see where the problem might lie.

Table: Results of profiling
% time	cumulative seconds	self seconds	calls	self µs/call	total µs/call	name
------------------------------
73.33	0.11	0.11	3693	29.79	29.79	getalloc
6.67	0.12	0.01	18630	0.54	0.54	lj_debug
6.67	0.13	0.01	3323	3.01	3.55	hash
6.67	0.14	0.01	1483	6.74	10.78	getp
6.67	0.15	0.01	4	2500.00	34335.39	delete_hash
0.00	0.15	0.00	3693	0.00	29.79	lj_free
0.00	0.15	0.00	1841	0.00	0.00	chomp
0.00	0.15	0.00	941	0.00	10.78	get
0.00	0.15	0.00	936	0.00	0.00	grow_alloctable
0.00	0.15	0.00	936	0.00	0.54	lj_malloc
0.00	0.15	0.00	936	0.00	0.00	newalloc
0.00	0.15	0.00	924	0.00	1.07	next
0.00	0.15	0.00	920	0.00	137.20	del
0.00	0.15	0.00	920	0.00	6.23	put
0.00	0.15	0.00	12	0.00	0.00	MD5Update
0.00	0.15	0.00	11	0.00	35.15	geti
0.00	0.15	0.00	11	0.00	35.15	getpi
0.00	0.15	0.00	7	0.00	0.38	lj_set_config
0.00	0.15	0.00	6	0.00	0.00	MD5Transform
0.00	0.15	0.00	4	0.00	0.00	MD5Data
0.00	0.15	0.00	4	0.00	0.00	MD5End
0.00	0.15	0.00	4	0.00	0.00	MD5Final
0.00	0.15	0.00	4	0.00	0.00	MD5Init
0.00	0.15	0.00	4	0.00	0.00	MD5Pad
0.00	0.15	0.00	4	0.00	0.00	closeconn
0.00	0.15	0.00	4	0.00	0.54	create_hash
0.00	0.15	0.00	4	0.00	0.00	getconn
0.00	0.15	0.00	4	0.00	1556.88	lj_send_request
0.00	0.15	0.00	4	0.00	0.54	md5_hex
0.00	0.15	0.00	4	0.00	0.00	reset
0.00	0.15	0.00	3	0.00	0.54	lj_realloc
0.00	0.15	0.00	3	0.00	1.07	lj_urlencode
0.00	0.15	0.00	2	0.00	36006.75	lj_getchallenge
0.00	0.15	0.00	2	0.00	36009.43	lj_setauth
0.00	0.15	0.00	1	0.00	0.00	create_getfriends_request
0.00	0.15	0.00	1	0.00	1.07	create_login_request
0.00	0.15	0.00	1	0.00	0.77	lj_ask_username
0.00	0.15	0.00	1	0.00	0.54	lj_config_file
0.00	0.15	0.00	1	0.00	76284.83	lj_getfriends
0.00	0.15	0.00	1	0.00	4.83	lj_init_config
0.00	0.15	0.00	1	0.00	73708.73	lj_login
0.00	0.15	0.00	1	0.00	150000.00	main
------------------------------
% time	cumulative seconds	self seconds	calls	self µs/call	total µs/call	name

In looking over the results, it's obvious the problem spots are in delete_hash(), lj_getchallenge(), lj_setauth(), lj_getfriends() (the option I used to test the program with) or lj_login. A quick glance at delete_hash() didn't show any problems, and commenting it out didn't solve anything.

So do the next best thing—run the program under a debugger and follow the execution to see where it might be hanging. A breakpoint in lj_login() (another possible hot spot) and following the code there lead to the following bit of code (from lj_send_request()):

>
```
/*
* First we ditch the headers...if we ever wanted to do
* anything with the headers this is where we 'd put the code
* I'm using key here because it' s just as good as any other buffer
* I could set up a buffer just for this, but let's not waste memory
*/
while (fgets(key, LJ_BUFFER_MAX, conn) && strcmp(key, "\r\n"))
{
#ifdef DEBUG3
int i = strlen(key);
if (!strcmp(key + (i - 2), "\r\n"))
key[i - 2] = '\0';
lj_debug(3, "[%s]\n", key);
#endif
}
/*
* Now we read in the key-value pairs. We're making the
* assumption that neither the key nor the value will exceed
* 1024 bytes. REVISIT THIS
*/
while (fgets(key, LJ_BUFFER_MAX, conn)
&& fgets(value, LJ_BUFFER_MAX, conn))
{
chomp(key);
chomp(value);
lj_debug(3, "Putting [%s]->[%s]...", key, value);
put(lj_hash, key, value);
}
closeconn(conn);
```

The code was blocking in the second loop to a call to fgets(). And I pretty much knew what was happening: leaky abstractions [3] (and it's fitting that the article talks about TCP (Transmission Control Protocol), which is exactly where the leaky abstraction is happening in this very piece of code).

fgets() is a nice function, since it reads in a line of text; if there's no more text to read, it returns NULL indicating there is no more input. But fgets() is (in the case of Unix) written on top of lower level code, notably read(). When applied to a disk file, if there is no more data to be read, read() returns a 0, meaning “no more data.”

But we're not reading from a file. We're reading from a network connection. And therein lies a difference. The function read() also works for network connections, but it behaves just a bit differently in that case. In the case of a disk file, it's pretty apparent when there is no more data to read. But in the case of a TCP connection, it's not clear-cut when the other side has stopped sending data. So read() will just sit there, waiting for data to come in. If there's no data after a period of time (about 30 seconds on my computer) read() will finally conclude that there's no data to be had. That result then bubbles up to fgets() and the loop finally terminates.

But not until a long 30 seconds per request goes by.

No wonder clive is slow (and no, it has nothing to do with “keepalives” in the HTTP protocol—I tested that hypothesis).

The fix is to parse the headers for a “Content-Length” header and read only those number of bytes, but since Wlofie is now using another program, and I don't really post to my LiveJournal [4] (except once a year) there's no incentive to really fix the program (but I did find the problem, which is what I set out to do).

[1] http://wlofie.livejournal.com/51837.html

[2] /boston/2003/01/12.1

[3] http://www.joelonsoftware.com/articles/LeakyAbstractions.html

[4] http://spc476.livejournal.com/

Gemini Mention this post

Contact the author