Discussion:
http_parser HPE_INVALID_CHUNK_SIZE errors after upgrading to node 0.8
Jimb Esser
2012-09-20 20:48:02 UTC
Permalink
We're trying migrating from node 0.6.21 to 0.8.9, and one of our processes
that does a database backup (a couple million items fetched over HTTP and
HTTPS from Amazon SimpleDB and DynamoDB APIs) has started getting
"HPE_INVALID_CHUNK_SIZE" errors, presumably from node's http_parser.c (and
then after that, it seems we're generally handed back a partial response,
and sometimes also a disconnected socket on the next request, which
black-holes the connection without some other fixes). It generally gets
one or two of these errors in the millions of fetched items (maybe 10s-100s
of thousands of individual HTTP requests over just a handful of HTTP 1.1
connections with connection keep-alive). Rolling back to 0.6.21 makes
everything fine. Doing a diff of http_parser.c between the two versions
show's it's had some major work done (though nothing right around where
this error is emitted), so perhaps a bug has been introduced, but I know
nothing of that code and don't know where to start debugging it. Any
suggestions? Any recent changes to related code I should try bisecting?
Anyone else getting this error?

Jimb Esser
Cloud Party, Inc
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs-/***@public.gmane.org
To unsubscribe from this group, send email to
nodejs+unsubscribe-/***@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Jimb Esser
2012-09-20 22:00:50 UTC
Permalink
As a follow-up, it looks like the data I get back from the response often
has "\r\n2000\r\n" in the middle of what otherwise looks like good data, so
I'm guessing something is going wrong with processing multi-part (chunked?
whatever is the right term?) responses, and it sometimes reads too much and
then tries to read the next chunk size from the middle of the response data
(sometimes getting an "invalid chunk size" error, or if it gets lucky and
reads a small number, it just truncates my response (though leaves the
"\r\n2000\r\n" in the middle of it).
Post by Jimb Esser
We're trying migrating from node 0.6.21 to 0.8.9, and one of our processes
that does a database backup (a couple million items fetched over HTTP and
HTTPS from Amazon SimpleDB and DynamoDB APIs) has started getting
"HPE_INVALID_CHUNK_SIZE" errors, presumably from node's http_parser.c (and
then after that, it seems we're generally handed back a partial response,
and sometimes also a disconnected socket on the next request, which
black-holes the connection without some other fixes). It generally gets
one or two of these errors in the millions of fetched items (maybe 10s-100s
of thousands of individual HTTP requests over just a handful of HTTP 1.1
connections with connection keep-alive). Rolling back to 0.6.21 makes
everything fine. Doing a diff of http_parser.c between the two versions
show's it's had some major work done (though nothing right around where
this error is emitted), so perhaps a bug has been introduced, but I know
nothing of that code and don't know where to start debugging it. Any
suggestions? Any recent changes to related code I should try bisecting?
Anyone else getting this error?
Jimb Esser
Cloud Party, Inc
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs-/***@public.gmane.org
To unsubscribe from this group, send email to
nodejs+unsubscribe-/***@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Ben Noordhuis
2012-09-21 12:45:41 UTC
Permalink
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=google.com; s=20120113;
h=x-beenthere:received-spf:mime-version:x-originating-ip:in-reply-to
:references:date:message-id:subject:from:to:x-gm-message-state
:x-original-sender:x-original-authentication-results:reply-to
:precedence:mailing-list:list-id:x-google-group-id:list-post
:list-help:list-archive:sender:list-subscribe:list-unsubscribe
:content-type;
bh=GsQ9sKxDyqZn+IgE+IT+say0AHbvus1a2nRg404bOHs=;
b=TowfPFjkG49i8dkLCsq3TQDbkVsS6LmthrDrDwnSooM64MOXstj+g596xZ3RHElRHz
Mg3wwr5uEQnAnwMmnjy0hsIsKzqfn3IVi0CgOJc1edKu+z+av8tztE7JCFh/RIYIGurP
5H0vcxxeCKYht1Cay95Od7PwrJxUlToBBHEybQ62bkgga0I3bEcgwWSYpaOJ0J9oFSro
90ugLixCZpubtaU0ME2X/tst1vjLZ06OL41WrJRVslQpEx1AQz2o/E7Qr0+a4fTasVF1
kKZG32lAxWBa6R9YJY2BUZNoqnxh6cuBriFPkdwr
Received: by 10.68.193.199 with SMTP id hq7mr179524pbc.8.1348231545451;
Fri, 21 Sep 2012 05:45:45 -0700 (PDT)
X-BeenThere: nodejs-/***@public.gmane.org
Received: by 10.68.242.38 with SMTP id wn6ls11467144pbc.8.gmail; Fri, 21 Sep
2012 05:45:41 -0700 (PDT)
Received: by 10.66.80.166 with SMTP id s6mr1068889pax.30.1348231541834;
Fri, 21 Sep 2012 05:45:41 -0700 (PDT)
Received: by 10.66.80.166 with SMTP id s6mr1068888pax.30.1348231541826;
Fri, 21 Sep 2012 05:45:41 -0700 (PDT)
Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43])
by gmr-mx.google.com with ESMTPS id o10si1505229pax.0.2012.09.21.05.45.41
(version=TLSv1/SSLv3 cipher=OTHER);
Fri, 21 Sep 2012 05:45:41 -0700 (PDT)
Received-SPF: pass (google.com: domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.220.43 as permitted sender) client-ip=209.85.220.43;
Received: by padfa10 with SMTP id fa10so399379pad.2
for <nodejs-/***@public.gmane.org>; Fri, 21 Sep 2012 05:45:41 -0700 (PDT)
Received: by 10.68.197.100 with SMTP id it4mr15585502pbc.16.1348231541254;
Fri, 21 Sep 2012 05:45:41 -0700 (PDT)
Received: by 10.68.25.199 with HTTP; Fri, 21 Sep 2012 05:45:41 -0700 (PDT)
X-Originating-IP: [87.214.96.125]
In-Reply-To: <c228ffb2-91d4-43c3-ae4a-b092984b8dc3-/***@public.gmane.org>
X-Gm-Message-State: ALoCoQlk+y0/ysV6brVk11mu5/UcRxp/U/FPGc4UhIjGT5Ich7v3cyWRLSeDZQpB4gqwp3G6x4Zs
X-Original-Sender: info-GlD9/+i8LImWrUy98/***@public.gmane.org
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.220.43 as permitted sender) smtp.mail=info-GlD9/+i8LImWrUy98/***@public.gmane.org
Precedence: list
Mailing-list: list nodejs-/***@public.gmane.org; contact nodejs+owners-/***@public.gmane.org
List-ID: <nodejs.googlegroups.com>
X-Google-Group-Id: 1056499897995
List-Post: <http://groups.google.com/group/nodejs/post?hl=en_US>, <mailto:nodejs-/***@public.gmane.org>
List-Help: <http://groups.google.com/support/?hl=en_US>, <mailto:nodejs+help-/***@public.gmane.org>
List-Archive: <http://groups.google.com/group/nodejs?hl=en_US>
Sender: nodejs-/***@public.gmane.org
List-Subscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:nodejs+subscribe-/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:googlegroups-manage+1056499897995+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.lang.javascript.nodejs/48132>
Post by Jimb Esser
As a follow-up, it looks like the data I get back from the response often
has "\r\n2000\r\n" in the middle of what otherwise looks like good data, so
I'm guessing something is going wrong with processing multi-part (chunked?
whatever is the right term?) responses, and it sometimes reads too much and
then tries to read the next chunk size from the middle of the response data
(sometimes getting an "invalid chunk size" error, or if it gets lucky and
reads a small number, it just truncates my response (though leaves the
"\r\n2000\r\n" in the middle of it).
Do you have some captured traffic I can look at?
Jimb Esser
2012-09-21 16:40:59 UTC
Permalink
I have no idea how to capture traffic on Linux, but I could do so if there
are some good instructions somewhere. The problem generally takes hundreds
of MB of traffic to reproduce, however, so I'm not sure if logging that
traffic will affect it and cause it to not occur, but I'm willing to try.

I did a git bisect and found the revision in the v0.8 history which
"caused" the issue, but unfortunately, I think that is a red herring, as
the revision is just one that fixed keep-alive being broken on that branch
[1], and older than that revision keep-alive is simply off, so the traffic
is drastically different (not to mention orders of magnitudes slower in my
test). On the 0.6 branch, however, even with keep-alive functioning
correctly, I do not see this problem, so that leads me to believe that
something broke while keep-alive was also broken on the 0.8 branch, which
makes it much harder to bisect. I may try cherry-picking in the keep-alive
fix while doing a bisect to see if I can narrow it down, though since each
test can take up to a half hour before it's conclusive, I'm not sure how
productive that will be.

[1]
https://github.com/Joyent/node/commit/a6f3451e2549d0b82ff18990e8bfe28ea4386aba
Post by Jimb Esser
Post by Jimb Esser
As a follow-up, it looks like the data I get back from the response
often
Post by Jimb Esser
has "\r\n2000\r\n" in the middle of what otherwise looks like good data,
so
Post by Jimb Esser
I'm guessing something is going wrong with processing multi-part
(chunked?
Post by Jimb Esser
whatever is the right term?) responses, and it sometimes reads too much
and
Post by Jimb Esser
then tries to read the next chunk size from the middle of the response
data
Post by Jimb Esser
(sometimes getting an "invalid chunk size" error, or if it gets lucky
and
Post by Jimb Esser
reads a small number, it just truncates my response (though leaves the
"\r\n2000\r\n" in the middle of it).
Do you have some captured traffic I can look at?
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs-/***@public.gmane.org
To unsubscribe from this group, send email to
nodejs+unsubscribe-/***@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Ben Noordhuis
2012-09-21 23:27:04 UTC
Permalink
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=google.com; s=20120113;
h=x-beenthere:received-spf:mime-version:x-originating-ip:in-reply-to
:references:date:message-id:subject:from:to:x-gm-message-state
:x-original-sender:x-original-authentication-results:reply-to
:precedence:mailing-list:list-id:x-google-group-id:list-post
:list-help:list-archive:sender:list-subscribe:list-unsubscribe
:content-type;
bh=oLLjhGtPAzhlWkN/Idkeau6KqT+ySA8f3WYKnPN6hc8=;
b=f+VsCOlqhZdyNK3xwMSmVF+OdYHsbizrzNHPj1dHy3zj+A3HWacEzMeAguzT8jF7dy
wBpTd/5uSUh8yWOdfWq6RE+pmvqNC3tOJ1uNMT3oiU5wpXSAInZxaARFPtAKE4UWeuCY
7oYJxwadFS907sTGn6xUVR7yTrdqWf4YJ/gTUKzvXo26fBp9EKtQw5/7C7L+H7OlB7Be
ZJwvEjIWVg8f+gjuW3PxnzKD7oc5ZFgofBMUHvoKbe8TE/dnnn79eQgpkMC5g+dlm/w3
GAKHE+80+9TU3dpvRPjdyN9znMgeFPdWLlhdJMtY
Received: by 10.68.237.163 with SMTP id vd3mr1899000pbc.9.1348270026677;
Fri, 21 Sep 2012 16:27:06 -0700 (PDT)
X-BeenThere: nodejs-/***@public.gmane.org
Received: by 10.68.225.39 with SMTP id rh7ls5192509pbc.7.gmail; Fri, 21 Sep
2012 16:27:04 -0700 (PDT)
Received: by 10.66.78.106 with SMTP id a10mr1147807pax.14.1348270024962;
Fri, 21 Sep 2012 16:27:04 -0700 (PDT)
Received: by 10.66.78.106 with SMTP id a10mr1147806pax.14.1348270024952;
Fri, 21 Sep 2012 16:27:04 -0700 (PDT)
Received: from mail-pb0-f48.google.com (mail-pb0-f48.google.com [209.85.160.48])
by gmr-mx.google.com with ESMTPS id o10si1772920pax.0.2012.09.21.16.27.04
(version=TLSv1/SSLv3 cipher=OTHER);
Fri, 21 Sep 2012 16:27:04 -0700 (PDT)
Received-SPF: pass (google.com: domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.160.48 as permitted sender) client-ip=209.85.160.48;
Received: by pbbrq13 with SMTP id rq13so8670481pbb.7
for <nodejs-/***@public.gmane.org>; Fri, 21 Sep 2012 16:27:04 -0700 (PDT)
Received: by 10.68.200.39 with SMTP id jp7mr1271694pbc.160.1348270024541; Fri,
21 Sep 2012 16:27:04 -0700 (PDT)
Received: by 10.68.25.199 with HTTP; Fri, 21 Sep 2012 16:27:04 -0700 (PDT)
X-Originating-IP: [87.214.96.125]
In-Reply-To: <744296c5-e2f4-4dce-be9e-76d546615eee-/***@public.gmane.org>
X-Gm-Message-State: ALoCoQlMy86nI3KI4bLUTopRgsM9b9YKslDQ87pqNCiT8DeqntKzpaUEOZLJ59rKGuZKbCy1nVdQ
X-Original-Sender: info-GlD9/+i8LImWrUy98/***@public.gmane.org
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.160.48 as permitted sender) smtp.mail=info-GlD9/+i8LImWrUy98/***@public.gmane.org
Precedence: list
Mailing-list: list nodejs-/***@public.gmane.org; contact nodejs+owners-/***@public.gmane.org
List-ID: <nodejs.googlegroups.com>
X-Google-Group-Id: 1056499897995
List-Post: <http://groups.google.com/group/nodejs/post?hl=en_US>, <mailto:nodejs-/***@public.gmane.org>
List-Help: <http://groups.google.com/support/?hl=en_US>, <mailto:nodejs+help-/***@public.gmane.org>
List-Archive: <http://groups.google.com/group/nodejs?hl=en_US>
Sender: nodejs-/***@public.gmane.org
List-Subscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:nodejs+subscribe-/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:googlegroups-manage+1056499897995+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.lang.javascript.nodejs/48178>
Post by Jimb Esser
I have no idea how to capture traffic on Linux, but I could do so if there
are some good instructions somewhere. The problem generally takes hundreds
of MB of traffic to reproduce, however, so I'm not sure if logging that
traffic will affect it and cause it to not occur, but I'm willing to try.
tcpdump? It's easy:

tcpdump -C 1 -G 5 -i <interface> -n -s 1500 -w <filename> tcp port <port>

The -C and -G options makes it rotate the log file every 5 seconds or
when it grows to 1 million bytes, whatever comes first. You may need
to increase the snarf size (-s) to 8192 or higher if you're capturing
loopback traffic.

You can replay the traffic with `tcpdump -A -r <filename>`.
Jimb Esser
2012-09-24 21:37:16 UTC
Permalink
I tried tcpdump, and it starts saving off packets (writes a few dump files,
a couple 10s of MBs), but them seems to stall out - nothing more gets
written though my app is clearly still receiving data. By the time it gets
the bad data, it's inevitably stopped by then, and searching through the
dumps does not contain the bad data.

I added some logging to http_parser.c, however, simply dumping out the raw
data that it is parsing, and it looks like the data is bad before it gets
in there. Specifically, in one case I'm looking at (raw data, though all
text at [1]) it looks like the tail of the buffer has 4377 bytes that
begins what should be a 8192 byte chunk, and then a duplicated copy of the
last ~4708 bytes or so, followed by "0\r\n\r\n". Sadly, now that I'm
familiar with the http_parser code, this really points to a problem
somewhere higher up... -_- I'll do some poking around with tcpdump to see
if I can get it or something else to dump more reliably, it would be good
to have absolute confirmation that the data was fine coming in at the TCP
level, although the fact that node 0.6 has no problems makes me pretty
certain the input is fine.

On other fronts, I did more tests searching through node's git history, and
found that before a specific revision [2], this seems to happen much less
frequently (happened once in 6 runs, as opposed to almost every run), but
I'm pretty convinced there's nothing wrong with this change (I added some
logging to both before and after and found nothing behaving differently,
though the frequency changed when adding logging, so it may just indicate
it's a delicate timing issue - although it does happen reliably both on my
work machine, home machine, and AWS instances - all on pretty drastically
different hardware, and similarly never seems to happen on node 0.6.21 on
any of those machines (although if it was only happening 1 in 6 times
before, it could be I would have never seen it in my tests)).

[1] http://thesilentb.com/files/bad_http_stream.txt
[2] https://github.com/joyent/node/commit/70033bd
Post by Jimb Esser
Post by Jimb Esser
I have no idea how to capture traffic on Linux, but I could do so if
there
Post by Jimb Esser
are some good instructions somewhere. The problem generally takes
hundreds
Post by Jimb Esser
of MB of traffic to reproduce, however, so I'm not sure if logging that
traffic will affect it and cause it to not occur, but I'm willing to
try.
tcpdump -C 1 -G 5 -i <interface> -n -s 1500 -w <filename> tcp port <port>
The -C and -G options makes it rotate the log file every 5 seconds or
when it grows to 1 million bytes, whatever comes first. You may need
to increase the snarf size (-s) to 8192 or higher if you're capturing
loopback traffic.
You can replay the traffic with `tcpdump -A -r <filename>`.
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs-/***@public.gmane.org
To unsubscribe from this group, send email to
nodejs+unsubscribe-/***@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Ben Noordhuis
2012-09-24 22:27:23 UTC
Permalink
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=google.com; s=20120113;
h=x-beenthere:received-spf:mime-version:x-originating-ip:in-reply-to
:references:date:message-id:subject:from:to:x-gm-message-state
:x-original-sender:x-original-authentication-results:reply-to
:precedence:mailing-list:list-id:x-google-group-id:list-post
:list-help:list-archive:sender:list-subscribe:list-unsubscribe
:content-type;
bh=RyA3bcFktDt2NxJoTsvR10vcYyDj3uubiM2dx0LVS18=;
b=RxoXnLyrHJZkF8xaMrSxfaRczzD5JnGf3EwJxJdr6JpzBDKnXz884Gh96Zc7JpAgMm
Dz7y24BtOwJbyKh2xzAJ4HZsn+Oe95vJccl/rIQTnUZwDhyzgZku/tOiOJrV32YF0+73
G3Cb0rhSPspH2jtjpKkFKQmA8GQWBr6/iFo7+xBigUNxWh2JFfuj4aoBAOpRlgPtCN1W
cMAGEuQrdXOa2yeurm0NTXwFG/XfvCwVTc/E+kqu/0sPHSM7d94dfCxw1YPHJz1wJwQr
QfcAQytUED+BGVRweh1GHorsbCI9TqQy3OSe+cLz
Received: by 10.68.216.132 with SMTP id oq4mr1168453pbc.8.1348525646952;
Mon, 24 Sep 2012 15:27:26 -0700 (PDT)
X-BeenThere: nodejs-/***@public.gmane.org
Received: by 10.68.141.75 with SMTP id rm11ls12813616pbb.6.gmail; Mon, 24 Sep
2012 15:27:24 -0700 (PDT)
Received: by 10.68.231.135 with SMTP id tg7mr203429pbc.2.1348525644301;
Mon, 24 Sep 2012 15:27:24 -0700 (PDT)
Received: by 10.68.231.135 with SMTP id tg7mr203428pbc.2.1348525644289;
Mon, 24 Sep 2012 15:27:24 -0700 (PDT)
Received: from mail-pb0-f43.google.com (mail-pb0-f43.google.com [209.85.160.43])
by gmr-mx.google.com with ESMTPS id qq4si1513277pbb.1.2012.09.24.15.27.23
(version=TLSv1/SSLv3 cipher=OTHER);
Mon, 24 Sep 2012 15:27:24 -0700 (PDT)
Received-SPF: pass (google.com: domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.160.43 as permitted sender) client-ip=209.85.160.43;
Received: by mail-pb0-f43.google.com with SMTP id jt11so1442274pbb.2
for <nodejs-/***@public.gmane.org>; Mon, 24 Sep 2012 15:27:23 -0700 (PDT)
Received: by 10.68.226.38 with SMTP id rp6mr41033521pbc.116.1348525643757;
Mon, 24 Sep 2012 15:27:23 -0700 (PDT)
Received: by 10.68.25.199 with HTTP; Mon, 24 Sep 2012 15:27:23 -0700 (PDT)
X-Originating-IP: [87.214.96.125]
In-Reply-To: <569341b5-cd12-4f96-84c6-8f321b256224-/***@public.gmane.org>
X-Gm-Message-State: ALoCoQn3N3bc0GF+AYC4spky70UsZmsGfa1OLVM1Ui+RN7izEHDsoNby5sf0UFFBZrFEzC2CKzEP
X-Original-Sender: info-GlD9/+i8LImWrUy98/***@public.gmane.org
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.160.43 as permitted sender) smtp.mail=info-GlD9/+i8LImWrUy98/***@public.gmane.org
Precedence: list
Mailing-list: list nodejs-/***@public.gmane.org; contact nodejs+owners-/***@public.gmane.org
List-ID: <nodejs.googlegroups.com>
X-Google-Group-Id: 1056499897995
List-Post: <http://groups.google.com/group/nodejs/post?hl=en_US>, <mailto:nodejs-/***@public.gmane.org>
List-Help: <http://groups.google.com/support/?hl=en_US>, <mailto:nodejs+help-/***@public.gmane.org>
List-Archive: <http://groups.google.com/group/nodejs?hl=en_US>
Sender: nodejs-/***@public.gmane.org
List-Subscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:nodejs+subscribe-/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:googlegroups-manage+1056499897995+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.lang.javascript.nodejs/48234>
Post by Jimb Esser
I tried tcpdump, and it starts saving off packets (writes a few dump files,
a couple 10s of MBs), but them seems to stall out - nothing more gets
written though my app is clearly still receiving data. By the time it gets
the bad data, it's inevitably stopped by then, and searching through the
dumps does not contain the bad data.
I added some logging to http_parser.c, however, simply dumping out the raw
data that it is parsing, and it looks like the data is bad before it gets in
there. Specifically, in one case I'm looking at (raw data, though all text
at [1]) it looks like the tail of the buffer has 4377 bytes that begins what
should be a 8192 byte chunk, and then a duplicated copy of the last ~4708
bytes or so, followed by "0\r\n\r\n". Sadly, now that I'm familiar with the
http_parser code, this really points to a problem somewhere higher up... -_-
I'll do some poking around with tcpdump to see if I can get it or something
else to dump more reliably, it would be good to have absolute confirmation
that the data was fine coming in at the TCP level, although the fact that
node 0.6 has no problems makes me pretty certain the input is fine.
On other fronts, I did more tests searching through node's git history, and
found that before a specific revision [2], this seems to happen much less
frequently (happened once in 6 runs, as opposed to almost every run), but
I'm pretty convinced there's nothing wrong with this change (I added some
logging to both before and after and found nothing behaving differently,
though the frequency changed when adding logging, so it may just indicate
it's a delicate timing issue - although it does happen reliably both on my
work machine, home machine, and AWS instances - all on pretty drastically
different hardware, and similarly never seems to happen on node 0.6.21 on
any of those machines (although if it was only happening 1 in 6 times
before, it could be I would have never seen it in my tests)).
[1] http://thesilentb.com/files/bad_http_stream.txt
[2] https://github.com/joyent/node/commit/70033bd
Can you post the patch you used for logging? I don't see anything
obviously wrong in the dump itself.

It's kind of odd that tcpdump hangs, I've never had issues with it
myself, even on maxed out Gbit interfaces.

On a completely unrelated note:

{"S":"[6.24719116,-3.552713678800501e-15,2.252045820000001]"}

Clamp your floats, man! :-)
Jimb Esser
2012-09-24 22:48:56 UTC
Permalink
Patch for logging was to replace the two occurrences of
SET_ERRNO(HPE_INVALID_CHUNK_SIZE) with:
{
const char *p2;
for (p2=data; p2 != data + len; p2++) {
fprintf(stderr, "0x%02x, ", *p2);
if (p == p2) {
fprintf(stderr, "/******/ ");
}
}
}
fprintf(stderr, "HPE_INVALID_CHUNK_SIZE case 1\n");
SET_ERRNO(HPE_INVALID_CHUNK_SIZE);

I then loaded the stderr output into a trivial JS program to write it out
as a binary file.

When you say you don't see anything obviously wrong in the dump itself, do
you mean in looks like a valid HTTP stream? It looks to me like, near the
end, it says an 8k data chunk is coming (2000\r\n), and then provides less
than 8k data and starts another chunk (with the original 8k terminating in
the middle of the second chunk). I am basing this entirely on assumptions
about the HTTP protocol gleaned from assumptions and reading http_parser.c,
though, so I could be quite mistaken =).
Post by Ben Noordhuis
Can you post the patch you used for logging? I don't see anything
obviously wrong in the dump itself.
It's kind of odd that tcpdump hangs, I've never had issues with it
myself, even on maxed out Gbit interfaces.
{"S":"[6.24719116,-3.552713678800501e-15,2.252045820000001]"}
Clamp your floats, man! :-)
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs-/***@public.gmane.org
To unsubscribe from this group, send email to
nodejs+unsubscribe-/***@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Ben Noordhuis
2012-09-25 01:17:07 UTC
Permalink
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=google.com; s=20120113;
h=x-beenthere:received-spf:mime-version:x-originating-ip:in-reply-to
:references:date:message-id:subject:from:to:x-gm-message-state
:x-original-sender:x-original-authentication-results:reply-to
:precedence:mailing-list:list-id:x-google-group-id:list-post
:list-help:list-archive:sender:list-subscribe:list-unsubscribe
:content-type;
bh=I22VDtXV+wjcmHWdeRiMuoR5zHMCanIIKG/yELesIw0=;
b=fPvs2DfzMZsE7KB+RtpEIutpkqToNeXk7Mmp6AFGwG14HXZndDWZ79tyXOJaX5NNB5
WizAZPh8w1SanT3BeXk2r9F+IMN+clzarLTWkcFeXsn0o3YrTK+HXp89BaOHcFLgc1sc
hyoJ3ldR2eximrwIj5y91VscGmPrGil7DkhaX3GARDQ1r5T7t5w/MsQT6u4CclmWvptk
Ds3bF/wYSvvO9HPCNeBwABIoq1zZWrFZIXurnntK/QuyEMoxRPghZz0I1DCnfQgxyVm4
0Nyr6ysOcv99h63RxtcAlLyPrsPb8yHaCIDVpYSm
Received: by 10.68.232.230 with SMTP id tr6mr3284191pbc.16.1348535830341;
Mon, 24 Sep 2012 18:17:10 -0700 (PDT)
X-BeenThere: nodejs-/***@public.gmane.org
Received: by 10.68.194.202 with SMTP id hy10ls21134970pbc.0.gmail; Mon, 24 Sep
2012 18:17:07 -0700 (PDT)
Received: by 10.66.79.233 with SMTP id m9mr192359pax.37.1348535827886;
Mon, 24 Sep 2012 18:17:07 -0700 (PDT)
Received: by 10.66.79.233 with SMTP id m9mr192358pax.37.1348535827875;
Mon, 24 Sep 2012 18:17:07 -0700 (PDT)
Received: from mail-da0-f48.google.com (mail-da0-f48.google.com [209.85.210.48])
by gmr-mx.google.com with ESMTPS id j3si3310149paz.2.2012.09.24.18.17.07
(version=TLSv1/SSLv3 cipher=OTHER);
Mon, 24 Sep 2012 18:17:07 -0700 (PDT)
Received-SPF: pass (google.com: domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.210.48 as permitted sender) client-ip=209.85.210.48;
Received: by dadz8 with SMTP id z8so241144dad.21
for <nodejs-/***@public.gmane.org>; Mon, 24 Sep 2012 18:17:07 -0700 (PDT)
Received: by 10.68.130.201 with SMTP id og9mr41323179pbb.12.1348535827368;
Mon, 24 Sep 2012 18:17:07 -0700 (PDT)
Received: by 10.68.25.199 with HTTP; Mon, 24 Sep 2012 18:17:07 -0700 (PDT)
X-Originating-IP: [87.214.96.125]
In-Reply-To: <1a61dded-f429-47e2-b63e-ff66ddc01425-/***@public.gmane.org>
X-Gm-Message-State: ALoCoQmeWd/T7XaChaunRuTOq2flHb6KZnL45MG7hRET5i/5MIIw1RALs6Z8e6JMRnTlwp0ueTnm
X-Original-Sender: info-GlD9/+i8LImWrUy98/***@public.gmane.org
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.210.48 as permitted sender) smtp.mail=info-GlD9/+i8LImWrUy98/***@public.gmane.org
Precedence: list
Mailing-list: list nodejs-/***@public.gmane.org; contact nodejs+owners-/***@public.gmane.org
List-ID: <nodejs.googlegroups.com>
X-Google-Group-Id: 1056499897995
List-Post: <http://groups.google.com/group/nodejs/post?hl=en_US>, <mailto:nodejs-/***@public.gmane.org>
List-Help: <http://groups.google.com/support/?hl=en_US>, <mailto:nodejs+help-/***@public.gmane.org>
List-Archive: <http://groups.google.com/group/nodejs?hl=en_US>
Sender: nodejs-/***@public.gmane.org
List-Subscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:nodejs+subscribe-/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:googlegroups-manage+1056499897995+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.lang.javascript.nodejs/48241>
Post by Jimb Esser
Patch for logging was to replace the two occurrences of
{
const char *p2;
for (p2=data; p2 != data + len; p2++) {
fprintf(stderr, "0x%02x, ", *p2);
if (p == p2) {
fprintf(stderr, "/******/ ");
}
}
}
fprintf(stderr, "HPE_INVALID_CHUNK_SIZE case 1\n");
SET_ERRNO(HPE_INVALID_CHUNK_SIZE);
I then loaded the stderr output into a trivial JS program to write it out as
a binary file.
When you say you don't see anything obviously wrong in the dump itself, do
you mean in looks like a valid HTTP stream? It looks to me like, near the
end, it says an 8k data chunk is coming (2000\r\n), and then provides less
than 8k data and starts another chunk (with the original 8k terminating in
the middle of the second chunk). I am basing this entirely on assumptions
about the HTTP protocol gleaned from assumptions and reading http_parser.c,
though, so I could be quite mistaken =).
The HTTP parser works across TCP packet boundaries. A request or
response doesn't necessarily fit in a single packet.

Unless you're benchmarking with a concurrency of 1 (a single client
that issues sequential requests), you'll see HTTP requests and
responses getting interleaved.
Jimb Esser
2012-09-25 04:38:34 UTC
Permalink
Yes, but, as I understand, this is logging during processing a single
stream of data, in a single call to http_parser_execute(), specifically
inside of a for loop looping over exactly the data I logged. That should
be internally consistent, at least, right? If the stream says it has an
8192 byte chunk coming, and there's more than that many bytes of date, 8192
bytes later there should be another chunk size in the stream (which, there
is not, which of course is why it's throwing the HPE_INVALID_CHUNK_SIZE
error as it reads some non-hex characters at the offset its expecting the
next chunk size).

Anyway, I managed to capture it as it went bad with tcpdump (I added -U -B
1000000, not sure if that helped or the timing was just different). Short
result is: it looks like the data was bad as it came in over TCP! I'll go
into detail and post the dumps below, but this quickly leads me to ask
(after apologizing about thinking this was a node bug): why does this
generally fail on 0.8 and never on 0.6?

My guess is that on 0.6 an HTTP parse error simply disconnected the socket,
and all of our AWS APIs simply retry any requests which get a socket
disconnect (as those happen rather frequently with AWS). We also
automatically retry on a bunch of 5xx http response codes, so if any of
those were being generated, we'd retry on those as well. I'll add some
additional logging and run this on node 0.6 again to see if I can verify
this.

Back to the details on the symptom at hand (just in case
I misinterpreted some of this data, it might be good to get another set of
eyes on it). In this case, it didn't actually get a http parser error, it
simply handed back what looks like bad data (includes the HTTP chunk size
lines inserted in the middle of it), perhaps because the socket got closed
or something (although it seems if it got closed prematurely that perhaps
that should have generated a parsing error since it was expecting more
bytes? I'm not sure on this, exactly). I've uploaded the parsed/printed
TCP dump of the end of the relevant stream (starting from before the last
good chunk size line) [1], as well as the tail of the data that was
returned to my app, starting from the same point [2].

As far as I can tell, the stream indicated there was an 8kb chunk coming,
8kb later it indicates another 8kb chunk is coming, but then 1296 bytes
later there's another chunk size line ("\r\n2000\r\n"). At the packet
level, the final packet (starting at seq 230681) appears to be simply
repeated data from 1304 bytes back in the stream, but then terminated
unexpectedly (never gets to the 8kb of data the earlier chunk size
indicated was coming). No other packets show up in this stream in my dump.

I can not think of any possible thing on our side (either my code or
node's) that could cause this, so my conclusion is that something is going
wrong within Amazon's Dynamo DB (or routing between there and our servers,
though I've tried both inside and outside of their datacenters with the
same results). In retrospect, every bad packet I've gotten has been on
DynamoDB queries, despite us slamming a couple other Amazon APIs with
similar levels of traffic during this test, so that also points the finger
in that direction. I think I will try and bring this up on the DynamoDB
forums, and just make our DDB client more robust if it gets low-level
parsing errors (although I'm a little afraid of what could happen if it
happens to accidentally parse correctly because of some of our data
happening to match up!).

Thank you very much for your help with this.

[1] http://thesilentb.com/files/tcpdump.txt
[2] http://thesilentb.com/files/packet.txt

Jimb Esser
Cloud Party, Inc
Post by Jimb Esser
Post by Jimb Esser
Patch for logging was to replace the two occurrences of
{
const char *p2;
for (p2=data; p2 != data + len; p2++) {
fprintf(stderr, "0x%02x, ", *p2);
if (p == p2) {
fprintf(stderr, "/******/ ");
}
}
}
fprintf(stderr, "HPE_INVALID_CHUNK_SIZE case 1\n");
SET_ERRNO(HPE_INVALID_CHUNK_SIZE);
I then loaded the stderr output into a trivial JS program to write it
out as
Post by Jimb Esser
a binary file.
When you say you don't see anything obviously wrong in the dump itself,
do
Post by Jimb Esser
you mean in looks like a valid HTTP stream? It looks to me like, near
the
Post by Jimb Esser
end, it says an 8k data chunk is coming (2000\r\n), and then provides
less
Post by Jimb Esser
than 8k data and starts another chunk (with the original 8k terminating
in
Post by Jimb Esser
the middle of the second chunk). I am basing this entirely on
assumptions
Post by Jimb Esser
about the HTTP protocol gleaned from assumptions and reading
http_parser.c,
Post by Jimb Esser
though, so I could be quite mistaken =).
The HTTP parser works across TCP packet boundaries. A request or
response doesn't necessarily fit in a single packet.
Unless you're benchmarking with a concurrency of 1 (a single client
that issues sequential requests), you'll see HTTP requests and
responses getting interleaved.
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs-/***@public.gmane.org
To unsubscribe from this group, send email to
nodejs+unsubscribe-/***@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Ben Noordhuis
2012-09-25 12:32:38 UTC
Permalink
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=google.com; s=20120113;
h=x-beenthere:received-spf:mime-version:x-originating-ip:in-reply-to
:references:date:message-id:subject:from:to:x-gm-message-state
:x-original-sender:x-original-authentication-results:reply-to
:precedence:mailing-list:list-id:x-google-group-id:list-post
:list-help:list-archive:sender:list-subscribe:list-unsubscribe
:content-type;
bh=YMGQO5xGf/s0lbsifdwBIy8oSiEuDAqXyevYiOSI8LI=;
b=g8UtZzFY25pOP/VMdmzE26vwMCghukqVCK9Jr5ACFOyyoUDBw0N/du6iJDBTfLVTIY
EF5X5dfH3sU98ifghtR/ZCcQUIjKtYv/qlxsQzzItfQx+LQKwNWli9TvCmfqtYUByNw/
5w01iXJm6zFWhUYFQxETbbQw9iv0m8/NDb7GhT2J71iVsy/ea80EFWAddDIFMfn4j9pM
XyTWKHjDmAdOxih08C7l3kniCQ3Ab7v2mGQL/euklP6SQHHw3efolPV1i78cAsoXolwO
PYWHIpmMhYTxKm5JfCnsYR/Ly12j29yNtgpCBpPf
Received: by 10.68.232.230 with SMTP id tr6mr3574745pbc.16.1348576360474;
Tue, 25 Sep 2012 05:32:40 -0700 (PDT)
X-BeenThere: nodejs-/***@public.gmane.org
Received: by 10.68.222.169 with SMTP id qn9ls419770pbc.4.gmail; Tue, 25 Sep
2012 05:32:38 -0700 (PDT)
Received: by 10.66.80.166 with SMTP id s6mr3564761pax.30.1348576358965;
Tue, 25 Sep 2012 05:32:38 -0700 (PDT)
Received: by 10.66.80.166 with SMTP id s6mr3564760pax.30.1348576358954;
Tue, 25 Sep 2012 05:32:38 -0700 (PDT)
Received: from mail-pb0-f42.google.com (mail-pb0-f42.google.com [209.85.160.42])
by gmr-mx.google.com with ESMTPS id qq4si109856pbb.1.2012.09.25.05.32.38
(version=TLSv1/SSLv3 cipher=OTHER);
Tue, 25 Sep 2012 05:32:38 -0700 (PDT)
Received-SPF: pass (google.com: domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.160.42 as permitted sender) client-ip=209.85.160.42;
Received: by pbbro2 with SMTP id ro2so54169pbb.1
for <nodejs-/***@public.gmane.org>; Tue, 25 Sep 2012 05:32:38 -0700 (PDT)
Received: by 10.68.216.2 with SMTP id om2mr45063630pbc.88.1348576358663; Tue,
25 Sep 2012 05:32:38 -0700 (PDT)
Received: by 10.68.25.199 with HTTP; Tue, 25 Sep 2012 05:32:38 -0700 (PDT)
X-Originating-IP: [87.214.96.125]
In-Reply-To: <f5859561-2f45-46d3-959e-8e7b4cce4e6b-/***@public.gmane.org>
X-Gm-Message-State: ALoCoQlbE3kAqGbFjG34rLu07YfYGxqCK9OVGfO0++aVVqU7xQAWRJU4js7AiTEkAyeK8zyrQHiH
X-Original-Sender: info-GlD9/+i8LImWrUy98/***@public.gmane.org
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
domain of info-GlD9/+i8LImWrUy98/***@public.gmane.org designates 209.85.160.42 as permitted sender) smtp.mail=info-GlD9/+i8LImWrUy98/***@public.gmane.org
Precedence: list
Mailing-list: list nodejs-/***@public.gmane.org; contact nodejs+owners-/***@public.gmane.org
List-ID: <nodejs.googlegroups.com>
X-Google-Group-Id: 1056499897995
List-Post: <http://groups.google.com/group/nodejs/post?hl=en_US>, <mailto:nodejs-/***@public.gmane.org>
List-Help: <http://groups.google.com/support/?hl=en_US>, <mailto:nodejs+help-/***@public.gmane.org>
List-Archive: <http://groups.google.com/group/nodejs?hl=en_US>
Sender: nodejs-/***@public.gmane.org
List-Subscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:nodejs+subscribe-/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/group/nodejs/subscribe?hl=en_US>, <mailto:googlegroups-manage+1056499897995+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.lang.javascript.nodejs/48254>
Yes, but, as I understand, this is logging during processing a single stream
of data, in a single call to http_parser_execute(), specifically inside of a
for loop looping over exactly the data I logged. That should be internally
consistent, at least, right? If the stream says it has an 8192 byte chunk
coming, and there's more than that many bytes of date, 8192 bytes later
there should be another chunk size in the stream (which, there is not, which
of course is why it's throwing the HPE_INVALID_CHUNK_SIZE error as it reads
some non-hex characters at the offset its expecting the next chunk size).
Anyway, I managed to capture it as it went bad with tcpdump (I added -U -B
1000000, not sure if that helped or the timing was just different). Short
result is: it looks like the data was bad as it came in over TCP! I'll go
into detail and post the dumps below, but this quickly leads me to ask
(after apologizing about thinking this was a node bug): why does this
generally fail on 0.8 and never on 0.6?
My guess is that on 0.6 an HTTP parse error simply disconnected the socket,
and all of our AWS APIs simply retry any requests which get a socket
disconnect (as those happen rather frequently with AWS). We also
automatically retry on a bunch of 5xx http response codes, so if any of
those were being generated, we'd retry on those as well. I'll add some
additional logging and run this on node 0.6 again to see if I can verify
this.
Back to the details on the symptom at hand (just in case I misinterpreted
some of this data, it might be good to get another set of eyes on it). In
this case, it didn't actually get a http parser error, it simply handed back
what looks like bad data (includes the HTTP chunk size lines inserted in the
middle of it), perhaps because the socket got closed or something (although
it seems if it got closed prematurely that perhaps that should have
generated a parsing error since it was expecting more bytes? I'm not sure on
this, exactly). I've uploaded the parsed/printed TCP dump of the end of the
relevant stream (starting from before the last good chunk size line) [1], as
well as the tail of the data that was returned to my app, starting from the
same point [2].
As far as I can tell, the stream indicated there was an 8kb chunk coming,
8kb later it indicates another 8kb chunk is coming, but then 1296 bytes
later there's another chunk size line ("\r\n2000\r\n"). At the packet
level, the final packet (starting at seq 230681) appears to be simply
repeated data from 1304 bytes back in the stream, but then terminated
unexpectedly (never gets to the 8kb of data the earlier chunk size indicated
was coming). No other packets show up in this stream in my dump.
I can not think of any possible thing on our side (either my code or node's)
that could cause this, so my conclusion is that something is going wrong
within Amazon's Dynamo DB (or routing between there and our servers, though
I've tried both inside and outside of their datacenters with the same
results). In retrospect, every bad packet I've gotten has been on DynamoDB
queries, despite us slamming a couple other Amazon APIs with similar levels
of traffic during this test, so that also points the finger in that
direction. I think I will try and bring this up on the DynamoDB forums, and
just make our DDB client more robust if it gets low-level parsing errors
(although I'm a little afraid of what could happen if it happens to
accidentally parse correctly because of some of our data happening to match
up!).
Thank you very much for your help with this.
[1] http://thesilentb.com/files/tcpdump.txt
[2] http://thesilentb.com/files/packet.txt
Glad you got it solved - or the next best thing, know what the culprit is. :-)

The reason that you're seeing it with 0.8 and not 0.6 probably is that
http_parser.c became considerably more strict. Error handling on the
JS level hasn't changed significantly.
Jimb Esser
2012-09-25 16:59:59 UTC
Permalink
Well, I did a bunch of testing on node 0.6, and after running for hours, I
received no http parsing errors, no JSON parsing errors on the data
returned, and no unexpected socket disconnects. I generally get one or two
of these within 5-10 minutes on node 0.8. So, there's definitely something
different node 0.8 is doing that is affecting this, though the tcpdump
definitely indicates the problem is on the other end, so there's basically
nothing more to try, but it is still worrisome :(. I think we'll have to
stick on node 0.6 at least in production for a while yet.

For posterity, the thread I started with Amazon is here (no responses yet):
https://forums.aws.amazon.com/thread.jspa?threadID=105268

Jimb Esser
Post by Jimb Esser
Post by Jimb Esser
Yes, but, as I understand, this is logging during processing a single
stream
Post by Jimb Esser
of data, in a single call to http_parser_execute(), specifically inside
of a
Post by Jimb Esser
for loop looping over exactly the data I logged. That should be
internally
Post by Jimb Esser
consistent, at least, right? If the stream says it has an 8192 byte
chunk
Post by Jimb Esser
coming, and there's more than that many bytes of date, 8192 bytes later
there should be another chunk size in the stream (which, there is not,
which
Post by Jimb Esser
of course is why it's throwing the HPE_INVALID_CHUNK_SIZE error as it
reads
Post by Jimb Esser
some non-hex characters at the offset its expecting the next chunk
size).
Post by Jimb Esser
Anyway, I managed to capture it as it went bad with tcpdump (I added -U
-B
Post by Jimb Esser
1000000, not sure if that helped or the timing was just different).
Short
Post by Jimb Esser
result is: it looks like the data was bad as it came in over TCP! I'll
go
Post by Jimb Esser
into detail and post the dumps below, but this quickly leads me to ask
(after apologizing about thinking this was a node bug): why does this
generally fail on 0.8 and never on 0.6?
My guess is that on 0.6 an HTTP parse error simply disconnected the
socket,
Post by Jimb Esser
and all of our AWS APIs simply retry any requests which get a socket
disconnect (as those happen rather frequently with AWS). We also
automatically retry on a bunch of 5xx http response codes, so if any of
those were being generated, we'd retry on those as well. I'll add some
additional logging and run this on node 0.6 again to see if I can verify
this.
Back to the details on the symptom at hand (just in case I
misinterpreted
Post by Jimb Esser
some of this data, it might be good to get another set of eyes on it).
In
Post by Jimb Esser
this case, it didn't actually get a http parser error, it simply handed
back
Post by Jimb Esser
what looks like bad data (includes the HTTP chunk size lines inserted in
the
Post by Jimb Esser
middle of it), perhaps because the socket got closed or something
(although
Post by Jimb Esser
it seems if it got closed prematurely that perhaps that should have
generated a parsing error since it was expecting more bytes? I'm not
sure on
Post by Jimb Esser
this, exactly). I've uploaded the parsed/printed TCP dump of the end of
the
Post by Jimb Esser
relevant stream (starting from before the last good chunk size line)
[1], as
Post by Jimb Esser
well as the tail of the data that was returned to my app, starting from
the
Post by Jimb Esser
same point [2].
As far as I can tell, the stream indicated there was an 8kb chunk
coming,
Post by Jimb Esser
8kb later it indicates another 8kb chunk is coming, but then 1296 bytes
later there's another chunk size line ("\r\n2000\r\n"). At the packet
level, the final packet (starting at seq 230681) appears to be simply
repeated data from 1304 bytes back in the stream, but then terminated
unexpectedly (never gets to the 8kb of data the earlier chunk size
indicated
Post by Jimb Esser
was coming). No other packets show up in this stream in my dump.
I can not think of any possible thing on our side (either my code or
node's)
Post by Jimb Esser
that could cause this, so my conclusion is that something is going wrong
within Amazon's Dynamo DB (or routing between there and our servers,
though
Post by Jimb Esser
I've tried both inside and outside of their datacenters with the same
results). In retrospect, every bad packet I've gotten has been on
DynamoDB
Post by Jimb Esser
queries, despite us slamming a couple other Amazon APIs with similar
levels
Post by Jimb Esser
of traffic during this test, so that also points the finger in that
direction. I think I will try and bring this up on the DynamoDB forums,
and
Post by Jimb Esser
just make our DDB client more robust if it gets low-level parsing errors
(although I'm a little afraid of what could happen if it happens to
accidentally parse correctly because of some of our data happening to
match
Post by Jimb Esser
up!).
Thank you very much for your help with this.
[1] http://thesilentb.com/files/tcpdump.txt
[2] http://thesilentb.com/files/packet.txt
Glad you got it solved - or the next best thing, know what the culprit is. :-)
The reason that you're seeing it with 0.8 and not 0.6 probably is that
http_parser.c became considerably more strict. Error handling on the
JS level hasn't changed significantly.
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs-/***@public.gmane.org
To unsubscribe from this group, send email to
nodejs+unsubscribe-/***@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Loading...