dominic.jones via nodejs
2018-09-13 08:45:36 UTC
We are trying to debug a poorly performing node application and would
appreciate any help or advice from this community. We have a node
application that serves as the user facing frontend for a payment platform
- code here https://github.com/alphagov/pay-frontend. We are in the process
of assessing and expanding our capacity to meet increasing need.
We have a target of being able to serve X payment journeys per second.
A payment journey comprises 4 pages, two of which require a form submission.
Each page in the journey entails some communication between the node
application in question (that we helpfully call frontend) and other
microservices to establish the current status of the payment etc, on
average around 2 http calls per page.
By carrying out performance tests (using Gatling) we have found that in
order to meet our target of X tx/s, we have to provision around X/2
frontend nodes, i.e. each frontend node appears capable of processing
around 2 payment journeys per second on average.
This seems wrong - by my reckoning it is wrong by orders of magnitude.
*Details about our tech stack*
We are on aws, and the frontends run in docker containers on C5.large ec2
instances.
We use https internally
We are running node 8 in production
The application is an express app
We use http.request to make downstream requests, but have also experimented
with using request, with no appreciable difference.
There is no major cpu heavy processes in our frontend app, and event loop
latency under normal load is fine
*What we have found so far*
The frontend nodes are CPU bound
Under strain/near breaking point, profiling reveals the frontends seem to
be spending a large amount of time doing things related to making
downstream http requests, but nothing obviously ludicrous.
Whilst there is no obvious memory leak, the heap dump deltas show a
proportionately large number of Sockets hanging around - I think this is
just due to keepalives though
Even not under heavy load, the network latency for a request seems high for
an internal request - we are seeing average latency of ~20-40ms, vs around
2-5ms for a Java app that is more or less identical in the calls it's
making.
Break down of the phases of a request (gained from request library's timing
facility) reveals that under low load on average socket wait, dns lookup
and tcp connection take practically no time - bulk of time is waiting for
server response
Under load it appears to be the time to establish a tcp connection and the
time to 'firstByte' that contribute to overall increase in http request time
*Things we have tried*
We have tried configuring the standard agent with different values of
maxSockets, maxFreeSockets...
We have tried using different agents
We have tried disabling socket pooling entirely
We have tried two different client libs - the core http module, and request.
We have matched the number of workers in our cluster to the number of CPUs
Some of these things have yielded gains of ~10%, but I am still convinced
there is something fundamentally wrong with the architecture and
configuration of the application - the throughput just seems too low.
I realise I haven't given enough detail to solve anything here, but if
anyone has any guidance on approaches that have worked for them, other
knobs to twiddle, guidance on better interpretation of profiling and heap
dumps, or any other useful pointers I would be very grateful.
Dom
--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/df817fd9-ae4c-41bd-8f35-b61a7ae842f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
appreciate any help or advice from this community. We have a node
application that serves as the user facing frontend for a payment platform
- code here https://github.com/alphagov/pay-frontend. We are in the process
of assessing and expanding our capacity to meet increasing need.
We have a target of being able to serve X payment journeys per second.
A payment journey comprises 4 pages, two of which require a form submission.
Each page in the journey entails some communication between the node
application in question (that we helpfully call frontend) and other
microservices to establish the current status of the payment etc, on
average around 2 http calls per page.
By carrying out performance tests (using Gatling) we have found that in
order to meet our target of X tx/s, we have to provision around X/2
frontend nodes, i.e. each frontend node appears capable of processing
around 2 payment journeys per second on average.
This seems wrong - by my reckoning it is wrong by orders of magnitude.
*Details about our tech stack*
We are on aws, and the frontends run in docker containers on C5.large ec2
instances.
We use https internally
We are running node 8 in production
The application is an express app
We use http.request to make downstream requests, but have also experimented
with using request, with no appreciable difference.
There is no major cpu heavy processes in our frontend app, and event loop
latency under normal load is fine
*What we have found so far*
The frontend nodes are CPU bound
Under strain/near breaking point, profiling reveals the frontends seem to
be spending a large amount of time doing things related to making
downstream http requests, but nothing obviously ludicrous.
Whilst there is no obvious memory leak, the heap dump deltas show a
proportionately large number of Sockets hanging around - I think this is
just due to keepalives though
Even not under heavy load, the network latency for a request seems high for
an internal request - we are seeing average latency of ~20-40ms, vs around
2-5ms for a Java app that is more or less identical in the calls it's
making.
Break down of the phases of a request (gained from request library's timing
facility) reveals that under low load on average socket wait, dns lookup
and tcp connection take practically no time - bulk of time is waiting for
server response
Under load it appears to be the time to establish a tcp connection and the
time to 'firstByte' that contribute to overall increase in http request time
*Things we have tried*
We have tried configuring the standard agent with different values of
maxSockets, maxFreeSockets...
We have tried using different agents
We have tried disabling socket pooling entirely
We have tried two different client libs - the core http module, and request.
We have matched the number of workers in our cluster to the number of CPUs
Some of these things have yielded gains of ~10%, but I am still convinced
there is something fundamentally wrong with the architecture and
configuration of the application - the throughput just seems too low.
I realise I haven't given enough detail to solve anything here, but if
anyone has any guidance on approaches that have worked for them, other
knobs to twiddle, guidance on better interpretation of profiling and heap
dumps, or any other useful pointers I would be very grateful.
Dom
--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/df817fd9-ae4c-41bd-8f35-b61a7ae842f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.