Amcom Home Page

 Stress Testing a CF Server - Interesting results

Over the years we've been enhancing this fairly large (500K+ lines) CF application. It's a system that we've inherited, and as we've built up the infrastructure we generally knew what the operating capacity needed was.

As part of a Disaster Recovery project we've relocated the environment to a state of the art Colo, so as part of building a brand new home for this product we wanted to do some stress testing. Part of that was to observe how the Web/CF/DB servers perform, but this is the first time users would be accessing the servers over a WAN - so we wanted to simulate what the experience will be like when the WAN has heavy HTTP load on it.

Tool Selection

We needed to quickly find some tools to do this, so we took a look at a few products including Microsoft's Web Application Stress Tool, Minq's PureLoad, Paessler's Webserver Stress Tool 7, and Apache's jMeter.

Microsoft's tool seems to have technical potential, but from a usability point of view it's difficult to use. PureLoad is industrial strength, and a great solution if you need to do a lot of it regularly, but the setup is too involved for a quick and simple test. Similarly with jMeter, it's an industrial stength tool, but there's no way non-technical people would be able to quickly get it setup.

So that left us with Paessler's WebServer Stress Tool (PWST); it's got great features and the ability to do complex scripting. But our use case was to have a bunch of people load it up on their desktops and fire away with a barrage of canned URLs. And PWST can help you do that in a matter of minutes.

Granted, when the URLs remain static you'll start to get caching going on at very levels of the system, but we made sure everyone was generating their own unique set of URLs that even if cached were returning significant amounts of data.

Observing the Results

Our environment consists of a CISCO Network Load Balancer using round robin w/sticky sessions to four ColdFusion 8 Enterprise servers.

We progressively ramped up more and more load and could see the load being distributed fairly evenly, and CPU and Memory increased at a linear rate.

Another goal was to progressively increase the load towards a point of failure, and as the load ramped up towards 10X more load than what the servers would ever see, it was interesting to see how the system failed. I was expecting for one server to lock up first, and then the others to pick up the slack but act incredibly sluggish before another went down.

What actually happened is within 10-15 seconds of the first server going down, the load was re-distributed and the remaining servers were already on the brink of failure that they all went down pretty much at the same time.

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
Given the scenario of load nearing 10x the expected load for the application, that doesn't actually surprise me in particular... If you had more servers I think you might see that more gradual failure you expected.

The problem here is that your distribution is over 4-to-1 on that app, so when the load-balanced servers reach an overloaded capacity at which any one of them (and by nature of load balancing, that means all 4 of them) is about to fail, there are only 3 more servers ready to take the remaining load... Well... you know that all 4 of them were about to fail, and suddenly you're asking each of the remaining 3 to then at the point of near failure increase their load by 33%. And hence, blammo! They all go down at once.

A more gradual failure in a load balanced environment like this would require a lot more servers on the balancer. So say if you had a farm of like 11 of them, then as they approach the failure stage, you're only asking each other server to take an additional 10% of the load. Maybe they can, maybe they can't, but they'll certainly be sluggish. ;) And of course the more servers you add to the farm, the more they'll be able to handle failure of any single server. But if the budget for the project can only afford 4 servers, then simultaneous failure imo is basically a given actually because the load-balancing works so well. :)

Although I will admit that it's kind of counter-intuitive because I think that in the IT industry we tend to imagine "fail-over"... and I'm not sure if the fact that we don't actually have fail over really prevents us from imagining that we do.
# Posted By ike | 11/24/08 8:21 PM
Although you may find that in production things are different because sessions will vary more in length. I'm guessing that in testing basically all the sticky sessions lasted roughly (if not exactly) the same duration. But with a curve of users with different usage habits where you've got some who come and go and some whose sessions last all day, it may be that the sticky sessions aren't as evenly distributed across the servers and then you may wind up with failure on one server by itself if the other servers are still well below their peak.
# Posted By ike | 11/24/08 8:25 PM
@ike: Thanks for sharing your experiences and insight!

Ya, the point of failure was more of an academic experiment. Even under "normal" high loads, we're able to run on just one web server if needed (though it's brutally slow at that point). Typically CPU is @ 15%, Memory 50% (on a 2GB server), handling about 5-8 active requests at any given time.

We use virtual machines, so we're able to quickly scale out horizontally by duplicating an existing VM instance. The only thing that holds us back from running say 8 servers is just licensing. Paying for 8 CF Enterprise licenses & maintenance agreements is quite costly! :(
# Posted By Tariq Ahmed | 11/25/08 2:54 PM

BlogCFC was created by Raymond Camden. This blog is running version 5.9.002. Contact Blog Owner