Shared Rhythm

October 25, 2009

Performance Is A Feature

Filed under: Shipito Ergo Sum — craigkn @ 8:23 pm

A comment posted against my recent post on Agile Testing reminded me that I had ignored an important topic usually grouped with other testing activities – performance testing.  Our team at Gearworks (now Xora) made significant progress in this area, so I thought I would take some time to explain what we did and how it worked.  The proof, as they say, is in the pudding – by the end of our efforts we had improved the scalability of the product by at least a factor of 5 as well as effectively eliminated the most common sources of downtime in production.  We also succeeded at bridging the great divide between product operations and product development and earned the trust and confidence of some fairly jaded members of that team.

My perspective on performance is that it is a feature of the product just like any other piece of business functionality.  I have seen teams ignore it altogether at their peril, treat it as a constraint (“the product must be performant” – oh, really; thanks for making that clear), or worse yet relegate it to an already overburdened QA team.  We all know that doing performance testing at the end of a development effort is risky – the product architecture is committed, time is short, and only the most compelling evidence will prevent deployment of the product – and yet this is how it is most often handled.

Banning State Park - Sandstone, MN

Banning State Park - Sandstone, MN

I’m going to go out on a limb and state that the long term scalability of any complex business system is the database.  We’ve had to tackle both CPU and I/O limitations, but regardless of the type of performance bottleneck, the source of the problem is always the application itself.  A good DBA can do wonders in making sure that everything is indexed properly, but once you move beyond the obvious mistakes that can be made in configuring the database itself, the burden for finding further improvement really lies with the development team.  I have a confession to make:  at one point in my career I fell completely out of love with object-oriented programming as it was clearly the root of all database abuse.  Since then I have seen that the proper (if not expert) use of persistence layers such as XHibernate can alleviate much of that pain.  I’m willing to let OO back into the tent IF you also commit be becoming a guru in a persistence framework.  And no, don’t write your own or I’m putting you right back in the penalty box.

Enough on that front – my purpose for this post was not to get into the technical particulars of performance tuning (this is a black art in and of itself) but rather to step back one level from that and talk about how you organize and what process you follow in order to ensure that performance gets fair consideration.

When we tackled performance as a feature, we started by retaining the best DBA we could find with expertise dedicated to our database platform of choice.  We created a feature team dedicated to performance and assigned another engineer to work with him.  The first goal was to make the measurement of performance as cheap as possible, so we put effort into building out an environment where performance of a build could be characterized.  We also worked to ensure that we could rollback the environment into a known starting state (we actually say disk and table fragmentation affecting the performance profile in a way that was uncharacteristic of our production environment).  Finally, we put time into extracting metrics from product logs that profiled the important figures of merit, especially frequency and duration of database calls by operation performed.

Once we had this basic infrastructure in place we could now do several things:

  • We could analyze our production logs and characterize the load profile and compare it to the analysis of our performance run – this helped eliminate the need to try to theoretically model the load profile.  Instead, we simply measured both used that to determine if our load test was over- or under-representing some aspect of system usage
  • We couple repeat the tests as often as needed.  The goal was to run them at least once per 2-week iteration.  By comparing the profile to the previous run we could quickly identify when some change to the code or schema had introduced a new problem area.
  • We could compare releases and use this information to assist in making the ship-ready decision
  • We could use the results to determine where our next effort should be focused, which queries needed to be tuned, or even which parts of the system needed to be re-architected to expand overall throughput.

Now, if you are going to convince the powers that be in the need for this kind of investment, you need a business case.  Our’s was fairly simple – we ran a hosted, multi-tenant system in which there were both penalties for down time and clear costs to increase the production capacity of our system.  These costs came from either expanding the number of servers or increasing capacity in our database SAN.  At one point we were faced with a quote for $80,000 to rebuild our SAN on new hardware in order to ensure we had enough capacity to handle a drive failure in our SAN.  Our only alternative was to decrease the SAN I/O to the point where we had capacity to spare even during a drive rebuild.  With this kind of motivation we were successful in figuring out the latter and were able to avoid the expenditure.  In fact, over a 12 month period of time we increased the system’s capacity five-fold through progressive and occasionally dramatic improvements in the application’s use of the database and the database’s use of the SAN – as a result, although our number of users increased by 25% in that period of time we not only avoided additional capital expenditures but also managed to reduce our operating cost.  In fact, our goal was to continue this trend for the foreseeable future – we expected to be able to continue to consolidate services, reduce servers, and reduce overall operating costs while doubling load in the coming year or two.

Such compelling cases do depend on your business situation, but the point is clear – treat performance as a feature and organize a cross-functional team to tackle performance and you can make significant strides toward improved stability and scalability.  You do not have to spend a lot of money on duplicate hardware (we didn’t) or expensive load-generation tools (we used open source or home-grown tools), but you do have to invest the time to make performance testing pervasive and free.  If you do, then you will discover yet another example of how “quality is free”.

Advertisement

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.