Estimated reading time: 6 minutes
Capacity Management (part of Service Management) is an old technology discipline. Reams have been written about the topic, spanning the basic through to the complex. To some degree, there are elements of voodoo involved when calculating capacity management!
This brief article will distil some of our experience and that of industry leaders like Google. The key intent is to demystify capacity management and help you implement a working model.
What is Capacity Management?
Thankfully, this part is not hard to describe.
Capacity management ensures you have the appropriate resources for your service to be scalable, efficient, and reliable. Without Capacity Management, you risk running out of resources for your critical services and systems. Never good if you are trying to keep your customers happy and your business operational.
What are the Capacity Management Principles?
There are three core principles of capacity management:
- Services must use their resources efficiently. Sprawling services that require a lot of resources are expensive to deploy and maintain. This is mainly a problem in the Public Cloud world, where resources cost money. In essence, you do not want to over-subscribe your resources, but you do want to be able to scale.
- Services must run reliably. Limiting resources to improve service cost efficiency can put the service at risk of outages during peak or unexpected periods. There is a tradeoff between cost efficiency and reliability.
- Service growth anticipation. Expanding resources can take a long time and comes with real-world limitations around deployment timing. For example, scaling may involve buying or deploying new infrastructure. It may also require increasing capacity for other software systems and infrastructure dependencies of the service.
Where are the complexities?
The services that you operate can be unpredictable beasts. Furthermore, demand for the service is not always linear. If you have ever worked in financial services, you will already understand the correlation between unexpected market volatility and its impact on technology.
Capacity Management complexities
- Service performance. Understanding how different components of the service performed under stress.
- Service failure modes. Considering known failure modes and how the service behaves when subjected to them.
- Demand. Determining the expected user count and traffic, where the users are located, and the usage patterns.
- Organic growth. Estimating how demand may grow over time.
- Inorganic growth. Keeping in mind the long-term resource impact of adding new features or of the service becoming more successful than expected.
- Scaling. Understanding how the service scales when increasing resource allocations.
- Market analysis. Estimating how market changes or constraints affect your ability to scale.
One key take-out of this article is that capacity management is not just about your service. In a world of shared resources (Storage, Compute and Networks), your system may have the capacity, but the shared infrastructure connecting to it may not.
For example, network bandwidth starvation will exhibit similar symptoms to a server running out of CPU resources. It does not matter that you have a powerful server. If you have a bandwidth bottleneck, then your service will suffer nonetheless.
What are the essential resources to assess?
In the real world, this may differ depending on the service’s functionality. However, the table below represents some of the most common hardware areas to include in capacity management planning.
Don’t forget about software
It is also essential to talk about software. Generally, Capacity Management has focused on hardware, but code can also be a limiting factor. For example, applications may only be single-threaded, creating a bottleneck for processing. In this example, no matter how high the hardware specifications are, you will be constrained by the code.
How do you plan for success?
We have already outlined the complexities in Capacity Management. So let us start looking at how you can build a framework to help you succeed.
The fundamental answer to capacity planning is data. Whether it is data from an existing system or load-testing during development, it will provide valuable insights. Unfortunately, you cannot measure what you don’t track, and as a result, you certainly cannot anticipate the future.
Service monitoring is an essential aspect of capacity management and is poorly implemented within organisations. To most people, monitoring is about tracking whether a service is up or down. Far less is it used to assess performance and capacity.
The below list contains examples of areas to include when considering what metrics to collect. Of course, there are plenty more, which will differ from system to system. However, it does count as a good starter for ten.
- Incoming requests per second
- Latency-insensitive load
- Number of active users
- Number of total users
- Resource allocations
- Actual resource usage (RAM, CPU, Storage, etc.)
- Quota usage
- How many requests are throttled
As you can see from the list above, you are collecting metrics for the technology resources but also the incoming load. This is another area we observe in organisations that have implemented capacity management. They can generally see the negative impact to technology but don’t understand what is driving it. Therefore, load drivers are just as important to understand as actual resource metrics.
Tools for the job
Once you understand the principles and what to start measuring, you need to be able to consolidate all of the data. This is where tools become an important part of capacity management.
We are often asked what tools should be used to amalgamate and make sense of all the data. The truthful answer is that it depends on your requirements and what your organisation may have already invested in. However, tools are really the bottom-rung of problems to solve. As long as it can ingest data from various sources, and make the data consumable, then you should be all set.
As a result of the above, we generally steer clear of recommending tools. However, using any of the below (in line with a solid plan) will enable you to improve the management of capacity.
This article discussed the components and complexities of capacity management and provided a broad framework to implement. Managing capacity is essential to reliability and a key service management feature.
Always remember the key aspect of capacity management. When provisioning resources, examine the various demand signals (input) and their effect on the resource allocations (output). It helps to understand the expected peak demands the service may face and the amount of redundancy you’re required to build into it.
Finally, we cannot stress the importance of capturing metrics enough, including load drivers, hardware and software data points. Without data, you cannot understand the current impact of resource usage. Furthermore, trying to predict future capacity requirements becomes a finger-in-the-air exercise.