Cooling problems at mission critical data centres can arise from too high a load – or too low. A good understanding of the system’s capacities is vital, says Robert Tozer
Higher IT load densities are stretching the demands on mission critical facilities such as data centres. This, combined with the focus on improving operational efficiencies, means effective load and capacity management are key to maximising site performance.
An understanding of true resilient system capacities, load profiles and growth, and effective control processes, are required to maximise cooling efficiency and potential. This will also help avoid overloading and preventable failures, maintain system resilience, reduce downtime and ensure future load profiles can be met.
Many clients have experienced problems of low capacity from their cooling systems. NTU (number of transfer units) and LMTD (log mean temperature difference) methods can be used to assess such cases, due to the difficulties in taking measurements on site when the plant/system is performing at its full design condition.
Some shortfalls in chiller capacity have been found but more often significant capacity shortfalls of heat rejection plant (dry coolers and cooling towers), in some cases exceeding 50%, have been experienced. This shortfall can go unnoticed until such time as the full redundant capacity is required because the cooling system has been oversized. A useful equation is:
The main reasons for the shortfall in capacity include untested plant and recirculation of air on the roof; rising ambient temperatures and elevated temperatures exceeding design.
Part load operation
After the dot.com crisis in 2000, where the market slowdown meant data centres remained unfilled and had minimum growth, problems arose with chilled water systems because they were required to run continuously at very low part loads.
The lesson from this is to clearly establish the lowest part-load at which a chiller could operate on a continuous basis. Modular design of cooling systems can avoid these problems.
Similar problems of low loads on chillers have occurred in large financial institution buildings where the main building chilled water system also supplies the data centre. Typically the data centre load is a small fraction of the building summer design cooling load.
At times of low load, such as winter weekends, nights and over Christmas, large centrifugal chillers have been required to run continuously at loads well below their minimum rating.
In both of the above cases, the designer has only thought through the maximum design loads and should have carefully considered part loads as the data centre increases its load with more servers and continuous operation during winter months.
An ideal metering installation will allow automated, live load trend monitoring. However, it is often only possible to collect snapshot readings. The problem with this data is that it is only accurate at the time it is recorded, which may not coincide with the peak load. If the remaining capacity is based on non-peak data, this could give a false sense of security because it will appear that there is more available capacity than actually exists, so there is a risk of overloading the system.
Server rooms traditionally have computer room air-conditioning units – more often referred to as CRAC units – located around the room’s perimeter. These distribute cool air under the raised floor.
Such designs mix warm and cool air to achieve the required room temperatures. Although cooling may be available in the system, effective air management is required to ensure that this is delivered to critical equipment in order to maintain a satisfactory operating temperature. This is because poor air management causes cool air to mix with warmer air streams on its way to the server inlet.
While server room load densities were low, ie below 500W/m2, this did not constitute a major problem. However, at higher loads of 1000W/m2 and above, this mismanagement can lead to hot spots. This has resulted in typical data centre supply air temperatures to the servers of 25°C. In other words there is a 10ºC increase through mixing and associated energy wasted.
We propose a novel data centre air management metric that addresses recirculation, negative pressure flow, bypass flow and the balance of server and CRAC unit flows. The problems of negative pressure flow, bypass flow and recirculation flow are indicated in Figure 1.
Through effective air management within the data it is possible to capitalise on energy savings through reduced fan power and increased chilled water temperature (reduced compressor power).
The air management model is presented in Figure 2.
Both negative pressure (NP) and bypass (BP) contribute to recirculation (R). If a cabinet is starved of air due to negative pressure, the servers will take return air (recirculation air). If air is bypassed back to CRAC units, it will not be able to supply the servers which will use recirculation air.
The above issues are also present within server cabinets and may be more significant due to poor separation and control of the hot air stream. Using these mass flow rates and temperatures, the following data centre air management metric equations are derived.
By taking a representative sample of air temperature measurements it is possible to gain an understanding of these properties in a data centre. Even if it is difficult to precisely measure NP, BP and R, there is always a specific value which is characteristic of every centre. Typically bypass and recirculation are 40-50%. Legacy data centres, particularly if not well managed, tend to have high levels of both bypass and recirculation flow rates, which are increased when subject to higher density loads.
The ideal solution requires low negative pressure flow, bypass and recirculation and also a balance between the CRAC and server air requirements. This can be achieved through physical segregation between cold and hot air streams and variable air volume to match the air demands of the server units.
CFD software is another tool which models data centre thermal performance. Air management metrics provide real feedback on an existing installation whereas CFD can provide theoretical feedback on a model with simplifying assumptions. It can be used for the virtual testing of: design stage concepts, failure scenario modelling of CRAC units, and testing increased server load scenarios.
Energy in data centres accounts for a significant part of the total cost of ownership (TCO). However, unlike commercial buildings, data centres typically have 100 times less people and 10 times the cooling load year round. Therefore, tools such as BREEAM and LEED which have their environmental impact based on commercial buildings are not suitable tools for assessing the data centres.
The various components of typical data centre systems and associated energy consumption are illustrated in Figure 3.
Energy consumption for data centres should be optimised in the following order:
- higher efficiency
- higher utilisation
- wider temperature and humidity range
Air management objective increase air and chilled water set points
- minimise bypass
- minimise negative pressure flow
- minimise recirculation
- free cooling (air or water side)
- plant/system optimise
- humidifiers, etc
- generator block heaters
Renewable power (mains/on-site)
The ideal situation is when the supply air temperature of the CRAC unit would be about 22°C, the optimum environmental conditions for IT equipment specified by ASHRAE. By allowing CRAC unit supply temperatures to tend towards this value the possibility of instigating free cooling strategies is significantly increased.
Chilled water set points can be raised. This means that the number of hours in the year when outdoor conditions are suitable for total and partial free cooling is increased and the number of hours in the year available for direct external air free cooling is increased.
More energy savings could be made by widening the operational window of conditions acceptable to IT equipment, ie run rooms hotter/reduce cooling. Employing free cooling can make a sizeable reduction in the energy required for this function and may be air side/water side/direct or indirect.
Good designs should consider how commissioning requirements are met. A site-specific “commissioning handover” document should be prepared by the commissioning consultant/manager to provide details of design handover, installation and site handover for each part of the process.
The integrated systems test is crucial to prove that all systems operate and interface correctly in emergency/failure scenarios and recovery from failure; they must be dependable and justify the investment. This level of testing is too intrusive and risky to perform on a live mission critical facility.
Mission critical facilities are complex systems with high capital, operational and risk costs. The pace of technology change and mismatch between IT and plant replacement timescales can be tackled with flexible modular designs at reduced capital cost.
Operational risks can be reduced through holistic management where the stakeholders collaborate and communicate effectively. Accountability for energy costs, external pressures and education on what is achievable, for example the use of free cooling and increasing the environmental operational window, will help drive a reduction in consumption.
In the long term, increasing heat densities and pressures to reduce energy consumption may result in cooling systems evolving towards more direct energy paths, for example liquid cooled servers.
- This is an edited extract from a paper, Cooling challenges for mission critical facilities, by Robert Tozer, Martin Wilson and Sophia Flucker, presented to the Institute of Refrigeration earlier this year.
Building Sustainable Design