| Carnegie Mellon University -- Department of Chemical Engineering |
| Beowulf Distributed Computer Cluster |
|
Cluster Status
|
Online temperature monitoring
Thermal warning thresholds:
About the monitoring system:In January of 2002, the cluster moved from its original home in an office in the basement of Doherty Hall to its permanent home in a specially renovated lab. Among other improvements, the new cluster room has 400-amp electrical service and a cooling unit rated for 5 tons of refrigeration (17.5 KW) operating on a local chilled water loop.The Cluster generates quite a bit of heat, and thus places a significant load on the chilled water main. If the chilled water system ever fails or is taken offline, the temperature in the cluster room rapidly increases (at rates exceeding 1 degree per minute). The temperature quickly reaches the point where there is a significant risk of thermal damage to the computer equipment. We have observed that memory is especially sensitive to heat and becomes prone to failure when the ambient temperature exceeds 85°F. As a safeguard, we have installed a Newport iSeries i3200 Temperature Meter connected to an EIS-2 iServer ethernet interface. This system allows us to monitor online the ambient temperature in the cluster room using a standard 100-Ohm RTD mounted above and behind the middle cluster rack (approximately the warmest part of the room). Automated scripts constantly monitor the temperature and will take various actions (suspending the batch system, suspending all running jobs, and shutting down the entire cluster) if the ambient temperature rises above preset thresholds. |
Thu Sep 22 09:10:31 2005