Carnegie Mellon University -- Department of Chemical Engineering
Beowulf Distributed Computer Cluster

Home

    Research

    Hardware

    Software

    Photo Gallery


    User Information

    System News[new news indicator]


    Cluster Status
        processor activity
        batch system
        temperature
        daily logs
        annual logs

       

Online temperature monitoring

The current temperature is 65.6°F
Normal Operation

Thermal warning thresholds:

Threshold* Mode Action
( < 77.5°F ) Normal Normal cluster operation
77.5°F Warning Batch queue system holds all queued jobs
80.0°F Alert Batch queue system suspends all running jobs
85.0°F Critical All cluster nodes are shut down
* - Note: the thresholds are all "latching" triggers; the thresholds listed above are the rising thresholds. The temperature must drop significantly below the rising threshold before the cluster resumes the previous mode. This behavior prevents the cluster mode from oscillating rapidly.

About the monitoring system:

In January of 2002, the cluster moved from its original home in an office in the basement of Doherty Hall to its permanent home in a specially renovated lab. Among other improvements, the new cluster room has 400-amp electrical service and a cooling unit rated for 5 tons of refrigeration (17.5 KW) operating on a local chilled water loop.

The Cluster generates quite a bit of heat, and thus places a significant load on the chilled water main. If the chilled water system ever fails or is taken offline, the temperature in the cluster room rapidly increases (at rates exceeding 1 degree per minute). The temperature quickly reaches the point where there is a significant risk of thermal damage to the computer equipment. We have observed that memory is especially sensitive to heat and becomes prone to failure when the ambient temperature exceeds 85°F.

As a safeguard, we have installed a Newport iSeries i3200 Temperature Meter connected to an EIS-2 iServer ethernet interface. This system allows us to monitor online the ambient temperature in the cluster room using a standard 100-Ohm RTD mounted above and behind the middle cluster rack (approximately the warmest part of the room). Automated scripts constantly monitor the temperature and will take various actions (suspending the batch system, suspending all running jobs, and shutting down the entire cluster) if the ambient temperature rises above preset thresholds.


Thu Sep 22 09:10:31 2005