======================================================================
BASIC CLUSTER INFORMATION
======================================================================
a) To log on to the cluster nodes, use the command "ssh nodename".
Please do not execute simulations on the front-end machine;
debugging and short interactive runs should be done on your own
desktop machine or on interactive nodes reserved for this purpose.
Node names are generated systematically. The machine named "cXnY"
is node Y found in cabinet X. The cabinets have a different number
of nodes, but are always numbered from 1 to <max> in a cabinet.
Note that it is (purposefully) made impossible to log on directly
to any of the nodes without first logging on to the front-end
machine named "beowulf.cheme.cmu.edu".
b) This computer system is frequently subject to unauthorized
login attempts from all over the world. You must use ssh
(protocol version 2) for all interactive connections.
However, we also disallow direct access from most sites.
(see Chpt 8 for details on access restrictions)
Note: the concept of "shared accounts" is frowned upon.
Do NOT give away your password to _anyone_ at all.
This applies for ANY reason; just do not do it.
Use the command "passwd" to change your password.
When doing so, please:
- use a non-dictionary password.
- change it at least every 6 months or so.
- read the output from the "passwd" program. If it suggests
that your password is a bad one, your password is usually
crackable in a matter of hours: pick a different one.
You do not need a password to connect to any of the nodes once
you have logged on to the cluster.
======================================================================
ADMINISTRATIVE POLICIES
======================================================================
1) User Accounts
- user accounts are available to anyone in or related to the
research groups of Biegler, Hauan, Kitchin or Sholl.
- all user accounts are individual, never shared.
- do not give out your password to anyone; if they think
they need or should have a user account, let them email us.
2) Remote Login Access
The cluster is only accessible using ssh with protocol v2 or
higher. Free clients are available for all operating systems.
Direct access is also filtered by an explicit list of
IP addresses and domains. All connections are permitted from
CMU machines (.cmu.edu) and the Pittsburgh supercomputing
center (.psc.edu).
Selected DSL domains are also allowed if they:
- provide a static IP address
- are locally or regionally limited to Pittsburgh
example: Verizon is enabled by ".pitt.east.verizon.net"
The complete list is available as beowulf:/etc/hosts.allow
Due to large portscanning activity and numerous unauthorized
connection attempts, general access for large/national service
providers will NOT be provided. Typical examples would include
".aol.com", ".att.com" and ".comcast.net"; these domains
represent millions of computers and imply too much exposure.
If you use one of these providers, you may be able to contact
them and ask for details as to how they allocate IP addresses
based on regional info. If so, email us the info and we'll enable
login from a subset of the relevant machines.
The only other alternative is to first log on to an andrew
machine and connect to the cluster using ssh from andrew.
(you must use "ssh -2" from andrew to get the right protocol)
3) Disk Usage
- The cluster should not be used for permanent storage of files.
In general, all data and result files should be moved to your
own computers when they no longer are being written to.
- The cluster fileserver has a redundant (RAID-5) disk array
which offers some protection against hardware failure. However,
we do NOT take backup -- complete or incremental -- of user
data. You should take the necessary precautions to ensure that
your source code is backed up outside the cluster and that any
data files generated are moved to your own computer.
- At present we are not enforcing disk quotas. This is convenient
for performance reasons and also allows anyone to temporary
generate large amounts of data. However, if the main /home disk
should go full it will ruin the simulations for everyone.
Please make sure this does not happen because of you.
To get a list of your 20 biggest files not accessed the last
7 days, execute the command:
find ~/ -type f -atime +7 -ls | sort -n +6 | tail -20
(you can of course change "7" or "20" to suit your needs)
- You are strongly encouraged to write temporary data to
the /scratch partition on execution nodes, in particular if
data is written continuously. This will both avoid disk storage
problems and make your program(s) run faster. This data is
available directly from the login machine through the network
file system in /beowulf/<nodename> (see Chpt 7 for details).
4) Fair CPU Usage
- The cluster is a multi-user environment where everyone would
like their calculations to be run as fast and as often as
possible. At present, there are no restrictions on the amount
of resources that may be simultaneously occupied by any one
individual. please take care to help us continue this policy by:
(a) using the batch system
(b) not submitting an excessive number of jobs
(c) always leave your batch jobs as "rerunable".
(d) carefully estimating the resources your job(s) will need.
this helps the scheduler to achieve maximum throughput.
(e) not submitting your jobs to a specific (named) node or
group of nodes. While it is possible to continuously
request the fastest nodes available, it is not nice
... and everything is logged, so we will know ...
Also make sure you do not submit programs with an signal handler
that traps run control (SIGSTOP, SIGCONT) as this will interfere
both with systems for load balancing and temperature monitoring.
(If you don't know what this means, do not worry about it; you
will not be doing this "by accident".)
======================================================================
MANUAL PAGES
======================================================================
Most Unix commands have a manual page, accessible from
the command prompt by typing "man name-of-command".
example: "man man" = overview of the "man" command
Related/useful commands include:
apropos: "man apropos" = topical search in man pages
Use this when you know what you want to do, but
need the name(s) of the relevant command(s).
locate : "man locate" = search for file(s) by name
If you know the (partial) name of a file, use
"locate name-of-file" to get it's full path.
(Please note that locate only indexes files on the
local machine, and thus will not list files from users'
home directories unless run while logged in to the file
server.)
The GNU TexInfo pages often contains more detailed information.
These pages are accessible through the "info" command.
While this information is considered 'dense', it is -- by far --
the most accurate/detailed help source available.
======================================================================
SPECIAL PARALLEL/CLUSTER SOFTWARE
======================================================================
A few special commands are required for parallel applications:
execution -> read "man mpirun"
MPI -> read manual pages for "mpi" + "type"
where "type" is one out of "cc", "CC", "f77" or "f90"
(with no spaces, i.e. "man mpiCC" for a C++ compiler)
The default compiler for MPI is gcc; you may find this slow.
A better alternative is the Intel Compilers (icc, ifort),
but this will require some more work on your behalf.
======================================================================
LOGIN INFORMATION
======================================================================
1. When you log in to the front-end machine (beowulf), you will get
a list of the 5 nodes with lowest current usage. The format is:
c1n12 up 2+01:08, 0 users, load 0.00, 0.00, 0.00
c1n01 up 8+05:45, 0 users, load 0.98, 0.97, 1.00
c1n04 up 10+07:23, 0 users, load 1.65, 1.52, 2.18
c1n11 up 15+05:47, 0 users, load 1.98, 1.97, 1.91
c1n05 up 15+05:47, 0 users, load 2.00, 2.00, 2.00
The above output is generated with the script "node-load N" where
N is the number of lines in the output (default=5). Only the
login machine has the ability to "see" nodes in multiple cabinets;
any computing node will only report statistics for other machines
within the same logical network.
The "load" on a unix machine is equal to the average number of
processes in the "run queue"; i.e. executing or waiting for
cputime. Although each machine in principle could handle a load in
excess of 200, they are most efficient if the load is equal to the
number of processors in each machine. For our cluster, this number
is typically 2 or 4. If you have more than N processes running on
an N-cpu machine, all jobs will still execute, but the total
throughput will go down. The batch system takes care of this.
2. When you log in to one of the nodes, you get a line on the form:
mem=932.8/3464.9/4250.0 swap=0.0/2056.3 cpu=2.00 1.89 1.83 3/92 2676
This is a (very) brief status monitor for the machine.
mem=x/y/z memory status : in-use/"free"/total [mb]
("free" = free+cached+buffered)
swap=x/y virtual memory: in-use/total [mb]
cpu 3x cpuload (1m, 5m, 15m)
number of processes (running/total)
last PID (process identification number)
General hints for maximum performance:
- make sure the node has enough available memory for your job.
- swapping is a (very) bad thing for speed.
- if the load is high (or there are many processes running),
you should consider to log in to a different node.
======================================================================
BATCH SYSTEM
======================================================================
The cluster has a fairly general batch system installed called PBS;
the Portable Batch System.
To get started, type "info pbs", "man qsub" and "man qstat".
Current status for the batch system is also available on the cluster
web site (http://beowulf.cheme.cmu.edu) under "Current Status". At
the command prompt, the alias 'qs' will show you the status of the
batch queues along with any currently running batch jobs owned /
submitted by you:
=====
server: beowulf
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext Type
---------------- --- --- --- --- --- --- --- --- --- --- ----------
reject 0 0 yes yes 0 0 0 0 0 0 Execution
short 0 0 yes yes 0 0 0 0 0 0 Execution
long 0 0 yes yes 0 0 0 0 0 0 Execution
hog_s 0 0 yes yes 0 0 0 0 0 0 Execution
hog_l 0 0 yes yes 0 0 0 0 0 0 Execution
q_feed 0 0 yes yes 0 0 0 0 0 0 Route
=====
(a) There are 6 queues
routing queue: determines where to send jobs based on the
amount of cpu time & memory requested by the
user.
q_feed: name of the routing queue. Jobs typically live here
less than 1 second.
execution queues: the jobs are run here
short : cpu time < 24 hr; memory < 500 mb
long : cpu time > 24 hr; memory < 500 mb
hog_s : cpu time < 24 hr; memory > 500 mb
hog_l : cpu time > 24 hr; memory > 500 mb
special execution queue: rejects unspecified jobs
reject: jobs submitted without resource constraints
will end up here & get killed within 2 minutes.
(b) Example batch submission command.
Suppose I have a script 'batchjob' that I know will run for
maximum 4 hours and use no more than 200mb of memory:
command: qsub -l cput=4:00:00,mem=200mb batchjob
You must always specify either cputime or memory size (or both);
any job submitted with no resource constraints will be rejected.
Batch system defaults:
- memory : 100 mb (if you only specify cputime, mem limit = 100mb)
- cputime: 8 hrs (if you only specify memory, cpu limit = 8 hrs)
Note #1: Any job exceeding its limits will get KILLED by the system.
--> It is generally a good idea to "overestimate" the
system resources required by maybe 10-20%. That
said, you don't want to add (way) too much since
there is a mild preference in the job scheduler to
pick short/small jobs first in preference of
long/large ones.
Note #2: It is possible to change the resource requirements for
running jobs through the 'qalter' command. however, to
_increase_ the amount of any resource requires system
manager rights.
--> If you need this, please email the administrator(s)
at the address listed at http://beowulf.cheme.cmu.edu
Note #3: All batch jobs are executed from your home directory.
read "man qsub" or see example script(s) in:
/store/examples/pbs
for possible ways to deal with directory issues.
Note #4: We have implemented a local set of restrictions to
control job resources more tightly than the standard
PBS system; this is an attempt to avoid inefficient
use of system resources (particularly memory).
(1) "-l" resource requests MUST be on the command line.
--> any PBS commands in your scripts are ignored.
(2) All nodes have property fields that describe their
available "per-cpu" memory. All batch jobs are
automatically assigned only to the subset of nodes on
which they can run without the per-job memory
requirement exceeding the per-cpu amount available.
This guarantees that we avoid swapping.
Consequences:
- you do NOT need to supply node properties yourself.
- your jobs will WAIT until they can run without
possibly swapping. This makes it more important
(for you!) not to overestimate memory requirements.
(3) Large-memory jobs (currently > 1gb per job) require
the use of a new command-line argument "-vcpu=N".
To submit a 50hr min job w/ 1.4GB of memory, use:
qsub -l cput=50:00:00,mem=1400mb -vcpu=2 script
This will allocate (and wait for!) a node where you
can allocate 2 CPUs and a total of 1400mb memory.
You could also submit this job as:
qsub -l cput=50:00:00,mem=1400mb -vcpu=3 script
or
qsub -l cput=50:00:00,mem=1400mb -vcpu=4 script
In these cases pbs will wait for a node where it can
allocate 3 or 4 CPUs, respectively (and 1.4GB mem).
===> in general, use "-vcput=2" for jobs up to 2GB.
Your job will NOT gain any execution speed from
requesting more "virtual" CPUs than necessary; your
job will just prevent the CPUs from being used by
someone else.
(c) Upon submission, the batch system will reply with the request ID.
When the job is completed, 2 files will be created in the
directory from which you submitted the job:
batchjob.oNNN : output from 'batchjob' with request ID NNN
batchjob.eNNN : errors from 'batchjob' with request ID NNN
(d) The batch system is fairly robust with respect to error
conditions. If the file server goes down while your job is
running, your job will temporarily be suspended unless running
against the local /scratch partition. Once the server gets back
online, all jobs will continue from their previous state.
If the execution node goes down, any job lost will be
automatically restarted on the first available node assuming
you have not used the "qalter" command to mark the job as
"not rerunnable." That said, automatic resubmission will not
happen until the original execution host is back online.
If your job is stuck on a dead node, you may rerun it with
the command "qrerun -W force JobId". You can only do this if your
job is set to be "rerunable"; otherwise you have to restart it.
======================================================================
GETTING HELP
======================================================================
The cluster is a research resource and has no support contracts;
however, limited assistance is offered on a voluntary basis.
NOTE: before reporting any problems, please seek help within your own
research group. Then think about it and try yourself
... and only THEN email the cluster administrators at the
contact address listed at http://beowulf.cheme.cmu.edu/
General guidelines for reporting problem:
(a) Briefly describe the problem:
- What are you trying to do.
- What happens ... and why is this wrong.
- Which machine(s) do(es) it happen on.
- Do you think it is an error in the cluster configuration
or simply something you don't understand how to make work.
(b) Include a verbatim log of ALL error messages:
- Cut and paste the exact commands + output.
(c) Include a stepwise description on how to reproduce the problem:
- If the problem involves any of your personal files, please
send the name of the directory and make sure all the files
in there are readable for a normal user.
- If the problem involves files in multiple directories,
add them all to a "tar.gz" file ("man tar", "man gzip")
and attach them to your email.
- Include the sequence of commands needed to reproduce
the problem and/or errors. If you have any special
environment variables set, state what they should be.
If you do not follow these very simple rules to help us minimize
time spent on support, you risk that your request is "delayed"
for quite some time. You are also likely to get a reply along the
lines of: "please submit a proper problem report."