| |
Posted by jkitchin,
Fri Sep 5 17:21:01 2008 |
the login node reboot messed up the queue system, and
all hte running and queued job info was lost. sorry.
|
| |
Posted by jkitchin,
Fri Sep 5 17:20:15 2008 |
the cluster
|
| |
Posted by jkitchin,
Fri Sep 5 12:28:59 2008 |
the login node was hung and had to be rebooted today.
i do not know why.
|
| |
Posted by jkitchin,
Thu Aug 7 07:29:42 2008 |
there will be another power outage tonight. The
cluster will be turned off at 6pm today and turned
back on tomorrow morning.
|
| |
Posted by jkitchin,
Fri Aug 1 14:07:42 2008 |
the queues are started again.
|
| |
Posted by jkitchin,
Fri Aug 1 12:56:46 2008 |
the queues have not been restarted. The nodes are
updating some software now. When that is done i will
restart the queues this afternoon.
|
| |
Posted by jkitchin,
Thu Jul 31 17:55:58 2008 |
the cluster is going down now! please log out.
|
| |
Posted by jkitchin,
Wed Jul 30 11:08:11 2008 |
Correction: the cluster will be shutdown thursday
evening JULY 31. all jobs in the queue will be cleared
at that time.
|
| |
Posted by jkitchin,
Fri Jul 25 08:53:59 2008 |
the power shutdown has again been moved. Now the
cluster will be shutdown Thursday evening, Aug 31. All
jobs in the quee will be cleared at that time before
it is restarted.
|
| |
Posted by jkitchin,
Wed Jul 23 14:29:31 2008 |
the power will be shutdown next tuesday, so i am
postponing restarting the queue system until then. I
still don't recommend you submit jobs unless you think
they will finish by then. the cluster will be shutdown
Monday evening.
|
| |
Posted by jkitchin,
Tue Jul 22 07:31:40 2008 |
the queues are all suspended now, pending a restart of
the queue system. Please do not submit any new jobs.
the queue system will be restarted on Thursday to give
existing jobs a chance to finish. sorry for the
inconvenience.
j
|
| |
Posted by jkitchin,
Wed Jul 16 07:39:29 2008 |
It appears the PBS system crashed last night. The
queue system is running again but the queues are
suspended while I investigate the cause.
|
| |
Posted by jkitchin,
Thu Jul 10 17:51:56 2008 |
The qstat command is working again. it is my fault
this happened, and there maybe other problems. let me
know if you have them.
j
|
| |
Posted by jkitchin,
Thu Jul 10 16:26:03 2008 |
there is a permissions problem with qstat right now. I
am investigating it.
|
| |
Posted by jkitchin,
Thu Jun 5 15:33:33 2008 |
we are in the process of bringing the cluster back
online. It should be up by trhe end of the week.
|
| |
Posted by jkitchin,
Sun Jun 1 13:32:48 2008 |
The cluster will be moved tomorrow. Shutdown will
start at around 8am Monday June 1. Power will be
rewired to teh cluster sometime this week, and after
that everything will be turned back on.
|
| |
Posted by jkitchin,
Wed May 28 08:25:46 2008 |
The cluster should be backup again. All jobs were lost
due to the power outage. Please resubmit them. All
servers should be on.
|
| |
Posted by jkitchin,
Tue May 27 18:33:46 2008 |
The cluster will be shut down this evening around
midnight due to a planned power outage tomorrow. It
will be turned back on tomorrow morning.
|
| |
Posted by jkitchin,
Fri May 23 08:53:37 2008 |
For some reason the cluster shut itself down
yesterday. I don't know why.
|
| |
Posted by jkitchin,
Wed May 21 17:56:28 2008 |
The cluster move is rescheduled again due to 3rd floor
renovations. The new shudown date is scheduled for
June 2. The nodes will be shutoff that morning and the
racks moved to the new cluster room. Once the
electrician rewires the power to the nodes and the
cooling is turned on the nodes will be returned to
service as they are rewired.
|
| |
Posted by jkitchin,
Tue May 13 14:37:24 2008 |
The cluster shutdown and move is currently postponed.
I am not sure when it will happen, but I hope it will
be around May 20 now.
|
| |
Posted by jkitchin,
Mon Apr 28 15:35:00 2008 |
The cluster move date has been changed to May 14 for
now. More information to come.
|
| |
Posted by jkitchin,
Fri Apr 25 08:20:55 2008 |
We plan to shut the cluster down at midnight May 5 in
preparation for moving it back to its renovated room.
It will take a few days before it is back up because
we have to schedule the electrician to reconnect power
to the cluster.
|
| |
Posted by jkitchin,
Tue Mar 11 11:18:03 2008 |
the login node was way overloaded today, and we had to
reboot it. unfortunately that also means we had to
kill all the jobs on the nodes to prevent new jobs
from being scheduled on nodes running old jobs. sorry
for the inconvenience.
|
| |
Posted by jkitchin,
Sun Mar 2 13:09:25 2008 |
The cluster is only partially up. The servers are
running, but there is insufficient cooling and power
to run many of the nodes right now. Hopefully this is
fixed Monday.
|
| |
Posted by jkitchin,
Fri Feb 29 08:24:21 2008 |
The cluster will be shut down in a few minutes. please
log out.
|
| |
Posted by jkitchin,
Fri Feb 15 08:11:10 2008 |
The cluster is planned to be shutdown early in the
morning on Friday Feb. 29. It will be moved to a new
location temporarily during the renovation. All jobs
will be killed at that time. Please copy all data you
want to a local machine as I am unsure how long it
will take to move the cluster and get it back up.
Hopefully only 2-3 days
|
| |
Posted by jkitchin,
Sat Jan 19 17:07:47 2008 |
All guest users should submit their jobs to the guest
queue. This is done by: qsub -q guest -l
cput=...,mem=... yourjob.sh Jobs found in other queues
will be removed.
|
| |
Posted by jkitchin,
Tue Jan 1 13:49:51 2008 |
You can submit jobs now.
|
| |
Posted by jkitchin,
Sat Dec 29 08:49:29 2007 |
Since no new jobs can run anyway, the queues are
disabled. After all jobs finish I will do some cluster
maintenance before re-enabling the queuees.
|
| |
Posted by jkitchin,
Sat Dec 29 08:46:46 2007 |
Jobs are not currently running because our PBS license
has expired. I requested new ones a few weeks ago and
hopefully they will arrive next week.
|
| |
Posted by jkitchin,
Sat Dec 15 13:14:15 2007 |
Cabinet 1 has been taken offline for maintenance. Jobs
will be continued to run until Monday Dec 17 to
complete. At that point any jobs running on Cabinet 1
nodes will be killed so maintenance can be done.
|
| |
Posted by jkitchin,
Thu Dec 13 20:40:41 2007 |
There was some difficulty restarting the PBS server
today. Some jobs may have been lost.
|
| |
Posted by root,
Sun Nov 25 14:47:21 2007 |
The belt on the main cooling unit failed last night,
resulting in a cluster shutdown. The belt has been
repaired and the cluster nodes are back up.
|
| |
Posted by root,
Sun Nov 25 11:25:13 2007 |
You can't delete suspended jobs right now because the
cluster nodes are not on. It is not clear FMS will fix
this today, so it may be late tomorrow that everythign
works again.
|
| |
Posted by root,
Sun Nov 25 10:35:12 2007 |
Please do not add more jobs to the queue system at
this time. Many jobs triggered the MPI status due to
their memory usage and will have to be deleted and the
queue restarted. It may happen that all of them get
deleted.
|
| |
Posted by root,
Sun Nov 25 10:32:31 2007 |
The main cooling unit is not on, so the cluster room
overheated and the cluster shutdown. We will call FMS
tomorrow to have it looked at.
|
| |
Posted by root,
Fri Oct 19 07:57:24 2007 |
Please log out immediately. We are performing
maintenance on the file server and any changes to your
home directory may not be saved.
|
| |
Posted by root,
Mon Oct 15 07:47:13 2007 |
On Saturday, October 20 Doherty Hall will not have
electrical power from 6am to 6pm. We will shut the
beowulf cluster down at midnight on Friday. Please do
not submit jobs that will run past this time.
|
| |
Posted by root,
Sun Oct 14 17:55:09 2007 |
PBS server was restarted again due to beowulf shutting
down after the the UPS was turned off. Sorry for the
inconvenience.
|
| |
Posted by root,
Thu Oct 11 17:23:38 2007 |
The PBS system had to be restarted Oct 11 after
beowulf appeared to be frozen. The reason is still
unknown, but the result is all jobs were lost. Please
resubmit them. Sorry for the inconvenience.
|
| |
Posted by steinhau,
Mon Jun 4 17:17:11 2007 |
Cabinets #1 and #2 have been shut down while the DH
renovation project affects the cluster room.
Hopefully, this will allow the rest of the computers
to be minimally affected by the temporary losses of
cooling seen and/or expected.
|
| |
Posted by steinhau,
Mon May 28 12:12:33 2007 |
The batch system was restarted at noon on Mon May 28
due to a fileserver error condition. All running jobs
were removed; sorry.
|
| |
Posted by steinhau,
Tue May 22 13:08:22 2007 |
FMS has added to the list of complete building
electrical outages as part of their tests. DH will be
without power 05/22 *and* 05/23 between 7pm and 3am.
All cluster machines will be shut down around 5pm today
(May 22). The servers will be restarted tomorrow
morning, but the nodes will not be put back online
again until Thu morning (May 24).
|
| |
Posted by steinhau,
Fri Dec 10 23:08:15 2004 |
A new qsub script has been implemented for batch jobs.
- all in-script PBS commands are ignored
- resource control is more strictly enforced
- a new '-vcpu' option allows for large memory jobs
type "qsub" to see the syntax; for details, see note #4 in
http://beowulf.cheme.cmu.edu/cgi-bin/unixhelp.cgi#chapter_09
|
| |
Posted by steinhau,
Mon Jul 19 11:32:45 2004 |
New doc for remote login access; see Chpt 2 of
http://beowulf.cheme.cmu.edu/cgi-bin/unixhelp.cgi
|