******* 6.4.3 Known bug list

* DAGMan doesn't detect when users mistakenly specify two DAG nodes with
  the same node name; instead it waits for the same node to complete twice,
  which never happens, and so DAGMan goes off into never-never land
  [see condor-admin 4502].

******  6.4.3&1 things that need to be confirmed
        - Better packaging of "security stuff"
        - Adding debugging information about nfs permission problems & FAQ entry
        - Bitmask ordering preservation of the methods of authentication
        - Robust parsing of the inherited socket string(for compatability)
        - Java universe "space in the pathname" fix. [ fixed ]
        - Windows crash problem in startd

        - error prop. from authentication failure
        -timeouts from starter -classad that are taking too long
        -better UI for condor_rm -constraint
        -fix old classad coredump when committing the transaction after
         actOnJobs in the schedd
        -switch condor_submit_dag to a C program
   
******* new features in the 6.3.2 release that never got implemented 

- authenticated syscall socket for standard universe
- local rusage is always 0 in the new shadow/starter
- new shadow email sending needs to be made coherent (what should the
  different notification levels really mean?)
- new messages in condor_(rm|hold|release) when using constraints
- new classad condor_q built, included in tarballs, etc.
- Daemon object correctness bug for getting the version of a daemon
  that's not what param() thinks it is
- speedier negotiation (aka oracle)


******* 6.2.1 Known bug list

?? ??
+ when the startd is trying to get starter classad info, and the
underlying starter binary can't be executed, popen() just makes it
look like there's no data, you don't actually see that the exec()
failed (since it didn't, /bin/sh exec'ed just fine *sigh*).  so,
either we need to use DC pipes for this, or do some extra checking, so
we can print a more readable and serious looking error message when we
can't exec() a given starter.  see condor-admin #4754 for details.

DW 1 Day
+ fix bug in the old starter w/ transfering/creating core files (for
  sure on Solaris, possibly other platforms) when running as root.

?? ??
+ track down remaining wrong dprintf()'s in startd that are causing segfaults
  Tough, because we rarely see it.

DT ??
+ We're missing some FORTRAN system calls. See Condor-Admin 2932 and 2955

?? ??
+ /dev/console doesn't do what we expect on linux anymore... we need
to find out a way to find physical console keyboard activity again.

PK ??
+ if there are more than about 10000 files that condor_submit/condor_schedd
has to check for writeablity, then things explode. You can use the -d option
to condor_submit to turn off the feature, but it needs fixing.

?? ??
+ CondorLoadAvg is bogus on SMP.  it's just bogus everywhere.

?? ??
+ if you put 'getenv = TRUE' in your submit file and your environment
  contains double quotes or newlines, submit will fail with a parse
  error

?? ??
+ You cannot pass spaces as part of an argument in the Arguments attribute
  in a command file. This should be fixed. [condor-admin #590]

?? ??
+ condor_kbdd doesn't work properly under Compaq Tru64 5.1, and as a
  result, resources may not leave the ``Unclaimed'' state regardless
  of keyboard or pty activity.  Compaq Tru64 5.0a and earlier appear
  to work properly. (condor-support #408)

?? ??
+ condor_status doesn't do this command correctly:
  condor_status -sort loadavg -sort enteredcurrentactivity
  the enteredcurrentactivity is not sorted correctly.

?? ??
+ pvm_spawn(,,PvmTaskDefault,,,) doesn't work in Condor-PVM
  [condor-admin #2217]

?? ??
+ Condor doesn't support pvm_spawn() calls that request that more than
  one worker be spawned (i.e., the fifth argument to pvm_spawn()
  currently must always be equal to one) [condor-admin #2244]

?? ??
+ PVM workers don't startup with the job's working directory or
  environment variables.  PVM_EXPORT environment variable is ignored.
  [condor-admin #2217]

?? ??
+ ProcAPI calls seem to fail a lot on Irix 6.5:
    "ProcAPI: Error opening /proc/58274, errno: 11" [condor-admin #2217]
    "ProcAPI: Error opening /proc/2013144, errno: 145" (seen in NCSA pool)

PK ??
+ scheduler universe jobs (e.g. dagman jobs) still send Condor mail
  from root

?? ??
+ startd seg faulted w/ PVM req_new_proc command -- procAPI bug?

?? ??
+ If user job changes umask(), that effects the shadow itself.

?? ??
+ utimes() does not work reliably on Solaris, maybe other platforms.

?? ??
+ printer.remote prints duplicate lines on .err file.

?? ??
+ threads.C on DUX is broken with the vendor C++ compiler.

?? ??
+ Memory problem in negotiator

?? ??
+ On some platforms, the code to lock the MasterLog to ensure there are
   not multiple masters running on the same machine does not work.

?? ??
+ if the condor developers collector changes its IP address, we never
   know about it and keep sending world ads off to who knows where.

?? ??
+ condor_master confusion: if you do a condor_off and a child daemon does
  not exit before you do a condor_on again, the daemon is never restarted.

?? ??
+ constructor.C on IRIX6.2 with g++ is broken. Must use gcc'c linker
   and not the systems or figure out a way to use the systems..

?? ??
+ fix the problem of the CondorView applet not graphing info when
  there is no data, thus making the x-axis time line non-linear
  (see condor-support #242).  this is largely a symptom of the
  lame/wrong behavior of condor_stats, which does not report any
  records which have all zeros in the fields.

PC 2 Days
+ Fix our released API (libcondorapi.a) to be C linkage so people can
  use the log event processing stuff nicely. [condor-support #298]

?? ??
+ in Fortran, the etime() function call does not work because it opens
   /proc and does ioctl PICOGETUSAGE, which of course fails. Should
   we support it? [condor-admin #1504]

?? ??
+ Submitting a vanilla job with 'requirements = memory > 1024' into a pool
  with machines that have greater than 1024 RAM just sit idle in
  the queue.  Appending '&& opsys == "SOLARIS57"' (for a pool that
  has such) to the requirments allows the expression to be then
  correctly evaluated.

?? ??
+ condor_compile is fooled by putting main() in a user supplied static
  library which goes on the link line [condor-admin #1358]. 
  e.g., gcc user_main.o mylib.a
  where mylib.a contains main() and user_main contains a function that
  calls main().

?? ??
+ When WANT_SUSPEND evals to undefined in the startd, it probably
  shouldn't EXCEPT and complain about "WANT_SUSPEND not defined in
  internal classad", since maybe it just references some attributes
  that aren't in the job ad.  What, exactly, it should do instead is
  not yet clear to me, but what it does now should be changed.

?? Done/1 Day [Entered into the bug tracker as #89]
+ fix condor_submit so it warns you about things in your submit file
  that it is ignoring (whereas now it just silently ignores them) (Will did
  a lot of this, but no idea where it is)

JF 4 Hours
+ Add DUROC support to the Grid Manager

JF 4 Hours 
+ Allow submitfile to override the RSL string we construct, if it starts with
  an '&' or an '+' (For DUROC)

TT 1 Day
+ Refresh Proxy on remote machine (shutdown and restart job managers...)

?? ??
+ a Daemon Core checkpoint server. No new functionality, but we'll need to
  beat on this to get Kerberos support into the checkpoint server.

DW 2 hours
+ condor_install should ask about DedicatedScheduler and set it for
  you! 

DW ??
+ Support multiple executables in an MPI job

?? ??
+ Rewrite of MPI startup methods. Support MPICH-P4, MPICH-MPD, LAM,
  MPICH/NT, MPI/Pro

?? ??
+ Submit/RM API. Basically hack up condor_submit.

DX ??
+ View Server indexing & speedups. Also condor_history

DW ??
+ the dedicated schedd should check DedicatedScheduler in the config
file.  if it doesn't match itself, and if someone tries to submit an
mpi job to it, it should refuse in such a way that condor_submit can
tell the user to go submit to the right schedd.

?? ??
+ idle_time calulations take 2 calls to stat() for every file, and
we're stating tons of files we don't need to be.  we should optimize
all this crap and clean it up.

?? ??
+ add a config file hook for specifying which devices to check for
idle_time.

+ sysapi should detect cyclical tty activity (i.e. elm updates, etc).  
  [condor-admin #868]

ZM ??
+ master agents (a) transfer log, config, history files, (b) remote "top"

CS ??
+ FileTransfer stuff should be directory aware (if a file you
  specify is really a directory, we should recursively deal with it). 

+ Trap calls to CreateProcess on NT

+ Support core files for NT user jobs

+ Inject a DLL to handle soft-kill signal on NT

TT ?? 
+ Store User's password encrypted in the registry, use NT magic to transfer
  it about

TT ??
+ Run as the user on NT, instead of of as nobody

TT ?? 
+ Port Scheduler Universe and DAGMan to NT
 
DW ??
+ Something other than FIFO scheduling for dedicated resources

?? ??
+ Create a "condor_last" tool, so startd logs every job that is
spawned there (user, command, date/time, anything else?), and
condor_last allows you to query this information

PK ??
+ New Shadow for PVM Universe

?? ??
+ New Shadow for Standard universe

+ Sonny's Firewall support

+ condor_vacate <jobid>, condor_checkpoint <jobid>

+ put job in HOLD state for long-term errors (and/or EXCEPTs in remote
  syscalls), logic to avoid the "black hole" syndrome where one
  misbehaving machine in the pool can bring down the whole show.
  (see [condor-admin #1254]). See also Pete Keller's long wish at the end

+ future reservations in dedicated scheduler 

+ startd enforces resource limits on jobs such as CPU time or Memory 
  on a machine. [condor-admin #846]

+ shadow spits out image size into user log on a job completed event.

+ First Toolcore-based tool -  command-line arg parsing, using the 
  daemonCore command client, consistent error reporting, 
  usage message format, etc, etc.  

+ On reconfig, daemons should check NETWORK_INTERFACE and bind their
  command socket to the new interface(s) if it changed [condor-admin
  #2217].

?? 4 Days
+ Add multihomed support to Condor (see RUST [condor-admin #1761]).

+ daemon core command port should try to bind to all interfaces by default,
  and only bind to specific interface if NETWORK_INTERFACE is specified.

+ notify_user = file, put the job completion e-mail into a user specified file.

+ some way in the config file to append email addresses to the
  notify_user list so admins can also get job completion emails
  [condor-admin #5352]

+ Network file system access under Windows NT


+ Add -all option to The Tool, e.g. "condor_off -all" would shutdown
	your whole pool.  :^)

- SCHEDD_CLAIM_LINGER parameter that tells the schedd to hold on to
  idle claims for a given number of seconds before relinquishing
  them.  Each time a new job is submitted, the schedd sees if it
  can run the job with one of the lingering claims.  This can
  improve the efficiency of MW style applications that wait for
  a job to exit, do some quick processing of the results, and
  then submit a new job.

+ SMP support is lame.  the different VMs should be able to access
information about the other VMs, the other VM's job classads, etc,
etc.  some (very reasonable) policies that users want
(e.g. condor-admin #4762) are impossible with our current religion in
the startd.  

+ we should put the instancelock functionality from the master into at
least the startd, maybe even daemoncore itself.  it's lame that you
can accidentally start up 2 startds w/ the same execute dir, 2 schedds
with the same spool, etc, etc.

****** V6_3 items looking for a good version...

+ condor_submit segfaults with a large environment string (*might* already be
	fixed, but needs checking). [condor-admin #2196]

+? There needs to be a way to disable the non-root/non-condor OwnerCheck
   stuff, other than setting QUEUE_SUPER_USERS.

+ Add a more robust idle_time tester function in the sysapi tester.

+ Need to put support for checksums into Condor: checksum the builds
so people can verify they have legit binaries, give a tool so people
can do that easily, etc.  Ultimately, we want the master to verify
checksums before it'll spawn any processes...

+ Add a "PreemptReason" attribute to the startd classad when
preempting. 

+ Compress logs once they're rotated.

+ On reconfig, if error, don't EXCEPT, go back to last good config.

+T Portland Group Compilers

+ if there's no MasterLog, and you run w/ -f -t, you EXCEPT b/c you
  can't get a lock on a file that's not there!

+D SMP admin tool to handle (a) reservations, (b) configuration

+ condor_report_bug script, which sends email to
  condor-admin/condor-support along with relevant sections of all

+ master handles DaemonList change

+ soft-kill jobs when user does a condor_rm (currently we hard kill)

+ need to specify a seperate rm softkill signal, and the regular,
  "shut down gracefully, but we'll start you again" softkill.
  (works for scheduler universe right now, needs a little more work
  for the other universes).

+ Initial executable can be stored on checkpoint server

+ seperate downloadable condor link kit (libs + condor_compile)

+ cleanup ATTR_SERVER_TIME for use by condor_status -state, so correct
  numbers are displayed even if time across pool and central
  manager is not synchronized

+ scheduler universe API

+ submit more jobs to an existing cluster

+ ?? stdin/out/err setup by the starter to go to the shadow, even for
  VANILLA jobs.  this would allow VANILLA jobs which just need
  stdin/out/err to not have a shared file systems; also, would
  allow stderr messages sent before our MAIN() runs in STANDARD
  jobs to appear.

+ Condor.pm should include a whole-cluster exit callback [condor-admin
  #2911]

+ tools for clusters:
    - include a distributed make for condor
    - include a distributed telnet
    - include a distributed rsh (or at least condor_submit -block)

+ Java Universe support

+ Automatic cleaning of old entries in the history. 

+ FS domain != DNS domain stuff.

DT ??
+ On restart, we should truncate files we've opened for writing to the
  size they were at checkpoint time.

+ user visable condor_submit improvements: command line args, environment
  defaults, ...

+ Put benchmarking in a seperate thread or process.  Having the whole
  SMP startd blocked whenever a VM wants to benchmark is a real
  pain.

+ bug: Fix getrusage()

+ Add DC acks for DC permission problems, etc.	Create CommandSock
  derived from ReliSock, use the high bit of the command int to
  specify if you want an ack, etc.

+ Checkpoint compression and related stuff is broken under linux, possibly
  other OSes.

+ bug: When setting up a Personal Condor, if your CONDOR_HOST macro is set to
  garbage, the master segfaults while trying to update the collector ad.

+ A simple submit-file syntax for heterogeneous submit that hides all
  the nastiness of requirements expressions with LastCkptArch, etc.

+ When the startd is going to preempt a claim, run a user specified command.

+ remove MAX_VIRTUAL_MACHINE_TYPES in the SMP startd, and just use an
  ExtArray.

+ better interface to condor_rm so you can say "condor_rm 1280-1290"
  to remove all clusters from 1280 to 1290.  [condor-admin #3849]

******* V6.5

+ fix the BindAnyCommandPort failed message in the daemons so that when this
	command fails, it tells you why. See [condor-support #632]

+ cache group lookups in _set_priv so when your changing between users you
	don't pound the crap out of the NIS master.

+ use new classads everywhere

+ use new classad collections everywhere

+ collector can handle new ad types without need to change source

+ rewrite the schedd

+ new world order for remote system calls (doug's work, unix<>NT integration)

+ improve checkpoint protocol (i.e. a classad with meta info is stored
  with the checkpoint, allow files to be stored along with the
  checkpoint, etc).
  In addition to this, it would be nice if you could request a checkpoint
  for a jobid, and it will produce for you a tar file that contains an
  initial checkpoint, a current checkpoint and all files necessary to
  restore the checkpoint. Then you could replace a job that left the
  queue prematurely with whatever is contained in the archive with handy
  tools that would need to be written for it. [condor-admin #1386] is
  part of this larger picture.

+ central manager automatic failover

+ condor_send_signal <sig num> <job num>

+ FileTransfer object maintains a local cache 

+ ?? do not need to re-exec shadow and starter everytime (for short
  running jobs)
     !!!Fix getrusage() if this occurs....

+ schedd can hold a claim on a resource for a specified period of time

+ centralized config server

+ config file is a classad

+ submit file language/syntax should be a classad

+ support in new starter for vendor checkpointing (e.g. IRIX kernel
	checkpointing, LINUX kernel checkpointing)

+ users can explicitly state want_checkpoint, and/or want_remotesyscall,
  and/or neither.

******* Undecided which version:

+ In Condor-PVM, an infinite message loop happened that looked like this:
3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002
3/25 09:24:33 (450.0) (5618):PVMd message is SM_CONFIG from t40002
3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002
3/25 09:24:33 (450.0) (5618):PVMd message is SM_CONFIG from t40002
3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002
3/25 09:24:33 (450.0) (5618):PVMd message is SM_CONFIG from t40002
3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002
It appears that neither of the two processes wanted to deal with the message
and so it kept bouncing between them...

+ Add support in the standard universe for the ABSoft fortran compiler.

+ Condor-PVM ignores the soft kill signal, and it is a TON of work to fix.
	See admin rust #5439

+ Condor totally ignores tty activity on linux machines with a usb 
	keyboard. admin rust #5288

+ The standard universe library says a user can open up a path up to 
	the POSIX max path. This is incorrect. Because we can add things like
	buffer:, compress:, ftp:, or combinations thereof, we short change the
	allowable size a ser can have for a path length.

+ Implement(Or see if we can) support for the Intel Fortran Compiler with
	condor_compile. See admin rust #5358

+ condor_c++_util/email.C: When trying to figure out a domain name for
	the EMAIL_DOMAIN, if all attempts to determine a domain name fail, then
	we will free a null pointer. Don't have time to fix now, so I'll just
	document it.

+ from Miron (2003-04-08): the Userlog-reader API should keep an event
  count and allow you to fast-forward/rewind N events

+ bug as desribed in [condor-admin #2452]

+ bug as desribed in [condor-admin #2769]

+ bug as desribed in [condor-admin #2888] Might be a non-redhat issue, but
	I remember seeing this type of error in something else I couldn't
	reproduce locally, so it might be legit.

+ Make it so that Condor can notice when the whole system begins to have
  problems and then e-mail people about it. For example, a shadow notices
  that it can't ever talk to the checkpoint server, so it will send mail
  to an e-mail location about it.

+ (V6.1.13?) Add -constraint for as many programs as it makes sense for.

+ (V6.5?) Documented, clean APIs for external consumption : 
	- collector query API
	- user job API (get info on resource, checkpoint, remap-files, etc)
	- job submit and queue edit API
	- ??? Sched<->Shadow API so users can write their own shadows

+ (V6.3?) shadow gets the startd ad, and thereby we allow job ad to
	refer to attributes in the startd ad

+ (V6.5) Schedd<>Negotiator protocol revisit: spinning pie algorithm
   implemented with multiple pies, negotiator asks schedd only
   for attributes that matter (signature), negotiator tells schedd
   _why_ a match failed, etc, etc.

+ checkpoint/remote syscalls on WinNT

+ (V6.2?) make Condor a Redhat "RPM" and/or a Debian installation package
	[condor-admin #151]

+ Fix access(2) check in condor_daemoncore so that we have one that
  works off of the EUID instead of the real UID.

+ printf-compatible format specification language which can be used
  with classad values [condor-support #257]

+ have the starter record in its logs the uncommitted time of a job in
  addition to the committed time when a job vacates.

+ make sure f90 works on IRIX machines. [condor-admin #1174]

+ Make the collector able to bind to other ports.  This could
  optionally be a COLLECTOR_PORT type thing, better yet, a portmapper
  could be developed to handle things magically.  Nick & Pete have
  some (IMHO) good ideas for this.

* (v6.5?) Fix cached IP address of the collector; currently, if the IP
  address of the collector changes, the startd (and probably
  everything else) updates will start silently going to the Great 
  BitBucket in the Sky(tm).
  - Modify all of the server code to use a common update block of
    code, probably some intelligent object
  - Create a new sinful string object to manage these things, time
    itself out, etc.  This should be used by the above

******* Wish List:

- condor_q should try to print out its data as it gets it from the
  schedd, instead of waiting till the end.  this way, if your q is
  enormous, you start to see something right away, instead of waiting
  a few minutes for any output from the command.  this would prevent
  any sorting, but it may be useful to some people... [admin #3841]

- the priv state code should do some caching.  for instance, on Unix everytime
init_user_ids is called, we do a getpwent().  the result of this call should
be cached by username (if not, we pound NIS on some platforms --- RUST from
biostat).  another example: on windows, we should cache the handle to the
login.

- when perl has been around long enough add this patch: [condor-admin #2019]
	It looks like it requires perl 5_003, which not everyone on the
	architectures we support might have as of 2002.

- make condor_userprio and friends take -constraint

- in the job ad of standard univ. jobs, the CondorVersion attribute
  should be the version of the syscall library linked in, not the
  version of condor_submit used.  in general, perhaps we want
  "schedd_version", "syscall_lib_version", and maybe even
  "shadow_version" or something.

- condor_install should setup flocking.

- Support for suspending claims and/or activations in the startd for
  running short, high-priority jobs quickly without preempting
  currently running jobs.  If a user submits a high priority
  job, the schedd could suspend one of that user's currently
  running jobs and use the claim to run the new job.  When the
  new job completes, the schedd unsuspends the previous job.
  Likewise, if the negotiator preempts a job to run a higher
  priority job, we suspend the preempted job until the higher
  priority job completes.  We'd obviously only want this
  behavior in some scenerios, so we'd need to be able to control
  it in the startd and schedd with expressions that refer to job
  attributes.  We could use this feature at NCSA on the Exemplar
  to run short, high-priority ChemViz jobs with long-running,
  lower-priority jobs that can't checkpoint.

- Optimize ProcAPI and ProcFamily.  On modi4.ncsa.uiuc.edu, it takes
  over 3 seconds to complete a single call to
  ProcFamily::takesnapshot().  The master calls
  ProcFamily::takesnapshot() for each child process every
  minute, so with just two children, it spends over 10% of its
  time in that call!  It would help if the ProcAPI/ProcFamily
  objects didn't rebuild the process tree from scratch for each
  child process.  Maybe the tree could be a static class member
  that gets rebuilt at most once every 10 seconds.  On each
  call, check the last update time of the tree to see if it
  needs to be rebuilt.  Also, on Irix and Solaris,
  ProcAPI::getProcInfo() makes 3 ioctl() calls: PIOCPSINFO,
  PIOCUSAGE, and PIOCSTATUS.  The PIOCSTATUS call seems
  redundant.  The PIOCUSAGE call returns the same information.
  Since ProcAPI::getProcInfo() can be called thousands of times
  in a single call to ProcAPI::buildProcInfoList() on a large
  machine, we should be sure we're doing things as efficiently
  as possible in that method.

- condor_compile could display a message about the version
  compatibility of the binary it's producing.  This would be
  particularly useful for remote submitters compiling up
  binaries for many architectures, so they can make sure all
  their binaries will agree with the shadow version on their
  submit machine.  Just spitting out the current Condor version
  number would help.

- If port 9618 is unavailable, the condor_collector will happily bind to a
  random port.  This is totally broken, since no other daemons have any way to
  know about the new port.  The most common case is incorrectly starting a
  second condor_collector while another is running.   The correct behavior is
  to emit a complaint to our log files, perhaps also to standard out, then
  exit.  This bug might be present in condor_negotiator as well.  CORRECTION:
  condor_collector requires "-p 9618" to ensure that it binds to the right
  port.  Otherwise it defaults to a random port.  This is surprising to users
  who try to start a collector by hand.

- The user's umask should follow the job around.  This is particularlly
  relavent for vanilla universe jobs.  You can work around it by wrapping the
  job in a shell script to set the umask, but that's clumsy.

- If the condor_startd can't launch a condor_starter because of permissions
  problems, it currently silently fails.  The condor_startd should emit
  warnings in this situation, perhaps once when the condor_startd starts, and
  once whenever it fails to launch the condor_starter.

- Create condor_backup and condor_restore.  condor_backup would allow a
  user to pull the last checkpoint of his job down to a local file that he
  can backup himself.  condor_restore would reinject that backup into the
  pool.  This would give users control over their own backups here at the
  UW (the current checkpoint servers are not backed up), and would allow
  paranoid users to keep multiple backups at various points in the
  program's lifetime.  (Based on general agreement at 2002-11-08 staff
  meeting.)

- Whereas,
  + the proliferation of "Personal Condors" and "Condor-G"s is growing,
    alongside the proliferation of traditional Condor pools, and;
  + users are frequently managing more than one condor_config for more than one
    type of Condor installation, and;
  + it is often confusing for them (and sometimes for us), when they run
    condor_master, to know which condor_config it's picking up, or even that
    there is a possibility of there being more than one to pick up, and;
  + be it resolved that, condor_master should print, when it is run, something
    like the following:
        starting condor daemons using /foo/bar/baz/condor_config
		as user "foo"...
        (ignoring alternate condor_config(s) in /bar/baz/bat and /bat/baz/bar)
        (to use an alternate condor_config, set CONDOR_CONFIG in your 
		environment)
    ...unless a "-q" options is given, telling the master to be quiet.
  (From an Oct 14 email by pfc, and generally agreed to be a Good Idea)

- Add NumberOfExceptions to job ads.  This will allow, amoung
  other things, for machines to prefer jobs with fewer
  Exceptions.  Reasoning: It's possible for a job to enter a
  cycle of endlessly excepting.  A set of these jobs can
  constantly tie up execute machines, blocking jobs that can
  actually make forward progress.  Once in place other useful
  policies can be added (perhaps holding or even removing a job
  with too many exceptions).  For example, thanks to the magic of
  Globus, occasionally Condor will be managing a vanilla job
  whose standard output and error are now and forever unwritable.
  The job starts, is unable to write the file, excepts, reenters
  the queue, and the cycle repeats.  This tied up most of a large
  cluster for a while before it was hunted down. 

- Have condor_submit do better error-checking on the "queue" statement [condor-admin #7313]


EP 1 Day

- Fix up condor_version 

- If the spool, log, and execute directories don't exist, Condor
  should try to create them.  This could save a step in the
  installation process, which would mean a lot for people
  administering Condor on large clusters of machines.

- It would be sweet if when a program explodes it gave us a stacktrace
  in an e-mail to condor-system, or just did something more useful than
  exiting without a core file.

- allow running vanilla jobs to be saved shortly after a schedd
  crash/shutdown

- Users and administrators should be able to specify a maximum run
  time (wall clock or CPU time) for each job, using either a
  config file parameter or a submit file macro (job ClassAd
  attribute).  The schedd would check the run times
  periodically, kill any jobs that have exceeded their maximums,
  and send email to the user.  We would use this (for example)
  in the NCSA ChemViz web portal (chemviz.ncsa.uiuc.edu) to
  ensure that user's jobs don't run for more than 30 minutes.

- Schedd timeouts need to be re-evaluated.  One example: If a bunch of
  machines are rebooted or there is a network outage, a bunch of
  shadows will except, and the schedd will get stuck in a tight
  loop in child_exit() trying to relinquish the matches.  Each
  connect for each relinquish command will take 45 minutes to
  timeout and fail.  In this case, the schedd should either call
  DaemonCore::ServiceCommandSocket() somewhere in this loop or
  the relinquish command should have a shorter timeout or should
  be UDP.  The relinquish command could safely be UDP, since if
  it's lost, the match will timeout anyway because the schedd
  won't send any more keepalives.

- Cleanup CondorView ckpt server module so it gets statistics directly
  from the ckpt server instead of using ftp/scp to grab the
  TransferLog file.

- Fix startd VirtualMemory attribute so it actually reports available
  virtual memory instead of available swap space.  For example,
  if an idle Linux machine has 4GB RAM and a 1GB swap file, we
  shouldn't restrict jobs to less than 1GB if there is in fact
  over 1GB of free RAM.  At least on Linux, VirtualMemory should
  be MemFree+SwapFree from /proc/meminfo.

- Make seperate condor_history status letters for C (completed, 0 exit
  status) and E (error, completed with non-0 exit status)
  [condor-admin #2081]

- Use NIS netgroups for some config file attributes. [condor-admin #2027]

- Set up the ability for hot failover CM's. This ticket describes a quick
  hack, however, a better solution is needed.  [condor-admin #1255]

- possibly pass more information on the subject line of a job completion
  email such as the exit reason and status. [condor-admin #1126]

- Have configurable turn off/on times for any daemons which the mater
  enforces. [condor-admin #820] 

- An updated condor_preen that allows selective removal of the history
  files based on cluster date.  It would be great if it were possible to
  specify how many days of history to keep at once. [condor-admin #619]
  (Is this already done? This ticket is from 1998! -psilord)

- Re-enable dynamically linked user jobs on Linux

- Have a way for a user job to determine at run time which version of the
  syscall lib it is linked with. Possibly extend the VersionObject into
  user space somehow. This is to make it easier for users to work around
  development series bugs in our syscall libraries.

- the ALIVE_INTERVAL ping between the schedd/startd allows an execute
  machine to quickly detect when a submit machine disappears.
  this should be enhanced to work both ways so a submit machine
  can quickly detect when an execute machine disappears (instead
  of relying on SO_KEEPALIVE, which can take hours).

- condor_q -analyze should be aware of RESERVED_SWAP and tell the user
  if jobs cannot run because the submit machine is low on swap
  space.

- When daemons are about to spawn other daemons (schedd -> shadow, for
  example), the new daemon should be born with stderr dupped to
  the parent's log file fd, so that we can see error messages from
  the child before it successfully completes dprintf_init() in the
  parent's log file.  True, we won't get locking, but better to
  see the messages than not, even if it means a scrambled-looking
  log file for the parent right around the time of problems in
  the child.

- allow administrator-defined priority factor groups (like nice-user),
  each with a setable prio factor.

- user priority half-life configurable per user

- a pool-wide condor sysadmin log. when a dprintf(D_SYSADMIN) appeared
  it would just send a command to the central manager, which would
  append to a file.

- checkpoint only dirty pages instead of the entire image.

- overlap periodic checkpoints with computation (i.e., fork before
  periodic checkpoint)

- checkpoint server load balancing

- checkpoint server fail-over: if a checkpoint server is down, jobs
  with checkpoints on that server shouldn't try to start until it's
  back up (check for checkpoint server ClassAd?), and jobs which want
  to write a checkpoint to that server should automatically choose
  another one instead of retrying the dead server over and over again

- user can specify "action" in submit file when an unsupported system
  call occurs: either "unsupported", "ignored", or send a SIGABORT.

- job hold state & resubmit with checkpoint: i.e., more robust job
  error recovery

- schedd/qmgmt support for large queues

- shared library checkpoint and remote system calls

- separated checkpoint and remote system call capabilities

- checkpoint support for mmap, pipes, sockets, X windows, kernel threads,
  file locks

- intelligent license management

- startd should detect swap and network activity

- metascheduler for jobs with large input/output files

- secure the Condor protocols

- CoCheck support

- accountant

- "remote shadow": shadow runs on the execute machine (instead of submit
   machine) if possible (i.e. file access via nfs/afs etc).

- packaged support for common applications

- partnerships with venders to get applications linked for checkpointing
  and remote system calls

- minimize resource usage (esp memory) in shadow

- deal with 64-bit pointers properly in our remote syscall lib (esp for
  OSF, IRIX, ALPHA LINUX)

- secure sandbox for the shadow 

- secure sandbox provided by the starter to the resource owner (sandbox
  the user job) ... largely already done, but not the policy
  part. (see Jim B)

- Ignoring certain _users_ when stating ttys for activity, so that
  certain users, as desired, won't cause jobs to get suspended, etc.

- limits: cpu limits, memory limits, etc

- better handling of running out of swap space, i.e. migrate the job when
  malloc fails.... (see Jim B) [condor-admin #326]

- Add new option to condor_status for -altfmt?

- condor_history should not exist.  it should be "condor_q -history"

- dprintf change to split bits into subsystem and severity

- config file should specify a range of ports, and CEDAR's bind() and
  _condor_local_bind() both honor that (to help w/ firewalls).

- checkpoint/migrate directly between two machines (for example, when
  a faster machine is available)

- pre and post commands for jobs, so users can specify a command to be
  run on the submit machine (1) before the job is run for the
  first time, (2) each time before the job starts running on an
  execute machine, (3) each time after the job is preempted from
  an execute machine, (4) after the job completes

- startd should fork off a user script which periodically injects
  dynamic host information into the startd classad.  LSF calls
  this an ELIM (external load index manager).  At any time, the
  script can send a classad expression to stdout and the startd
  selects on a pipe, reading the stdout of the script, and
  injecting any updates into its ClassAd.  It's possible to do
  this with condor_config_val -set and condor_reconfig, but
  that's cumbersome and inefficient.
  This is complicated, please see [condor-admin #856].

- add configurable upper limit on the total diskspace of corefiles from
  a particular job [condor-admin #271]

- GUI job editor [condor-admin #1784]

- automatic staging of master binary on local disk for sites which put
  binaries on NFS/AFS: condor_init/condor_install should ask
  where to stage the binary and make the initial copy; the boot
  script should exec the local master binary; the master should
  periodically check its binary on NFS/AFS and update the local
  binary when the version on NFS/AFS changes

- A Group or DefaultGroup ClassAd attribute, akin to Owner, so
  administrators can define policies in terms of UNIX
  groups (See condor-admin #1896).

- Add an option to condor_hold, so that the job is checkpointed if the
  job is running.

- Add a condor_swap_jobs jobId1 jobId2 that replaces a running job (jobId1)
  with a queued job (jobId2).

- Add a pool monitoring service that reports when machines and/or
  daemons that should be in the pool aren't.  It should send
  email in cases when the master wouldn't (for example, when the
  master is down), and it should dump out some html pages that
  display the status of the pool (an overview page with links to
  pages with details about each problem).  The service needs to
  be redundant so admins will know when monitor(s) are down.  It
  should work with the pool-wide Condor sysadmin log mentioned
  above, by logging its own D_SYSADMIN events and/or watching
  the log for events that should generate email or should be
  displayed on the Condor status web page.  See Jim Basney for
  more details.

- Provide a cron job that periodically checks to see if the master is
  running and restarts it if it isn't.  One option is to have
  cron start the master periodically and have the master
  silently exit if it detects another master already running.
  (The master is already supposed to detect this using a lock
  file, so this may simply be a matter of testing and
  documentation.)  The other option would be to have a script
  run ps to look for the master and start it up if it doesn't
  find it.  This seems more error-prone, but may be the better
  option if we can't find a reliable method for having the
  master detect if another master is already running (i.e., if
  the lock file approach doesn't work correctly in all cases).
  The final step would be to work with the CSL to get this
  installed in our pool.

- It would be cool if RSCs would flush the file buffers when a job is
  about to crash, so users will get their error messages and be
  better able to debug their problem.  The RSC library could
  setup SIGSEGV, SIGABRT, SIGFPE, etc. signal handlers and flush
  the buffers before letting the job die from the signal.
  Depending on how messed up the job is, the flush might cause
  further problems, but it seems to me that it would generally
  do more good than harm.  Of course, ideally users should debug
  their code outside of Condor.

- much better analysis of match failures and potential matches put into
  tools with which the user can query about how and why his/her
  jobs will/won't be scheduled. Nick Coleman and rajesh are doing
  research in this area.	[condor-admin #857]


******* DAGMan 
       (Taken wholesale from what was the section on the limitations of DAGMan.

PC ?? 
   1. A general purpose command socket will be used to
   direct Dagman while it's running.  Commands like CANCEL_JOB X or DELETE_ALL
   would be supported, as well as notification messages like JOB_SUBMIT or
   JOB_TERMINATE, etc.  Eventually, a Java Gui would graphically represent the
   Dag's state, and offer buttons and dials for graphic Dag manipulation.

PC ??
   2. DAG removal supported by a command socket such as DELETE_ALL

PC ??
   3. Condor Log File
   All jobs in a Dag must go to one log file, but
   log file can be shared with other Dags and Condor jobs.

PC ??
   4. Currently, all jobs must exit normally, else the DAG will be aborted.
   Wish lists that a job can be ``undone'', or there is some
   notion of a job instance.  Hence, a job that exits abnormally or is
   cancelled by the user can be rerun such that the new run's log entry
   is unique from the old run's log entry (in terms of recovery)

PC ??
+ DagMan inserts attributes into child job ads, such as the job id of
  parent in Dag, job id of DagMan job, etc.

PC ??
+ DagMan supports submitting multiple jobs per cluster. [condor-admin #1303]

PC ?? 
- Make DAGMan more intelligent about removing leaf jobs and understanding
  how to manipulate the DAG so jobs can be removed or added into the
  DAG. [condor-admin #2001]

****** Port Lists

?? ??
+ Alpha Linux port w/ checkpointing 
	[condor-admin #220 #350 #738 #776 #1166 #1190]

?? ??
+ HPUX 11.0 port ?? (clipped)

?? ??
+ AIX port ?? (clipped)
	[condor-admin #697 #1042 #1082 #1507]

+EP  2 Days 
    MacOS X  

?? ??
+ PowerPC Linux port ?? (clipped)

?? ??
+ {free,open,net}bsd port ?? (clipped) [condor-admin #59]
  The FreeBSD should be basically done after the OS X port...

?? 1 Day
+ SparcLinux port ?? (clipped)
    EP did most of one in an afternoon after a month on the project
    so it's not a very tough port... [condor-admin #1588]


******* Task list: (Not direct development work, local to UWCS

BS 1 Day
+ Tinderbox/Bonsai Javascript popup boxes are b0rk3n with IE. Must fix.

BS 2 Weeks
+ Tinderbox should gather up the results of individual tests and present
  them in the pop-up boxes

+D/PC ?? Finish automating test suite

+ we need to add a make-sure-there-are-no-tokens sanity check to
  master_on, or just have it do a pagsh -- because otherwise, if it's
  run by a user with a token, it results in a nasty security
  embarrassment (userlogs, etc. written to AFS get owned by the user
  who ran condor_on, not the user who is running the jobs)


****** Appendix A: Long Keller Wishes :)

- This is a long wish. :) The scoop is to add a new event in the user log
	dictating when the job begins it journey to the execute machine to
	run after the startd has been claimed, but before the job actually
	starts. This spawned an idea for a reworking of how the protocol
	between the user job and the shadow works to get a better handle on
	where time is spent doing what and recording what is happeneing in
	the user log. Below is a (slightly modified) excerpt from an email 
	talking about this.

I still like the new event in the user log(when the job is scheduled to
run on the resource, but before it actually starts running) idea and I
do believe it would lead to a better understanding in the user log about
what is going wrong.  So many times do I have to poke piles of crap just
to learn that the job is blowing up on restart and committing Suicide()
when I could just look in the userlog and see "oh, the job is getting
scheduled to run, and then is evicted without actually running". (I
actually vote for sending a pseudo-call to the shadow from the user job
*just before returning to user-code and after the restart is complete*).

Also, this would help our calculation of Goodput. I make the bold claim
that we shouldn't count the time a job spends restarting as goodput. We
should ONLY count the time the job is in user-code up until the start
of a checkpoint.  Then when the checkpoint finishes, we can count all
the time in user code the job was actually doing what the user wanted.

This means:

0. Schedd tells job to begin running on a startd. Time A
1. Schedd and Shadow records shadow birth time. Time B
2. User job starts up and sends "in restart" pseudo call to the shadow. Time C.
3. Job tells shadow it is in user-code. Time D.
4. Job runs a while....
5. Just as job goes into checkpointing code, it sends a Time E to shadow.
6. When shadow sees either a completed checkpoint, or a fucked checkpoint
	it records a Time F. If the checkpoint was successful, the
	shadow knows Time E was valid and the time is committed. If the
	checkpoint was fucked, then it knows Time E is invalid and does
	things accordingly.

Time the job ran from scheduling to last known committed time: F - A
Time the schedd spent doing God knows what before job starts: B - A
Time spent between birth of shadow and transfer executable to resource: C - B
Time spent restarting(transfer of checkpoint, etc): D - C
True Goodput(actual user code running): E - D
Time spent checkpointing(or wasting time if it failed): F - E
Wall clock time job has been "in the process of running": Now - A

This would make Good|Badput calculations more true to the mark and it would
show us problems in stages of how our jobs run that were hidden by the
coarse granularity we have now. You could put this crap into the jobad
(we do for some of it) and get up to the minute timings that are damn near
what they are supposed to be.

The possibilities are endless for how you can munch on this data.

Also, you can put most of these timed events in the user log that dictate
to the user exactly what is happening with thier job. This means we can
look at it and see a decent record of what happened(or is happening) and
get a good idea of how to fix it. (RUST would be cooler when they send
us a job log that has something more interesting than a shadow exception
in it eh?)

Of course, the above is subject to haggling and 6.3/5. And I might have
misunderstood exactly what happens between steps 0 and 2. But you get the
idea. I honestly think it should be worked into the skeletons of the new
shadow. Adding the minor parts to the schedd might already be done or
just be a few lines of code.

- set a submission file attribute like say "JOB_TRACK_SERVER =
  sinfulstring" that would cause every daemon that knows about the
  job as it is running to copy its logfile entries about that job
  via a socket to a log server located at the sinful string(in
  addition to going to the local log files).  This would allow
  us to watch, in real time, every daemon who is dealing with a
  particular job(or maybe even a set of jobs in a cluster). If
  the server goes away, then the connection is closed, and the
  daemon continues to go to the logfile. This would help us _so
  much_ in trying to debug why a job fails running or screws up
  for some reason on a machine that we might or might not have an
  account, or even juristiction, on. A lot of developer time is
  wasted in tracking a job or a set of jobs through logfiles that
  we might or might not own, and it is mentally taxing to do it
  over and over again while looking for elusive race condition,
  or statistcally significat(but you need a lot of runs) bugs.


+ A "blackhole machine" information gathering system that will propogate
information about potential blackhole machines(as defined by a user
specifed policy in the submit file) from the shadow to higher order condor
daemons, like the schedd or the negotiator. What would happen is that
if a job dies under circumstances that would make the user specified
blackhole policy activate(job completed less than 5 minutes when this
job was supposed to run 5 hours), the shadow will inform the schedd that
the machine the shadow had been talking to is a blackhole candidate.
The schedd could collect all of this information from various shadows
it controls and over multiple executions of shadows for the same job
building summaries if so desired. Then it could forward the blackhole
statistics to the collector as part of a larger blackhole ad that all
schedd's can contribute to.

Also, there should be ways to mark when a machine ISN'T a blackhole so we
can allow detection for blackholes that only affect certain jobs or users
or blackholes that are only that way for a finite period of time.

This would be a boon to allow admins to discover oddities in thier pools and
places they potentially flock or glide into.

After all of this works, then maybe we can put smarts into the pool to
respond to blackhole situations in a more intelligent way like have
the negotiator not schedule jobs on machine that are known blackholes
for that job. But that is way later.