******* 6.4.3 Known bug list * DAGMan doesn't detect when users mistakenly specify two DAG nodes with the same node name; instead it waits for the same node to complete twice, which never happens, and so DAGMan goes off into never-never land [see condor-admin 4502]. ****** 6.4.3&1 things that need to be confirmed - Better packaging of "security stuff" - Adding debugging information about nfs permission problems & FAQ entry - Bitmask ordering preservation of the methods of authentication - Robust parsing of the inherited socket string(for compatability) - Java universe "space in the pathname" fix. [ fixed ] - Windows crash problem in startd - error prop. from authentication failure -timeouts from starter -classad that are taking too long -better UI for condor_rm -constraint -fix old classad coredump when committing the transaction after actOnJobs in the schedd -switch condor_submit_dag to a C program ******* new features in the 6.3.2 release that never got implemented - authenticated syscall socket for standard universe - local rusage is always 0 in the new shadow/starter - new shadow email sending needs to be made coherent (what should the different notification levels really mean?) - new messages in condor_(rm|hold|release) when using constraints - new classad condor_q built, included in tarballs, etc. - Daemon object correctness bug for getting the version of a daemon that's not what param() thinks it is - speedier negotiation (aka oracle) ******* 6.2.1 Known bug list ?? ?? + when the startd is trying to get starter classad info, and the underlying starter binary can't be executed, popen() just makes it look like there's no data, you don't actually see that the exec() failed (since it didn't, /bin/sh exec'ed just fine *sigh*). so, either we need to use DC pipes for this, or do some extra checking, so we can print a more readable and serious looking error message when we can't exec() a given starter. see condor-admin #4754 for details. DW 1 Day + fix bug in the old starter w/ transfering/creating core files (for sure on Solaris, possibly other platforms) when running as root. ?? ?? + track down remaining wrong dprintf()'s in startd that are causing segfaults Tough, because we rarely see it. DT ?? + We're missing some FORTRAN system calls. See Condor-Admin 2932 and 2955 ?? ?? + /dev/console doesn't do what we expect on linux anymore... we need to find out a way to find physical console keyboard activity again. PK ?? + if there are more than about 10000 files that condor_submit/condor_schedd has to check for writeablity, then things explode. You can use the -d option to condor_submit to turn off the feature, but it needs fixing. ?? ?? + CondorLoadAvg is bogus on SMP. it's just bogus everywhere. ?? ?? + if you put 'getenv = TRUE' in your submit file and your environment contains double quotes or newlines, submit will fail with a parse error ?? ?? + You cannot pass spaces as part of an argument in the Arguments attribute in a command file. This should be fixed. [condor-admin #590] ?? ?? + condor_kbdd doesn't work properly under Compaq Tru64 5.1, and as a result, resources may not leave the ``Unclaimed'' state regardless of keyboard or pty activity. Compaq Tru64 5.0a and earlier appear to work properly. (condor-support #408) ?? ?? + condor_status doesn't do this command correctly: condor_status -sort loadavg -sort enteredcurrentactivity the enteredcurrentactivity is not sorted correctly. ?? ?? + pvm_spawn(,,PvmTaskDefault,,,) doesn't work in Condor-PVM [condor-admin #2217] ?? ?? + Condor doesn't support pvm_spawn() calls that request that more than one worker be spawned (i.e., the fifth argument to pvm_spawn() currently must always be equal to one) [condor-admin #2244] ?? ?? + PVM workers don't startup with the job's working directory or environment variables. PVM_EXPORT environment variable is ignored. [condor-admin #2217] ?? ?? + ProcAPI calls seem to fail a lot on Irix 6.5: "ProcAPI: Error opening /proc/58274, errno: 11" [condor-admin #2217] "ProcAPI: Error opening /proc/2013144, errno: 145" (seen in NCSA pool) PK ?? + scheduler universe jobs (e.g. dagman jobs) still send Condor mail from root ?? ?? + startd seg faulted w/ PVM req_new_proc command -- procAPI bug? ?? ?? + If user job changes umask(), that effects the shadow itself. ?? ?? + utimes() does not work reliably on Solaris, maybe other platforms. ?? ?? + printer.remote prints duplicate lines on .err file. ?? ?? + threads.C on DUX is broken with the vendor C++ compiler. ?? ?? + Memory problem in negotiator ?? ?? + On some platforms, the code to lock the MasterLog to ensure there are not multiple masters running on the same machine does not work. ?? ?? + if the condor developers collector changes its IP address, we never know about it and keep sending world ads off to who knows where. ?? ?? + condor_master confusion: if you do a condor_off and a child daemon does not exit before you do a condor_on again, the daemon is never restarted. ?? ?? + constructor.C on IRIX6.2 with g++ is broken. Must use gcc'c linker and not the systems or figure out a way to use the systems.. ?? ?? + fix the problem of the CondorView applet not graphing info when there is no data, thus making the x-axis time line non-linear (see condor-support #242). this is largely a symptom of the lame/wrong behavior of condor_stats, which does not report any records which have all zeros in the fields. PC 2 Days + Fix our released API (libcondorapi.a) to be C linkage so people can use the log event processing stuff nicely. [condor-support #298] ?? ?? + in Fortran, the etime() function call does not work because it opens /proc and does ioctl PICOGETUSAGE, which of course fails. Should we support it? [condor-admin #1504] ?? ?? + Submitting a vanilla job with 'requirements = memory > 1024' into a pool with machines that have greater than 1024 RAM just sit idle in the queue. Appending '&& opsys == "SOLARIS57"' (for a pool that has such) to the requirments allows the expression to be then correctly evaluated. ?? ?? + condor_compile is fooled by putting main() in a user supplied static library which goes on the link line [condor-admin #1358]. e.g., gcc user_main.o mylib.a where mylib.a contains main() and user_main contains a function that calls main(). ?? ?? + When WANT_SUSPEND evals to undefined in the startd, it probably shouldn't EXCEPT and complain about "WANT_SUSPEND not defined in internal classad", since maybe it just references some attributes that aren't in the job ad. What, exactly, it should do instead is not yet clear to me, but what it does now should be changed. ?? Done/1 Day [Entered into the bug tracker as #89] + fix condor_submit so it warns you about things in your submit file that it is ignoring (whereas now it just silently ignores them) (Will did a lot of this, but no idea where it is) JF 4 Hours + Add DUROC support to the Grid Manager JF 4 Hours + Allow submitfile to override the RSL string we construct, if it starts with an '&' or an '+' (For DUROC) TT 1 Day + Refresh Proxy on remote machine (shutdown and restart job managers...) ?? ?? + a Daemon Core checkpoint server. No new functionality, but we'll need to beat on this to get Kerberos support into the checkpoint server. DW 2 hours + condor_install should ask about DedicatedScheduler and set it for you! DW ?? + Support multiple executables in an MPI job ?? ?? + Rewrite of MPI startup methods. Support MPICH-P4, MPICH-MPD, LAM, MPICH/NT, MPI/Pro ?? ?? + Submit/RM API. Basically hack up condor_submit. DX ?? + View Server indexing & speedups. Also condor_history DW ?? + the dedicated schedd should check DedicatedScheduler in the config file. if it doesn't match itself, and if someone tries to submit an mpi job to it, it should refuse in such a way that condor_submit can tell the user to go submit to the right schedd. ?? ?? + idle_time calulations take 2 calls to stat() for every file, and we're stating tons of files we don't need to be. we should optimize all this crap and clean it up. ?? ?? + add a config file hook for specifying which devices to check for idle_time. + sysapi should detect cyclical tty activity (i.e. elm updates, etc). [condor-admin #868] ZM ?? + master agents (a) transfer log, config, history files, (b) remote "top" CS ?? + FileTransfer stuff should be directory aware (if a file you specify is really a directory, we should recursively deal with it). + Trap calls to CreateProcess on NT + Support core files for NT user jobs + Inject a DLL to handle soft-kill signal on NT TT ?? + Store User's password encrypted in the registry, use NT magic to transfer it about TT ?? + Run as the user on NT, instead of of as nobody TT ?? + Port Scheduler Universe and DAGMan to NT DW ?? + Something other than FIFO scheduling for dedicated resources ?? ?? + Create a "condor_last" tool, so startd logs every job that is spawned there (user, command, date/time, anything else?), and condor_last allows you to query this information PK ?? + New Shadow for PVM Universe ?? ?? + New Shadow for Standard universe + Sonny's Firewall support + condor_vacate , condor_checkpoint + put job in HOLD state for long-term errors (and/or EXCEPTs in remote syscalls), logic to avoid the "black hole" syndrome where one misbehaving machine in the pool can bring down the whole show. (see [condor-admin #1254]). See also Pete Keller's long wish at the end + future reservations in dedicated scheduler + startd enforces resource limits on jobs such as CPU time or Memory on a machine. [condor-admin #846] + shadow spits out image size into user log on a job completed event. + First Toolcore-based tool - command-line arg parsing, using the daemonCore command client, consistent error reporting, usage message format, etc, etc. + On reconfig, daemons should check NETWORK_INTERFACE and bind their command socket to the new interface(s) if it changed [condor-admin #2217]. ?? 4 Days + Add multihomed support to Condor (see RUST [condor-admin #1761]). + daemon core command port should try to bind to all interfaces by default, and only bind to specific interface if NETWORK_INTERFACE is specified. + notify_user = file, put the job completion e-mail into a user specified file. + some way in the config file to append email addresses to the notify_user list so admins can also get job completion emails [condor-admin #5352] + Network file system access under Windows NT + Add -all option to The Tool, e.g. "condor_off -all" would shutdown your whole pool. :^) - SCHEDD_CLAIM_LINGER parameter that tells the schedd to hold on to idle claims for a given number of seconds before relinquishing them. Each time a new job is submitted, the schedd sees if it can run the job with one of the lingering claims. This can improve the efficiency of MW style applications that wait for a job to exit, do some quick processing of the results, and then submit a new job. + SMP support is lame. the different VMs should be able to access information about the other VMs, the other VM's job classads, etc, etc. some (very reasonable) policies that users want (e.g. condor-admin #4762) are impossible with our current religion in the startd. + we should put the instancelock functionality from the master into at least the startd, maybe even daemoncore itself. it's lame that you can accidentally start up 2 startds w/ the same execute dir, 2 schedds with the same spool, etc, etc. ****** V6_3 items looking for a good version... + condor_submit segfaults with a large environment string (*might* already be fixed, but needs checking). [condor-admin #2196] +? There needs to be a way to disable the non-root/non-condor OwnerCheck stuff, other than setting QUEUE_SUPER_USERS. + Add a more robust idle_time tester function in the sysapi tester. + Need to put support for checksums into Condor: checksum the builds so people can verify they have legit binaries, give a tool so people can do that easily, etc. Ultimately, we want the master to verify checksums before it'll spawn any processes... + Add a "PreemptReason" attribute to the startd classad when preempting. + Compress logs once they're rotated. + On reconfig, if error, don't EXCEPT, go back to last good config. +T Portland Group Compilers + if there's no MasterLog, and you run w/ -f -t, you EXCEPT b/c you can't get a lock on a file that's not there! +D SMP admin tool to handle (a) reservations, (b) configuration + condor_report_bug script, which sends email to condor-admin/condor-support along with relevant sections of all + master handles DaemonList change + soft-kill jobs when user does a condor_rm (currently we hard kill) + need to specify a seperate rm softkill signal, and the regular, "shut down gracefully, but we'll start you again" softkill. (works for scheduler universe right now, needs a little more work for the other universes). + Initial executable can be stored on checkpoint server + seperate downloadable condor link kit (libs + condor_compile) + cleanup ATTR_SERVER_TIME for use by condor_status -state, so correct numbers are displayed even if time across pool and central manager is not synchronized + scheduler universe API + submit more jobs to an existing cluster + ?? stdin/out/err setup by the starter to go to the shadow, even for VANILLA jobs. this would allow VANILLA jobs which just need stdin/out/err to not have a shared file systems; also, would allow stderr messages sent before our MAIN() runs in STANDARD jobs to appear. + Condor.pm should include a whole-cluster exit callback [condor-admin #2911] + tools for clusters: - include a distributed make for condor - include a distributed telnet - include a distributed rsh (or at least condor_submit -block) + Java Universe support + Automatic cleaning of old entries in the history. + FS domain != DNS domain stuff. DT ?? + On restart, we should truncate files we've opened for writing to the size they were at checkpoint time. + user visable condor_submit improvements: command line args, environment defaults, ... + Put benchmarking in a seperate thread or process. Having the whole SMP startd blocked whenever a VM wants to benchmark is a real pain. + bug: Fix getrusage() + Add DC acks for DC permission problems, etc. Create CommandSock derived from ReliSock, use the high bit of the command int to specify if you want an ack, etc. + Checkpoint compression and related stuff is broken under linux, possibly other OSes. + bug: When setting up a Personal Condor, if your CONDOR_HOST macro is set to garbage, the master segfaults while trying to update the collector ad. + A simple submit-file syntax for heterogeneous submit that hides all the nastiness of requirements expressions with LastCkptArch, etc. + When the startd is going to preempt a claim, run a user specified command. + remove MAX_VIRTUAL_MACHINE_TYPES in the SMP startd, and just use an ExtArray. + better interface to condor_rm so you can say "condor_rm 1280-1290" to remove all clusters from 1280 to 1290. [condor-admin #3849] ******* V6.5 + fix the BindAnyCommandPort failed message in the daemons so that when this command fails, it tells you why. See [condor-support #632] + cache group lookups in _set_priv so when your changing between users you don't pound the crap out of the NIS master. + use new classads everywhere + use new classad collections everywhere + collector can handle new ad types without need to change source + rewrite the schedd + new world order for remote system calls (doug's work, unix<>NT integration) + improve checkpoint protocol (i.e. a classad with meta info is stored with the checkpoint, allow files to be stored along with the checkpoint, etc). In addition to this, it would be nice if you could request a checkpoint for a jobid, and it will produce for you a tar file that contains an initial checkpoint, a current checkpoint and all files necessary to restore the checkpoint. Then you could replace a job that left the queue prematurely with whatever is contained in the archive with handy tools that would need to be written for it. [condor-admin #1386] is part of this larger picture. + central manager automatic failover + condor_send_signal + FileTransfer object maintains a local cache + ?? do not need to re-exec shadow and starter everytime (for short running jobs) !!!Fix getrusage() if this occurs.... + schedd can hold a claim on a resource for a specified period of time + centralized config server + config file is a classad + submit file language/syntax should be a classad + support in new starter for vendor checkpointing (e.g. IRIX kernel checkpointing, LINUX kernel checkpointing) + users can explicitly state want_checkpoint, and/or want_remotesyscall, and/or neither. ******* Undecided which version: + In Condor-PVM, an infinite message loop happened that looked like this: 3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002 3/25 09:24:33 (450.0) (5618):PVMd message is SM_CONFIG from t40002 3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002 3/25 09:24:33 (450.0) (5618):PVMd message is SM_CONFIG from t40002 3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002 3/25 09:24:33 (450.0) (5618):PVMd message is SM_CONFIG from t40002 3/25 09:24:33 (445.0) (1072):PVMd message is SM_CONFIG from t40002 It appears that neither of the two processes wanted to deal with the message and so it kept bouncing between them... + Add support in the standard universe for the ABSoft fortran compiler. + Condor-PVM ignores the soft kill signal, and it is a TON of work to fix. See admin rust #5439 + Condor totally ignores tty activity on linux machines with a usb keyboard. admin rust #5288 + The standard universe library says a user can open up a path up to the POSIX max path. This is incorrect. Because we can add things like buffer:, compress:, ftp:, or combinations thereof, we short change the allowable size a ser can have for a path length. + Implement(Or see if we can) support for the Intel Fortran Compiler with condor_compile. See admin rust #5358 + condor_c++_util/email.C: When trying to figure out a domain name for the EMAIL_DOMAIN, if all attempts to determine a domain name fail, then we will free a null pointer. Don't have time to fix now, so I'll just document it. + from Miron (2003-04-08): the Userlog-reader API should keep an event count and allow you to fast-forward/rewind N events + bug as desribed in [condor-admin #2452] + bug as desribed in [condor-admin #2769] + bug as desribed in [condor-admin #2888] Might be a non-redhat issue, but I remember seeing this type of error in something else I couldn't reproduce locally, so it might be legit. + Make it so that Condor can notice when the whole system begins to have problems and then e-mail people about it. For example, a shadow notices that it can't ever talk to the checkpoint server, so it will send mail to an e-mail location about it. + (V6.1.13?) Add -constraint for as many programs as it makes sense for. + (V6.5?) Documented, clean APIs for external consumption : - collector query API - user job API (get info on resource, checkpoint, remap-files, etc) - job submit and queue edit API - ??? Sched<->Shadow API so users can write their own shadows + (V6.3?) shadow gets the startd ad, and thereby we allow job ad to refer to attributes in the startd ad + (V6.5) Schedd<>Negotiator protocol revisit: spinning pie algorithm implemented with multiple pies, negotiator asks schedd only for attributes that matter (signature), negotiator tells schedd _why_ a match failed, etc, etc. + checkpoint/remote syscalls on WinNT + (V6.2?) make Condor a Redhat "RPM" and/or a Debian installation package [condor-admin #151] + Fix access(2) check in condor_daemoncore so that we have one that works off of the EUID instead of the real UID. + printf-compatible format specification language which can be used with classad values [condor-support #257] + have the starter record in its logs the uncommitted time of a job in addition to the committed time when a job vacates. + make sure f90 works on IRIX machines. [condor-admin #1174] + Make the collector able to bind to other ports. This could optionally be a COLLECTOR_PORT type thing, better yet, a portmapper could be developed to handle things magically. Nick & Pete have some (IMHO) good ideas for this. * (v6.5?) Fix cached IP address of the collector; currently, if the IP address of the collector changes, the startd (and probably everything else) updates will start silently going to the Great BitBucket in the Sky(tm). - Modify all of the server code to use a common update block of code, probably some intelligent object - Create a new sinful string object to manage these things, time itself out, etc. This should be used by the above ******* Wish List: - condor_q should try to print out its data as it gets it from the schedd, instead of waiting till the end. this way, if your q is enormous, you start to see something right away, instead of waiting a few minutes for any output from the command. this would prevent any sorting, but it may be useful to some people... [admin #3841] - the priv state code should do some caching. for instance, on Unix everytime init_user_ids is called, we do a getpwent(). the result of this call should be cached by username (if not, we pound NIS on some platforms --- RUST from biostat). another example: on windows, we should cache the handle to the login. - when perl has been around long enough add this patch: [condor-admin #2019] It looks like it requires perl 5_003, which not everyone on the architectures we support might have as of 2002. - make condor_userprio and friends take -constraint - in the job ad of standard univ. jobs, the CondorVersion attribute should be the version of the syscall library linked in, not the version of condor_submit used. in general, perhaps we want "schedd_version", "syscall_lib_version", and maybe even "shadow_version" or something. - condor_install should setup flocking. - Support for suspending claims and/or activations in the startd for running short, high-priority jobs quickly without preempting currently running jobs. If a user submits a high priority job, the schedd could suspend one of that user's currently running jobs and use the claim to run the new job. When the new job completes, the schedd unsuspends the previous job. Likewise, if the negotiator preempts a job to run a higher priority job, we suspend the preempted job until the higher priority job completes. We'd obviously only want this behavior in some scenerios, so we'd need to be able to control it in the startd and schedd with expressions that refer to job attributes. We could use this feature at NCSA on the Exemplar to run short, high-priority ChemViz jobs with long-running, lower-priority jobs that can't checkpoint. - Optimize ProcAPI and ProcFamily. On modi4.ncsa.uiuc.edu, it takes over 3 seconds to complete a single call to ProcFamily::takesnapshot(). The master calls ProcFamily::takesnapshot() for each child process every minute, so with just two children, it spends over 10% of its time in that call! It would help if the ProcAPI/ProcFamily objects didn't rebuild the process tree from scratch for each child process. Maybe the tree could be a static class member that gets rebuilt at most once every 10 seconds. On each call, check the last update time of the tree to see if it needs to be rebuilt. Also, on Irix and Solaris, ProcAPI::getProcInfo() makes 3 ioctl() calls: PIOCPSINFO, PIOCUSAGE, and PIOCSTATUS. The PIOCSTATUS call seems redundant. The PIOCUSAGE call returns the same information. Since ProcAPI::getProcInfo() can be called thousands of times in a single call to ProcAPI::buildProcInfoList() on a large machine, we should be sure we're doing things as efficiently as possible in that method. - condor_compile could display a message about the version compatibility of the binary it's producing. This would be particularly useful for remote submitters compiling up binaries for many architectures, so they can make sure all their binaries will agree with the shadow version on their submit machine. Just spitting out the current Condor version number would help. - If port 9618 is unavailable, the condor_collector will happily bind to a random port. This is totally broken, since no other daemons have any way to know about the new port. The most common case is incorrectly starting a second condor_collector while another is running. The correct behavior is to emit a complaint to our log files, perhaps also to standard out, then exit. This bug might be present in condor_negotiator as well. CORRECTION: condor_collector requires "-p 9618" to ensure that it binds to the right port. Otherwise it defaults to a random port. This is surprising to users who try to start a collector by hand. - The user's umask should follow the job around. This is particularlly relavent for vanilla universe jobs. You can work around it by wrapping the job in a shell script to set the umask, but that's clumsy. - If the condor_startd can't launch a condor_starter because of permissions problems, it currently silently fails. The condor_startd should emit warnings in this situation, perhaps once when the condor_startd starts, and once whenever it fails to launch the condor_starter. - Create condor_backup and condor_restore. condor_backup would allow a user to pull the last checkpoint of his job down to a local file that he can backup himself. condor_restore would reinject that backup into the pool. This would give users control over their own backups here at the UW (the current checkpoint servers are not backed up), and would allow paranoid users to keep multiple backups at various points in the program's lifetime. (Based on general agreement at 2002-11-08 staff meeting.) - Whereas, + the proliferation of "Personal Condors" and "Condor-G"s is growing, alongside the proliferation of traditional Condor pools, and; + users are frequently managing more than one condor_config for more than one type of Condor installation, and; + it is often confusing for them (and sometimes for us), when they run condor_master, to know which condor_config it's picking up, or even that there is a possibility of there being more than one to pick up, and; + be it resolved that, condor_master should print, when it is run, something like the following: starting condor daemons using /foo/bar/baz/condor_config as user "foo"... (ignoring alternate condor_config(s) in /bar/baz/bat and /bat/baz/bar) (to use an alternate condor_config, set CONDOR_CONFIG in your environment) ...unless a "-q" options is given, telling the master to be quiet. (From an Oct 14 email by pfc, and generally agreed to be a Good Idea) - Add NumberOfExceptions to job ads. This will allow, amoung other things, for machines to prefer jobs with fewer Exceptions. Reasoning: It's possible for a job to enter a cycle of endlessly excepting. A set of these jobs can constantly tie up execute machines, blocking jobs that can actually make forward progress. Once in place other useful policies can be added (perhaps holding or even removing a job with too many exceptions). For example, thanks to the magic of Globus, occasionally Condor will be managing a vanilla job whose standard output and error are now and forever unwritable. The job starts, is unable to write the file, excepts, reenters the queue, and the cycle repeats. This tied up most of a large cluster for a while before it was hunted down. - Have condor_submit do better error-checking on the "queue" statement [condor-admin #7313] EP 1 Day - Fix up condor_version - If the spool, log, and execute directories don't exist, Condor should try to create them. This could save a step in the installation process, which would mean a lot for people administering Condor on large clusters of machines. - It would be sweet if when a program explodes it gave us a stacktrace in an e-mail to condor-system, or just did something more useful than exiting without a core file. - allow running vanilla jobs to be saved shortly after a schedd crash/shutdown - Users and administrators should be able to specify a maximum run time (wall clock or CPU time) for each job, using either a config file parameter or a submit file macro (job ClassAd attribute). The schedd would check the run times periodically, kill any jobs that have exceeded their maximums, and send email to the user. We would use this (for example) in the NCSA ChemViz web portal (chemviz.ncsa.uiuc.edu) to ensure that user's jobs don't run for more than 30 minutes. - Schedd timeouts need to be re-evaluated. One example: If a bunch of machines are rebooted or there is a network outage, a bunch of shadows will except, and the schedd will get stuck in a tight loop in child_exit() trying to relinquish the matches. Each connect for each relinquish command will take 45 minutes to timeout and fail. In this case, the schedd should either call DaemonCore::ServiceCommandSocket() somewhere in this loop or the relinquish command should have a shorter timeout or should be UDP. The relinquish command could safely be UDP, since if it's lost, the match will timeout anyway because the schedd won't send any more keepalives. - Cleanup CondorView ckpt server module so it gets statistics directly from the ckpt server instead of using ftp/scp to grab the TransferLog file. - Fix startd VirtualMemory attribute so it actually reports available virtual memory instead of available swap space. For example, if an idle Linux machine has 4GB RAM and a 1GB swap file, we shouldn't restrict jobs to less than 1GB if there is in fact over 1GB of free RAM. At least on Linux, VirtualMemory should be MemFree+SwapFree from /proc/meminfo. - Make seperate condor_history status letters for C (completed, 0 exit status) and E (error, completed with non-0 exit status) [condor-admin #2081] - Use NIS netgroups for some config file attributes. [condor-admin #2027] - Set up the ability for hot failover CM's. This ticket describes a quick hack, however, a better solution is needed. [condor-admin #1255] - possibly pass more information on the subject line of a job completion email such as the exit reason and status. [condor-admin #1126] - Have configurable turn off/on times for any daemons which the mater enforces. [condor-admin #820] - An updated condor_preen that allows selective removal of the history files based on cluster date. It would be great if it were possible to specify how many days of history to keep at once. [condor-admin #619] (Is this already done? This ticket is from 1998! -psilord) - Re-enable dynamically linked user jobs on Linux - Have a way for a user job to determine at run time which version of the syscall lib it is linked with. Possibly extend the VersionObject into user space somehow. This is to make it easier for users to work around development series bugs in our syscall libraries. - the ALIVE_INTERVAL ping between the schedd/startd allows an execute machine to quickly detect when a submit machine disappears. this should be enhanced to work both ways so a submit machine can quickly detect when an execute machine disappears (instead of relying on SO_KEEPALIVE, which can take hours). - condor_q -analyze should be aware of RESERVED_SWAP and tell the user if jobs cannot run because the submit machine is low on swap space. - When daemons are about to spawn other daemons (schedd -> shadow, for example), the new daemon should be born with stderr dupped to the parent's log file fd, so that we can see error messages from the child before it successfully completes dprintf_init() in the parent's log file. True, we won't get locking, but better to see the messages than not, even if it means a scrambled-looking log file for the parent right around the time of problems in the child. - allow administrator-defined priority factor groups (like nice-user), each with a setable prio factor. - user priority half-life configurable per user - a pool-wide condor sysadmin log. when a dprintf(D_SYSADMIN) appeared it would just send a command to the central manager, which would append to a file. - checkpoint only dirty pages instead of the entire image. - overlap periodic checkpoints with computation (i.e., fork before periodic checkpoint) - checkpoint server load balancing - checkpoint server fail-over: if a checkpoint server is down, jobs with checkpoints on that server shouldn't try to start until it's back up (check for checkpoint server ClassAd?), and jobs which want to write a checkpoint to that server should automatically choose another one instead of retrying the dead server over and over again - user can specify "action" in submit file when an unsupported system call occurs: either "unsupported", "ignored", or send a SIGABORT. - job hold state & resubmit with checkpoint: i.e., more robust job error recovery - schedd/qmgmt support for large queues - shared library checkpoint and remote system calls - separated checkpoint and remote system call capabilities - checkpoint support for mmap, pipes, sockets, X windows, kernel threads, file locks - intelligent license management - startd should detect swap and network activity - metascheduler for jobs with large input/output files - secure the Condor protocols - CoCheck support - accountant - "remote shadow": shadow runs on the execute machine (instead of submit machine) if possible (i.e. file access via nfs/afs etc). - packaged support for common applications - partnerships with venders to get applications linked for checkpointing and remote system calls - minimize resource usage (esp memory) in shadow - deal with 64-bit pointers properly in our remote syscall lib (esp for OSF, IRIX, ALPHA LINUX) - secure sandbox for the shadow - secure sandbox provided by the starter to the resource owner (sandbox the user job) ... largely already done, but not the policy part. (see Jim B) - Ignoring certain _users_ when stating ttys for activity, so that certain users, as desired, won't cause jobs to get suspended, etc. - limits: cpu limits, memory limits, etc - better handling of running out of swap space, i.e. migrate the job when malloc fails.... (see Jim B) [condor-admin #326] - Add new option to condor_status for -altfmt? - condor_history should not exist. it should be "condor_q -history" - dprintf change to split bits into subsystem and severity - config file should specify a range of ports, and CEDAR's bind() and _condor_local_bind() both honor that (to help w/ firewalls). - checkpoint/migrate directly between two machines (for example, when a faster machine is available) - pre and post commands for jobs, so users can specify a command to be run on the submit machine (1) before the job is run for the first time, (2) each time before the job starts running on an execute machine, (3) each time after the job is preempted from an execute machine, (4) after the job completes - startd should fork off a user script which periodically injects dynamic host information into the startd classad. LSF calls this an ELIM (external load index manager). At any time, the script can send a classad expression to stdout and the startd selects on a pipe, reading the stdout of the script, and injecting any updates into its ClassAd. It's possible to do this with condor_config_val -set and condor_reconfig, but that's cumbersome and inefficient. This is complicated, please see [condor-admin #856]. - add configurable upper limit on the total diskspace of corefiles from a particular job [condor-admin #271] - GUI job editor [condor-admin #1784] - automatic staging of master binary on local disk for sites which put binaries on NFS/AFS: condor_init/condor_install should ask where to stage the binary and make the initial copy; the boot script should exec the local master binary; the master should periodically check its binary on NFS/AFS and update the local binary when the version on NFS/AFS changes - A Group or DefaultGroup ClassAd attribute, akin to Owner, so administrators can define policies in terms of UNIX groups (See condor-admin #1896). - Add an option to condor_hold, so that the job is checkpointed if the job is running. - Add a condor_swap_jobs jobId1 jobId2 that replaces a running job (jobId1) with a queued job (jobId2). - Add a pool monitoring service that reports when machines and/or daemons that should be in the pool aren't. It should send email in cases when the master wouldn't (for example, when the master is down), and it should dump out some html pages that display the status of the pool (an overview page with links to pages with details about each problem). The service needs to be redundant so admins will know when monitor(s) are down. It should work with the pool-wide Condor sysadmin log mentioned above, by logging its own D_SYSADMIN events and/or watching the log for events that should generate email or should be displayed on the Condor status web page. See Jim Basney for more details. - Provide a cron job that periodically checks to see if the master is running and restarts it if it isn't. One option is to have cron start the master periodically and have the master silently exit if it detects another master already running. (The master is already supposed to detect this using a lock file, so this may simply be a matter of testing and documentation.) The other option would be to have a script run ps to look for the master and start it up if it doesn't find it. This seems more error-prone, but may be the better option if we can't find a reliable method for having the master detect if another master is already running (i.e., if the lock file approach doesn't work correctly in all cases). The final step would be to work with the CSL to get this installed in our pool. - It would be cool if RSCs would flush the file buffers when a job is about to crash, so users will get their error messages and be better able to debug their problem. The RSC library could setup SIGSEGV, SIGABRT, SIGFPE, etc. signal handlers and flush the buffers before letting the job die from the signal. Depending on how messed up the job is, the flush might cause further problems, but it seems to me that it would generally do more good than harm. Of course, ideally users should debug their code outside of Condor. - much better analysis of match failures and potential matches put into tools with which the user can query about how and why his/her jobs will/won't be scheduled. Nick Coleman and rajesh are doing research in this area. [condor-admin #857] ******* DAGMan (Taken wholesale from what was the section on the limitations of DAGMan. PC ?? 1. A general purpose command socket will be used to direct Dagman while it's running. Commands like CANCEL_JOB X or DELETE_ALL would be supported, as well as notification messages like JOB_SUBMIT or JOB_TERMINATE, etc. Eventually, a Java Gui would graphically represent the Dag's state, and offer buttons and dials for graphic Dag manipulation. PC ?? 2. DAG removal supported by a command socket such as DELETE_ALL PC ?? 3. Condor Log File All jobs in a Dag must go to one log file, but log file can be shared with other Dags and Condor jobs. PC ?? 4. Currently, all jobs must exit normally, else the DAG will be aborted. Wish lists that a job can be ``undone'', or there is some notion of a job instance. Hence, a job that exits abnormally or is cancelled by the user can be rerun such that the new run's log entry is unique from the old run's log entry (in terms of recovery) PC ?? + DagMan inserts attributes into child job ads, such as the job id of parent in Dag, job id of DagMan job, etc. PC ?? + DagMan supports submitting multiple jobs per cluster. [condor-admin #1303] PC ?? - Make DAGMan more intelligent about removing leaf jobs and understanding how to manipulate the DAG so jobs can be removed or added into the DAG. [condor-admin #2001] ****** Port Lists ?? ?? + Alpha Linux port w/ checkpointing [condor-admin #220 #350 #738 #776 #1166 #1190] ?? ?? + HPUX 11.0 port ?? (clipped) ?? ?? + AIX port ?? (clipped) [condor-admin #697 #1042 #1082 #1507] +EP 2 Days MacOS X ?? ?? + PowerPC Linux port ?? (clipped) ?? ?? + {free,open,net}bsd port ?? (clipped) [condor-admin #59] The FreeBSD should be basically done after the OS X port... ?? 1 Day + SparcLinux port ?? (clipped) EP did most of one in an afternoon after a month on the project so it's not a very tough port... [condor-admin #1588] ******* Task list: (Not direct development work, local to UWCS BS 1 Day + Tinderbox/Bonsai Javascript popup boxes are b0rk3n with IE. Must fix. BS 2 Weeks + Tinderbox should gather up the results of individual tests and present them in the pop-up boxes +D/PC ?? Finish automating test suite + we need to add a make-sure-there-are-no-tokens sanity check to master_on, or just have it do a pagsh -- because otherwise, if it's run by a user with a token, it results in a nasty security embarrassment (userlogs, etc. written to AFS get owned by the user who ran condor_on, not the user who is running the jobs) ****** Appendix A: Long Keller Wishes :) - This is a long wish. :) The scoop is to add a new event in the user log dictating when the job begins it journey to the execute machine to run after the startd has been claimed, but before the job actually starts. This spawned an idea for a reworking of how the protocol between the user job and the shadow works to get a better handle on where time is spent doing what and recording what is happeneing in the user log. Below is a (slightly modified) excerpt from an email talking about this. I still like the new event in the user log(when the job is scheduled to run on the resource, but before it actually starts running) idea and I do believe it would lead to a better understanding in the user log about what is going wrong. So many times do I have to poke piles of crap just to learn that the job is blowing up on restart and committing Suicide() when I could just look in the userlog and see "oh, the job is getting scheduled to run, and then is evicted without actually running". (I actually vote for sending a pseudo-call to the shadow from the user job *just before returning to user-code and after the restart is complete*). Also, this would help our calculation of Goodput. I make the bold claim that we shouldn't count the time a job spends restarting as goodput. We should ONLY count the time the job is in user-code up until the start of a checkpoint. Then when the checkpoint finishes, we can count all the time in user code the job was actually doing what the user wanted. This means: 0. Schedd tells job to begin running on a startd. Time A 1. Schedd and Shadow records shadow birth time. Time B 2. User job starts up and sends "in restart" pseudo call to the shadow. Time C. 3. Job tells shadow it is in user-code. Time D. 4. Job runs a while.... 5. Just as job goes into checkpointing code, it sends a Time E to shadow. 6. When shadow sees either a completed checkpoint, or a fucked checkpoint it records a Time F. If the checkpoint was successful, the shadow knows Time E was valid and the time is committed. If the checkpoint was fucked, then it knows Time E is invalid and does things accordingly. Time the job ran from scheduling to last known committed time: F - A Time the schedd spent doing God knows what before job starts: B - A Time spent between birth of shadow and transfer executable to resource: C - B Time spent restarting(transfer of checkpoint, etc): D - C True Goodput(actual user code running): E - D Time spent checkpointing(or wasting time if it failed): F - E Wall clock time job has been "in the process of running": Now - A This would make Good|Badput calculations more true to the mark and it would show us problems in stages of how our jobs run that were hidden by the coarse granularity we have now. You could put this crap into the jobad (we do for some of it) and get up to the minute timings that are damn near what they are supposed to be. The possibilities are endless for how you can munch on this data. Also, you can put most of these timed events in the user log that dictate to the user exactly what is happening with thier job. This means we can look at it and see a decent record of what happened(or is happening) and get a good idea of how to fix it. (RUST would be cooler when they send us a job log that has something more interesting than a shadow exception in it eh?) Of course, the above is subject to haggling and 6.3/5. And I might have misunderstood exactly what happens between steps 0 and 2. But you get the idea. I honestly think it should be worked into the skeletons of the new shadow. Adding the minor parts to the schedd might already be done or just be a few lines of code. - set a submission file attribute like say "JOB_TRACK_SERVER = sinfulstring" that would cause every daemon that knows about the job as it is running to copy its logfile entries about that job via a socket to a log server located at the sinful string(in addition to going to the local log files). This would allow us to watch, in real time, every daemon who is dealing with a particular job(or maybe even a set of jobs in a cluster). If the server goes away, then the connection is closed, and the daemon continues to go to the logfile. This would help us _so much_ in trying to debug why a job fails running or screws up for some reason on a machine that we might or might not have an account, or even juristiction, on. A lot of developer time is wasted in tracking a job or a set of jobs through logfiles that we might or might not own, and it is mentally taxing to do it over and over again while looking for elusive race condition, or statistcally significat(but you need a lot of runs) bugs. + A "blackhole machine" information gathering system that will propogate information about potential blackhole machines(as defined by a user specifed policy in the submit file) from the shadow to higher order condor daemons, like the schedd or the negotiator. What would happen is that if a job dies under circumstances that would make the user specified blackhole policy activate(job completed less than 5 minutes when this job was supposed to run 5 hours), the shadow will inform the schedd that the machine the shadow had been talking to is a blackhole candidate. The schedd could collect all of this information from various shadows it controls and over multiple executions of shadows for the same job building summaries if so desired. Then it could forward the blackhole statistics to the collector as part of a larger blackhole ad that all schedd's can contribute to. Also, there should be ways to mark when a machine ISN'T a blackhole so we can allow detection for blackholes that only affect certain jobs or users or blackholes that are only that way for a finite period of time. This would be a boon to allow admins to discover oddities in thier pools and places they potentially flock or glide into. After all of this works, then maybe we can put smarts into the pool to respond to blackhole situations in a more intelligent way like have the negotiator not schedule jobs on machine that are known blackholes for that job. But that is way later.