[drmaa-wg] DRMAA TEST SUITE
Peter Tröger
peter.troeger at hpi.uni-potsdam.de
Tue Mar 21 14:43:26 CST 2006
> Sorry, I do not agree. In the DRMS context, job life cycle comprises all the
> job execution stages since the job enters the DRM system. In this sense,
> whenever a job is submitted there should be a termination (either it actually
> ran or not). I can give you an example, if you submit a job (qsub) and then
> you kill it (qdel), it is obvious that the job terminated abnormally (it has
> been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned
to POSIX semantics here - it is only possible to have something
terminated which was running (== executed) before.
> There is no relation between if the job terminated normally and if there is no
> further information from the DRM. In the previous example (a job that has
> been killed) could or could not be more information from the DRMS. But in any
> case, it is clear that the job terminated abnormally.
>
> drmaa_wifexited description should concentrate in one aspect since there is no
> obvious (or general) relation between job termination and getting further
> information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if
additional information about the job execution ending is available. The
final status of the job is provided by drmaa_job_ps(), and nothing else.
The confusion might eventually be solvable by a slight reformulation of
the first sentences in the drmaa_wif...() descriptions, in order to
avoid the word "termination". This would not lead to a change of semantics.
I have no good proposal - DRMAA group ?
>> ( Note: The testsuite assumes here that unusable input files are
>> detected by the DRM before the job starts. This seems to be realistic,
>> since file staging operations are usually not part of the job execution.)
>>
>
> I do not think so. Usually job preparation stages are part of the job
> execution, for example:
...
> Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE
> and ST_ERROR_FILE_FAILURE from the official test suite. In the previous DRMs
> at least, you can submit a job with output file /etc/passwd or an unusable
> input file , the job is queued, runs and fails.
During the last phone call, the group went through the code. We agree to
your impression that the 3 tests are currently not sufficient. The
descriptions for "input / output / error stream" job template parameters
says that an invalid value should result in the job state
DRMAA_PS_FAILED - and nothing more. There is no description of what that
means for drmaa_wif...() calls, but the testsuite expects a particular
behavior. If you look at DRMAA section 2.6, it is clearly shown that
DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for
ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE.
The drmaa_wait() call does not hurt (since all submitted jobs must be
waitable), but the crucial part is the testing for the result of
drmaa_synchronize(). After this change, I would expect the test cases to
be successful also on your system. In case of malicious input / output /
error files, the DRMAA implementation would only be expected to state a
job failure. This should work for all GridWay-supported systems, right ?
Could you accept this proposal ?
BTW: Condor is one example for a system where the existence of input
files is checked before the job is started. But at least your GRAM
example convinced me that the opposite is also true ;-) ...
> Sure. The problem is that the code is not clear either. From DRMAA 1.0 C
> bindings example:
...
> From this code it seems that a signaled job should end with a zero exited
> value from wifexited (as if it did not terminate normally), as opposed to
> your comments in the previous mails and the code in the DRMAA test suite.
You are right, as already said above. drmaa_wifexited() mainly indicates
the availability of additional information.
Regards,
Peter.
More information about the drmaa-wg
mailing list