batch

Automate batch job processing

Synopsis

gmt batch mainscript -Nprefix -Tnjobs|min/max/inc[+n]|timefile[+pwidth][+sfirst][+w[str]|W] [ -D ] [ -Ftemplate ] [ -Iincludefile ] [ -M[job] ] [ -Q[s] ] [ -Sbpreflight ] [ -Sfpostflight ] [ -V[level] ] [ -W[dir] ] [ -Z ] [ -fflags ] [ -x[[-]n] ] [ --PAR=value ]

Note: No space is allowed between the option flag and the associated arguments.

Description

The batch module can generate GMT processing jobs using a single master script that is repeated for all jobs, with some variation using specific job variables. The module simplifies (and hides) most of the steps normally needed to set up a full-blown processing sequence. Instead, the user can focus on composing the main processing script and let the parallel execution of jobs be automatic. We can set up required data sets and do one-time calculations via an optional preflight script. After completion we can optionally assemble the data output and make summary plots or similar in the postflight script.

Required Arguments

mainscript: Name of a stand-alone GMT modern mode processing script that makes the parameter-dependent calculations. The script may access job variables, such as job number and others defined below, and may be written using the Bourne shell (.sh), the Bourne again shell (.bash), the C shell (.csh) or DOS batch language (.bat). The script language is inferred from the file extension and we build hidden batch scripts using the same language. Parameters that can be accessed are discussed below.

-Nprefix: Determines the prefix of the batch file products and the final sub-directory where intermediate job products can be find after execution.

-Tnjobs|min/max/inc[+n]|timefile[+pwidth][+sfirst][+w[str]|W]

Either specify how many jobs to make, create a one-column data set width values from min to max every inc , or supply a file with a set of parameters, one record (i.e., row) per job. The values in the columns will be available to the mainscript as named variables BATCH_COL0, BATCH_COL1, etc., while any trailing text can be accessed via the variable BATCH_TEXT. The number of records equals the number of jobs. Note that the preflight script is allowed to create timefile, hence we check for its existence both before and after the preflight script has completed. Note: If just njobs is given then only BATCH_JOB is available as no data file is available. For details on array creation, see Generate 1-D Array. Several modifiers are also available:

+n indicates that inc is the desired number of jobs from min to max instead of an increment.
+p can be used to set the tag width of the job number format used in naming the jobs. For instance, name_000010.grd has a tag width of 6. By default, this width is automatically set, but if you are splitting large jobs across several computers (via +s) then you must ensure the same tag width for all frame names.
+s starts the output job numbering at first instead of 0. Note: All jobs are still included; this modifier only affects the numbering of the specific jobs on output.
+w will split the trailing text string into individual words that can be accessed via variables BATCH_WORD0, BATCH_WORD1, etc. By default we look for either tabs or spaces to separate the words. Append str to select other character(s) as the valid separator(s) instead. To just use TAB as the only valid separator, use modifier +W instead.

Optional Arguments

-D: Select this option if (1) the main script does not produce products named using the prefix BATCH_NAME, so we should not attempt to move such files to the top directory, or (2) the main script will handle the placement of any such product files directly.

-Ftemplate: Rather than build product file names from the BATCH_NAME prefix based on a single running number, use this C-format template instead and create unique names by formatting the data columns given by timefile. Some limitations apply: (1) If timefile has trailing text then it may be used with a single %s code as the last format statement in template. If no %s is found then any trailing text present will not be used. (2) The previous N format statements will be used to convert the first N data columns in timefile; there is no option to skip a column or to specify a specific order of columns in the template (but see -iflags to rearrange the input order). (3) Up to five numerical statements may be used (provided the timefile has enough columns), including none. E.g., -Fmy_data_%05.2lf_%07.0lf_%s will use the first two numerical columns in timefile as well as the trailing text to create a unique product prefix. Note: Since a GMT data set internally is using double precision variables you must use floating point format statements even if some or all of your data columns are integers. Finally, if your choice of format statement and trailing text yield tabs or spaces in the final prefix we will automatically replace those with underscores.

-Iincludefile: Insert the contents of includefile into the batch_init.sh script that is accessed by all batch scripts. This mechanism is used to add information (typically constant variable assignments) that the mainscript and any optional -S scripts can rely on.

-M[job]: Instead of making and launching the full processing sequence, select a single master job [0] for testing. The master job will be run and its product(s) are placed in the workdir. While any preflight script will be run prior to the master job, the postflight script will not be executed (but it will be created).

-Q[s]: Debugging: Leave all files and directories we create behind for inspection. Alternatively, append s to only build the batch scripts but not perform any executions. One exception involves the optional preflight script set via -Sb which is always executed since it may produce data needed when building the main batch (or master) scripts.

-Sbpreflight: The optional GMT modern mode preflight script (written in the same scripting language as mainscript) can be used to download or copy data files or create files (such as timefile) that will be needed by mainscript. It is always run before the main sequence of batch scripts.

-Sfpostflight: The optional postflight script (written in the same scripting language as mainscript) can be used to perform final processing steps following the completion of all the individual jobs, such as assembling all the products into a single larger file, report overall statistics, etc. The script may also make one or more illustrations using the products or stacked data after the main processing is completed. Note: The postflight script does not have to be a GMT script.

-V[level]: Select verbosity level [w]. (See full description) (See technical reference).

-W[dir]: By default, all temporary files and job products are created in the subdirectory prefix set via -N. You can override that selection by giving another dir as a relative or full directory path. If no path is given then we create a working directory in the system temp folder named prefix. The main benefit of a working directory is to avoid endless syncing by agents like DropBox or TimeMachine, or to avoid problems related to low space in the main directory. The product files will still be placed in the prefix directory. The dir is removed unless -Q is specified for debugging.

-Z: Erase the mainscript and all input scripts given via -I and -S upon completion. Not compatible with -Q.

-f[i|o]colinfo (more …): Specify data types of input and/or output columns.

-x[[-]n]: Limit the number of cores to use when distributing the jobs. By default we try to use all available cores. Append n to only use n cores (if n is too large it will be truncated to the maximum cores available). Finally, give a negative n to select (all - n) cores (or at least 1 if n equals or exceeds all). The parallel processing does not depend on OpenMP; new jobs are launched when the previous ones complete. Note: One core is reserved by batch so in effect n-1 are used for the jobs.

-^ or just -: Print a short message about the syntax of the command, then exit (Note: on Windows just use -).
-+ or just +: Print an extensive usage (help) message, including the explanation of any module-specific option (but not the GMT common options), then exit.
-? or no arguments: Print a complete usage (help) message, including the explanation of all options, then exit.
--PAR=value: Temporarily override a GMT default setting; repeatable. See gmt.conf for parameters.

Generate 1-D Array

We will demonstrate the use of options for creating 1-D arrays via math. Make an evenly spaced coordinate array from min to max in steps of inc, e.g.:

gmt math -o0 -T3.1/4.2/0.1 T =
1
2
3
4
5
6
7
...

Append +b if we should take \(\log_2\) of min and max, get their nearest integers, build an equidistant \(\log_2\)-array using inc integer increments in \(\log_2\), then undo the \(\log_2\) conversion. E.g., -T3/20/1+b will produce this sequence:

gmt math -o0 -T3/20/1+b T =
4
8
16

Append +l if we should take \(\log_{10}\) of min and max and build an array where inc can be 1 (every magnitude), 2, (1, 2, 5 times magnitude) or 3 (1-9 times magnitude). E.g., -T7/135/2+l will produce this sequence:

gmt math -o0 -T7/135/2+l T =
10
20
50
100

For output values less frequently than every magnitude, use a negative integer inc:

gmt math -o0 -T1e-4/1e4/-2+l T =
0.0001
0.01
1
100
10000

Append +i if inc is a fractional number and it is cleaner to give its reciprocal value instead. To set up times for a 24-frames per second animation lasting 1 minute, run:

gmt math -o0 -T0/60/24+i T =
0
0.0416666666667
0.0833333333333
0.125
0.166666666667
...

Append +n if inc is meant to indicate the number of equidistant coordinates instead. To have exactly 5 equidistant values from 3.44 and 7.82, run:

gmt math -o0 -T3.44/7.82/5+n T =
44
535
63
725
82

Alternatively, let inc be a file with output coordinates in the first column, or provide a comma-separated list of specific coordinates, such as the first 6 Fibonacci numbers:

gmt math -o0 -T0,1,1,2,3,5 T =
0
1
1
2
3
5

Notes: (1) If you need to pass the list nodes via a dataset file yet be understood as a list (i.e., no interpolation), then you must set the file header to contain the string “LIST”. (2) Should you need to ensure that the coordinates are unique and sorted (in case the file or list are not sorted or have duplicates) then supply the +u modifier.

If you only want a single value then you must append a comma to distinguish the list from the setting of an increment.

If the module allows you to set up an absolute time series, append a valid time unit from the list year, month, day, hour, minute, and second to the given increment; add +t to ensure time column (or use -f). Note: The internal time unit is still controlled independently by TIME_UNIT. The first 7 days of March 2020:

gmt math -o0 -T2020-03-01T/2020-03-07T/1d T =
2020-03-01T00:00:00
2020-03-02T00:00:00
2020-03-03T00:00:00
2020-03-04T00:00:00
2020-03-05T00:00:00
2020-03-06T00:00:00
2020-03-07T00:00:00

A few modules allow for +a which will paste the coordinate array to the output table.

Likewise, if the module allows you to set up a spatial distance series (with distances computed from the first two data columns), specify a new increment as inc with a geospatial distance unit from the list degree (arc), minute (arc), second (arc), meter, foot, kilometer, Miles (statute), nautical miles, or survey foot; see -j for calculation mode. To interpolate Cartesian distances instead, you must use the special unit c.

Finally, if you are only providing an increment and will obtain min and max from the data, then it is possible (max - min)/inc is not an integer, as required. If so, then inc will be adjusted to fit the range. Alternatively, append +e to keep inc exact and adjust max instead (keeping min fixed).

Parameters

Several parameters are automatically assigned and can be used when composing the mainscript and the optional preflight and postflight scripts. There are two sets of parameters: Those that are constants and those that change with the job number. The constants are accessible by all the scripts: BATCH_PREFIX: The common prefix of the batch jobs (it is set with -N). BATCH_NJOBS: The total number of jobs (given or inferred from -T). Also, if -I was used then any static parameters listed therein will be available to all the scripts as well. In addition, the mainscript also has access to parameters that vary with the job counter: BATCH_JOB: The current job number (an integer, e.g., 136), BATCH_ITEM: The formatted job number given the precision (a string, e.g., 000136), and BATCH_NAME: The name prefix unique to the current job (i.e., prefix_BATCH_ITEM), Furthermore, if a timefile was given then variables BATCH_COL0, BATCH_COL1, etc. are also set, yielding one variable per column in timefile. If timefile has trailing text then that text can be accessed via the variable BATCH_TEXT, and if word-splitting was explicitly requested by +w modifier to -T then the trailing text is also split into individual word parameters BATCH_WORD0, BATCH_WORD1, etc. Note: Any product(s) made by the processing scripts should be named using BATCH_NAME as their name prefix as these will be automatically moved up to the starting directory upon completion (unless -D is in effect). However, note that -F can be used to select more diverse product names based on the input parameters given via -T.

Data Files

The batch scripts will be able to find any files present in the starting directory when batch was initiated, as well as any new files produced by mainscript or the optional scripts set via -S. No path specification is needed to access these files. Other files may require full paths unless their directories were already included in the DIR_DATA setting.

Custom gmt.conf files

If you have a gmt.conf file in the top directory with your main script prior to running batch then it will be used and shared across all the scripts created and executed unless your scripts use -C when starting a new modern mode session. The preferred ways of changing GMT defaults is via set calls in your input scripts. Note: Each script is run in isolation (modern) mode so trying to create a gmt.conf file via the preflight script to be used by other scripts is futile.

Constructing the Main Script

A batch sequence is not very interesting if nothing changes between calls. For the process to change you need to have your mainscript either access a different data set as the job number changes, or you need to access only a varying subset of a data set, or the processing parameters need to change, or all of the above. There are several strategies you can use to accomplish these effects:

Your timefile passed to -T may list names of specific data files and you simply have your mainscript use the relevant BATCH_TEXT or BATCH_WORD? to access the job-specific file name.
You have a 3-D grid (or a stack of 2-D grids) and you want to interpolate along the axis perpendicular to the 2-D slices (e.g., time, or it could be depth). In this situation you will use the module grdinterpolate to have the mainscript obtain a slice for the correct time (this may be an interpolation between two different times or depths) and process this temporary grid file.
You may be creating data on the fly using math or grdmath, or perhaps processing data slightly differently per job (using parameters in the timefile) and computing these or the changes between jobs.
Use your imagination to pass whatever arguments are needed via timefile.

Technical Details

The batch module creates several hidden script files that are used in the generation of the products (here we have left the script file extension off since it depends on the scripting language used): batch_init (initializes variables related to the overall batch job and includes the contents of the optional includefile), batch_preflight (optional since it derives from -Sb and computes or prepares needed data files), batch_postflight (optional since it derives from -Sf and processes files once all the batch job complete), batch_job (accepts a job counter argument and processes data for those parameters), and batch_cleanup (removes temporary files at the end of the process). For each job, there is a separate batch_params_###### script that provides job-specific variables (e.g., job number and anything given via -T). The preflight and postflight scripts have access to the information in batch_init, while the batch_job script in addition has access to the job-specific parameter file. Using the -Q option will just produce these scripts which you can then examine. Note: The mainscript is duplicated per job and many of these are run simultaneously on all available cores. Multi-treaded GMT modules will therefore be limited to a single core per call. Because we do not know how many products each batch job makes, we ensure each job creates a unique file when it is finished. Checking for these special (and empty) files is how batch learns that a particular job has completed and it is time to launch another one.

Shell Limitations

As we cannot control how a shell (e.g., bash or csh) implements piping between two processes (it often involves a sub-shell), we advise against using commands in your main script that involve piping the result from one GMT module into another (e.g., gmt blockmean ….. | gmt surface …). Because batch is running many instances of your main script simultaneously, odd things can happen when sub-shells are involved. In our experience, piping in the context of batch script may corrupt the GMT history files, resulting in stray messages from some frames, such as region not set, etc. Split such pipe constructs into two using a temporary file when writing batch main scripts. Note: Piping from a non-GMT module into a GMT module or vice versa is not a problem (e.g., echo ….. | gmt convert …).

Hints for Batch Makers

Composing batch jobs is relatively simple, but you have to think in terms of variables. Examine the examples we describe. Then, start by making a single script (i.e., your mainscript) and identify what things should change with time (i.e., with the job number). Create variables for these values. If they are among the listed parameters that batch creates automatically then use those names. Unless you only require the job number you will need to make a file that you can pass via -T. This file should then have all the values you need, per job (i.e., per row), with values across all the columns you need. If you need to assign various fixed variables that do not change with time, then your mainscript will look shorter and cleaner if you offload those assignments to a separate includefile (via -I). To test your mainscript, start by using options -Q -M to ensure that your master job results are correct. The -M option simply runs one job of your batch sequence (you can select which one via the -M arguments [0]). Fix any issues with your use of variables and options until this works. You can then try to remove -Q. We recommend you make a very short (i.e., via -T) and small batch sequence so you don’t have to wait very long to see the result. Once things are working you can beef up number of jobs.

Examples

We extract a subset of bathymetry for the Gulf of Guinea from the 2x2 arc minute resolution Earth DEM and compute Gaussian filtered high-pass grids using filter widths ranging from 10 to 200 km in steps of 10 km. When the grids are all completed we determine the standard deviation in the results. To replicate our setup, try:

cat << EOF > pre.sh
gmt begin
    gmt math -o0 -T10/200/10 T = widths.txt
    gmt grdcut -R-10/20/-10/20 @earth_relief_02m -Gdata.grd
gmt end
EOF
cat << 'EOF' > main.sh
gmt begin
    gmt grdfilter data.grd -Fg${BATCH_COL0}+h -G${BATCH_NAME}.grd -D2
gmt end
EOF
cat << 'EOF' > post.sh
gmt begin ${BATCH_PREFIX} pdf
    gmt grdmath ${BATCH_PREFIX}_*.grd -S STD = ${BATCH_PREFIX}_std.grd
    gmt grdimage ${BATCH_PREFIX}_std.grd -B -B+t"STD of Gaussians residuals" -Chot
    gmt coast -Wthin,white
gmt end show
EOF
gmt batch main.sh -Sbpre.sh -Sfpost.sh -Twidths.txt -Nfilter -V -Z

Of course, the syntax of how variables are used vary according to the scripting language. Here, we actually build the pre.sh, main.sh, and post.sh scripts on the fly, hence we need to escape any variables (since they start with a dollar sign that we need to be written verbatim). By putting EOF in quotes, the redirect will not replace the variables but leave them as verbatim text. At the end of the execution we find 20 grids (e.g., such as filter_07.grd), as well as the filter_std.grd file obtained by stacking all the individual scripts and computing a standard deviation. The information needed to do all of this is hidden from the user; the actual batch scripts that we execute are derived from the user-provided main.sh script and batch supplies the extra machinery. The batch module automatically manages the parallel execution loop over all jobs using all available cores and launches new jobs as others complete.

As another example, we get a list of all European countries and make a simple coast plot of each of them, placing their name in the title and the 2-character ISO code in the upper left corner, then in postflight we combine all the individual PDFs into a single PDF file and delete the individual files. Here, we place the EOF tag in quotes which prevent the un-escaped variables from being interpreted:

cat << EOF > pre.sh
gmt begin
    gmt coast -E=EU+l > countries.txt
gmt end
EOF
cat << 'EOF' > main.sh
gmt begin ${BATCH_NAME} pdf
    gmt coast -R${BATCH_WORD0}+r2 -JQ10c -Glightgray -Slightblue -B -B+t"${BATCH_WORD1}" -E${BATCH_WORD0}+gred+p0.5p
    echo ${BATCH_WORD0} | gmt text -F+f16p+jTL+cTL -Gwhite -W1p
gmt end
EOF
cat << 'EOF' > post.sh
gmt psconvert -TF -F${BATCH_PREFIX} ${BATCH_PREFIX}_*.pdf
rm -f ${BATCH_PREFIX}_*.pdf
EOF
gmt batch main.sh -Sbpre.sh -Sfpost.sh -Tcountries.txt+w"\t" -Ncountries -V -W -Z

<Here, the postflight script may not even be a GMT script. In our case we simply run psconvert (which just calls gs (Ghostscript)) and deletes what we don’t want to keep.

macOS Issues

Note: The limit on the number of concurrently open files is relatively small by default on macOS and when executing numerous jobs at the same time it is not unusual to get failures in batch jobs with the message “Too many open files”. We refer you to this helpful article for various solutions.