Vesta SDL Primitive Function: _run

`_run_tool( platform: text, command: list(text), stdin: text = "", stdout_treatment: text = "report", stderr_treatment: text = "report", status_treatment: text = "report_nocache", signal_treatment: text = "report_nocache", fp_content: int = -2, wd: text = ".WD", existing_writable: bool = FALSE): binding`

_run_tool is the function for invoking an external tool (such as a compiler or linker) from a Vesta SDL program. Thus, it is arguably the central function of Vesta SDL.

Most users should not have to directly use, or even understand, _run_tool. In most cases it will be invoked by an abstract bridge function which implements some higher-level operation (e.g. "compile this source file", "build a program from these sources", etc.). In fact, one of the major goals of the design of Vesta SDL is abstraction of complex build targets into such functions. Only those user who need to write new bridges should have to understand _run_tool.

The documentation of _run_tool is broken up into the following sections:

A Simple example
Encapsulation
Capturing Standard Output
Result Files
_run_toolReturn Value
Providing Standard Input
Modifying Files
Changing the Working Directory
Failure
Deterministically Fingerprinting Result Files
platform and Host Selection
Controlling how Dependencies are Recorded [New]
Summary

Also see the language specification for another description of _run_tool.

A Simple example

Suppose we have a very simple program which we would like to run inside a Vesta evaluation. This program simply prints "Hello world" to its standard output. Let's also suppose that we have a statically linked copy of this program, compiled for Linux on the IA-32 architecture, in a file named "hello_world". If we place this file into a Vesta checkout working directory, a simple model to invoke it might look like this:

files
  hello_world;    // import the executable into a variable
{
  // Set up the required ./envVars and a root filesystem with the
  // executable in the working directory.
  . = [ envVars = [ ],
        root/.WD = [ hello_world ],
      ];

  // The platform to execute the command on.
  platform = "Linux2.4-ia32";

  // The command line to execute.
  cmd = <"hello_world">;

  // Run the program
  r = _run_tool(platform, cmd);
  return r;
}

If this model is evaluated, the output will be something like this:

0/linuxhost.foobar.com: hello_world 
Hello world!

The first parameter to _run_tool is a string specifying which platform the command is to be executed on. For now, you can just think of this as an arbitrary string. (This is described in detail below, and the evaluator man page also describes _run_tool host selection.)

The second parameter to _run_tool is a list of text values which specifies the command line to be executed. In this case the program takes no command-line arguments, so the list contains just one element: the name of the executable to run.

Like any other function call, there is an implicit final parameter to _run_tool which defaults to the value of the variable named "." (aka "dot"). _run_tool expects this parameter to have the following type:

binding(
  envVars:binding(:text),
  root:binding
)

That is, it must be a binding with at least two names defined: "envVars", and "root".

The value of ./envVars must a binding containing only text values. This specifies the complete set of environment variables when the command is run. In other words, when _run_tool invokes a command, it does so in an encapsulated environment. In this case, since we know that "hello_world" doesn't need any environment variables to execute, we leave ./envVars empty.

The value of ./root defines the entire filesystem seen by the command being executed. All filesystem accesses are redirected such that they are interpreted as look-ups within the binding ./root. That is, _run_tool invokes commands in encapsulated filesystems. (This is implemented with the chroot(2) system call.) The only file we need is the hello_world executable, which we place in a directory which the command will see as "/.WD". This is the default working directory for _run_tool.

Note that if hello_world were not statically linked, then we would also need to include whatever shared libraries it needed to run (e.g. /usr/lib/libc.so) in the value of ./root.

Encapsulation

Probably the most important feature of _run_tool is its encapsulation of both the filesystem and the environment in which commands are executed. The goal of encapsulation is simple: control all the inputs which have an effect on the result of a build. This provides several important properties:

Users can neither make nor break a build with their personal environment variable settings. If a build works for one user, it will work exactly the same for everyone.
Builds are not affected by changes to files in the filesystem at large. Since the root filesystem is passed in as an argument to _run_tool, changes to the compiler version in /usr/bin or the library versions in /usr/lib simply cannot affect tools run from a Vesta evaluation. (Some people say this solves the "sands shifting beneath your feet" problem.)
Builds are guaranteed the be repeatable over time. Since the environment variables and filesystem are specified precisely and completely in the arguments to _run_tool, re-evaluating the same Vesta SDL model always produces the same results.
Dependency detection is automatic, language-independent, and complete. When a tool is invoked with _run_tool, the evaluator keeps track of which files in ./root the command reads. When the command completes, the evaluator records the parameters, the dependencies, and the result in the cache server. If _run_tool is later invoked again with the same command-line, the same ./envVars, and all the files which the previous _run_tool call used having the same contents, the saved result from the old call can be re-used without re-running the tool.
Developers can benefit from work done by each others builds. Since _run_tool calls are user-independent and have all their dependencies recorded, one user can make use of the result files produced by a build that another user previously ran.

Of course, _run_tool does not provide a complete virtual environment. The person writing Vesta SDL code invoking a command with _run_tool should be aware of any cases in which the behavior of the tool depends on something other than environment variables and files. Some tools make use of information obtained from the operating system, such as the date and time when they are run. Also, any tool which makes network connections is obviously problematic. For the kind of processes Vesta SDL was intended to support (running a sequence of tools that process source files and produce result files), this sort of issue normally isn't a problem.

Capturing Standard Output

Returning to our earlier example, suppose that we want to capture the standard output of hello_world. Perhaps we want to return it as the result of the model, so that even if the _run_tool call is cached, the user who evaluates the build can get the output text. Another reason we might want to save it is that we want to use it as an input to another tool.

The fourth parameter to _run_tool, stdout_treatment, determines what is done with the standard output of the command. The default for this parameter is "report", which means that it should be displayed to the terminal of the user performing the build. To capture it, we'll use one of the other possible choices: "value".

files
  hello_world;    // import the executable into a variable
{
  // Set up the required ./envVars and a root filesystem with the
  // executable in the working directory.
  . = [ envVars = [ ],
        root/.WD = [ hello_world ],
      ];

  // The platform to execute the command on.
  platform = "Linux2.4-ia32";

  // The command line to execute.
  cmd = <"hello_world">;

  // Run the program
  r =  _run_tool(platform, cmd,
		 /*stdin=*/ "" /* (the default value) */,
		 /*stdout_treatment=*/ "value" /* (save stdout) */);

  // Return just the standard output
  return r/stdout;
}

If this revised model is evaluated, the output will be something like this:

0/linuxhost.foobar.com: hello_world

You'll note that the standard output was not displayed in the output of the evaluator. If you add "-shipto /tmp/hello_world.out" to the evaluator command line (before the model name), then the output will be placed in the file /tmp/hello_world.out.

If we wanted to both display and capture the standard output, then we would pass "report_value" for stdout_treatment. The table below summarizes all the possible values for stdout_treatment.

`stdout_treatment`	Handling of standard output
`"report"` (default)	Displayed by the evaluator but not captured.
`"value"`	Captured and returned in the result of `_run_tool`, but not displayed.
`"ignore"`	Discarded without being displayed or captured. (Think "`> /dev/null`".)
`"report_value"`	Both displayed and returned in the result of `_run_tool`. (Think "`\| tee`".)
`"report_nocache"`	Displayed by the evaluator and not captured. If non-empty, the evaluator will not add a cache entry for this `_run_tool` call. (Causes a tool to be re-executed in subsequent evaluations if it produced any output.)

The possible values for stderr_treatment, and their effects on the handling of the standard error stream, are exactly the same.

Result Files

Suppose that hello_world takes a command line option that specifies a file to which it should write its message, rather than standard output. So the command line "hello_world -o hi_there.dat" would cause hello_world to write its message to a file named "hi_there.dat". We can re-write the model to use this feature.

files
  hello_world;    // import the executable into a variable
{
  // Set up the required ./envVars and a root filesystem with the
  // executable in the working directory.
  . = [ envVars = [ ],
        root/.WD = [ hello_world ],
      ];

  // The platform to execute the command on.
  platform = "Linux2.4-ia32";

  // The command line to execute.
  cmd = <"hello_world", "-o", "hi_there.dat">;

  // Run the program
  r =  _run_tool(platform, cmd);

  // Return the file written
  return r/root/.WD/hi_there.dat;
}

The sub-binding named "root" within the result of _run_tool represents changes made to the filesystem by the command during its execution. As mentioned earlier, the default working directory is "/.WD". Since we didn't specify an absolute path for the output file, it will be written in the working directory. So, we can find the contents of the file written with the expression "r/root/.WD/hi_there.dat".

(Note that the value of ./root is unaffected by _run_tool. Vesta SDL is a functional language, so calling _run_tool, or any other function, can't have any side-effects.)

`_run_tool`Return Value

Now that we've seen a few examples using it, let's take a detailed look at the value returned by _run_tool. Its data type could be written like this:

binding(code   : int,
        signal : int,
        stdout_written : bool,
        stderr_written : bool,
        stdout : text,
        stderr : text,
        root   : binding)

So, in the examples above:

r/code is the exit status of the process invoked by _run_tool. (Unless it failed, this will be 0.)
r/signal is the signal that terminated the process, or 0 if it exited voluntarily. (This is usually also 0.)
r/stdout_written is TRUE if the command wrote anything to its standard output stream and FALSE otherwise. (This would be TRUE for the first example above and the second, but FALSE for the third.)
Similarly, r/stderr_written is TRUE if the command wrote anything to its standard error stream and FALSE otherwise.
r/stdout is a text value containing the bytes written to standard output by the tool. Note that r/stdout is only defined if stdout_treatment is "value" or "report_value". (So r/stdout would only be defined in the second example above. )
Similarly, r/stderr is a text value containing the bytes written to standard error by the tool. Note that r/stderr is only defined if stderr_treatment is "value" or "report_value". (So actually, r/stderr would not be defined in any of the preceding examples, as in all of them we left stderr_treatment with its default value of "report".)
r/root is a binding representing the changes made to the filesystem by the tool. More specifically:
- Any files which the tool creates or modifies while it runs which still exist when the tool exits will have their contents in text values within the root sub-binding of the result. Above, we used this to get the contents of the file which the tool wrote as "/.WD/hi_there.dat" with the expression "r/root/.WD/hi_there.dat". If the tool wrote some other file such as "/foo/bar/blah.bin", we could get its contents at the end of the tool's run with "r/root/foo/bar/blah.bin".
- Any files which the tool deletes during its run will have a value of FALSE in the root sub-binding of the result. This is the case both for files which existed when the tool is started (those in ./root when _run_tool is called), and any files created and then deleted by the tool (such as temporary files used by the tool to store intermediate results). So, if there was a file "/foo/bar/blah.bin" at the start of the tool which the tool deleted, then "r/root/foo/bar/blah.bin" would evaluate to FALSE. Similarly, if the tool created and then deleted a file named "/tmp/hello.XXXXuKmKuL", then "r/root/tmp/hello.XXXXuKmKuL" would be FALSE.

Note that if you wanted to apply the changes made to the filesystem by a tool to subsequent code, you would have to do something like this:

/**nocache**/
remove_deleted(b: binding): binding
{
  res: binding = [];
  foreach [ n = v ] in b do
    res += if (v == FALSE) then []
           else if _is_binding(v) then [ $n = remove_deleted(v) ]
           else [ $n = v ];
  return res;
};

// ...

r = _run_tool(/* ... */);

// Apply filesystem changes from this point forward
new_root = remove_deleted(./root ++ r/root);
. += [ root = new_root ];

Providing Standard Input

The standard input for commands executed by _run_tool is empty by default. It's controlled by the stdin parameter. This parameter is literally the entire contents of the standard input stream, much like it had been redirected from a file.

Continuing with our running example, let's suppose that hello_world will replace the word "world" in its output with a word read from standard input if it is passed the "-i" command-line flag. To use this, we need only pass a text string with the word we want for the stdin parameter to _run_tool.

files
  hello_world;    // import the executable into a variable
{
  // Set up the required ./envVars and a root filesystem with the
  // executable in the working directory.
  . = [ envVars = [ ],
        root/.WD = [ hello_world ],
      ];

  // The platform to execute the command on.
  platform = "Linux2.4-ia32";

  // The command line to execute.
  cmd = <"hello_world", "-i">;

  // Run the program
  r = _run_tool(platform, cmd,
                /*stdin=*/ "foobar");

  return r/root/.WD/message.txt;
}

Evaluating this model would produce output like this.

0/linuxhost.foobar.com: hello_world -i
Hello foobar!

Modifying Files

By default, the files passed into a _run_tool call appear read-only to the command executed. However sometimes we want a tool to be able to modify a file (rather than just creating new ones). To allow modification of existing files, pass TRUE for the existing_writable parameter.

Building on earlier examples, suppose hello_world has another command line option that specifies a file to which it should append its message. So the command line "hello_world -a message.txt" would cause hello_world to append its message to the end of a file named "message.txt".

files
  hello_world;    // import the executable into a variable
  message.txt;    // file to be modified
{
  // Set up the required ./envVars and a root filesystem with the
  // executable in the working directory.
  . = [ envVars = [ ],
        root/.WD = [ hello_world, message.txt ],
      ];

  // The platform to execute the command on.
  platform = "Linux2.4-ia32";

  // The command line to execute.
  cmd = <"hello_world", "-a", "message.txt">;

  // Run the program
  r = _run_tool(platform, cmd,
                /*stdin=*/ "",
                /*stdout_treatment=*/ "report",
                /*stderr_treatment=*/ "report",
                /*status_treatment=*/ "report_nocache",
                /*signal_treatment=*/ "report_nocache",
                /*fp_content=*/ -2,
                /*wd=*/ ".WD",
                /*existing_writable=*/ TRUE);

  return r/root/.WD/message.txt;
}

In this example, the file message.txt must exist in the directory with this model file and the hello_world executable. If it contains the text:

Goodbye home...

Then the result of the model will be this text:

Goodbye home...
Hello world!

(Of course the original message.txt in the directory with the model will be unmodified. It's treated as an immutable source, kept separate from the changes made by tools.)

Changing the Working Directory

As mention above, the working directory when the command executed by a _run_tool starts defaults to "/.WD". (This corresponds to the value of ./root/.WD at the time of the _run_tool call.) For many tools, the working directory isn't important, so the default is often used.

However, a different working directory can be specified with _run_tool's wd parameter. The leading slash should be omitted when specifying the working directory. So, if we wanted to run a tool in the directory /foo/bar, we would pass "foo/bar" for the wd parameter. Here's a model illustrating this with hello_world:

files
  hello_world;    // import the executable into a variable
{
  // Set up the required ./envVars and a root filesystem with the
  // executable in the working directory.
  . = [ envVars = [ ],
        root/foo/bar = [ hello_world ],
      ];

  // The platform to execute the command on.
  platform = "Linux2.4-ia32";

  // The command line to execute.
  cmd = <"hello_world", "-o", "hi_there.dat">;

  // Run the program
  r =  _run_tool(platform, cmd,
                /*stdin=*/ "",
                /*stdout_treatment=*/ "report",
                /*stderr_treatment=*/ "report",
                /*status_treatment=*/ "report_nocache",
                /*signal_treatment=*/ "report_nocache",
                /*fp_content=*/ -2,
                /*wd=*/ "foo/bar");

  // Return the file written
  return r/root/foo/bar/hi_there.dat;
}

Failure

There are several different ways that a _run_tool call can fail. The handling of some of these cases is affected by the values passed for the status_treatment and signal_treatment parameters.

If the tool exits with a non-zero status (indicating failure), the default behavior is to display an error message and halt the evaluation. This is controlled by the status_treatment parameter, which has the following possible values:

`status_treatment`	Handling of tool exit status
`"report_nocache"` (default)	If the exit status is non-zero do not add a cache entry for this `_run_tool` call (or any of the functions in the call stack above it). If the `-k` ("keep going") flag was specified on the evaluator command-line, record the exit status in the `_run_tool` result and continue. Otherwise, treat this as a run-time error and halt the evaluation.
`"report"`	Record the exit status in the `_run_tool` result, and continue regardless of its value.

If the tool was terminated by a signal (segmentation fault, floating-point exception, etc.), the default behavior is to display an error message and halt the evaluation. This is controlled by the signal_treatment parameter, which has the same possible values and associated effects as status_treatment.
If the tool can't be executed (the specified executable doesn't exist in ./root, or isn't executable, or has the wrong format, etc.), an error message will be printed and the evaluation will halt.
If there is an error with the arguments to _run_tool (./root or ./envVars is not set, the wrong data type was passed for one of the parameters, an illegal value was passed for stdout_treatment, etc.), an error message will be printed and the evaluation will halt.

Deterministically Fingerprinting Result Files

All files in the Vesta repository, both sources and derived files produced by commands executed with _run_tool are assigned a unique identifying number called a fingerprint. Fingerprints are primarily used by the evaluator and cache server to identify when a previously produced (e.g. an object file produced by a previous compilation) result can be re-used. Fingerprints have a fixed size, so it's much faster to compare two fingerprints than to compare the complete contents of two files.

There are two different ways in which fingerprints can be assigned: by content, and arbitrarily. Fingerprinting by content is essentially a check-sum. Arbitrarily assigned fingerprints are simply chosen in a way that is very likely to be unique.

The main value in having files fingerprinted by content is that it allows for cache hits when a file is identical but was produced in a different way. For example, the vadvance command can fingerprint source files by content. That way if a developer makes a change and performs a build, and then removes the change (maybe because it didn't work), when they vadvance again, the repository will recognize that the file's contents are the same. A subsequent build using that file could re-use a cached result from a build performed before the change was made.

Fingerprinting of derived files produced during a _run_tool calls is controlled by the fp_content parameter, and the associated configuration variable [Evaluator]FpContent. The table below summarizes how these two affect the fingerprinting of derived files.

`fp_content`	Fingerprinting of derived files
A positive integer	Any derived files whose size in bytes is less than `fp_content` will be fingerprinted by content. All other derived files will be given an arbitrary unique fingerprint.
-1	All derived files will be fingerprinted by content.
0	All derived files will be given an arbitrary unique fingerprint.
-2 (default)	Act as though the value of `fp_content` is the value of `[Evaluator]FpContent`. (Thus: if `[Evaluator]FpContent` is set to a positive integer all files smaller than that number of bytes will be fingerprinted by content, if it is set to -1 all derived files will be fingerprinted by content, and if it is 0 all derived files will be given an arbitrary unique fingerprint.)
`TRUE`	Synonym for -1. (All derived files will be fingerprinted by content.)
`FALSE`	Synonym for 0. (All derived files will be given an arbitrary unique fingerprint.)

Why would you want to fingerprint derived files by content? Suppose you have a compiler which will produce bit-wise identical results from semantically equivalent sources even if the sources are not bit-wise identical. gcc behaves this way: adding or removing comments of white-space will not change the binary contents of the object files it produces. If you fingerprint the result files of such a compiler by content, then subsequent dependent tool invocations could re-use previously cached results, even though the sources changed. If gcc's derived files are deterministically fingerprinted, then a developer who builds, adds source comments, and rebuilds would see the evaluator run gcc but skip the final link. The compilation would be re-run because the source file changed, but the previously cached link operation could be re-used because the object file would be recognized as identical to what was used in the previous build.

Why wouldn't you always fingerprint all derived files by content? Because there is a computational cost for fingerprinting files. (It takes approximately 1 second per megabyte on machines circa 2000, and obviously goes down as computing power goes up.) Furthermore, the computation is performed by the repository server, not the client machine performing the evaluation. Since the repository is a central resource shared by all users at a site, care must be taken when deciding to fingerprint derived files by content.

`platform` and Host Selection

Vesta is designed to be a multi-platform system. A single evaluation can use _run_tool to execute commands on different computers, even of different CPU architectures and running different operating systems.

To execute the command specified by a _run_tool call, the evaluator contacts the RunToolServer daemon running on an appropriate machine. (This could of course be the same machine running the evaluator, but it need not be.) The selection of which hosts are considered appropriate for a given _run_tool call is controlled by the platform argument to _run_tool and associated settings in the Vesta configuration file.

To select a host, the evaluator looks up several values in the section named after the platform argument. Here's a typical section from a vesta.cfg defining a platform for Linux machines with IA-32 processors running a 2.4 kernel (corresponding to the platform "Linux2.4-ia32" that we've been using above):

[Linux2.4-ia32]
sysname = Linux
release = 2.4.*
version = *
machine = i?86
cpus    = 1
cpuMHz  = 0
memKB   = 0
hosts   = romulus remus

The setting for [Linux2.4-ia32]hosts is a list of hostnames to be considered as candidates for this platform. In this case it specifies two machines: one named "romulus" and one named "remus".

The other variables in this section (sysname, release, version, machine, cpus, cpuMHz, memKB) describe the characteristics a machine must have to be used for this platform.. The RunToolServer daemon provides information corresponding to each of these fields about the machine on which it is running to any evaluator that queries it. In detail here's what each of these fields mean and how they are used:

`platform` section setting	Meaning	Matching
`sysname`	The operating system type (the same thing returned by "`uname -s`").	Matched against the OS type of each host like a shell wild-card (aka glob) pattern. (The example above has to wild-card characters, so only matches machines running Linux.)
`release`	The operating system release (the same thing returned by "`uname -r`").	Matched against the OS release of each host like a shell wild-card pattern. (The example above matches "`2.4.17`", "`2.4.3-12`", and "`2.4.2-2smp`", but not "`2.2.17`".)
`version`	The operating system version (the same thing returned by "`uname -v`").	Matched against the OS version of each host like a shell wild-card pattern. (The example above matches any system version.)
`machine`	The machine type (the same thing returned by "`uname -m`").	Matched against the machine type of each host like a shell wild-card pattern. (The example above matches all of "`i386`", "`i486`", "`i586`", "`i686`", but would not match "`alpha`", "`9000/785`", or "`ia64`".)
`cpus`	The number of CPUs. (How to get this value varies between operating systems, but on Linux you can get it with the command "`grep -c '^cpu[0-9]' /proc/stat`".)	Any system with at least this many CPUs will be considered acceptable. (The example above would accept uni-processor machines, dual-processors, and any other machine, unless someone invents a 0-processor computer. :-)
`cpuMHz`	The CPU speed in MHz. (How to get this value varies between operating systems, but on Linux you can usually find it in `/proc/cpuinfo`.)	Any system with at least this CPU speed will be considered acceptable. (The example above would accept machines regardless of speed. It can be set to a higher value if you have a reason to limit tool executions to faster machines.)
`memKB`	The memory size in kilobytes. (How to get this value varies between operating systems, but on Linux you can usually find it in `/proc/cpuinfo`.)	Any system with at least this much physical memory will be considered acceptable. (The example above would accept machines regardless of physical memory size. It can be set to a higher value if you have a reason to limit tool executions to machines with more memory.)

After finding the set of hosts which match the criteria specified for the platform, the evaluator simply selects one which is the least heavily loaded. (Note that as of this writing, the RunToolServer only reports load in terms of the number of tool invocations it is currently executing not the load average on the whole system.) This can be used to distribute the computational load of a build over a number of hosts (when running the evaluator multi-threaded with models using _par_map).

Here are some other examples of platform sections that you might see in a vesta.cfg file (without any hosts lists):

;
; Tru64 Alpha machines
;
[DU4.0]
sysname = OSF1
release = V[45].0
version = *
machine = alpha
cpus    = 1
cpuMHz  = 0
memKB   = 0

;
; Linux Alpha with a 2.4 kernel
;
[Linux2.4-alpha]
sysname = Linux
release = 2.4.*
version = *
machine = alpha
cpus    = 1
cpuMHz  = 0
memKB   = 0

You can define as many platforms as you want, give them whatever names you want, and make them as general or as specific as you like. One interesting example of the use of additional platform definitions is the way one site partitioned certain tool invocations which were known to have large memory requirements. They defined a platform named "bigmemDU4.0" which had memKB set to 1024000, so that only hosts with a gigabyte of memory or more would be considered candidates for that platform. Then in their models they used "DU4.0" for most of their tool invocations but switched to this "bigmemDU4.0" for the memory-intensive tool runs. This allowed them to make use of a large pool of personal workstations for most tool runs, while limiting certain tools to servers with large physical memory (as sending them to the workstations would cause virtual memory thrashing).

The evaluator man page has a shorter description of _run_tool host selection. You may also find it useful to refer to RunToolServer man page.

Controlling how Dependencies are Recorded

[This section refers to a new feature introduced in eval/91. Though some people are using it, it is not yet available in any major or minor Vesta release.]

For caching, the primary key of each _run_tool call includes all the arguments (platform, command, stdin, stdout_treatment, stderr_treatment, status_treatment, signal_treatment, fp_content, wd, and existing_writable) plus ./envVars. (It must include all of ./envVars, because it's impossible to determine which environment variables a tool uses and which ones it ignores.) The secondary dependencies are recorded as the tool runs. The following table lists different kinds of filesystem accesses and the corresponding dependency that would be recorded. (Secondary dependencies are written with a leading character representing the kind of dependency followed by a slash followed by a path repesenting a specific value.)

Example of Filesystem Access	Recorded Dependency
Opening the file `/foo/bar`	`N/./root/foo/bar` ("N" type dependencies are on the entire value.)
Calling `stat(2)` on the file `/foo/bar`	`N/./root/foo/bar` (This is the same as opening the file. Recorded dependencies can't distinguish between opening a file and checking its attributes such as size and executable status.)
Looking for a file/directory `/foo/bar` that doesn't exist	`!/./root/foo/bar` ("!" type dependencies are on the existence of a particular name in a binding.)
Listing a directory `/foo/bar`	`B/./root/foo/bar` ("B" type dependencies are on the list of names in a binding. Note that this includes the order of the names in the binding, as bindings are ordered lists of name/value pairs.)
Looking for a directory `/foo/bar` but not looking inside it (rare)	`T/./root/foo/bar` ("T" type dependencies are on the type of a value.)

In some cases, it may be desirable to change the way _run_tool is cached. For example, if you are certain that particular files will always be read by a tool, it may be desirable to include them in the primary key. If there is an empty directory in which temporary files are created with random names, rather than recording one "!" dependency for each such filename it may be preferable to record a dependency on the entire directory. ./tool_dep_control can be used to make these kinds of adjustments.

./tool_dep_control may be left unset. If it is set, it should be a binding. It has three sub-bindings:

pk: Any names in this binding bound to TRUE or a non-zero integer will cause the corresponding piece of ./root to be incorporated into the primary key of the _run_tool call. You can include individual files or entire directories into the primary key this way. If anything included in the primary key changes in a later evaluation, the _run_tool call will not be able to get a hit in the cache (even if the previous tool didn't actually use the file that was included in the primary key). Any files or directories included in the primary key will not be recorded as secondary dependency in the cache. (However, files/directories included in the primary key that the tool makes use of will still be recorded in dependencies and propagated along with the result value of _run_tool as they might not be included in the primary key of higher level function calls.) It's worth noting that the order of elements in ./tool_dep_control/pk and its sub-bindings do not affect the primary key. The binding elements are sorted by name before using them to form the primary key. (Normally binding order is significant when computing primary keys.)
coarse: Any names in this binding bound to TRUE or a non-zero integer will cause the corresponding directory in ./root to be recorded coarsely, as though the tool used all of it. If a directory is recorded coarsely and the tool uses only one file in the directory, future calls will cache miss if anything in the directory changes regardless of whether the tool actually used it. Even adding or removing files to such a directory will cause a cache miss. A common use of coarse directory recording is for temporary directories where a tool creates files with random temporary names. Recording the directory coarsely will avoid accumulating many "!" dependencies across different tool runs.
coarse_names: Any names in this binding bound to TRUE or a non-zero integer will cause the corresponding directory in ./root to have non-existence dependencies recorded coarsely, as though the tool had listed the entire directory. If the tool looks for names which don't exist inside a directory recorded this way, future calls will cache miss if the set of names in the directory has changed or if the order of the names has changed. (Remember that directories like bindings have an order to the names they contain.) If ./tool_dep_control/coarse_names is a true boolean or a non-zero integer rather than a binding, all fine-grained recording of non-existence dependencies will be disabled.

Suppose you have an input file which a tool will always read that is named on the tool command line and placed in the working directory. Making the input file part of the _run_tool primary key will split up cache entries for different input file contents. This will reduce the number of cache entries that need to be considered when checking for a cache hit and make builds more efficient. You could do this with SDL code something like the following:

run_foo(input_file:binding(:text))
{
  // ...

  // Place the input file in the working directory
  . ++= [ root/.WD = input_file ];

  // Include the input file in the tool primary key
  . ++= [ tool_dep_control/pk/.WD/$(_n(input_file)) = TRUE ];

  tool_result = _run_tool(./target_platform,
                          <"foo", _n(input_file)>);

  // ...
};

Imagine a complex tool provided by a vendor that that generates C code and then compiles it by invoking a C compiler. Suppose that each time it runs it creates a header file with a random temporary name which is included by the C file being compiled. Suppose also that this file is searched for along the include path even though it is placed in the working directory. Over time this could result in a large number of secondary dependencies accumulating across multiple _run_tool cache entries:

!/./root/usr/include/pKWNnmjYGE.h
!/./root/usr/include/YGCwYNLNMn.h
!/./root/usr/include/0XThnvz0Nf.h
!/./root/usr/include/5S0SS9B6CP.h
!/./root/usr/include/MQvxLwk0GY.h
...

If all the cache entries have the same primary key, having so many existence secondary dependencies will force the evaluator to check for the existence of all previous temporary file names on each successive tool invocation. One way to avoid this would be to record the names in /usr/include coarsely, which might make sense if the set of header files in that directory don't change very often.

// Keep from accumulating existence dependencies for temporary names in /usr/include
. ++= [ tool_dep_control/coarse_names/usr/include = TRUE ];

If ./tool_dep_control is not set, it defaults to [coarse=[tmp=TRUE,usr/tmp=TRUE,var/tmp=TRUE]]. In other words, any accesses of the directories "/tmp", "/usr/tmp", and "/var/tmp", will record a dependency on the entire directory. (These directories are often used for temporary files and are typically empty at the start of a _run_tool call.) However, if ./tool_dep_control is set thse directories are not recorded coarsely unless specified in ./tool_dep_control/coarse. You can of course add in these default coarse directories with the following statement:

. ++= [ tool_dep_control/coarse = [tmp=TRUE,usr/tmp=TRUE,var/tmp=TRUE] ];

Finally, the following table shows how the same _run_tool call would record dependencies and be cached with different settings for ./tool_dep_control. The tool is a fictional one named (/usr/bin/foo) which reads some files from the working directory and some from another directory (/usr/share/foo). For purposes of illustrating some of the operating system pieces we'll assume it's running on a Linux-like system.

`./tool_dep_control`	Primary Key	Secondary Dependencies	Description
[ ]	A	`N/./root/usr/bin/foo N/./root/lib/ld-linux.so.2 !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/lib/libc.so.6 N/./root/lib/libm.so.6 N/./root/dev/null N/./root/.WD/a.x !/./tmp/edizLqM816.z N/./root/usr/share/foo/b.y !/./root/usr/share/foo/c.y N/./root/.WD/c.y !/./root/usr/share/foo/d.y N/./root/.WD/d.y !/./tmp/192WmQkAd0.z`	This is the original call with no modifications made to the the normal dependency recording and primary key. (We'll use letters to represent different primary keys rather than writing out several different 128-bit numbers in hex.) A few things to note: Initial loading of the program executable and shared libraries Using a search path to look for some names in `/usr/share/foo` and then in `/.WD` Creating temporary files in `/tmp` Accessing `/dev/null`
[ coarse = [ lib = TRUE ] ]	A	`N/./root/usr/bin/foo N/./root/lib !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/dev/null N/./root/.WD/a.x !/./tmp/edizLqM816.z N/./root/usr/share/foo/b.y !/./root/usr/share/foo/c.y N/./root/.WD/c.y !/./root/usr/share/foo/d.y N/./root/.WD/d.y !/./tmp/192WmQkAd0.z`	Here we've made the recording of `/lib` coarse. Rather than the three secondary dependencies on specific files within `/lib`, `_run_tool` acts as though the tool read the entire `/lib` directory. Looking for a cache hit or miss will be less work because there are fewer secondary dependencies. However, if anything at all in `/lib` is different on a later `_run_tool` (i.e. if files and directories never used by this tool are changed, added, or removed), a cache hit on this earlier entry will not be possible. Since `/lib` is primarily basic components provided by the operating system that change infrequently, changes to its contents would probably mean that the `_run_tool` is using a different OS version and would miss anyway.
[ pk = [ lib = TRUE ] ]	B (Includes the entire `/lib` directory)	`N/./root/usr/bin/foo !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/dev/null N/./root/.WD/a.x !/./tmp/edizLqM816.z N/./root/usr/share/foo/b.y !/./root/usr/share/foo/c.y N/./root/.WD/c.y !/./root/usr/share/foo/d.y N/./root/.WD/d.y !/./tmp/192WmQkAd0.z`	An alternative to recording `/lib` as a single coarse dependency would be to include it in the `_run_tool` primary key. Since we expect it to be used by this `_run_tool` call every time, it's perfectly reasonable to put it in the primary key. (You would not want to include a directory in the primary key if the tool would use it in some cases but not in others.) Putting `/lib` in the primary key causes the evaluator to fingerprint the directory before beginning the cache lookup process. This is different from recording `N/./root/lib` as a coarse dependency, as the cache first tells the evaluator the secondary dependencies to compute and then the cache searches for a match. Doing the work up front means more work done in the evaluator and less done in the cache server. (In a large installation this means distributing more work to clients and doing less in a central location.) It's worth noting that putting a directory in the primary key removes recorded secondary dependencies when caching `_run_tool`, but it doesn't change how they are recorded. While the tool runs, it will still record the three secondary dependencies on specific files in `/lib` ("`N/./root/lib/ld-linux.so.2`", "`N/./root/lib/libc.so.6`", and "`N/./root/lib/libm.so.6`"). If these files are passed down in the value of `./root` through several layers of function calls, then these secondary dependencies can propagate back up through those calls anywhere the result of our `_run_tool` is used. For this reason, it may make sense to put a directory in both `./tool_dep_control/pk` and `./tool_dep_control/coarse`.
[ coarse = [ lib = TRUE, tmp = TRUE ] ]	A	`N/./root/usr/bin/foo N/./root/lib !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/dev/null N/./root/.WD/a.x N/./tmp N/./root/usr/share/foo/b.y !/./root/usr/share/foo/c.y N/./root/.WD/c.y !/./root/usr/share/foo/d.y N/./root/.WD/d.y`	Now we've made the recording of `/tmp` coarse as well. Instead of two existence dependencies for the temporary files created by the tool, we now simply record a dependency on the whole value of `/tmp`. Since `/tmp` is empty at the start of each time we make this `_run_tool` call, this is fine. More importantly, we avoid accumulating many different existence dependencies that we would have to check each time we look for a cache hit or miss. When performing a lookup, some work must be done for each secondary dependency in the union of all secondary dependency sets across all current cache entries with the same primary key. Suppose this `_run_tool` call had been made 100 times in the past and that each of those created two temporary files each with a different name. Checking that `/tmp` is empty is much more efficient than individually checking to see whether `/tmp` contains each of those 200 different temporary file names.
[ coarse = [ lib = TRUE ], coarse_names = [ tmp = TRUE ] ]	A	`N/./root/usr/bin/foo N/./root/lib !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/dev/null N/./root/.WD/a.x B/./tmp N/./root/usr/share/foo/b.y !/./root/usr/share/foo/c.y N/./root/.WD/c.y !/./root/usr/share/foo/d.y N/./root/.WD/d.y`	An alternative to recording `/tmp` coarsely would be to just record its names coarsely as we've done here. The difference is that the dependency is only on the list of names (i.e. the output of "`ls /tmp`") rather than the entire contents of the directory. Rather than having a cache miss if anything in `/tmp` changes, this would cause a cache miss if any files or directories were added to or removed from `/tmp`. Since there are no files or directories in `/tmp` initially they're effectively the same in this case. However, if there were any files in `/tmp` that might change without the set of files in `/tmp` changing, using `./tool_dep_control/coarse_names` rather than `./tool_dep_control/coarse` could avoid some false cache misses.
[ coarse = [ lib = TRUE, tmp = TRUE ], pk = [ .WD/a.x = TRUE ] ]	C (Includes source file `/.WD/a.x`.)	`N/./root/usr/bin/foo N/./root/lib !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/dev/null N/./tmp N/./root/usr/share/foo/b.y !/./root/usr/share/foo/c.y N/./root/.WD/c.y !/./root/usr/share/foo/d.y N/./root/.WD/d.y`	Let's suppose that the file `/.WD/a.x` is frequently modified and is the primary input to our fictional tool. The SDL code that makes the `_run_tool` places it in `/.WD` and passes its name on the tool command line. We know that every time we run the tool it will read this file. Including it in the primary key will separate cache entries for different runs of the tool when this source file has different contents. This means there will be fewer potential cache entries to consider when searching for a cache hit, which means the lookup process will be more efficient. Also, since it is now in the primary key it does not appear in the secondary dependencies of the `_run_tool` cache entry. (There's no point in having it in both the primary key and the secondary dependencies.) This doesn't mean that the evaluator forgets that the tool read it. higher-level function calls that use the result of this `_run_tool` will still include a dependency on `a.x`.
[ coarse = [ lib = TRUE, tmp = TRUE ], pk = [ .WD/a.x = TRUE, usr/bin/foo = TRUE ] ]	D (Includes source file `/.WD/a.x` and tool executable `/usr/bin/foo`.)	`N/./root/lib !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/dev/null N/./tmp N/./root/usr/share/foo/b.y !/./root/usr/share/foo/c.y N/./root/.WD/c.y !/./root/usr/share/foo/d.y N/./root/.WD/d.y`	Obviously every time the tool is run it will read the tool executable file `/usr/bin/foo`. Suppose there are multiple different versions of the tool in use. It might even be under active development as an in-house tool. Including it in the primary key will separate cache entries using different versions of the tool. Just as with the input file, this separates cache entries and gives each cache lookup operation fewer entries to consider.
[ coarse = [ lib = TRUE, tmp = TRUE ], pk = [ .WD/a.x = TRUE, usr/bin/foo = TRUE ], coarse_names = [ usr/share/foo = TRUE ] ]	D (Includes source file `/.WD/a.x` and tool executable `/usr/bin/foo`.)	`N/./root/lib !/./root/etc/ld.so.preload !/./root/etc/ld.so.cache N/./root/dev/null N/./tmp N/./root/usr/share/foo/b.y B/./root/usr/share/foo N/./root/.WD/c.y N/./root/.WD/d.y`	For some reason, our tool is searching for the files `c.y` and `d.y` in `/usr/share/foo` even though they're in `/.WD`. Perhaps for our tool `/usr/share/foo` is a library of common files similar to how `/usr/include` is used by the C compiler. Maybe it always searches this shared directory before the local directory, which means it will often record such non-existence secondary dependencies. Let's suppose that the set of files in the shared directory doesn't change very often (i.e. files don't get added to or removed from the library), though the contents of the shared files may change. If we add `/usr/share/foo` to `./tool_dep_control/coarse_names`, we can collapse all the non-existence dependencies on names in `/usr/share/foo` to a single dependency on the set of names in that directory. Because we expect the contents of the shared files to change, recording a single dependency on the entire directory or putting the directory in the primary key would be too coarse. If the contents of unused files changed, our `_run_tool` call would get a false cache miss. Recording a dependency on the set of names alone could still cause a false cache miss (if anything in `/usr/share/foo` were added or removed), but in some cases it may still be a good trade-off over the non-existence dependencies.

Summary

The parameters to _run_tool are as follows:

Parameter Type Default Description

platform text none A string specifying the platform on which the tool should be run. Refer to the description above and/or the evaluator man page for more information.

command list(text) none The command-line to execute. (Used with the execve(2) system call to start the command.) Note that the file to execute (specified by the first element of command) must be present in the filesystem passed in through ./root.

stdin text "" The standard input given to the invoked tool. Acts as though the standard input is from a file with the contents of this text string. (Can be a file accessed with a files clause, as those are just text strings.) Refer to the example above for more information.

stdout_treatment

text
(limited values)

"report"

Determines the handling of the standard output stream of the executed command. The possible values are summarized below.

Value	Handling of standard output
`"report"` (default)	Displayed by the evaluator but not captured.
`"value"`	Captured and returned in the result of `_run_tool` under the name `stdout`, but not displayed. (See the example above.)
`"ignore"`	Discarded without being displayed or captured. (Think "`> /dev/null`".)
`"report_value"`	Both displayed and returned in the result of `_run_tool`. (Think "`\| tee`".)
`"report_nocache"`	Displayed by the evaluator and not captured. If non-empty, the evaluator will not add a cache entry for this `_run_tool` call. (Causes a tool to be re-executed in subsequent evaluations if it produced any output.)

stderr_treatment text
(limited values) "report" Determines the handling of the standard error stream of the executed command. The possible values are the same as those for stdout_treatment.

status_treatment

text
(limited values)

"report_nocache"

Determines what happens if the command exits with a non-zero status. The possible values are summarized below.

Value	Handling of tool exit status
`"report_nocache"` (default)	If the exit status is non-zero do not add a cache entry for this `_run_tool` call (or any of the functions in the call stack above it). If the `-k` ("keep going") flag was specified on the evaluator command-line, record the exit status in the `_run_tool` result under the name `code` and continue. Otherwise, treat this as a run-time error and halt the evaluation.
`"report"`	Record the exit status in the `_run_tool` result, and continue regardless of its value.

signal_treatment text
(limited values) "report_nocache" Determines what happens if the command is terminated by a signal (segmentation fault, floating-point exception, etc.) rather than exiting voluntarily. The possible values are the same as those for status_treatment.

fp_content

int or
bool

-2

Along with the configuration setting [Evaluator]FpContent, determines the method used to assign fingerprints to derived files. (See the discussion above.) The possible values and their effects are summarized below.

Value	Fingerprinting of derived files
A positive integer	Any derived files whose size in bytes is less than `fp_content` will be fingerprinted by content. All other derived files will be given an arbitrary unique fingerprint.
-1	All derived files will be fingerprinted by content.
0	All derived files will be given an arbitrary unique fingerprint.
-2 (default)	Act as though the value of `fp_content` is the value of `[Evaluator]FpContent`. (Thus: if `[Evaluator]FpContent` is set to a positive integer all files smaller than that number of bytes will be fingerprinted by content, if it is set to -1 all derived files will be fingerprinted by content, and if it is 0 all derived files will be given an arbitrary unique fingerprint.)
`TRUE`	Synonym for -1. (All derived files will be fingerprinted by content.)
`FALSE`	Synonym for 0. (All derived files will be given an arbitrary unique fingerprint.)

wd text ".WD" Specifies the current working directory at the start of the command's execution with the leading slash omitted. Note that this is relative to the filesystem passed in ./root. See the example above for more information.

existing_writable bool FALSE Determines whether files existing at the start of the command's execution (those passed in ./root) will be writable by the tool. (Note that the default is for existing files to be read-only.) See the example above for more information.

.

binding(
  envVars:binding(
    :text
  ),
  root:binding
  tool_dep_control:binding(
    pk:binding,
    coarse:binding,
    coarse_names:binding
  ),
)

.

The special variable named "." (aka "dot"). For _run_tool, dot must have two sub-bindings named envVars and root. The value of ./envVars defines the complete set of environment variables when the command is run. The value of ./root defines the entire filesystem seen by the command being executed. Dot may also have a sub-binding named tool_dep_control that can be used to control how _run_tool calls are cached.

See the description accompanying the first example above, the following section on encapsulation, and the section on controlling dependencies for more information.

The return type of _run_tool is as:

binding(code   : int,
        signal : int,
        stdout_written : bool,
        stderr_written : bool,
        stdout : text,
        stderr : text,
        root   : binding)

The purpose of each name in the result binding is summarized below.

Name	Type	Description
`code`	`int`	The exit status of the process invoked by `_run_tool`. Note that if the exit status is non-zero, evaluation will halt with a run-time error unless the `status_treatment` parameter is `"report"` or the the `-k` ("keep going") flag is specified on the evaluator command-line.
`signal`	`int`	The signal that terminated the process invoked by `_run_tool`, or 0 if it exited voluntarily. Note that if the process is terminated by a signal, evaluation will halt with a run-time error unless the `signal_treatment` parameter is `"report"` or the the `-k` ("keep going") flag is specified on the evaluator command-line.
`stdout_written`	`bool`	Indicates whether the command wrote anything to its standard output stream.
`stderr_written`	`bool`	Indicates whether the command wrote anything to its standard error stream.
`stdout`	`text`	The bytes written to standard output by the tool. Note that the name `stdout` is only defined if the `stdout_treatment` parameter is `"value"` or `"report_value"`. (See the above example of capturing standard output.)
`stderr`	`text`	The bytes written to standard error by the tool. Note that the name `stderr` is only defined if the `stderr_treatment` parameter is `"value"` or `"report_value"`.
`root`	`binding`	A record of the filesystem changes made by the tool. Specifically: Any files which the tool creates or modifies while it runs which still exist when the tool exits will have their contents in text values within the `root` sub-binding of the result. Any files which the tool deletes during its run will have a value of `FALSE` in the `root` sub-binding of the result. This is the case both for files which existed when the tool is started (those in `./root` when `_run_tool` is called), and any files created and then deleted by the tool (such as temporary files used by the tool to store intermediate results). Also see the above example on result files.

Also see the earlier section on the _run_tool return value.

Kenneth C. Schalk <ken@xorian.net> / Primitive Functions / Vesta SDL Programmer's Reference