Navigation and service

Overview of LML_da and its configuration


LML_da extracts status data from the target system's resource management system. It parses the outputs of batch system specific commands such as qstat, pbsnodes or llq and converts them into an XML format, which can be processed by the LLview client. The first steps in the data retrieval process generate data formatted in LML. LML is an XML format especially designed for transferring status data of supercomputers. Depending on the availability and access rules following information can be obtained by LML_da:

  • compute nodes managed by the batch system
  • batch jobs running and waiting in queues
  • information about reservations, rules for job-queue mapping and queue-node mapping
  • system state (failed components, ...)
  • additional information, e.g. power usage, I/O-throughput, memory usage, temperature, ... (depending on system)

LML_da is a set of PERL scripts, which use the standard batch system query functions to get the information described above. Depending on the access right given to the user's account, which is used to execute LML_da, different status data can be retrieved. E.g. some query commands can only be executed from a privileged user account.

For large systems the queries can take a long time (more than 60 seconds) because information about all jobs and all nodes of the system is requested. To prevent an overload of the batch system management daemons, LML_da should be installed by a system administrator and run only in one instance on the system via crontab (e.g. once a minute).

Technical structure of LML_da

The generation of the status data is processed in a sequence of steps, executed by the main script An example call of LML_da is:

./ -conf=<workflow-config-file>

The main steps are calling the driver scripts, combining the output files, adding unique color information and generating the XML intended to be used by the LLview client. These steps are configured with a workflow configuration file. This configuration file is similar to a simple shell script, where each command is wrapped into an XML tag. In addition each of these steps can be activated or deactivated with an appropriate XML attribute. Another XML attribute allows for defining dependencies of the different steps, which should simplify the reusability of a set of associated steps.

In the following, an example workflow file is examined step by step in order to give a starting point for configuring LML_da for a specific target system. The entire workflow file is given here.

Step by step workflow documentation

The structure of a workflow configuration file is given by the following frame:

  <var key="var1" value="./LMLtmp" />              
  <var key="var2" value="./LMLperm" />

<step id="step1" .../>
<step id="step2" .../>

Variable definitions

The vardefs section specifies variables, which can be referenced in the entire workflow. Here, you can specify parameters such as the installation directory path of LML_da or the path to temporary and permanent directories. The following vardefs section defines the installation directory of LML_da ($instdir), temporary directory for volatile data ($tmpdir), permanent directory for important generated data ($permdir) and a directory for placing history data ($histdir). All subsequent occurrences of these variables are replaced by the assigned values. In addition, the special variables $stepinfile and $stepoutfile hold the path to the input and output file for each step. By using these variables you can think of a step like a pipe, where $stepinfile is processed and the step's output can be accessed from $stepoutfile.

	<var key="instdir" value="/path/to/lml/da" />
	<var key="tmpdir" value="./tmp" />
	<var key="permdir" value="./perm" />
	<var key="histdir" value="$permdir/hist" />

Step 1: driver scripts

The first step is usually to parse the batch system commands and convert their outputs into separate LML files (one for global system information, one for jobs, one for nodes and so forth). This step differs for every target system, it calls the adapter scripts located in $instdir/rms/<target-system>/. Examples for other target systems are located directly in $instdir such as LML_da_BGQ_sample.conf or LML_da_SLURM_sample.conf.

<step active="1" id="getdata" exec_after="" type="execute">
		/usr/bin/perl $instdir/rms/TORQUE/ 
		$tmpdir/sysinfo_LML.xml" />
		/usr/bin/perl $instdir/rms/TORQUE/ 
		$tmpdir/nodes_LML.xml" />
		$instdir/rms/TORQUE/ $tmpdir/jobs_LML.xml" />

Via the variables CMD_JOBINFO and CMD_NODEINFO the path to the batch system commands is configured. The commands (<cmd>) are executed in the given order. This step is configured as active and it does not have any dependency on a previous step (exec_after) is empty). The other steps define their dependency step in the exec_after attribute. As a result, a dependency chain for all steps can be derived. This chain defines the order, in which the different steps are processed. Depending on the target system different adapter scripts need to be triggered here. For most target systems the presented scripts form the basis of data retrieval scripts. For debugging or testing, the different commands can also be executed independently of the workflow. Therefore, the variables need to be replaced by their values and the expanded command can be executed from a shell.

Step 2: LML merge

The following step merges the three separate LML files into a single file for further processing.

<step active="1" id="combineLML" exec_after="getdata" type="execute">
		-v -o $stepoutfile $tmpdir/sysinfo_LML.xml 
		$tmpdir/jobs_LML.xml $tmpdir/nodes_LML.xml" />

Here the special variable $stepoutfile is used, which is automatically replaced with a path to a temporary file. In this case $stepoutfile is replaced with $tmpdir/datastep_combineLML.xml as the generic name scheme for stepoutfile is $tmpdir/datastep_$stepname.xml. In the next step the variable $stepinfile is set to the value of $stepoutfile of the previous step. Note, that LML_da expects a step to process the input file into a derived file located at $stepoutfile. If $stepoutfile is not generated by the command, LML_da moves $stepinfile to $stepoutfile.

Step 3: save raw LML file

In order to make a file, so far only stored at $stepoutfile, persistent, an explicit copy step is necessary as configured with the following step.

<step active="1" id="cppermfile" exec_after="combineLML" type="execute">
	<cmd exec="cp $stepinfile $permdir/LMLraw_torque.xml" />

Step 4: add colors

The next step assigns unique colors to all jobs and nodes. These colors are used in the visualization to identify resources assigned to each job.

<step active="1" id="addcolor" exec_after="cppermfile" type="execute">
		exec="/usr/bin/perl $instdir/LML_color/ 
		-colordefs $instdir/LML_color/default.conf -dbdir $permdir 
		-o $stepoutfile $permdir/LMLraw_torque.xml" />
	<cmd exec="cp $stepoutfile $permdir/LML_color.xml" />

In order to preserve the result of this step, the output file is directly copied to $permdir. This command is optional, but the more step files are preserved the easier is the debugging of the workflow.

Step 5: convert LML to LLview XML

All steps up until now generate or process LML formatted data. At this point the LML file is converted into an older XML data format, on which the LLview client is operating. The conversion is run by the following step configuration.

<step active="1" exec_after="addcolor" id="convertLLview" type="execute">
	exec="/usr/bin/perl $instdir/LML2llview/ 
	-o $permdir/LML_color.xml 
	$stepinfile" />

As input file the $stepinfile path is used. The result overwrites the $permdir/LML_color.xml file. Special configuration parameters can be passed to this step's script The parameter (SMTenabled=12)(CONF_CPUS=24)(CONF_CORES=12) defines that in the default case a compute node has 12 physical cores, each having two CPUs so that a node has 24 CPUs. A node is estimated to be using SMT(Simultaneous Multithreading), if more than 12 CPUs are used. If your system does not allow SMT, the CONF_CPUS and CONF_CORES values can be set to the same. These parameters are very useful for homogenous systems, where all nodes are equally configured. In the case of missing data for a compute node due to an error on the node, the default value CONF_CORES is used for the number of configured cores. For heterogenous systems, these configuration parameters can be omitted.

Step 6: collect load history

The next step collects the load of the system in a usage file, which allows to present the system load history. The interval configures the step width for the history diagram in hours. In this case one load value is stored for every 15 minutes. By decreasing the interval the resolution of the history on the x-axis is increased. The parameter maxlogentries specifies the number of collected data values. Here 288 values are stored, which sums up to a history range of three days.

<step active="1" exec_after="convertLLview" id="Usage" type="execute">
	exec="/usr/bin/perl $instdir/UsageDB/ 
	--infile $permdir/LML_color.xml
	--outfile $permdir/llview_usage.xml
	--v " />

For this step to work properly the directory $permdir/db needs to be created manually before executing LML_da. Otherwise, the step will fail.

Step 7: copy final result

The last step only copies the result to $permdir/llview_final.xml. This file should be used as input file for the LLview-client.

<step active="1" exec_after="Usage" id="cpLLviewFile" type="execute">
	<cmd exec="cp $permdir/llview_usage.xml  $permdir/llview_final.xml" />