Using the LLview clientAfter starting LLview with the command
> llviewthe main window of LLview will come up and LLview will try to get data from the defined data source. If this is the first call of LLview and the system default configuration file contains no site specific information, a local configuration of LLview is necessary. At least the data source option llqxml (executing LML_da on the remote machine) or WWW (accessing the data from a web server) has to be selected. Additionally the path to the llqxml command or the web address and authorization information have to be specified.
The components of the main windowThe main window is a collection of diagrams and tables focussing on different aspects of the monitored system. On top of the window is a combined menu bar and status bar. The Node Display on the left renders the system architecture and maps running jobs on the compute resources. For supercomputers using torus networks a logical view provides a better understanding of adjacent compute nodes. A summary of the Node Display is given in the Usage bar placed directly below the status bar. A Job List shows details on running jobs. In the center of this example configuration a three-day history view shows the recent system load. More statistics are rendered as histograms, which show the job wait time distribution, number of jobs in each queue or the size of jobs. At the bottom a prediction component is drawn showing a predicted schedule of submitted jobs. Most of these elements are mouse pointer sensitive. I.e. that moving the cursor over a processor box, a job list entry or a colored rectangle of the usage bar causes all other display elements to highlight the corresponding information. Moving the cursor over the machine picture in the node display shows the usage of this node in the Info box. More details on each of these components are documented in the following.
The menu bar contains on the left side the File Option menu and on the right side the Help menu. The Change entry of the Option menu opens new windows in which all elements of LLview can be configured (see "Managing Configuration Options ").
The status bar contains an entry for defining the step time between two updates. The automatic update of the information displayed by LLview can be disabled by the active option. A direct update of the information can be forced by the reload button (). The next entry allows to search for user IDs (by regular expressions) in the job list. The last three entries of the status bar show the time of the last update, the time to the next update and the current selected data source. The time of the last update is the timestamp the XML data is recorded on the remote machine. If the WWW data source is selected, an update event will only request the data from the www server again, this is independent from updating the data on the web server.
The node display is the main element of LLview. It shows a graphical representation of the Cluster Nodes and the usage of their processors by the running jobs.
For each node of the target system the Nodes element of LLview is able to display the node name, the memory and cpu usage, the node state and for each processor a colored box, corresponding to the job running on this processor. Due to the large number of CPUs of todays supercomputers LLview incorporates level of detail functionality. E.g. for the JSC MPP system JUQUEEN each of the smallest colored rectangles represents a nodeboard containing actually 512 cores. Since the smallest job on JUQUEEN must use at least one nodeboard this view does not lack any information. The same principal can be applied to cluster systems in order to decrease the number of painted rectangles. The information about CPU and memory usage for instance on Loadleveler controlled systems is derived from the LoadLeveler data access entries "LL_MachineLoadAverage", "LL_MachineFreeRealMemory64" and "ConsumableMemory". The dark blue part of the memory bar shows the real memory usage of each node and the light blue part the requested memory (ConsumableMemory).
The Job List lists details of all running jobs. The number of requested processors, the job owner, the consumed wall clock time, the requested wall clock time, a flag indicating that the job is running under UNICORE control, the job class, the job specifiers and the estimated end time of the job are available for each job. This information implies that the job scheduling is done on a wall clock time basis and the nodes are not used in a time shared mode. The job specifiers give a better description about the number of started processes. The specifiers contains the number of nodes (n), number of processes on each node (p) and the number threads per process.
The sorting order of this list can be changed by clicking on the header keyword of the corresponding column. Clicking again switches the sorting order. This is indicated by a small triangle below the header.
The Usage bar on the top of the window shows the utilization of the whole machine. The jobs are marked as small rectangles and are sorted by job size. This element gives a fast overview about the fragmentation of the machine. The white part of the usage indicates free nodes, the grey part indicates the number of processors which are currently not available. The numbers printed next to the usage bar describe the usage in percent, the number of free processors and number of entirely empty nodes. The nshd entry describes the number of processors which are wasted by jobs running in the not_shared nodes usage mode.
The Info Box is an ASCII based display element which provides additional information about the object which is currently focused by the user. By moving the cursor the information in the box will be automatically updated.
There are two statistics display elements. The first one shows different histograms of the current state of the scheduling system. The diagrams are fully configurable by selecting values for the x- and y-axis from a list of collected statistic data like job size, waiting time, job wall clock time or queue name. The x-axis is automatically scattered in value ranges. x- and y-axis can have linear or logarithmic (log2, log10) scaling. The screenshot above shows a statistic window configured for 5 different diagrams, which can be selected by small rectangles on the left side of the diagram.
The second statistic (history) window shows the usage of the system for the last three days. The diagram shows the history of two values: the number of processors used by small jobs and the number of processors used by large jobs. Moving the mouse over this window shows the corresponding values in the info window.
Additional graphical components can be activated via the Options->Elements tab. E.g. large ASCII based job lists for running and submitted can be enabled. The search pattern in the status bar will also be applied to these lists.
The note book window History shows the history of the machine usage in a graphical display. Therefore the information of the usage bar will be displayed vertically and appended to previous information.
The Prediction window displays a Gantt chart for the estimated future job schedule.
Job scheduling prediction window
It shows the result of a simulation for the job scheduler. This simulation is based on the information stored in the XML delivered by LML_da. LLview uses the priority value of each job, the current usage state of each node of the system and some global information about maximum starters in each job class and per user for this simulation. Futhermore LLview can simulate a Scheduler which is working in a Backfilling mode. I.e. that the scheduler selects one of the waiting jobs as a Top dog which will be scheduled as soon as possible on the machine. All other jobs can only be scheduled if this Top dog is not interfered.
The window shows a diagram with the time line in x-direction and the number of nodes in the y-direction. The colored jobs stacked on the left side of the diagram are the currently running jobs. The blue vertical line shows the current position in the time lime. The blue jobs plotted on the right side of the diagram are jobs from the waiting queue. The left border of each job's rectangle shows the predicted starting time. The height of the job boxes corresponds to the number of processors requested by this job. The length of the box is defined by the requested run time of the job (wall clock time limit).
Instead of using the internal scheduler prediction, LML_da can now be configured to use JuFo (Juelich Forecast) for the scheduling prediction. This tool is written in C++ and therefore needs to be installed as additional module. While the internal prediction module is especially designed for Loadleveler, JuFo can also be applied to other scheduling systems like Moab. It is highly configurable and capable of simulating hundreds of jobs in near real-time. JuFo is already in production for the general purpose cluster JUROPA.
Command Line OptionsFollowing options are available when starting llview from command line:
llview [-source www|locdata|exec] -rcfile
|-source source||defines the data source from which the data should be requested|
www: from a web server
locdata: from a local tar file or local flat files
exec: execute the llqxml directly on the same host
|-rcfile inifile||use this inifile for loading and saving the local configuration option|
|-hist||enable the history sub panel of the main window|
|-mc||enable the multi cluster mode of LLview. The configuration file (default: .llview_mc.rc) defines a list of machines and corresponding configuration files for this machines.|
Key BindingsFollowing key bindings are available in the LLview window.
|Mouse-3, Control-u||Data update/reload|
|Control-o||Open/Close Option Panel|
|Control-p||Print node display in Postscript file ./llview.ps|