PRISM USER'S GUIDE Version 2.0, April 1994 Copyright (c) 1991-1994 Thinking Machines Corporation. CHAPTER 6: OBTAINING PERFORMANCE DATA ************************************** Prism lets you collect performance data on your C* or CM Fortran program. Collecting and analyzing performance data can help you uncover and correct bottlenecks that slow down a program. Section 6.1 is an overview of obtaining performance data in Prism. To learn: o How to write and compile your program to obtain performance data, see Section 6.2. o How to obtain the most accurate performance data, see Section 6.3 o How to collect performance data, see Section 6.4. o How to display performance data, see Section 6.5. o How to interpret performance data, see Section 6.6. o How to save a file of performance data and reload it into Prism, see Section 6.7. See Section 10.7 for a discussion of performance analysis in Node Prism. 6.1 OVERVIEW ------------- Prism helps you determine where your C* or CM Fortran program is spending its time, and why. To determine where your program is spending its time, Prism provides data at the level of the entire program, individual procedures within the program (with both call-graph and flat displays), and individual source lines within procedures. This allows you to zero in on the lines that have the greatest impact on a program's performance. To determine why a procedure or a source line is a bottleneck in your program, Prism provides data on a program's use of several different computing resources, not just CPU time. For example, the code may be doing a lot of send/get communication or I/O. Providing data on the code's use of these resources makes it easier to determine how, or if, the code's performance can be improved. Prism aggregates performance data separately for the front end (or partition manager) and for the CM (or the nodes); these are referred to as the serial subsystem and parallel subsystem, respectively. This is necessary because both subsystems contribute independently to a program's execution time. (Note that there are not separate subsystems for performance data in Node Prism.) In addition to displaying the data, Prism provides a performance advisor that gives an interpretation of the data. See Section 6.6.2 for more information on it. 6.2 WRITING AND COMPILING YOUR PROGRAM --------------------------------------- Performance data is available for C* and CM Fortran programs. To collect performance data, you must compile your program with the -cmprofile option. Don't use the -O0 option to turn off optimization for a CM-2/200 C* program. If your program calls an individual routine not compiled with the -cmprofile option (such as a routine from a CM library like CMSSL), serial-subsystem data will be available for that routine. Specific information on parallel-subsystem resources will not be available; summary information is available only for routines written in CM Fortran Version 2.1. A routine not compiled with -cmprofile cannot call a routine compiled with -cmprofile. Performance data is available if you compile your CM Fortran or CM-5 C* program with the -cmsim option and run it on a Sun-4. The timings will effectively be those for a one-node CM-5. 6.2.1 Including Timers within Your Program ------------------------------------------- Prism collects performance data using the CM timing utility; see the CM Fortran User's Guide or C* User's Guide for a description of this utility. By default, Prism uses timers 5-63, leaving timers 0-4 available for use within the program itself. If you want to use more than five timers, use the environment variable CMPROF_N_USER_TIMERS to specify the number. For example, if you want to use 10 timers, set the variable as follows (for the C shell): % setenv CMPROF_N_USER_TIMERS 10 This reserves timers 0-9 for use within your program. Program execution becomes less efficient, and the performance data becomes more distorted, as you use more timers, leaving fewer available for Prism. NOTE: If you fail to set this environment variable, and thereby try to use timers that Prism itself is using, the resulting performance data will be incorrect and possibly bizarre (for example, with utilization values well in excess of 100 percent). You receive a warning from Prism if you try to use timers that Prism is using. 6.3 OBTAINING THE MOST ACCURATE PERFORMANCE DATA ------------------------------------------------- This section gives some hints on how to obtain the most accurate performance data in Prism. Note these general points: o Collecting performance data slows execution of the program. The exact degree to which this occurs depends on the program. In general, programs that work on small amounts of data in each node will be affected more by this performance-collection overhead; the overhead will be negligible in programs that work on large arrays relative to CM-5 partition size. Prism corrects for this effect when it presents its data. If you are using your own timers to measure performance during a run in which Prism is also collecting data, however, you need to be aware that this effect will inflate the values in these timers. o Interrupting your program while collecting performance data (for example, by stopping at a breakpoint and printing values) will distort the data. o As described in more detail below, utilization percentages are affected by system load and timesharing. To decrease the variability of these percentages between runs of the program, use a lightly loaded system. Note these points with regard to serial-subsystem data: o Running on a heavily loaded front end or partition manager may inflate somewhat the time allocated to its CPU resource. Once again, for most accurate results, run your program on a front end or partition manager that is lightly loaded. o Paging on the front end or partition manager may cause some discrepancies in the data for the serial subsystem; these discrepancies will be greater on smaller programs. 6.4 COLLECTING PERFORMANCE DATA -------------------------------- To collect performance data, you must turn collection on before running the program. Collection remains on until you explicitly turn it off. o From the menu bar: Choose Collection from the Performance menu. (This selection is also available by default in the tear-off region.) Collection toggles the collection of performance data. Performance collection is off when the toggle box to the left of the menu selection is not filled in; this is the default. Choosing Collection turns it on, and the toggle box is filled in. To turn it off, choose Collection when the toggle box is filled in. o From the command window: Issue the collection on command to turn collection on; issue collection off to turn it off. Issuing the collection command also affects the state of the toggle box in the Collection menu selection. On a CM-2 or CM-200, Prism automatically turns safety off if you run a program with collection turned on; it turns safety back on if you subsequently run the program with collection turned off. 6.4.1 Collecting Performance Data Outside of Prism --------------------------------------------------- You can also collect performance data by setting environment variables, without entering Prism. This is convenient if you can't enter Prism for some reason (for example, because the CM is only accepting batch jobs). To turn on collection of performance data, set the environment variable CMPROFILING to t: % setenv CMPROFILING t To turn collection off, set the environment variable to f. To specify the program on which data is to be collected, set the environment variable CMPROFILING_EXECUTABLE_FILENAME to the name of the executable program. For example: % setenv CMPROFILING_EXECUTABLE_FILENAME a.out To specify the file to which the performance data is to be sent, set the environment variable CMPROFILING_DATA_FILENAME to the name of the file. For example: % setenv CMPROFILING_DATA_FILENAME perf.data You can load this file into Prism for examination at a later time; Section 6.7 explains how. 6.5 DISPLAYING PERFORMANCE DATA -------------------------------- To display performance data, the program must have finished execution. Choose Display Data from the Performance menu. A window appears, containing the data. Figure 33 shows an example. [ Figure Omitted ] Figure 33. The Performance Data window. At the top of the window are two values that represent the program's execution time: o Wallclock time represents the actual elapsed wallclock time from when the program started to when it finished; thus, it is sensitive to effects of timesharing and front-end or partition- manager load. o CM elapsed represents the time during which the CM or the nodes were allocated to the program; this measurement takes into account timesharing on the CM, and should be reasonably consistent no matter what the load is on the front end or partition manager. Thus, this measurement gives an indication of what your program's run time would be on a dedicated system. The Performance Data window contains three levels of performance data: o Performance statistics for the resources that Prism measures, along with totals for each of the two subsystems. o Per-procedure performance statistics for a specified resource or subsystem. You can choose either flat or call-graph display of these statistics. o Per-source-line performance statistics for a specified resource and procedure. All statistics are displayed as histograms in panes within the Performance Data window, along with the amount of time or the percentage of wallclock time that each histogram bar represents. If the program didn't use the resource, the histogram bar does not appear. (Occasionally, however, a resource will show a utilization of 0% because of rounding.) By default, the window displays the number of seconds used by the resource next to each histogram bar. Choose Units from the Options menu to change this. You have these choices: o Choose Seconds (the default) to display the actual time, in seconds, that the histogram bar represents. o Choose Microseconds to display the actual time in microseconds. o Choose Utilization to display the percentage of the total wallclock time that the histogram bar represents. NOTE: As of Prism Version 2.0, the Utilization percentages are based on wallclock time; this means they are sensitive to effects of load and timesharing. These percentages may therefore vary considerably from run to run of the program; the lighter the load, the higher the percentages will be. If you choose the Utilization display option, a popup message warns you of this variability. Although the actual percentages may vary, the relative percentages for most resources will generally remain the same. For example, if Node cpu (user) uses twice as much time as any other resource on one run of the program, it should use twice as much time as any other resource on the next run. I/O time, however, is an exception; it can vary non- proportionally to the other resources from run to run. Once collected, performance data is retained until you load another program (whether or not you leave collection on) or until you re- execute the currently loaded program with collection on. Choose Close from the File menu to close the Performance Data window. 6.5.1 The Resources Pane ------------------------- The Resources pane within the Performance Data window displays histogram bars showing a program's use of the measured resources, along with totals for each subsystem. (Note that there aren't separate subsystems for performance data in Node Prism.) You can use the Sort By selection from the Options menu to determine the order in which the resources are displayed. Choose Name (the default) to display the resource usages by subsystem. (In Node Prism, the time spent by the scalar portion of the program is listed first, followed by time spent by array operations and communication.) Choose Time to display the resources in order from the highest usage (at the top) to the lowest. The resources are ordered within a subsystem, except for performance data in Node Prism. On a CM-2 or CM-200 ------------------- The Resources pane for a CM-2 or CM-200 series Connection Machine system provides this data: o FE cpu (user)--This is front-end CPU time used by the program. o FE cpu (system)--This is front-end CPU time used by the operating system on behalf of the program. o FE I/O--This is time spent in I/O on the front end. o FE Total is the total of these resources. It represents the program's use of the front-end subsystem. o CM cpu--This is the time that the program spent in processing on the CM. It refers to the amount of time any CM processor was active. o Comm (Send/Get)--This is the time that the program spent in router communication (sends and gets) on the CM. o Comm (NEWS)--This is the time that the program spent in NEWS communication (also referred to as grid communication) on the CM. o Comm (Reductions)--This is the time that the program spent doing data reductions on the CM. o Comm (FE<-->CM)--This is the time spent in communication between the front end and the CM processors. o CM I/O--This is the time spent in I/O between the CM processors and I/O devices. o CM not profiled--This is the time spent on the CM by routines that weren't compiled with the -cmprofile option. (These include routines in CM libraries such as CMSSL.) If the routine had been compiled with -cmprofile, this time would be allocated to the other CM resources. This resource is not displayed for C* programs or for CM Fortran programs prior to Version 2.1; for these programs, CM time in routines not compiled with -cmprofile is not measured. o CM Total is the total of these resources. It represents the program's use of the CM subsystem. The total use of the CM subsystem can be less than or equal to the CM elapsed time shown at the top of the Performance Data window. The difference between the two is the time during which the CM is idle. To use the CM efficiently, CM idle time should be kept as low as possible. Note that the totals for the front-end subsystem and the CM subsystem are separate, since the front end can be active while the CM is idle, and vice versa. On a dedicated system, the total utilization for either subsystem can be 100%, indicating that the subsystem was busy during the entire run. On a CM-5 --------- The Resources pane provides this data for a CM-5 system: o PM cpu (user)--This is partition-manager CPU time used by the program. o PM cpu (system)--This is partition-manager CPU time used by the operating system on behalf of the program. o PM I/O--This is time spent in I/O on the partition manager. o PM Total is the total of these resources. It represents the program's use of the partition-manager subsystem. o Node cpu--This is the time that the program spent in processing on the nodes. It refers to the amount of time any nodes were active. o Comm (Send/Get)--This is the time that the program spent in router communication (sends and gets) on the CM. o Comm (NEWS)--This is the time that the program spent in NEWS communication (also referred to as grid communication) on the nodes. o Comm (Reductions)--This is the time that the program spent doing data reductions on the nodes. o Comm (PM<-->Node)--This is the time spent in communication between the partition manager and the nodes. o Node I/O--This is the time spent in I/O between the nodes and I/O devices. o Node not profiled--This is the time spent on the nodes by routines that weren't compiled with the -cmprofile option. (These include routines in CM libraries such as CMSSL.) If the routine had been compiled with -cmprofile, this time would be allocated to the other node resources. This resource is available for C* programs as of Version 7.1.1 and for CM Fortran programs as of Version 2.1; for programs compiled with older versions of these compilers, time spent in routines not compiled with -cmprofile is not measured. o Node+Comm Total is the total of these resources. It represents the program's use of the node subsystem. The total use of the node subsystem can be less than or equal to CM elapsed time shown at the top of the Performance Data window. The difference between the two is the time during which the nodes are idle. To use the nodes efficiently, node idle time should be kept as low as possible. Note that the totals for the partition-manager subsystem and the node subsystem are separate, since the partition manager can be active while the nodes are idle, and vice versa. On a dedicated system, the total utilization for either subsystem can be 100%, indicating that the subsystem was busy during the entire run. 6.5.2 The Procedures Pane -------------------------- The pane titled Resources: name in the Performance Data window displays histograms showing the utilization of a specific resource or subsystem by each procedure in a program; we call this the Procedures pane. You choose the resource or subsystem by left-clicking on it in the Resources pane. By default, the most-used resource appears in the Procedures pane. The name of the resource or subsystem appears in the title of the pane--for example: Resource: CM Total. Use the Mode selection from the Options menu to choose how you want to display the procedure data: o Choose Call Graph to display the dynamic call graph of the procedures. o Choose Flat (the default) to display a list of all procedures in the program and their use of the resource or subsystem. In flat mode, the Procedures pane displays a list of all procedures in the program and each one's total use of the selected resource or subsystem. This is useful for determining which procedures are consuming most of the time for the resource or subsystem. The Procedures pane in Figure 33 shows the data in flat mode. NOTE: Data for the CM or Node not profiled resource is not available in flat mode. In call-graph mode, you see which procedures call which other procedures, and the use of the selected resource or subsystem for each individual call. This gives a more detailed picture of the program's behavior. Figure 34 shows the call-graph display for the data shown in the Procedures pane in Figure 33. Note in Figure 34 that the time allocated to the MAIN routine includes the time spent in loop, which it calls. [ Figure Omitted ] Figure 34. A call-graph display. To navigate down through the call graph, click anywhere on the line that lists a procedure (other than the procedure at the top); the display changes to show this procedure at the top, with the procedures it calls below it. Thus, in call-graph mode, the Procedures pane at any one time shows two levels of the call graph. To move up through the call graph, click on the top procedure in the display; the display changes to show the caller of this procedure at the top, with the procedures it calls beneath it. As with the Resources pane, you can use the Sort By selection from the Options menu to arrange the procedures in the Procedures pane. o Choose Time (the default) to list procedures according to their use of the resource or subsystem, from most to least. o Choose Name to arrange the procedures in alphabetical order. In call-graph mode, the sorting applies only to the children of the calling procedure; the calling procedure is always at the top of the display. If a routine is not compiled with the -cmprofile option, Prism will display data only for serial-subsystem resources or for the CM or Node not profiled parallel-subsystem resource (if available); as mentioned above, all parallel-subsystem time for the routine is included in the not profiled resource. 6.5.3 The Source-Lines Pane ---------------------------- The pane titled Procedure: name displays performance data associated with each source line in a procedure; we call this the Source-Lines pane. Choose the procedure by left-clicking on the line for the procedure in the Procedures pane; by default, Prism displays the source code for the procedure that has the highest utilization of the most-used resource. The resource or subsystem for which the data is shown is the one displayed in the Procedures pane. For slicewise and CM-5 CM Fortran programs, Prism actually calculates performance data at the level of basic blocks. These basic blocks can include one or more lines of source code; the lines are not necessarily contiguous. Prism allocates the amount of time spent in a basic block equally to each line in the block. In general, this will give an accurate picture of each line's contribution to the overall time spent in the basic block. It is possible, however, that the data may be misleading. To get a more accurate picture of per-line data, compile with the -g switch in addition to -cmprofile. This produces unoptimized code, however, and overall performance will be much worse. Also note these points: o Source-line data is not available for the serial subsystem, or for serial-subsystem resources. o If a routine is not compiled with the -cmprofile option, source- line data is not available. 6.5.4 Displaying Performance Data in the Command Window -------------------------------------------------------- To display an ASCII version of the performance data, issue the perf command from the command window. As with other commands, you can redirect output to a file by using the syntax @ filename. This is useful if you are using Prism with the commands-only option, or if you want to study the data at a later time when you don't have a graphical interface available. By default the perf command displays actual times, in seconds, for resources. Use the util argument to display utilization percentages. 6.6 INTERPRETING THE DATA -------------------------- This section discusses how to make sense of the performance data that Prism provides. 6.6.1 Making Sense of the Times -------------------------------- Recall that the totals for the serial subsystem (front end or partition manager) and the parallel subsystem (CM or nodes) are separate. The ideal is for the total of the parallel subsystem resources to be busy as close to to the value of CM elapsed time as possible (in other words, for the CM or nodes to be kept as busy as possible). Recall also, as described in Section 6.3, that the load on either the serial subsystem or the parallel subsystem can affect your results. For most accurate results, run on lightly loaded systems. One indication of a light load is that the CM elapsed time will be close to the wallclock time. Finally, recall that utilization percentages are computed using the total wallclock time, which is affected by load. If the system is heavily loaded, the percentages will be small. On a lightly loaded system, the percentages will be higher, although the relative differences between the percentages will be similar. On a dedicated system, the total utilization for either subsystem can be 100%, indicating that the subsystem was busy during the entire run. 6.6.2 Isolating Bottlenecks ---------------------------- Prism's performance data gives you a picture of how your program uses system resources. We assume you will want to use this information to try to improve the program's performance. The key to improving performance is to find the bottlenecks in the program--the procedures, and the source lines within the procedures, whose use of a particular resource has the greatest impact on how long the program takes to complete. This section describes how to use the performance data to find your program's bottlenecks. To help you in this analysis, Prism provides a performance advisor, which summarizes and analyzes the performance data that Prism has collected. To display this information, choose Advice from the Performance menu, or issue the command perfadvice. You can use this performance advisor, or you can analyze the data on your own, to isolate the bottlenecks in your program. The performance advisor provides answers to the questions discussed below; we believe that following this procedure provides the best method for interpreting the performance data. We suggest asking these questions to isolate the bottlenecks in your program: 1 Which of the two subsystems that Prism measures does the program use more heavily? For example, if total serial-subsystem time is greater than total parallel-subsystem time, then reducing the use of the serial subsystem is likely to provide the greatest performance gains. Reducing the use of the parallel subsystem may improve performance, but you may also find that it will have no effect on performance, since the use takes place at the same time that the serial subsystem is also in operation. 2 Which resource within this subsystem has the highest usage? If your program uses the parallel subsystem more heavily than the serial subsystem, and Send/Get communication is the most-used parallel-subsystem resource, then you will obtain the greatest performance gains by reducing the use of this resource. 3 Which procedure uses this resource most heavily? This tells you where you will have the biggest payoff when attempting to reduce the use of the most heavily used resource. 4 Which source lines within this procedure use this resource most heavily? Finally, going to the source-line level isolates the specific lines of code that have the greatest effect on performance. When you first display data for a program, by default the Performance Data window displays the most-used resource and the procedure that uses this resource the most; this helps you analyze your data without having to use the performance advisor. NOTE: Under certain circumstances, the time assigned to the FE or PM cpu (user) resource will incorrectly include time that the front end or partition manager spent waiting for the CM or the nodes. In such cases, this resource may appear to be the bottleneck for the program, when in fact it is not. When this is the most-used resource, the performance advisor prints a warning, and the Procedures pane of the Performance Data window displays data for the most-used CM or Node resource instead. In general, this bug is probably not affecting your performance data if the FE or PM cpu (user) utilization is high while CM or Node cpu utilization is low. (In such a situation, the front end or partition manager is probably not spending time waiting for the CM or the nodes.) 6.6.3 Anomalous Performance Data --------------------------------- It is possible that your performance data will simply appear incorrect. For example, you may get percentages in the thousands or millions. This problem can occur if you try to use timers that Prism has allocated for its own use. If you use more than five timers in your program, be sure to set the environment variable CMPROF_N_USER_TIMERS to the number that you use; see Section 6.2.1. 6.7 SAVING AND LOADING PERFORMANCE DATA FILES ---------------------------------------------- See Section 10.7.2 for a discussion of saving and loading performance data files in Node Prism. You can save performance data you have collected for a program in a file; you can later load this file into Prism and re-display the data. This lets you look at the progression of performance analyses as you work on your program. It is also useful if you do your original data collection outside of Prism or in commands-only Prism, and later want to look at your data in the graphical version. Follow this procedure: 1 Collect the data as you normally do (that is, turn collection on and run the program to completion). 2 Choose Save Data from the Performance menu. (Alternatively, you can choose Display Data from the Performance menu to display the Performance Data window, then choose Save Data from the File menu in this window.) A dialog box appears; in it, specify the name of the file in which you want to save the data. If you don't supply a complete pathname, the filename is interpreted relative to the directory from which you started Prism. The data is then saved in this file. Alternatively, you can issue the perfsave command from the command window, specifying the name of the file in which the data is to be saved. 3 When you want to look at the data again, choose Load Data from the Performance menu (or from the File menu in the Performance Data window). A file-selection dialog box is displayed, from which you choose the file in which you saved the data. The data is then reloaded. If no program is loaded at the time, Prism loads the corresponding executable program; if another program is loaded, Prism displays a dialog box and asks if you want to load the program associated with the performance data. If you don't, the usefulness of the performance data will be limited, since Prism will incorrectly associate the data with the procedures and source lines of the program that is loaded. Alternatively, you can issue the perfload command from the command window, specifying the name of the file in which the data was saved. Note these points in saving and loading performance data: o The performance data is associated with a specific version of the program. If you modify the program, Prism will not be able to load the version for which the data was collected. (It prints a warning when it detects that its performance data file is out of date.) Therefore, if you want to use this feature to maintain a historical record of your attempts at improving a program's performance, you should rename the program whenever you change it, and save the earlier versions along with their performance data files. o You can display only one set of performance data at a time within Prism. Therefore, if you want to compare data from different versions of a program on-screen, you have to run multiple instances of Prism. ***************************************************************** The information in this document is subject to change without notice and should not be construed as a commitment by Think- ing Machines Corporation. Thinking Machines reserves the right to make changes to any product described herein. Although the information in this document has been reviewed and is believed to be reliable, Thinking Machines Corporation assumes no liability for errors in this document. Thinking Machines does not assume any liability arising from the application or use of any information or product described herein. ***************************************************************** Connection Machine (r) is a registered trademark of Thinking Machines Corporation. CM, CM-2, CM-200, and CM-5 are trademarks of Thinking Machines Corporation. CMOST, CMAX, and Prism are trademarks of Thinking Machines Corporation. C* (r) is a registered trademark of Thinking Machines Corporation. Paris and CM Fortran are trademarks of Thinking Machines Corporation. CMMD, CMSSL, and CMX11 are trademarks of Thinking Machines Corporation. CMview is a trademark of Thinking Machines Corporation. Thinking Machines (r) is a registered trademark of Thinking Machines Corporation. SPARC and SPARCstation are trademarks of SPARC International, Inc. Sun, Sun-4, and Sun Workstation are trademarks of Sun Microsystems, Inc. UNIX is a trademark of UNIX System Laboratories, Inc. The X Window System is a trademark of the Massachusetts Institute of Technology. OSF and Motif are trademarks of The Open Software Foundation, Inc. Worldview is a trademark of Interleaf, Inc. Copyright (c) 1991-1994 by Thinking Machines Corporation. All rights reserved. This file contains documentation produced by Thinking Machines Corporation. Unauthorized duplication of this documentation is prohibited. Thinking Machines Corporation 245 First Street Cambridge, Massachusetts 02142-1264 (617) 234-1000