CM FORTRAN PROGRAMMING GUIDE Version 2.1, January 1994 Copyright (c) 1994 Thinking Machines Corporation. CHAPTER 12: CM ARRAY LAYOUT **************************** The CM-5's run-time system (RTS) distributes array elements among the processors (either nodes or vector units) that are available to the program. The particular distribution of an array is called its layout, and the layout chosen by the RTS without intervention from the user is the canonical layout. CM Fortran normally spares the user the details of how arrays are laid out. However, you might improve program performance by using the compiler directive LAYOUT to control an array's layout yourself. Sophisticated use of this tool requires an understanding of how the RTS maps arrays onto the parallel memory and the implications of changing the canonical mapping. You can also use the compiler switches -nopadding and -noaxisreorder to make global changes in the RTS's decision rules and thus in the canonical mapping for a particular program. This chapter focuses on the LAYOUT directive and its impact on RTS decision making. Some related features will become clear once you understand how array layout is handled: o the utility library procedures CMF_ALLOCATE_LAYOUT_ARRAY and CMF_ALLOCATE_DETAILED_ARRAY o the utility library procedure CMF_ALIAS_DETAILED o the compiler directive ALIGN The ALIGN directive is described in more detail in the CM Fortran Language Reference Manual and the utility procedures in the CM Fortran Libraries Reference Manual. See also the CM-5 CM Fortran Performance Guide for additional information on the performance considerations in using all these features. 12.1 THE CANONICAL LAYOUT -------------------------- This section describes how the current implementation of the run-time system lays out arrays. For convenience, it focuses on the CM-5 with vector units, but the same principles apply to a CM-5 without vector units and to a CM-2/200 under the slicewise execution model (see Section 12.2 for the differences). These principles may change in future implementations. Let's say you have an 8 x 12 array in a program running on a 16- vector-unit (4-node) partition. How does the RTS determine how to lay out the elements on the 16 vector units? Understanding how it does this requires understanding five concepts: o physical grids o garbage elements o subgrids o axis sequence o subgrid sequence 12.1.1 Physical Grids ---------------------- When laying out an array, the RTS configures the vector units into a grid whose rank is the same as the rank of the array. The total number of vector units in a partition is always a power of 2; therefore, the number of vector units along each axis of the physical grid must be a power of 2. Thus, for the 8 x 12 example, the RTS has these choices for arranging the 16 vector units into a 2-dimensional physical grid: 4 x 4, 8 x 2, 2 x 8, 16 x 1, and 1 x 16. 12.1.2 Garbage Elements ------------------------ The RTS tries to divide the array's elements equally among the vector units. In doing so, it follows these rules: o Each vector unit must receive the same number of elements. o The number of elements per vector unit must be a multiple of the value returned by the utility library function CMF_ARRAY_SIZE_QUANTUM. o By default in the -vu execution model, this value is 8 (the length of the vector units' vector register). It is 0 in the -sparc model and 4 in the CM-2/200 slicewise model. However, it is not possible to follow these rules in laying out an 8 x 12 array on 16 vector units; each vector unit could receive the same number of elements, but the number of elements would not be a multiple of 8. In such a case, the RTS internally uses a machine array with larger extents along one or more axes, so that the rules can be followed. It then lays out the actual array within this larger layout, leaving unused elements along the extended axes. These unused elements are referred to as garbage elements or padding. The RTS always adds garbage elements to the high end of an axis, and adds as few garbage elements as possible. For our 8 x 12 array, the RTS would pad axis 2 to make a machine array of 8 x 16. Its layout on a 2 x 8 physical grid is shown in Figure 20. [ Figure Omitted ] Figure 20. An 8 x 12 array laid out on a 2 x 8 physical grid. You can think of garbage elements as coming from two distinct sources: o Global padding is added, if necessary, to make a machine array's size an integer multiple of CMF_NUMBER_OF_PROCESSORS. o Vector-length padding is added, if necessary, to make the subgrid an integer multiple of CMF_ARRAY_SIZE_QUANTUM (the vector length of 8 in VU model). The compiler switch -nopadding removes the multiple-of-8 rule for subgrids under the -vu execution model. With the switch, the subgrid can be any arbitrary length. Note, however, that global padding may still be added if the array does not divide evenly among the vector units. With -nopadding, an 8 x 12 array would be laid out on a 4 x 4 physical grid, with 2 x 3 subgrids and no garbage elements. [ Figure Omitted ] Figure 21. An 8 x 12 array on 16 VUs with -nopadding. The RTS deals with padding transparently, guaranteeing that garbage data does not escape and corrupt user data. However, padding does have performance consequences, especially for communication. Also, you must take it into account when specifying detailed layouts yourself. You can determine whether any given array is padded by calling the utility library subroutine CMF_DESCRIBE_ARRAY. If the output shows a machine array with a different number of elements from the user array, the difference is padding. For example, this array fits onto the current partition without padding: DOUBLE PRECISION, ARRAY (128,128,8,16) :: A ... CALL CMF_DESCRIBE_ARRAY(A) as shown by this output: Array geometry id: 0xa3610 Rank: 4 Number of elements: 2097152 Extents: [128 128 8 16] Machine geometry id: 0xa9e50, rank: 4 Machine geometry elements: 2097152 ... 12.1.3 Subgrids ---------------- Continuing its layout strategy, the RTS divides the (machine) array into a number of sections, each section corresponding to a vector unit. These sections are called subgrids. Note the layout requirements discussed so far: o The physical grid is of the same rank as the user array and must have a power-of-2 number of vector units along each axis. o Each vector unit must contain the same number of elements--in other words, the subgrid must be the same size on each. The subgrid also has the same rank as the physical grid and the user array. o With the default -padding, the subgrid must be a multiple of 8 elements; with -nopadding, the subgrid can be any arbitrary size. Our sample 8 x 12 array, padded to 8 x 16 by default, will have 8 elements per vector unit. But, will they be laid out as a 4 x 2 subgrid, as shown in Figure 20, or, for example, as a 1 x 8 subgrid? The 1 x 8 layout implies a physical grid of 8 x 2, as shown in Figure 22. [ Figure Omitted ] Figure 22. An 8 x 12 array laid out on an 8 x 2 physical grid. The basic rule that the RTS follows is to minimize the size of the subgrid. In other words, it uses as few garbage elements as possible. This doesn't help it choose between the 2 x 4 and 1 x 8 subgrids, which both have the same number of garbage elements. It does, however, eliminate subgrids that use fewer than all 16 vector units. In practice, the RTS chooses a layout that uses fewer than all the vector units only when the number of elements in the array is small relative to the number of vector units on which the program is running. 12.1.4 Axis Sequence --------------------- One piece of information left out of the layouts shown in Figure 20 and Figure 22 is the numbering of the vector units within the physical grid. In the 2 x 8 physical grid layout, for example, Figure 23 shows two ways in which the vector units could be numbered. [ Figure Omitted ] Figure 23. Two possible vector-unit numberings. Vector units 0-3 are on node 0, vector units 4-7 are on node 1, etc. In the example on the top in Figure 23, the vector-unit numbers increase fastest (that is, by the smallest interval) along axis 1; we call this an axis sequence of (1,2)--the first axis in the sequence is the one that varies fastest. In the example on the bottom in Figure 23, they increase fastest along axis 2; this corresponds to an axis sequence of (2,1). By default, the current implementation of the RTS lays out multidimensional arrays so that the vector-unit numbers vary fastest along the highest-numbered axis--that is, it would choose the axis sequence of (2,1) in this example. The compiler switch -noaxisreorder causes it to choose axis sequence (1,2). 12.1.5 Subgrid Sequence ------------------------ The final issue with regard to the canonical layout is how the elements in the subgrid are arranged into linear order in the memory of a vector unit. This is known as the subgrid sequence. If you have a subgrid whose dimensions are 4 x 2, there are two possible layouts of the elements in vector-unit memory: (1,1) (1,1) (1,2) (2,1) (2,1) (3,1) (2,2) or (4,1) (3,1) (1,2) (3,2) (2,2) (4,1) (3,2) (4,2) (4,2) By default, the current implementation of the RTS chooses the layout on the left; it reorders the axes such that the highest-numbered axis varies fastest (that is, the adjacent subgrid elements along the highest-numbered axis are contiguous in memory). By specifying the non-default -noaxisreorder, you cause it to lay out elements in column-major order (leftmost is fastest). NOTE : The default sequence becomes more complicated when serial axes are specified (as shown below). 12.1.6 Putting It All Together ------------------------------- It turns out that the RTS chooses a 2 x 8 physical grid for our 8 x 12 array. Given the information covered so far, we can now see exactly how the RTS would lay out the array using this physical grid (Figure 24). [ Figure Omitted ] Figure 24. Default layout of an 8 x 12 array on 16 vector units. Note that the padding occurs in vector units 6, 7, 14, and 15; other layouts would put the padding in other vector units. Note these general performance rules for vector units: o Data movement within a vector unit is faster than data movement between vector units. Thus, in the default, data movement would be faster along axis 2. o Data movement between the two vector units on a chip is faster than data movement across the two chips of a node. o Data movement between vector units on the two chips of a node is faster than data movement off-node. To take advantage of the chip-versus-node distinction, you can specify layout in detail, as shown below. 12.2 LAYOUT WITHOUT VECTOR UNITS --------------------------------- The same general layout principles apply to all CM Fortran execution models. This section notes a few differences you will see if you are using the CM-5 global SPARC nodes model or the CM-2/200 slicewise model. 12.2.1 CM-5 SPARC Nodes Model ------------------------------ If your program is executing under the SPARC nodes model, the processing elements are the nodes instead of vector units. The layout rules are basically the same, except that the rule that the subgrid must be a multiple of 8 elements does not apply. The RTS does not add vector-length padding in the SPARC model, although it may need to add global padding. If the RTS were to lay out our sample 8 x 12 array on 4 nodes, it could choose among the physical grids and subgrids shown in Figure 25. [ Figure Omitted ] Figure 25. Subgrids and physical grids for an 8 x 12 array on four nodes. No global padding is required in this example, because the 96 elements of the array divide evenly into 24 per subgrid. The RTS in this case would choose the 4 x 6 subgrid. The actual default layout is shown in Figure 26. [ Figure Omitted ] Figure 26. Default layout of an 8 x 12 array on four nodes. 12.2.2 CM-200 Slicewise Model ------------------------------ If your program is executing under the CM-2/200 slicewise model, the processing elements are the nodes instead of vector units. The layout rules are basically the same, except that the constraint on subgrid length is that it must be a multiple of 4 elements. The -nopadding switch is not supported on the CM-2 and CM-200. 12.3 CONTROLLING ARRAY LAYOUT: AXIS ORDERING AND WEIGHTS --------------------------------------------------------- The compiler directive LAYOUT enables you to guide the RTS in laying out an array. Two fairly straightforward parameters you can specify are: o serial axes o weighting of axes Some more detailed syntax is shown in Section 12.6. 12.3.1 Syntax -------------- Only one LAYOUT directive can be applied to an array in a program, and the directive must be repeated in every program unit where that array is used. The directive has the form: CMF$ LAYOUT array-name ( axis-1-spec, axis-2-spec , . . . ) Each array dimension must have exactly one axis descriptor (axis- spec), which specifies the ordering and weight of the dimension. In a simple LAYOUT directive, the ordering can be either :SERIAL or :NEWS, and each :NEWS keyword can be preceded by a literal or named integer constant that indicates the relative weight. For example, DIMENSION A( 100,100,100 ) CMF$ LAYOUT A( :SERIAL, 2:NEWS, :NEWS ) This directive specifies that array A is to be laid out with one serial dimension and two NEWS-ordered dimensions, and that dimension 2, with a weight of 2, is to be favored for interprocessor communication over dimension 3, which gets the default weight 1. Since NEWS is the default axis ordering, the keyword may be omitted and just a weight supplied; or, both ordering and weight may be omitted as long as placeholder commas remain. Thus, the following directive is equivalent to that above: DIMENSION A( 100,100,100 ) CMF$ LAYOUT A( :SERIAL, 2, ) Increasing the relative dimension weight does not guarantee a proportional increase in communication speed for a dimension, but it might help the system in selecting from among several possible layouts. Note on :SEND Ordering CM-2/200 systems support a second parallel ordering, SEND ordering, where elements are adjoining if their send addresses differ by one bit (as opposed to NEWS ordering, where elements are adjoining if their axis coordinates differ by one). SEND ordering is preferred with some CM-2/200 library procedures such as CMSSL FFTs. CM Fortran on CM-5 systems accepts the axis specification :SEND in LAYOUT directives; but the RTS lays out a :SEND axis in the same way as a :NEWS axis. You should avoid SEND ordering on the CM-5 whenever possible. Although the compiler accepts it without protest, it does not recognize that this ordering is identical to NEWS ordering. It therefore generates an expensive SEND routine for an elemental operation between a :NEWS- ordered array and an otherwise identical :SEND-ordered array. Also, some CM-5 libraries do not accept :SEND-ordered arrays. 12.3.2 Serial Axes ------------------- Specifying an axis as :SERIAL in a LAYOUT directive has two effects: o All elements along the axis will be located on the same vector unit. o This layout is useful if communication tends to occur mostly along that one axis, since communication is faster within a vector unit than between vector units. It is also useful if the serial axis is used to simulate a "structure," whereby you specify a parallel "slice" of the array by supplying a scalar or triplet subscript for the serial axis. o For example, if your array is 3 x 8 x 12 and you specify that axis 1 is to be serial, all elements along axis 1 would be located on the same vector unit; the physical grid of the remaining two axes is the same as it would be if axis 1 didn't exist--see Figure 24. One result of this is that the RTS never adds garbage elements to a serial axis; it always satisfies the requirements for garbage elements in the non-serial axes. o By default (with the switch -axisreorder), the elements along a serial axis vary more slowly than elements along any non-serial axis--that is, they must be furthest away from each other in memory. For example, if you have a subgrid that is 2 x 3 x 4, by default the order of elements in memory is (1,1,1) (1,1,2) (1,1,3) (1,1,4) (1,2,1) (1,2,2) (1,2,3) (1,2,4) (1,3,1) (1,3,2) etc. o If you were to specify that axis 2 is serial, the sequence would be (1,1,1) (1,1,2) (1,1,3) (1,1,4) (2,1,1) (2,1,2) (2,1,3) (2,1,4) (1,2,1) (1,2,2) etc. o Axis 2 now varies most slowly. If you have more than one serial axis, by default the highest-numbered (rightmost) axis varies fastest, just as it does for non-serial axes. In effect, serial and non-serial axes are ordered independently. The rules are: o All NEWS axes vary faster than any serial axis. (This guarantees that a section formed by scalar indexing into all serial axes is a contiguous block of memory.) o The NEWS axis declared rightmost varies faster than other NEWS axes, the next rightmost NEWS axis varies next fastest, and so on. o The serial axis declared rightmost varies faster than other serial axes, the next rightmost serial axis varies next fastest, and so on. o If you specify the switch -noaxisreorder, then axes are laid out in the order declared, with the leftmost varying fastest, the next leftmost next fastest, and so on, regardless of which are serial or parallel. o For example, if you have a 2 x 3 x 4 subgrid with the middle axis serial and -noaxisreorder, the sequence would be: (1,1,1) (2,1,1) (1,2,1) (2,2,1) (1,3,1) (2,3,1) (1,1,2) (2,1,2) (1,2,2) (2,2,2) (1,3,2) (2,3,2) etc. 12.3.3 Weighting Axes ---------------------- The layout tools enable you to influence the RTS's choice of layout by differentially weighting the array axes to indicate which will be most used in communication operations. By default, the RTS weights all axes evenly. When an axis is weighted more highly than others, the RTS tries to localize more of the elements along that axis within a node or vector unit, thereby reducing the cost of the communication along that axis. In the example of laying out ax 8 x 12 array on 16 vector units, weighting axis 2 would tend to result in a 1 x 8 subgrid. If you have more than two dimensions, you can assign different weights to each dimension, or assign the same weight to two or more of the dimensions, causing the RTS to treat them similarly. Note that the run-time system will not necessarily be able to lay out the array to completely reflect the weights you assign to the axes. In the current implementation, once it has chosen a subgrid with the fewest elements, it uses the weights to trade off factors of 2 in the lengths of subgrid axes against factors of 2 in the length of physical-grid axes, attempting to give axes with higher weights a longer subgrid length and a smaller physical-grid length. -------------------------------------------------- NOTE The use of axis weights is deprecated in favor of the detailed LAYOUT syntax described below. The RTS may find that layout is totally determined by other factors before weights are considered, and so it often cannot respond usefully to this directive. -------------------------------------------------- 12.4 PERFORMANCE ISSUES ------------------------ The main reason to use the layout tools is to improve performance over what you can obtain using the canonical layout. This section summarizes the performance issues you should consider (most of these issues have already been mentioned in previous sections). Although the discussion here refers to vector units, it also applies to programs running on CM-5s without vector units and on CM-2/200s under the slicewise execution model. 12.4.1 Effect of Subgrid Length and Physical Grid -------------------------------------------------- As we discussed above, the RTS by default chooses the smallest subgrid size. This is the most efficient size for operations that are local to a vector unit. When you lengthen a subgrid axis, you improve the efficiency of communication along that axis at the expense of communication along other axes; you also decrease the overall efficiency of operations within the vector unit. For example, assume that we're using a subgrid size of 8 x 4 and we want to do a shift of distance 1 along axis 2--that is, each element sends its value to the element that is one coordinate higher along axis 2. Figure 27 shows the subgrids on two vector units; each vector unit moves 8 values to the next vector unit, and 24 values within the vector unit. [ Figure Omitted ] Figure 27. On-VU and off-VU data movement for a NEWS operation along axis 2. In general, the number of off-VU moves in an operation like this is equal to the total number of elements in the subgrid divided by the subgrid length of the axis along which the communication is taking place (32/4 = 8); this value is referred to as the subgrid- orthogonal-length. The number of moves within the vector unit is the total number of elements in the subgrid minus the subgrid-orthogonal-length (32-8 = 24). This number is roughly proportional to the subgrid size. As long as the subgrid size stays roughly constant, changing the layout does not greatly affect the cost of these on-VU moves. Decreasing the subgrid-orthogonal-length of a subgrid axis does, however, result in better communication performance along the axis. 12.4.2 Effect of Serial Axes ----------------------------- Making an axis serial guarantees that all elements along the axis are on the same vector unit that is, no off-VU data movement is required. This means that a serial axis will have optimal within-VU performance, once again at the expense of communication along other axes. But beware that, since all the other constraints on array layout must be satisfied by the non-serial axes, a array with only a small number of elements along the non-serial axes can end up with an inefficient layout, because it will require a large number of garbage elements. This is particularly true for the default -padding in -vu model. Since the RTS uses block layout--whereby successive elements are placed "down memory" rather than across processors--it is not possible to lay out an array with 1 element of a parallel axis per processor. Instead, 8 elements are laid out in the first few processors, along with all the serial axis elements, leaving the remaining processors with garbage elements only. 12.4.3 Effect of Garbage Elements ---------------------------------- If an array has garbage elements, they are not actually part of the array's data, but must be taken into account by communication functions. Often this requires extra work to move data "over" the garbage locations, thus decreasing efficiency. You should therefore avoid choosing an array or a subgrid size that requires the creation of garbage elements along an axis that will be used heavily for communication. For other communication operations, sometimes the existence of any garbage elements in the array will add overhead. 12.5 QUERYING AN ARRAY'S LAYOUT -------------------------------- CM fortran provides several utility library procedures that report how the run-time system actually laid out an array. You can use this information to determine whether you should use the layout tools to specify a different layout. Besides CMF_DESCRIBE_ARRAY mentioned earlier, the library also provides the following subroutines. These are particularly useful for deriving layout information for use in calls to CMF_ALLOCATE_DETAILED_ARRAY and CMF_ALIAS_DETAILED. CMF_GET_AXIS_ORDERS (incompatible with -noaxis switch) CMF_GET_LOCAL_EXTENTS CMF_GET_LOCAL_STRIDES CMF_GET_PROCESSOR_MASKS The last item, processor masks, refers to integer bitmasks that describe the layout of axes across particular processors. They are reported by CMF_DESCRIBE_ARRAY, and you can specify them as part of a detailed layout directive or procedure, as described in the following section. 12.6 SPECIFYING DETAILED ARRAY LAYOUTS --------------------------------------- The LAYOUT directive accepts additional axis specifications that describe array layout in detail. These forms of the directive are not meaningful at the language level or to the CM Fortran compiler, but are passed on as instructions to the run-time system. These detailed forms of the LAYOUT directive can be used with any array, including dynamically allocated arrays (although variable values used in the directive are established upon entry). A comparable feature for allocating an array dynamically and specifying its layout in detail is the utility procedure CMF_ALLOCATE_DETAILED_ARRAY, where variable values are those at the time the utility executes. This procedure is described in the CM Fortran Libraries Reference Manual. 12.6.1 Syntax -------------- The LAYOUT directive accepts the following forms of axis descriptor that pertain to detailed layouts: CMF$ LAYOUT array-name ( axis-descriptor(s) ) An axis-descriptor is one of: o :SERIAL o :BLOCK = axis-spec :PROCS = axis-spec o :BLOCK = axis-spec :PDESC = proc-spec o Each axis-spec or proc-spec may be a decimal integer constant, an integer-valued dummy argument, or an integer variable in COMMON. Integer expressions are not supported. The value of :PROCS= must be a power of 2. The resulting axis layout can be either: o the subgrid length (:BLOCK=) followed by the number of processors desired for that axis (:PROCS=) o the subgrid length (:BLOCK=) followed by a bit-mask indicating which processors are desired (:PDESC=) Keep in mind these restrictions on axis specifications: o If one axis of an array is specified with :BLOCK :PROCS, then all axes of that array should be either :BLOCK :PROCS or :SERIAL. If one axis of an array is specified with :BLOCK :PDESC, then all axes of that array should be either :BLOCK :PDESC or :SERIAL. o The compiler does not accept these detailed forms in an interface block when they use a dummy argument. Use only parameters, literal constants, or variables in COMMON in this situation. 12.6.2 Processor Masks ----------------------- NOTE: This discussion assumes that you are compiling for the VU model. If you are not compiling for the SPARC nodes model, the use of masks is the same but they apply to nodes instead of vector units. On the CM-2/200, they apply to nodes. To understand processor masks, it is useful to review the concept of a physical grid. The physical grid of an array is the arrangement of vector units; it has one position for each VU, arranged in a grid whose rank is the same as that of the array (excluding serial axes, if any). The dimensions of the physical grid are each powers of 2, since there must be a power-of-2 number of VUs in the partition. If axis i of the physical grid has length di, then we need log2(di) bits to represent a position in the physical grid along this axis. The processor mask for the axis is a mask with log2(di) bits set. When we number all of the VUs linearly, these bits are the ones that determine each VU's position along axis i in the physical grid. We call this linear numbering the physical address of the VU. The vector units along the axis containing the least significant bit in the mask are contiguous. Also, in the current implementation, the bits for any one axis must be contiguous. Let's assume we are going to run a program on 32 vector units, and we want a physical grid that is 4 x 8, as shown in Figure 28. [ Figure Omitted ] Figure 28. A physical grid with processor masks of 3 and 28. We would represent the physical address of each of the 32 vector units with a number between 0 and 32; this requires 5 bits: bbbbb MSB LSB To specify a 4 x 8 physical grid: o Axis 1 (4 vector units) requires the lowest 2 bits of the mask because the vector units along this axis are contiguous; its processor mask is therefore 3: 00011 MSB LSB o Axis 2 (8 vector units) requires bits 3, 4, and 5 of the mask; bits 0 and 1 are set to 0. Its processor mask is therefore 28: 11100 MSB LSB If you want the vector units along axis 2 to be contiguous, the masks would be: axis 0 = 11000 axis 1 = 00111 or 24 and 7. Note that the least significant bits of the physical address denote the four vector units within a node. Since communication within a node is more efficient than communication between nodes, axes to which these two bits are assigned can be more efficient directions for communication. What if we want to maximize the speed of communication along axis 1? To do that, we could allocate all the bits of the physical mask to axis 2. This would create these masks: axis 1 = 00000 axis 2 = 11111 or 0 and 31. Elements along axis 1 would be on the same vector unit, and performance would be best along that axis. If the program is running on 64 vector units and you want a physical grid of 8 x 8, both axes require three bits. But note that the bit mask lets us specify the axis sequence: o To make vector units along axis 1 contiguous, specify axis 1 in the low-order bits: 000111 o and axis 2 in the high-order bits: 111000 o or 7 and 56. o To make vector units along axis 2 contiguous, reverse the masks: 56 and 7. Note these constraints in the current implementation for specifying the processor masks: o Each mask must represent a contiguous set of bits. For example, a mask of 5 (101 in binary) is illegal. o The mask for one axis must not use any bits used by another axis. For example, masks of 7 (binary 0111) and 12 (binary 1100) are illegal in combination, because both use bit 3. o The sum of the masks must use all bits 0 through n, where n is less than or equal to the total number of bits that represent the vector units on which the program will run. For example, if you are going to run on 32 vector units, you can use all five bits, or the lowest four bits, or the lowest three bits, and so on. You can't use only the highest four bits. Typically the sum should be equal to the total number of bits, except for very small arrays, which don't use all the vector units. o If you use less than the total number of bits, the array will use less than the total number of vector units; this is usually inefficient. o A serial axis has a processor mask of 0. 12.6.3 Examples of Detailed Layout Directives ---------------------------------------------- o The array A(16,16,4) could be laid out on 8 processors as follows: CMF$LAYOUT A(:SERIAL,:BLOCK=8:PROCS=2,:BLOCK=1:PROCS=4) o In this mapping, the first axis is totally local (it could also have been specified as :BLOCK=16:PROCS=1); the 16 elements of the second axis are divided equally between 2 processors; and the third axis is spread across 4 processors, one element in each. o The array B(64,16) could be laid out on 16 processors, using :BLOCK :PDESC, as follows: CMF$ LAYOUT B(:BLOCK=16:PDESC=12,:BLOCK=4:PDESC=3) o Since the mask :PDESC=12 has 2 nonzero bits, it represents a physical axis of length 4. The axis extent (64) equals subgrid length (16) times physical axis length (4). Similarly, the mask of 3 for the second axis has 2 nonzero bits, and 4*4 = 16. 12.6.4 Restrictions on Array Size with Padding Enabled ------------------------------------------------------- Certain restrictions apply to detailed layouts when the program's execution environment pads user arrays such that the number of elements allocated per processor (for non-serial axes) is evenly divisible by the length of the vector registers in the CM-5 vector units or in the CM-2 nodes. These requirements vary by execution model; they apply only if the cmf switch -padding is enabled. o For the CM-5 vector units (-vu) model with -nopadding, there are no restrictions on the product of the :BLOCK values. o For the CM-5 vector units (-vu) model with -padding, the product of the :BLOCK values must be an integer multiple of 8. Axes labeled :SERIAL are excluded from the calculation. o For the CM-5 nodes (-sparc) model, there are no restrictions on the product of the :BLOCK values. o For the CM-2 and CM-200 slicewise (-slicewise) model, the product of the :BLOCK values must be an integer multiple of 4. Axes labeled :SERIAL are excluded from the calculation. (The switch -nopadding is not supported on the CM-2 or CM-200.) For example, assume a 32-vector-unit partition on the CM-5 and this detailed layout: REAL B(100,32) CMF$ LAYOUT B(:SERIAL, :BLOCK=1:PROCS=32) With -padding enabled, this layout causes a run-time error: *** RTS-FATAL-CMRTS: *** Fatal RTS Error ***: CMRT_intern_detailed_array_geometry: the product of the specified subgrid lengths, 100, must be a multiple of 8 With -nopadding, this directive gives the intended layout of one 100- element serial axis per processor. 12.7 A NOTE ON ALIGN --------------------- The ALIGN directive describes the layout of an array in terms of specified axes of another array whose layout is already determined. This directive causes the elements of the source array to be placed in the same processors as certain sections of the target array. 12.7.1 Syntax -------------- Like the LAYOUT directive, the ALIGN directive can be applied only once to an array in a program, and it must be repeated in every program unit where that array is used. Its format is: CMF$ALIGN source-array ( axis-specs ) WITH target-array ( axis-specs ) The axis-specs of the source array assign a symbolic name to each of its dimensions. The same symbolic names are then used in the axis- specs of the target array to indicate the pairs of source-and-target dimensions that are to be aligned in the same processors. Any target dimensions that are not related to the source array are identified with a scalar index value (or offset). For example, to align a 5-element vector V with the first row of the 5 x 5 matrix A: DIMENSION V(5), A(5,5) CMF$ ALIGN V(I) WITH A(1,I) The effect is to place V in the same geometry as A, in the same processors as A's first row: VVVVV AAAAA ..... AAAAA ..... AAAAA ..... AAAAA ..... AAAAA Similarly, to align V with the last row of A: CMF$ ALIGN V(I) WITH A(5,I) as illustrated by ..... AAAAA ..... AAAAA ..... AAAAA ..... AAAAA VVVVV AAAAA To align V with the first column of A: CMF$ ALIGN V(I) WITH A(I,1) as illustrated by V.... AAAAA V.... AAAAA V.... AAAAA V.... AAAAA V.... AAAAA Other alignments, including offset alignments, are illustrated in the CM Fortran Language Reference Manual. Permuted alignments are not supported. 12.7.2 Caveats on ALIGN ------------------------ The ALIGN directive can greatly enhance program performance when two arrays that are not conformable are often used together. By giving the arrays the same geometry and aligning specified elements in the same processors, the compiler can use local memory accesses instead of communication to operate on corresponding elements of the two arrays. There is, however, a major cost in memory use when you align an array with another array of larger size or higher rank (which is normally the intention). The aligned array is given the same geometry as the target and is therefore allocated the same amount of memory, even though much of this memory may never be used. See the CM-5 CM Fortran Performance Guide for more information on the performance impact of ALIGN. ***************************************************************** The information in this document is subject to change without notice and should not be construed as a commitment by Think- ing Machines Corporation. Thinking Machines reserves the right to make changes to any product described herein. Although the information in this document has been reviewed and is believed to be reliable, Thinking Machines Corporation assumes no liability for errors in this document. Thinking Machines does not assume any liability arising from the application or use of any information or product described herein. ***************************************************************** Connection Machine (r) is a registered trademark of Thinking Machines Corporation. CM, CM-2, CM-200, CM-5, CM-5 Scale 3, and DataVault are trademarks of Thinking Machines Corporation. CMOST, CMAX, and Prism are trademarks of Thinking Machines Corporation. C* (r) is a registered trademark of Thinking Machines Corporation. Paris, *Lisp, and CM Fortran are trademarks of Thinking Machines Corporation. CMMD, CMSSL, and CMX11 are trademarks of Thinking Machines Corporation. CMview is a trademark of Thinking Machines Corporation. Scalable Computing (SC) is a trademark of Thinking Machines Corporation. Scalable Disk Array (SDA) is a trademark of Thinking Machines Corporation. Thinking Machines (r) is a registered trademark of Thinking Machines Corporation. SPARC and SPARCstation are trademarks of SPARC International, Inc. Sun, Sun-4, SunOS, Sun FORTRAN, and Sun Workstation are trademarks of Sun Microsystems, Inc. UNIX is a trademark of UNIX System Laboratories, Inc. The X Window System is a trademark of the Massachusetts Institute of Technology. Copyright (c) 1989-1994 by Thinking Machines Corporation. All rights reserved. This file contains documentation produced by Thinking Machines Corporation. Unauthorized duplication of this documentation is prohibited. Thinking Machines Corporation 245 First Street Cambridge, Massachusetts 02142-1264 (617) 234-1000