USING THE CMAX CONVERTER Version 1.0, July 1993 Copyright (c) 1993 Thinking Machines Corporation. CHAPTER 1: OVERVIEW ******************* CMAX--the CM Automated Translator--converts Fortran 77 programs into CM Fortran. It greatly simplifies the task of porting serial Fortran programs to the massively parallel Connection Machine system. The conversion procedure is the same whether your target is the global data parallel model of CM Fortran or the nodal model that uses the vector units for parallel processing and explicit message passing for communication. This chapter provides an overview of the converter: its goals and its capabilities. Later chapters address the larger porting process: preparing the program, using the converter, and customizing code for best performance on the CM system. 1.1 WHY USE THE CONVERTER? --------------------------- The CMAX Converter has two basic goals: o It enables developers of new Fortran 77 programs to access the parallel processing resources of the CM system. For programs that follow the few simple conventions of scalable Fortran, conversion is largely automatic. o Users in a heterogeneous computer environment and third-party software developers can use the converter as a "preprocessor" for routine CM Fortran compilation. Since the program is maintained in Fortran 77 for portability to multiple platforms, the converter provides a migration path onto and off from the CM system. o The converter assists users in porting older Fortran 77 programs to CM Fortran. Nonstandard features and non-scalable programming idioms may not convert automatically, however, so some manual recoding is usually required. o The amount of recoding depends, naturally, on the size and condition of the input program. It is not a specific goal of the CMAX Converter to convert pre-Fortran 77, "dusty deck" programs, although it can be helpful in the later stages of the porting process. o Dusty decks must first be brought to conformance with the Fortran 77 standard. From there, any of several paths can be followed, depending on whether the program is to run on several architectures or on the CM only. Chapter 2 of this manual provides more information on the converter's place within an overall porting process. 1.2 A SIMPLE CONVERSION ------------------------ The CMAX Converter's basic actions are: o to vectorize DO loops on arrays, that is, translate them into CM Fortran array operations o to indicate the CM home serial control processor or parallel processing unit of each array These actions enable the CM Fortran compiler to generate parallel instructions and to allocate arrays on the appropriate part of the system. This section illustrates the converter by presenting the straightforward conversion of a simple program. Program BART, Figure 1, numerically integrates the cosine function over a given interval using Simpson's rule. The converter automatically translates BART into the CM Fortran program shown in Figure 2. 1.2.1 Invoking the Converter ----------------------------- The converter is currently available as a batch tool that can be invoked from any shell prompt. The barebones conversion procedure is the following. 1. Create a package containing the source program. A package is a set of files that the CMAX Converter treats as a complete program. % cmax bartpack -AddFiles= bart.f [other source files] 2. Convert the program in the package to CM Fortran. Each converted .f file is written to a .fcm file. % cmax bartpack 3. Compile the output file(s) for the CM system of your choice. % cmf [ cmf switches ] -o bart bart.fcm 4. Execute the program on the appropriate CM system. % bart Chapter 2 of this manual presents more information about the cmax command and the variations in the conversion procedure. [ Figures Omitted ] Figure 1. The BART program, coded in Fortran 77 with DO loops on arrays. Figure 2. The BART program, converted to CM Fortran. 1.2.2 The Converter's Action ----------------------------- Notice what the converter has done to the BART program. It has performed loop conversion to express parallelism, and it has determined where on the CM system each operation will be performed. This section is a brief introduction only; a more detailed list of the transformations CMAX can perform appears in Appendix ???. Loop Conversion --------------- The converter has translated the two DO loops in the Fortran 77 program into array operations. The loop that operates on each array element independently, C Evaluate function and compute coefficients: DO I = 1,NELTS X = START + (I-1)*EPSILON VALUE(I) = COS(X) COEFF(I) = 2 + 2*MOD(I-1,2) END DO turns into these array assignment statements: C Evaluate function and compute coefficients: FORALL (I=1:NELTS) X100(I) = START+(I-1)*EPSILON VALUE(:NELTS) = COS(X100(:NELTS)) FORALL (I=1:NELTS) COEFF(I) = 2+2*MOD(I-1,2) The loop that sums the products of the respective elements of COEFF and VALUE into a scalar, C Compute total area using Simpson's rule: AREA = 0.0 DO I = 1,NELTS AREA = AREA + COEFF(I)*VALUE(I) END DO turns into this call to the intrinsic function DOTPRODUCT: C Compute total area using Simpson's rule: AREA = 0.0 AREA = AREA + . DOTPRODUCT(COEFF(:NELTS),VALUE(:NELTS)) Array Homes ----------- The converter has also inserted directives that control where on the CM system the operations will occur, guaranteeing that the data to be processed in parallel is allocated in the memory of the parallel processors. The arrays COEFF and VALUE, PARAMETER (MAXSLICES = 1000) REAL VALUE(MAXSLICES), COEFF(MAXSLICES) become distributed across CM processors in NEWS order: PARAMETER (MAXSLICES = 1000) REAL VALUE(MAXSLICES), COEFF(MAXSLICES) CMF$ LAYOUT COEFF(:NEWS) CMF$ LAYOUT VALUE(:NEWS) In addition, the scalar X, REAL X is promoted to an array named X100 and distributed, since it will be used in array operations: REAL X100(MAXSLICES) CMF$ LAYOUT X100(:NEWS) 1.2.3 Is BART Typical? ----------------------- BART's conversion is painless not only because the program is simple, but also because it is, to a large extent, scalable. The basic feature of scalable programs is that they operate on arrays of data, which a compiler can split up among multiple processors for independent or coordinated processing. Contrast BART with program MAGGIE in Figure 3. MAGGIE is also a Simpson integrator, but this implementation does not use arrays. The CMAX Converter will not produce parallel code for MAGGIE. The following section presents the conventions of scalable Fortran 77 programming. Although these conventions arose from a desire to use data parallel systems effectively, they turn out to be a convenient strategy for achieving good performance on many other architectures as well. [ Figure Omitted ] Figure 3. The MAGGIE program, coded in Fortran 77 without arrays. 1.3 WRITING SCALABLE FORTRAN ----------------------------- Fortran 77 is available on virtually all platforms and thus assures convenient portability. A program's performance, however, is often tuned to the details of a particular architecture, and thus may not port well. With the recent explosion of architectures vector machines, highly pipelined machines, machines with multiple functional units, massively parallel machines, and machines with combinations of all these features the portability of a code's performance is not guaranteed, and must be engineered in. We define scalability as the portability of a code's performance. In particular, a scalable program is one designed to execute efficiently on any size data set, large or small, using any number of processors, from one to thousands. Modern data parallel languages like Fortran 90 provide constructs conducive to scalable programming; for older languages like Fortran 77, one must follow certain conventions to ensure scalability. The conventions express three basic objectives: o Make it easy for a compiler to recognize how data and computations may be split up for independent or coordinated processing. o Avoid constructions that rely on a particular memory organization. o Use data layout directives and library procedures (with some conditionalizing convention) to take advantage of the specific performance characteristics of the target platforms. This section introduces the conventions of scalable programming in Fortran 77, using fragments from a widely used computational aerodynamics program called FLO67]. This program uses a multigrid scheme to simulate the 3-dimensional airflow past a swept wing. It is implemented in about 5000 lines of code.* NOTE: The conventions are intended as rules of thumb rather than hard- and-fast requirements. In places where different conventions may conflict, as noted in the illustrations below, the goal being sought should guide the programmer's judgment which to follow. Rule 1: Scalable programs operate on most or all the data "at once." In Fortran 77, this convention means looping over as many array axes, and as much of each axis, as possible. The loop below, for example, calculates the air pressure for each cell: DO 90 K=1,KL DO 90 J=1,JL DO 90 I=1,IL QQ = W(I,J,K,2)**2 + W(I,J,K,3)**2 + W(I,J,K,4)**2 QQ = .5*QQ/W(I,J,K,1) P(I,J,K)=(GAMMA-1.)*DIM(W(I,J,K,5),QQ) 90 CONTINUE The 4-dimensional array W implements a 3-dimensional array of 5- element record structures; structure components are selected by indexing along the fourth axis. The first structure component is the air density; the second, third, and fourth components are the momentum densities in the X, Y, and Z directions, respectively; and the fifth component is the total energy density. NOTE: While data can be distributed across processors along the three spatial dimensions of W, the fourth dimension is better left undistributed (in CM Fortran parlance, it is a serial axis) since operations on it are likely to be inherently sequential. This case illustrates the judgmental quality of scalable programming, since a serial axis warrants different treatment from axes in the data domain. Rule 2: Scalable programs operate on data elements homogeneously, performing identical or similar operations on each. The programming style encouraged by serial computers (and serial thinking) often violates this convention. The code below computes different values for interior and boundary cells: DO 10 J=2,M-1 DO 10 I=2,N-1 A(I,J) = A(I,J) + B(I,J) ! Interior 10 CONTINUE DO 20 J=1,M A(1,J) = C(1,J) ! Edge A(N,J) = C(N,J) 20 CONTINUE DO 30 I=2,N-1 A(I,1) = C(I,1) ! Edge A(I,N) = C(I,N) 30 CONTINUE In addition to violating the first rule operate on all the data at once these loops perform different operations for different elements. Rather than coding three loops as above, one could set up a logical variable MASK which is .TRUE. for interior elements and .FALSE. for boundary elements. The above computation could then be expressed in this scalable code: DO 10 J=1,N DO 10 I=1,N IF (MASK(I,J)) THEN A(I,J) = A(I,J) + B(I,J) ELSE A(I,J) = C(I,J) END IF 10 CONTINUE Rule 3: Scalable programs exhibit locality of reference. In Fortran 77, this convention means that expressions within loops should reference similarly indexed portions of similarly shaped arrays. Elements at the loop indices or nearby are used in computations; distant elements are not. A subroutine of FLO67 computes each element of array DTL as a function of nearby elements of array RAD: DO 70 K=1,KL DO 70 J=1,JL DO 70 I=1,IL RADA = RAD(I+0,J+0,K+0) + RAD(I+1,J+0,K+0) + . RAD(I+0,J+1,K+0) + RAD(I+1,J+1,K+0) + . RAD(I+0,J+0,K+1) + RAD(I+1,J+0,K+1) + . RAD(I+0,J+1,K+1) + RAD(I+1,J+1,K+1) DTL(I,J,K)=CFL/RADA 70 CONTINUE In data parallel architectures, this convention allows the compiler to use nearest-neighbor communication instead of the more costly general communication between processors. Rule 4: Scalable programs exhibit numerical stability. Some degree of imprecision is unavoidable in floating-point calculations. Since most real numbers cannot be represented exactly in digital form, these numbers are rounded to a representable number. The cumulative round-off error for a series of computations can lead at times to floating-point exceptions or incorrect results. Scalable programs avoid relying on the correctness of an algorithm or round-off tolerance tuned to a single architecture. This is particularly true of array summations and other reductions, since floating-point reductions are especially sensitive to the order of evaluation. Because they may use a different order of evaluation, vector and parallel machines may produce markedly different results from serial machines, including run-time errors. For example, if reversing the order of the loop changes the answer in a significant way, then the computation is numerically unstable. Consider this program: ----------------------------------------------------------------- PROGRAM ROUNDABOUT REAL A(12), X INTEGER I A(1) = +1E+8 A(2) = +1E+1 A(3) = +2E+8 A(4) = +2E+1 A(5) = +3E+8 A(6) = +3E+1 A(7) = -3E+1 A(8) = -3E+8 A(9) = -2E+1 A(10) = -2E+8 A(11) = -1E+1 A(12) = -1E+8 C Add up from 1 to 12: X = 0.0 DO I = 1,12 X = X + A(I) END DO PRINT *, X C Add up from 12 to 1: X = 0.0 DO I = 12,1,-1 X = X + A(I) END DO PRINT *, X C Add up odd elements, then even: X = 0.0 DO I = 1,11,2 X = X + A(I) END DO DO I = 2,12,2 X = X + A(I) END DO PRINT *, X END ----------------------------------------------------------------- A system using IEEE 32-bit REAL values produces this output: -40.0000 40.0000 0. Using IEEE 64-bit DOUBLE PRECISION instead of 32-bit REAL, the output is: 0. 0. 0. Scalable programs maintain numerical stability on all target architectures. This example illustrates two options for achieving this: o Go to higher-precision representation. o Change the algorithm. In this example, the numerically stable answer is obtained by summing first the odd elements, then the even elements. Rule 5: Scalable programs use easily recognizable idioms to express common, well-structured dependences. In the loops shown thus far, each iteration can be executed independently of the others; that is, there are no loop-carried data dependences. Most algorithms do have some form of interaction between data items. While code with arbitrary data dependences does not generally execute well on vector or parallel machines, most well-structured data-dependent computations do have efficient low-level algorithms on serial, vector, and parallel machines. Scalable code enables a compiler to recognize such computations by expressing them with a small set of common idioms. (Appendix B of this manual lists many of these idioms.) Consider this convergence-checking code, from the EULER subroutine of FLO67: NSUP = 0 HRMS = 0 DO 80 K=1,KL DO 80 J=1,JL DO 80 I=1,IL V1 = W(I,J,K,2)*W(I,J,K,2) + . W(I,J,K,3)*W(I,J,K,3) + . W(I,J,K,4)*W(I,J,K,4) IF (V1.GE.(GAMMA*P(I,J,K)*W(I,J,K,1))) . NSUP = NSUP + 1 ! COUNT V1 = W(I,J,K,5) + P(I,J,K) V1 = V1/W(I,J,K,1) - H0 HRMS = HRMS + V1*V1 ! SUM 80 CONTINUE HRMS = SQRT(HRMS/FLOAT((NX-1)*NY*NZ)) The variable NSUP is incremented once for each cell that exceeds a threshold. An incrementation contained in an IF statement is an idiomatic way of expressing the reduction operation named COUNT in Fortran 90. The variable HRMS is computed using another reduction operation: the SUM of the square of a value computed for each cell. This operation is also the idiomatic expression of dotproduct. Rule 6: Scalable programs do not assume a particular memory model. Certain features of Fortran 77 particularly sequence association and storage association assume a linear model of memory. o Sequence association is the mapping of a multidimensional object in column-major order to a linear sequence of values. REAL A(10,10), B(10,10) ... CALL SUBA(A, B(5,5)) ... SUBROUTINE SUBA (C,D) REAL C(100), D(10) In this example, the elements of matrices A and B are sequence associated with elements of the dummy vector arguments C and D. o Storage association occurs when two or more variables (or arrays) share the same storage. This feature is exploited by EQUIVALENCE, REAL A(2,50) COMPLEX C(100) EQUIVALENCE (A(1,1),C(1)) and by some uses of COMMON, SUBROUTINE SUBA COMMON /BOUNDS/ IBX, IBY, IBZ ... SUBROUTINE SUBB COMMON /BOUNDS/ IBDS(3) Sequence and storage association are difficult to implement and tend to be inefficient on memory systems that are not linear. These features are included in Fortran 90 for compatibility with Fortran 77, but scalable programs avoid them. New features of Fortran 90, such as allocatable, automatic, and assumed-shape arrays, reduce the need for some uses of sequence and storage association. Other uses can be replaced by alternative Fortran 77 practices, which are both clearer and more scalable (see Chapter 4). As general principles of scalable programming, the following are corollaries of Rule 6. Rule 6a: Multidimensional arrays should be declared as such and be declared consistently throughout a program. Arrays that change shape, particularly across subroutine boundaries, impose a severe burden on systems with nonlinear memory organization such as the CM system. In addition, performance is greatly affected by the way in which arrays are allocated, and the number and size of dimensions must be made explicit when they are allocated. The original FLO67 code used storage pools from which it allocated different mesh variables. Pointers into these pools were passed to subroutines, which declared their arguments as multidimensional arrays. To enhance scalability, this original code, DIMENSION W(IDX5),P(IDX), ... ... K1 = 1 KW = 1 NC = (IE+1) * (JE+1) * (KE+1) L1 = K1 + NC LW = KW + 5*NC ... CALL FLO (W(KW),P(K1), ..., W(LW),P(L1), ...) was recoded as, DIMENSION W_0(0:IEM,0:JEM,0:KEM,5),P_0(0:IEM,0:JEM,0:KEM) DIMENSION W_1(0:IEM,0:JEM,0:KEM,5),P_1(0:IEM,0:JEM,0:KEM) ... CALL FLO (W_0,P_0, ..., W_1,P_1, ...) Rule 6b: Data layout directives should be supplied where necessary and helpful. Most machines' performance is sensitive to data layout, because of its impact on locality of access. Distributed-memory parallel machines are particularly sensitive, since data layout directly influences the frequency of interprocessor communication. Scalable programs reflect an awareness of the layout conventions of the target systems and use machine-specific directives to override the default layout where appropriate. In the FLO67 program, the 4-dimensional array W shown under Rule 1 implements a 3-dimensional array of 5-element record structures; the structure components are selected by indexing along the fourth axis. For the CM system, the fourth dimension is better left undistributed, as specified by the CM Fortran compiler directive LAYOUT: REAL X(0:IE,0:JE,0:KE,5) CMF$ LAYOUT W(:NEWS, :NEWS, :NEWS, :SERIAL) A program might also benefit from aligning arrays in distributed memory, by means of the directive ALIGN. These directives can be inserted into the Fortran 77 source program, since they are ignored by other Fortran compilers. Rule 6c: Arrays should be of appropriate size and declared such that they are easily changeable. The optimal size for arrays and array dimensions varies among machines. On Cray machines, certain array sizes can degrade performance because of bank conflicts; the CM-2 system rewards power- of-2 array dimensions; other machines may have constraints stemming from page sizes or cache sizes; and so on. Array sizes are easily changed between compilations if they are declared with parameters instead of literal constants and if the PARAMETER statements are easy to locate. (INCLUDE files are convenient for this purpose.) This convention is also helpful in scaling a Fortran 77 program to different problem sizes. Rule 7: Scalable programs use machine-specific libraries where available. A call to a library routine fulfills the scalability goal of specifying the desired operation without dictating its algorithm. Sorting, matrix multiplication, equation solving, and Fourier transforms are examples of operations where no single algorithm is universally optimal. Using library procedures, along with some conditionalizing convention in the source program, helps make the code machine-independent and also takes advantage of any machine-specific tuning by the library implementor. 1.4 WHAT DOES THE CONVERTER DO? -------------------------------- CM Fortran includes all of Fortran 77 plus some extensions that support data parallel processing. These extensions include: o Fortran 90 array operations and other language features that CM Fortran provides to express parallelism o CM Fortran compiler`s conventions for laying out data (in linear versus distributed memory, depending on whether parallel processing is appropriate) o Fortran 90 method of passing array arguments and the semantics of procedure calls The converter is designed to deal with all these translation issues. By viewing the whole program, it can provide the cmf compiler with code (including directives) that expresses parallelism; that avoids "array home" conflicts by directing that arrays processed in parallel are to be allocated in distributed memory; and that uses correct CM Fortran methods of passing distributed arrays as arguments. 1.4.1 Array Operations ----------------------- An array operation is a computation that is performed on an array as a single entity, that is, on all the array's elements or a specified subset of them. CM Fortran adopts Fortran 90 array notation to indicate the set of elements. The array reference A is short for A(1:N:1), indicating all the elements; the reference A(1:N/2:2) indicates every other element in the first half. This triplet notation (express or implied) is analogous in function to the control variables in a DO construct, specifying a range of indices for array elements to be processed. But there is an important difference. A DO loop specifies indices in a particular sequential order, and processes one or more entire statements for an index before going on the the next index. An array operation, in contrast, may process indices in any order, but it must perform any given single operation (such as addition, multiplication, or assignment) for all indices before performing the next operation. The net effect is that an array operation can process array elements in any order without changing the result. The CM system takes advantage of this feature to process array elements in parallel. The CMAX Converter analyzes Fortran 77 DO constructs to determine whether they are functionally equivalent to any CM Fortran array operation, and if so, converts the array notation and the loop(s). This section illustrates the converter's loop-conversion capabilities; output may be simplified to clarify the essential transformation. Verbatim output from loop transformations is shown in Appendix B. Loops without Dependences ------------------------- Loop operations in which array elements do not interact vectorize straightforwardly into elemental operations: [Diagram Omitted] In other loop operations, the data elements may interact but there is no dependence between loop iterations. The converter's dependence analysis detects that a step-by-2 smoothing operation can vectorize to an array assignment. (Notice that the converter and CM Fortran accept the END DO statement.) [Diagram Omitted] Similarly, the apparent dependence in a "mirror" operation is recognized as not a real dependence: [Diagram Omitted] Loops That Express Common Idioms -------------------------------- Loops with certain well-structured dependences are functionally equivalent to CM Fortran intrinsic functions, Utility Library procedures, or FORALL statements. The converter recognizes the intent of the loop and substitutes the corresponding array operation. For example, consider these idiomatic loop constructions: [Diagram Omitted] Loops with Embedded Conditionals -------------------------------- Loops with embedded IF statements may convert to a masked array assignment, such as WHERE or FORALL, or a masked intrinsic. For example: [Diagram Omitted] Other Vectorization Capabilities -------------------------------- The converter analyzes and may restructure code to facilitate vectorization. For example, it promotes scalar values that are used with arrays into arrays of the appropriate shape, as illustrated above in the conversion of program BART. The array name is derived from the scalar's name, as with X and X100, and the array is aligned, if necessary, with another CM array to enhance the locality of access of their respective elements. The converter may also fission loops to isolate vectorizable code from inherently serial operations: [Diagram Omitted] The converter may also transform a loop with an embedded subroutine call into a subroutine that contains the loop, thereby making the loop vectorizable. This loop-pushing transformation can be visualized as follows. Notice that the dummy variable X becomes an array in the intermediate form, and the loop becomes an array assignment in the CM Fortran version. [Diagram Omitted] Limits on Vectorization ----------------------- If the converter cannot safely vectorize a DO loop, it passes the code through unchanged. The CM Fortran compiler then treats it as serial code. Some loop constructions that the converter does not translate are the following. o Loops that contain a STOP or RETURN statement, a computed or assigned GO TO statement, or a forward or backward branch. In general, nonstructured constructs cannot convert to array operations. o Loops containing calls to functions that are passed in as arguments. o Loops that contain certain complicated data dependences. For example, the converter cannot vectorize the following: carried", Sort String = "dependences, loop?carried", "complicated"> REAL A(10), B(10) DO I=1,7 B(I) = A(I+3) + 10 A(I) = B(I+3) + 1 END DO [Diagram Omitted] o Loops for which the converter cannot resolve a question of data dependence. For example, the following loop can vectorize only if the index offset M is zero or positive: [Diagram Omitted] Where M is negative, the loop must execute serially. If M is a run-time value, the converter cannot resolve this question. NOTE: The converter provides directives with which the user can assert information that resolves some dependence issues. If the user asserts, in this case, CMAX$NODEPENDENCE meaning that M is necessarily zero or positive--the converter can vectorize the loop. o Loops with structured dependences that have no functionally equivalent parallel operation in CM Fortran. For example, a parallel-prefix (or "scan") operation along a dimension is vectorizable. A scan along the diagonal of two dimensions, however, cannot (at present) be computed in parallel, and the converter does not vectorize it. DO J=2,N DO I=2,N A(I,J) = A(I-1,J) + A(I,J-1) END DO END DO 1.4.2 Array Homes ------------------ Every CM system consists of two processing components with different memory organizations: o A serial control processor, called the front end on a CM-2 or CM-200 and the partition manager on the CM-5. This processor has the conventional linear memory organization. o A parallel processing unit consisting of some number of processing elements, called processors (in the abstract) or nodes or vector units (depending on the CM hardware configuration). The memory of the parallel unit is distributed among the processors. Data stored on the control processor, including all scalar data and front-end arrays, is operated upon serially. Other arrays (CM arrays) are laid out across the parallel processors, one or more elements in the local memory of each processor, and are operated upon in parallel. The processing component on which an array is stored is called its home. The CMAX Converter guides the compiler's array home decisions. [Diagram Omitted] NOTE: The compiler's decomposition of data and code into "serial" versus "parallel" is the same regardless of whether a CM Fortran program is to run globally or "on a node." In the latter case, each CM-5 node serves as a control processor and the four vector units associated with each node are its dedicated parallel processing unit. See the CMMD documentation set for more information about executing CM Fortran programs "on a node." The Compiler's Action --------------------- The CM Fortran compiler decides the home of each array in a program after attempting to determine whether it is to be processed serially or in parallel. Usually, it goes by whether the array is used in an array operation, although the programmer can control the decision with a compiler directive such as LAYOUT. One of the chores of CM Fortran programming is to keep array homes consistent across procedures. Since the compiler does not at present perform interprocedural analysis, it may give an array different homes in different procedures. However: o It is an error to perform an array operation on a front-end array. o There is a severe performance penalty for performing a serial operation on a CM array. o It is an error to pass an actual array argument from either home to a dummy argument with the other home. For example: PROGRAM FAILURE INTEGER A(100,100) DATA [initialize A] CALL SUB(A) PRINT *, A STOP END SUBROUTINE SUB(B) INTEGER B(100,100) B = B**2 RETURN END The compiler allocates array A on the control processor, since there is nothing in the main program to indicate that A is to be processed in parallel. Array B, on the other hand, becomes a CM array by virtue of the array assignment in the subprogram. This program fails at run time when the front-end array is passed to the subroutine. The Converter's Action ---------------------- The CMAX Converter spares the programmer the problem of controlling array homes by inserting a LAYOUT directive for every array. Since the converter performs interprocedural analysis, it can determine whether an array is used in an array operation anywhere in the program and can thus avoid home mismatches across procedure boundaries. It may also create variants of certain subprograms, so that both front-end arrays and CM arrays can be passed to them as arguments. The user can of course override the converter's action by means of in-line directives or converter command-line options. These features, and the situations where they might be useful, are described in Chapter 2 of this manual. The converter makes array home decisions by the following rules: o A user-supplied cmf directive LAYOUT or cmax option overrides all internal rules. CMAX propagates LAYOUT directives across program boundaries and issues a warning if it encounters inconsistency in their use. o The converter also respects the cmf directive ALIGN within a program unit, but does not propagate it across program boundaries. (See Chapter 2 for restrictions on ALIGN.) o All arrays that CM Fortran restricts to the control processor are marked as front-end arrays. These include arrays of CHARACTER type and arrays subject to an EQUIVALENCE statement. o All arrays whose total size is below a threshhold number of elements are marked as front-end arrays. (The default threshhold size is 8 elements; you can specify a different number with the -ShortVectorLength=nn converter switch. See Chapter 2.) o All arrays in COMMON that change type or shape across program boundaries are marked as front-end arrays. o All arrays passed as arguments to procedures that define a dummy argument of a different type or shape are marked as front-end arrays, except in those cases where the converter is able to adjust code to meet this CM Fortran restriction. The last two rules reflect restrictions that CM Fortran imposes on arrays stored in distributed memory. Section 1.4.3 describes the current state of the converter's ability to generate code that meets these restrictions, thereby enabling more arrays to become CM arrays and be subject to array operations. 1.4.3 Argument Passing ----------------------- CM Fortran adopts the Fortran 90 semantics for passing CM array arguments; it retains the Fortran 77 semantics for front-end array arguments. In Fortran 77, where arrays are passed by reference, an array name as an argument indicates the array's first element. In CM Fortran, where CM arrays are passed by descriptor, the reference to the name of a CM array indicates all its elements. When the argument is a whole array, a procedure call looks the same in both cases: REAL A(1000) .... CALL SUB(A) However, the A argument in Fortran 77 is short for A(1), whereas in CM Fortran, the A is short for A(1:1000). The semantic difference becomes apparent when the procedure is to operate on only part of the array. For instance, in the following fragment the first subroutine call operates on the first half of the array and second call on the second half: [Diagram Omitted] Both the calls on the left fail when A is a CM array because it is an error to change array size or shape across subroutine boundaries. In fact, the argument to SUB_2 is not an array at all, since the expression A(501) indicates a single array element. Where A is a CM array, this element is transferred to the control processor and passed as a scalar argument. The CMAX Converter's interprocedural analysis enables it to generate code that meets the argument-passing requirements for CM arrays in three particular cases. o Array arguments with a scalar subscript (one-dimensional) o Array arguments with elided axes o Assumed-size dummy array arguments (without change in rank) Array Arguments with a Scalar Subscript --------------------------------------- "Hidden array" arguments, like A(501) above, violate CM Fortran argument-passing restrictions by passing a scalar argument to a dummy array. The converter changes the call to pass the whole array, its length, and an integer index into it, and changes the procedure definition accordingly. At present, this action occurs only if both the actual and the dummy arrays are 1-dimensional. [Diagram Omitted] Array Arguments with Elided Axes -------------------------------- In Fortran 77, one may pass n-minus-k-dimensional contiguous slices of n-dimensional arrays to subprograms in the following way: all indices in the call other than rightmost k indices are the lower bound of the corresponding dimension. REAL X(100,200,5) CALL FOO(X(1,1,3)) ... SUBROUTINE FOO(Y) REAL Y(100,200) ... This type of memory reference violates CM Fortran argument-passing restrictions by reshaping an array across program boundaries. However, CMAX recognizes the intention of axis elision in this situation and transforms the code to pass a 2-dimensional section of the 3- dimensional array. The lower bound indices are turned into colons, indicating the whole dimension. (The user could improve performance of the output by inserting a CM Fortran LAYOUT directive to designate the rightmost axis of X a serial, or non-distributed, axis.) REAL X(100,200,5) ... CALL FOO(X(:,:,3)) ... SUBROUTINE FOO(Y) REAL Y(100,200) CMF$ LAYOUT Y(:NEWS, :NEWS) ... Assumed-Size Dummy Array Arguments ---------------------------------- CM Fortran does not support assumed-size dummy CM arrays (those with dimension lists ending in *), since Fortran 77 permits collapsing multiple axes into the * axis. Where CMAX has determined that the actual and dummy arguments are in fact of the same rank, and that they meet the other criteria for a CM home and parallel processing, it generates an assumed-shape dummy array. For example, this subroutine: SUBROUTINE SUB(A,MAXA0,NSL) REAL A(MAXA0, MAXA0, *) DO IX = 1, MAXA0 DO IY = 1, MAXA0 DO ISL = 1, NSL A(IX, IY, ISL) = A(IX, IY, ISL) * 2 END DO END DO END DO END Is translated to: SUBROUTINE SUB(A,MAXA0,NSL) REAL A(:,:,:) CMF$ LAYOUT A(:NEWS,:NEWS,:NEWS) A(:,:,:NSL) = A(:,:,:NSL) * 2 END Array Arguments Confined to Front End ------------------------------------- The converter confines to the control processor any array arguments that violate CM Fortran argument-passing restrictions in ways that the converter cannot work around. Specifically, it marks as front-end arrays any array arguments that: o Are passed with a scalar index ("hidden array argument") when either or both the actual and dummy arguments are multidimensional. o Change type or shape across subprogram boundaries (including array arguments and arrays in COMMON) except for recognized cases of axis elision or of scalar indexing into a 1-dimensional array. o Are passed to assumed-size arrays when the actual and dummy arguments are of different ranks. As a result, no loops on these arrays can be vectorized. See Chapter 4 for some ways to work around these restrictions in the Fortran 77 program. ----------------------------------------------------------------- Contents copyright (C) 1993 by Thinking Machines Corporation. All rights reserved. This file contains documentation produced by Thinking Machines Corporation. Unauthorized duplication of this documentation is prohibited. ***************************************************************** The information in this document is subject to change without notice and should not be construed as a commitment by Think- ing Machines Corporation. Thinking Machines reserves the right to make changes to any product described herein. Although the information in this document has been reviewed and is believed to be reliable, Thinking Machines Corporation assumes no liability for errors in this document. Thinking Machines does not assume any liability arising from the application or use of any information or product described herein. ***************************************************************** Connection Machine (r) is a registered trademark of Thinking Machines Corporation. CM, CM-2, CM-200, CM-5, CM-5 Scale 3, and DataVault are trademarks of Thinking Machines Corporation. CMOST, CMAX, and Prism are trademarks of Thinking Machines Corporation. C* (r) is a registered trademark of Thinking Machines Corporation. Paris, *Lisp, and CM Fortran are trademarks of Thinking Machines Corporation. CMMD, CMSSL, and CMX11 are trademarks of Thinking Machines Corporation. Scalable Computing (SC) is a trademark of Thinking Machines Corporation. Thinking Machines (r) is a registered trademark of Thinking Machines Corporation. CONVEX is a trademark of CONVEX Computer Corporation. Cray is a registered trademark of Cray Research, Inc. SPARC and SPARCstation are trademarks of SPARC International, Inc. Sun, Sun-4, and Sun Workstation are trademarks of Sun Microsystems, Inc. UNIX is a registered trademark of UNIX System Laboratories, Inc. The X Window System is a trademark of the Massachusetts Institute of Technology. Copyright (c) 1993 by Thinking Machines Corporation. All rights reserved. Thinking Machines Corporation 245 First Street Cambridge, Massachusetts 02142-1264 (617) 234-1000