CM FORTRAN PROGRAMMING GUIDE
Version 2.1, January 1994
Copyright (c) 1994 Thinking Machines Corporation.


CHAPTER 3:  EXECUTION MODELS
****************************

CM Fortran's execution model refers to the way a program makes use of
the hardware. CM systems support three basic execution models, plus
some variations:

  o  global model, where a single program operates on arrays of data
     spread across all the parallel processors

  o  nodal model, where multiple copies of a program operate
     independently on subsections of the data and communicate in
     message-passing style

  o  global/local model, a combination of data parallel and message
     passing for maximum flexibility in programming

  o  simulator, where a data parallel program executes on a single
     processor (for purposes of testing and debugging)


The compiler always generates both serial and parallel code for its
built-in "two-machine" model of the hardware. A user-supplied switch
determines which system components are to serve as "the front end" and
"the CM." These switches control the assembly-level language generated
for parallel code blocks and determine the libraries used for linking.


     --------------------------------------------------

                                   NOTE


     The execution models are not object-code-compatible. If you
     compile and link separately, be sure to specify the target model
     on both command lines.


     --------------------------------------------------


3.1  GLOBAL DATA PARALLEL MODELS
--------------------------------

The global models describe a program executed by a single control
processor and a set of system components that serve as parallel
processors. The parallel processors may be either the nodes or the
vector units of a CM-5 or the nodes of a CM-200 or CM-2.

The global models are the most elegant and convenient for programming,
since the compiler generates code that transparently handles parallel
memory allocation, array layout, and interprocessor communication. You
can take some control over allocation and layout, by using dynamically
allocated arrays and by supplying LAYOUT directives to the compiler.
You cannot, however, control communication except by using different
language constructions. Synchronization is always automatic in the
global models: the processors synchronize after every block of
parallel computation and every communication operation.


3.1.1  CM-5 Global SPARC Nodes Model
------------------------------------

In this model, the partition manager serves as the control processor,
and the node microprocessors are the parallel processors. If the
system has vector units, they serve merely as memory controllers and
do not participate in the processing of parallel data. Code for
parallel computations is generated in SPARC assembler.

To compile and link for the SPARC nodes model:

     % cmf -cm5 -sparc myfile.fcm


Arrays are allocated in the expected data parallel fashion, as shown
in Figure 7: front-end arrays on the partition manager and CM arrays
across the nodes. The figure shows a 16-element vector A laid out on
four nodes. A 4-element vector would be laid out one element per node.


                          [ Figure Omitted ]

                     Figure 7. SPARC nodes model.


3.1.2  CM-5 Global Vector-Units Model
-------------------------------------

In this model, the partition manager serves as the control processor,
and the vector units are the parallel processors. The SPARC
microprocessors on the nodes are invisible to the program, although
they assist the vector units with OS services and communication. Code
for parallel computations is generated in DPEAC assembler.

To compile and link for the vector-units model:

     % cmf -cm5 -vu [-nopadding] myfile.fcm


Arrays are allocated as shown in Figure 8: front-end arrays on the
partition manager and CM arrays across the vector units. The figure
shows a 16-element vector A laid out across 16 vector units (4 nodes).


                          [ Figure Omitted ]

     Figure 8. Vector-units model without vector-length padding.


The significance of the -nopadding switch shown in the command line is
that it permits any size array or dimension to be spread evenly across
the vector units, as in Figure 8. Without this switch, the system
allocates parallel memory in quanta of eight elements per vector unit,
the length of the vector register. If an array's size is not a
multiple of 8 times partition size, the unused memory at the high end
of one or more axes contains garbage data or padding, as shown in
Figure 9.

                          [ Figure Omitted ]

       Figure 9. Vector-units model with vector-length padding.

Padding is not a problem for program correctness, since the system
masks it out of any communication or store operation. It can, however,
waste system resources and make for undesirable array layouts, as it
has with the 16-element vector. Since the system uses block layout,
the first two vector units contain all the data, and the others
contain garbage. On 16 vector units, array size needs to be a multiple
of 128 elements (8 x num-procs) to be stored without padding. See
Chapter 12 for more information on CM array layouts.


3.1.3  CM-2/200 Slicewise Model
-------------------------------

In this model, the front end serves as the control processor, and the
nodes--the units of the 64-bit floating-point accelerator-- are the
parallel processors. Code for parallel computations is generated in
PEAC assembler.

To compile and link for the slicewise model:

     % cmf -cm200 -slice myfile.fcm

     % cmf -cm2   -slice myfile.fcm


Arrays are allocated as shown in Figure 10: front-end arrays on the
front end and CM arrays across the nodes.

For the slicewise model, the system allocates parallel memory in
quanta of four elements per node, the length of the FPA vector
register. If an array's size is not a multiple of 4 times machine
size, the unused memory at the high end of one or more axes contains
garbage data or padding. The -nopadding switch, described in Section
3.1.2, is not supported on CM-200 and CM-2.

                          [ Figure Omitted ]

           Figure 10. Slicewise model with vector-length 4.


3.2  NODE-LEVEL MESSAGE-PASSING MODELS
--------------------------------------

The nodal models describe the independent execution of multiple copies
of a CM Fortran program on the nodes of a CM-5. For each copy, the
microprocessor in the node serves as the CM Fortran "front end," and
the four vector units serve as its parallel processors. The programs
use the CM message-passing library CMMD.

The nodal models are appropriate for a more do-it-yourself style of
programming than the global models. The programmer has the convenience
of using CM Fortran features to allocate and compute on arrays, but
uses CMMD library calls to cause the nodal programs to communicate and
synchronize with one another as needed. CMMD also provides routines
for parallel I/O for all the nodes cooperatively, since CM Fortran in
the nodal models supports only serial I/O from individual nodes acting
independently.


3.2.1  CM-5 Nodal Model (Hostless)
----------------------------------

In this model, the node microprocessor serves as the control processor
and its four vector units are the parallel processors of a CM Fortran
program. Code for parallel computations is generated in DPEAC
assembler. Communication between programs is explicit via CMMD calls.

To compile and link for nodal CM Fortran (hostless) with CMMD message
passing:

     % cmf -cm5 -vu [-nopadding] -node myfile.fcm

     INCLUDE '/usr/include/cm/cmmd_fort.h'


The division of labor and the allocation of arrays among system
components is shown in Figure 11. CMMD provides a server program,
running on the partition manager, that starts up copies of the CM
Fortran program on each of the nodes. The partition manager is
invisible thereafter, serving mainly as an I/O controller. For each
program, the node microprocessor stores and processes front-end arrays
(called serial arrays in CMMD), and the associated four vector units
process CM arrays (called parallel arrays in CMMD). The programmer
decides how to decompose the data among the nodal programs. Each
program then allocates its share of the data across its four parallel
processors.

Recall the multiple-of-8 rule for allocating memory on the vector
units (Section 3.1.2). The -nopadding switch gives you finer control
of the allocation of parallel memory.

                          [ Figure Omitted ]

          Figure 11. Nodal CM Fortran model, hostless model.


3.2.2  Fortran 77 on a Node (Hostless)
--------------------------------------

The nodal model of CM Fortran is clearer if you compare it with the
more familiar nodal Fortran 77, supported by the CM-5 and other
vendors' systems. In this model, the node microprocessors execute
copies of a Fortran 77 program and communicate via a message-passing
library (CMMD in the case of the CM system). See Figure 12.

                          [ Figure Omitted ]

             Figure 12. Nodal Fortran 77, hostless model.

The nodal program executes entirely on a node microprocessor. It does
not get the performance benefits of the vector units, since they can
execute only Fortran 90-style array operations encoded in DPEAC
assembler.

Nodal programs written in CM Fortran do get the benefit of vector
processing. Their array syntax gives the compiler the dependence
information it needs to generate parallel computations on array
elements, which then execute on the vector units.

Nodal CM Fortran is supported only with the -vu switch. A nodal -sparc
model of CM Fortran would have the programming convenience of the
Fortran 90 array syntax, but there would be no system component to
serve as the parallel processors. For this reason, CM Fortran does not
support the combination of -sparc -node.


3.2.3  CM-5 Nodal Model (Host/Node)
-----------------------------------

This model is similar to the hostless model of CM Fortran: the node
microprocessor serves as the control processor and its four vector
units are the parallel processors. Code for parallel computations is
generated in DPEAC assembler. Communication between programs is
explicit via CMMD calls.

The difference from the hostless model is that the programmer writes
the host program, instead of letting CMMD provide it. The host program
cannot be in CM Fortran; it must be in a serial language such as
Fortran 77 or C.

To compile and link for nodal CM Fortran (host/node) with CMMD message
passing, you must specify on the cmf command line the files that
contain the host code and the appropriate way to link them. Use the
cmf switches -host and  -comphost for this purpose. For example,
assuming a Fortran 77 host program:

     % cmf -cm5 -vu -node file1.fcm -comphost f77 -host file2.f

     INCLUDE '/usr/include/cm/cmmd_fort.h'


In this example, the cmf command invokes the f77 compiler for the .f
file and, because of the -comphost switch, links it with Sun FORTRAN
libraries. Each .f file must be preceded by -host. The execution model
switches (-vu and -node) govern all the files not preceded by -host.

The division of labor and the allocation of arrays among system
components is shown in Figure 13. It is similar to the hostless model
shown in Figure 11, except that the partition manager is running a
user-written program and thus may play a more active role at run time.

                          [ Figure Omitted ]

            Figure 13. Nodal CM Fortran, host/node model.


3.3  THE GLOBAL/LOCAL MODEL
---------------------------

The global/local model describes a CM Fortran program that executes in
global data parallel fashion across the entire partition of a CM-5,
but which can call local subroutines that run independently on the
nodes. A copy of the local routine runs on each microprocessor, using
the four associated vector units as its parallel processors. Code for
parallel computations is generated in DPEAC assembler, in both the
global and local program units. The global program uses transparent
compiler-generated communication; the local programs use the CM
message-passing library CMMD.

The global/local model combines the convenience of global programming
with the fine-tuned control of the message-passing style. The global
program allocates memory and distributes CM (parallel) arrays as
usual; the local routines then treat their respective subarrays of the
global array as their own CM arrays, as in the nodal models.

The global program must be in CM Fortran; the local routines can be in
either CM Fortran or C (to facilitate direct calls to C/DPEAC).
Global/local programming is described more fully in Chapter 13 of this
manual.

To compile and link for the global/local model, you must specify on
the cmf command line the files that contain local code and supply a
prototype file (described in Chapter 13) that defines the interface
between the global and local portions of the program. For example,
assuming a CM Fortran subprogram:

     % cmf -cm5 -vu global.fcm -local local.fcm file.proto

       in local.fcm:     INCLUDE '/usr/include/cm/cmmd_fort.h'
       in global.fcm:    INCLUDE '/usr/include/cm/cmgl.h'


A separate -local switch must precede each file that contains local
code.

The division of labor and the allocation of arrays among system
components is similar to the host/node model shown above--with one
crucial difference: front-end arrays can appear either on the
partition manager (in the global program) or on the nodes (for the
local routines). The global/local library provides the procedure
CMGL_BROADCAST_SERIAL_ARRAY to copy a global front-end array to the
nodes; the local routines can also create local front-end arrays. CM
arrays are distributed across the vector units of the partition.
Unlike front-end data, CM array data does not move when the program
shifts from the global view to the local view.

The scheme is shown in Figure 14 (assuming no vector-length padding of
the array shown). Front-end arrays are allocated on the partition
manager in the global program, but explicitly broadcast to the nodes
for use as front-end arrays in the local programs. CM arrays are
always allocated on the vector units; only the system's view of them
changes when the global program calls a local subprogram.

                          [ Figure Omitted ]

     Figure 14. Global/local model without vector-length padding.

The global program operates on the entire array, whereas the local
routines operate asynchronously on their respective portions of it.
For example, consider a 16-element vector A, laid out across the
vector units of 4 nodes. Figure 15 shows the different effects of
calling the circular shift function CSHIFT from the global program and
from the local routine.

                          [ Figure Omitted ]

 Figure 15. Computing on arrays and subarrays in global/local model.


3.4  CM FORTRAN SIMULATOR
-------------------------

In the simulator model, a CM Fortran program executes on a single
Sun-4 processor, which may be a CM-5 partition manager, a CM-2/200
front end, or a stand-alone workstation. This model is convenient for
developing and debugging programs without tying up parallel processing
resources.

To compile and link for the CM Fortran simulator:

     % cmf -cmsim myfile.fcm


Notice that you do not specify a hardware platform, like -cm5, or the
execution model you eventually intend to use for the program, such as
-vu or -sparc. The compiler generates code for its "two-machine" model
as usual, but links with a library that causes all operations to be
performed sequentially. The simulator does not support message
passing.

Since the code is generated for two machines, there is a division of
labor of sorts but the Sun-4 processor plays both the serial and
parallel roles. The simulator generates two structures in serial
memory to serve as the stack and heap memory of parallel processors,
as shown in Figure 16. Code for parallel computations is generated in
SPARC assembler.

                          [ Figure Omitted ]

                   Figure 16. CM Fortran simulator.

The CM Fortran simulator is intended only as a convenience for program
development. It necessarily runs slower than comparable Fortran 77
code on a Sun-4, since it partitions memory as shown and calls low-
level CM run-time functions for simulated "communication." At the same
time, since the compiler output is linked for a one-node system, it
cannot give the programmer an idea of how fast the same code would run
when linked for an n-node system.

*****************************************************************

  The information in this document is subject to change without
  notice  and should not be construed as a commitment by Think-
  ing  Machines  Corporation. Thinking  Machines  reserves  the
  right to make changes to any product described herein.

  Although the information  in this document has  been reviewed
  and is believed to be reliable, Thinking Machines Corporation
  assumes no liability for  errors in this  document.  Thinking
  Machines  does  not  assume  any  liability  arising from the
  application  or use of any  information or product  described
  herein.

*****************************************************************

Connection Machine (r)
is a registered trademark of Thinking Machines Corporation.
CM, CM-2, CM-200, CM-5, CM-5 Scale 3, and DataVault
are trademarks of Thinking Machines Corporation.
CMOST, CMAX, and Prism are trademarks of Thinking Machines Corporation.
C* (r) is a registered trademark of Thinking Machines Corporation.
Paris, *Lisp, and CM Fortran are trademarks of Thinking Machines Corporation.
CMMD, CMSSL, and CMX11 are trademarks of Thinking Machines Corporation.
CMview is a trademark of Thinking Machines Corporation.
Scalable Computing (SC) is a trademark of Thinking Machines Corporation.
Scalable Disk Array (SDA) is a trademark of Thinking Machines Corporation.
Thinking Machines (r)
is a registered trademark of Thinking Machines Corporation.
SPARC and SPARCstation are trademarks of SPARC International, Inc.
Sun, Sun-4, SunOS, Sun FORTRAN, and Sun Workstation 
are trademarks of Sun Microsystems, Inc.
UNIX is a trademark of UNIX System Laboratories, Inc.
The X Window System
is a trademark of the Massachusetts Institute of Technology.

Copyright (c) 1989-1994 by Thinking Machines Corporation.  All rights reserved.
This file contains documentation produced by Thinking Machines Corporation.
Unauthorized duplication of this documentation is prohibited.

Thinking Machines Corporation
245 First Street
Cambridge, Massachusetts 02142-1264
(617) 234-1000