ACML - Release Notes - version 4.3.0 ------------------------------------ ACML Contents ------------- ACML - the AMD Core Math Library - is a tuned math library designed for high performance on AMD64 machines, including Opteron(TM) and Athlon(TM) 64, and includes both 32-bit and 64-bit library versions. Different versions are available for Linux, Windows and Solaris operating systems. A full suite of Basic Linear Algebra (BLAS) routines is provided, and key BLAS routines have been tuned for high performance on AMD processors in both 32-bit and 64-bit modes. A full suite of Linear Algebra (LAPACK) routines is provided and takes advantage of the highly tuned BLAS for good performance. A comprehensive suite of Fast Fourier Transforms (FFTs) is provided, again with highly tuned code. A comprehensive suite of Random Number Generators (RNGs) is provided, again with highly tuned code. New Features ------------ New features of release 4.3.0 of ACML since version 4.2.0: (o) Performance of DGEMM and SGEMM has been further improved. This performance improvement carries through to other Level 3 BLAS and LAPACK routines that call DGEMM and SGEMM. This improvement is particularly significant when ACML runs on 64-bit Intel hardware. (o) An experimental "fast memory allocation" scheme has been introduced which may allow you to improve performance of the matrix-matrix multiply routine DGEMM and any other routines (such as LAPACK) which make heavy use of DGEMM. See section 2.11 of the ACML user guide for full details. (o) A bug which affected certain problem sizes in DGEMM and ZGEMM has been fixed. (o) Level 1 BLAS routines have been re-tuned for newer hardware to make better use of available cache memory and hence improve performance. Routines affected include xDOT, xCOPY, xAXPY, and xSCAL routines. (o) Assembly language kernels used by the real-complex FFT routines csfft, dzfft, scfft and zdfft have been re-tuned for newer hardware, leading to performance improvements of between 20% and 40% for these routines. (o) The 3D FFT expert driver routines cfft3dy and zfft3dy have been improved to make use of a faster code path when possible (i.e. when the arguments to the routines are such that memory use is contiguous). (o) An open64 compiler build of ACML has been introduced. (o) The documentation for the Fast Fourier Transform routine ZFFT2D understated the required length of the communication array COMM by 100 elements. Programs that supplied the old documented amount may have crashed or returned incorrect results due to memory overwrites. Users of ZFFT2D should check carefully against the latest documentation and increase the size of COMM if necessary. Documentation for other FFT routines (including CFFT2D) was correct. New features of release 4.2.0 of ACML since version 4.1.0: (o) DGEMM has been retuned, in particular for the benefit of smaller problem sizes in 64-bit code. There is also improved performance for other double precision level 3 BLAS and some double precision LAPACK routines. (o) Various of the uniform random number (base) generators have been tuned for better performance on 32-bit platforms. Work has been done on the Mersenne Twister, the Wichmann-Hill and the NAG basic generators, all of which help improve the higher level statistical distribution functions which call them. (o) Work has been done to significantly improve the performance of the 3D complex FFT routines in ACML. At the same time, the workspace memory requirements have been substantially reduced, making routine cfft3d better able to deal with large problem sizes. See the documentation of cfft3d in the ACML user guide for details of new workspace requirements. New features of release 4.1.0 of ACML since version 4.0.1 include: (o) Further improvements in performance of double and single precision real BLAS and LAPACK routines on AMD Family 10h (Barcelona) processors in 64-bit libraries. (o) ZGEMM has been retuned for higher performance on both AMD K8 and AMD Family 10h (Barcelona) processors in 64-bit libraries. This also improves the performance of many other double precision complex BLAS and LAPACK routines. (o) The ACML uniform random number generators, when asked to produce numbers in the range 0.0 to 1.0, occasionally produced numbers inconsistent with the documented expected return values. In particular, the expected occurrence or non-occurrence of the end-of-range values 0.0 and 1.0 was not consistent with documentation. At ACML 4.1.0 code and documentation has been harmonized, and the base generators should all return values in the semi-open interval (0.0,1.0]. (o) Support for Absoft Fortran compilers under Linux via the gfortran builds of ACML has been documented (see the ACML User Guide). (o) Support for the older GNU g77 compiler has been removed. New features of release 4.0.1 of ACML since version 4.0.0 include: (o) A memory overwrite bug in DGEMM and SGEMM was introduced at ACML 3.7.0 and occasionally (but rarely) led to incorrect results. This bug, which only affected 64-bit versions of ACML, has been fixed. New features of release 4.0.0 of ACML since version 3.7.0 include: (o) Support for AMD Family 10h (Barcelona) processors. Double and single precision BLAS and LAPACK routines have been retuned for optimum performance on Barcelona. (o) ACML has been upgraded to comply with the latest version of standard LAPACK routines (version 3.1.1). All routines new at LAPACK 3.1.1 have been included, and all LAPACK 3.1.1 bug fixes and enhancements have been applied. (o) A bug in the ACML version of LAPACK routine DGEBRD which was introduced at ACML 3.6.0, and which affected certain problem sizes, has been fixed. (o) ACML 4.0.0 is only available in 64-bit implementations, with PGI, PathScale, gfortran and Intel compilers (under Linux) and with PGI and Intel compilers (under Microsoft Windows). New features of release 3.7.0 of ACML since version 3.6.0 include: (o) 3.7.0 features pre-launch AMD Family 10h support. Only certain ACML builds are available. The DGEMM and CFFT 1D routines have been updated for optimum performance on the upcoming Barcelona processors. New features of release 3.6.2 of ACML since version 3.6.1 include: (o) 3.6.2 is only available for the Intel FORTRAN builds. 3.6.2 is compiled with an Intel FORTRAN flag that allows correct propagation of NaNs in situations where the FORTRAN source code is comparing a floating point value to zero. For the Windows builds, the flag is /Qprec, and for Linux it is /mp1. New features of release 3.6.1 of ACML since version 3.6.0 include: (o) 3.6.1 is only available for the Linux GFORTRAN builds. The new version uses updated compilers for both scalar and OpenMP builds. The scalar build uses the released 4.1.2 GCC/GFORTRAN. The OpenMP build uses the 2007-03-16 4.2 GFORTRAN build available from the www.gfortran.org download pages. The OpenMP build fixes the segfault issue that happens when using the 3.6.0 GFORTRAN OpenMP library with newer 4.2 compilers. New features of release 3.6.0 of ACML since version 3.5.0 include: (o) Many LAPACK routines have been newly parallelized with OpenMP to take better advantage of multi-processor or multi-core hardware. The lists below mention double precision and double complex routines only. All equivalent single precision and single complex routines are also parallelized except where mentioned. Newly parallelized routines at this release include: Banded LU factorization and solver routines xGBTRF xGBTRS Packed Cholesky solvers DPPTRS ZPPTRS Banded Cholesky solvers DPBTRS ZPBTRS Iterative refinement and error bounds for various factorizations and storage formats: DGERFS ZGERFS DGBRFS ZGBRFS DGTRFS ZGTRFS DPORFS ZPORFS DPPRFS ZPPRFS DPBRFS ZPBRFS DPTRFS ZPTRFS DSYRFS ZHERFS ZSYRFS DSPRFS ZHPRFS ZSPRFS DTRRFS ZTRRFS DTPTRS DTPRFS ZTPTRS ZTPRFS DTBTRS DTBRFS ZTBTRS ZTBRFS Various form Q/apply Q routines DORGQR ZUNGQR DORGTR ZUNGTR DOPGTR ZUPGTR Reduction to tri-diagonal form DSYTRD ZHETRD (but not SSYTRD) Reduction from banded to tri-diagonal form DSBTRD ZHBTRD Symmetric eigenvalues (Bi-section algorithm) DSTEBZ Reduction to bi-diagonal form DGEBRD ZGEBRD Unsymmetric eigenproblems DHSEIN ZHSEIN Generalized eigenproblems, packed storage DSPGV DSPGVX DSPGVD ZHPGV ZHPGVX ZHPGVD Note that the following routines were parallelized at earlier releases of ACML: LU factorization and solver DGETRF DGETRS ZGETRF ZGETRS Cholesky factorization and solver DPOTRF DPOTRS ZPOTRF ZPOTRS QR Factorization DGEQRF ZGEQRF Symmetric eigenproblem (QR algorithm) DSTEQR ZSTEQR SVD (QR algorithm) DBDSQR ZBDSQR (o) Under 32-bit Windows operating systems, a new PGI Visual Fortran compiler version of ACML now replaces the old Unix-like 32-bit PGI compiler build. This brings the 32-bit Windows PGI version of ACML into line with the 64-bit Windows PGI build. (o) ACML now supports use of the Intel ifort compiler under both Linux and Windows. (o) OpenMP versions of ACML now include "_mp" as part of the library name (e.g. libacml_mp.so instead of libacml.so) to make it easier to link to the appropriate version and avoid internal soname clashes. (o) A bug in the parameter checking for FFT routines cfft2dx and zfft2dx which returned an incorrect error code has been fixed. Also, an error in the documented workspace requirements for cfft2dx and zfft2dx has been fixed. New features of release 3.5.0 of ACML since version 3.1.0 include: (o) Many level 1 BLAS routines are now able to take advantage of SSE3 instructions for extra performance on processors equipped with them (including recent revision AMD Opteron and Athlon64 processors). Routines affected include xDOT, xCOPY, xAXPY, and xSCAL routines. (o) Two new expert interfaces for 3D FFTs have been added at this release. These have the names CFFT3DY and ZFFT3DY and are in addition to the previously available expert interfaces CFFT3DX and ZFFT3DX. The new interfaces allow for greater flexibility in the storage of input and output data within supplied arrays through the setting of increment arguments. In particular, data held in submatrices of a notional 3-dimensional array can be transformed. The interfaces CFFT3DY and ZFFT3DY do not contain the LTRANS parameter available in CFFT3DX and ZFFT3DX since the restrictions on increments required to return un-transposed data would be equivalent to using the CFFT3DX and ZFFT3DX routines. In summary, if the 3D data to be transformed (and the subsequent transformed data) are stored contiguously then the routines CFFT3DX or ZFFT3DX can be called with the possibility of not performing a final transpose; on the other hand, for data (original or transformed) not stored contiguously, the routines CFFT3DY or ZFFT3DY should be called with increment arguments describing the data storage arrangement. (o) ACML now comes with a small set of "performance example" programs designed to demonstrate how fast a selection of ACML routines run on your machine; see the ACML User Guide for details. (o) Versions of ACML compiled with gfortran and using OpenMP are now available for use on SMP machines. (o) All versions of ACML built with PGI compilers are now compiled with the -Mlarge_arrays flag to allow access to large data arrays. The special large_arrays variants are therefore no longer needed and have been withdrawn. (o) Versions of ACML for older 32-bit processors without SSE/SSE2 instructions are no longer produced. (o) A bug in LAPACK auxiliary routines SLAED6 and DLAED6 which caused LAPACK eigensystem routines CSTEDC, DSTEDC, SSTEDC and ZSTEDC to return inorrect results has been fixed. (o) A bug which caused FFT routine ZFFT1D to return incorrect results when working on problem sizes which were very large powers of 7 or 11 has been fixed. (o) Calls of FFT routines in the degenerate cases where all problem dimensions are equal to 1 were omitting to multiply by the scaling factor. This has been corrected. (o) A bug which caused the random number routines DRANDDISCRETEUNIFORM and SRANDDISCRETEUNIFORM sometimes to return integer values in the range [A,B+1] instead of [A,B] has been fixed. (o) Errors in documentation of the use of the array COMM in some FFT routines have been fixed. Errors in documentation of the size of the array STATE in some random number routines have been fixed. New features of release 3.1.0 of ACML since version 3.0.0 include: (o) Most of the random number generators introduced into ACML at version 3.0.0 have been tuned in order to significantly improve performance of all 64-bit versions of ACML. (o) Variants of ACML using 64-bit integer arguments New features of release 3.0.0 of ACML since version 2.7.0 include: (o) A new suite of random number generator routines, including a variety of base-level uniform generators as well as higher level statistical distribution routines. At this release of ACML, these routines have not been highly tuned for performance, but tuning will be performed for future ACML versions. (o) ACML now supports the GNU gfortran compiler, important for users on operating systems where the older g77 compiler is no longer installed. Note that initial tests indicate that the gfortran version of ACML is not always so efficient as the g77 version. At the moment, if you have the option to choose either of the GNU compiler versions, and you are keen for highest performance, we recommend the g77 version. Naturally, this situation may change in future versions of ACML as the gfortran compiler matures. New features of release 2.7.0 of ACML since version 2.6.0 include: (o) Most ACML FFT routines now allow the user to generate an optimal plan to get best performance for a particular problem size, benefitting repeated FFT computations with the same size. Two expert driver FFT routines (namely cfft3dx and zfft3dx) needed slightly revised interfaces to allow this improvement. For more information see the FFT documentation in the ACML User Guide. (o) A bug which caused program crashes when calling some ACML BLAS routines with extremely large data arrays has been fixed. (o) 64-bit versions of ACML now include an expanded set of fast/vector math library functions. (o) A 32-bit Windows ACML implementation using the __cdecl calling convention now exists, to complement the _stdcall version. (o) 64-bit PGI versions of ACML now include variants compiled with the -Mlarge_arrays flag, which allows them to be called with larger data arrays than previously. New features of release 2.6.0 of ACML since version 2.5.3 include: (o) ACML has been tested and found to perform well on new AMD dual core chips, which allow users to take advantage of superior performance of OpenMP ACML versions even on a single processor machine. (o) Further tuning improvements to some level 3 BLAS in 64-bit ACML versions, with improvements propagated to LAPACK routines. (o) Under Linux, variant compiler versions of ACML can now be downloaded and installed independently. New features of release 2.5.3 of ACML since version 2.1.0 include: (o) A routine ILAENVSET, which allows the user some control of block sizes used by ACML LAPACK routines, has been added. See ACML documentation for details. (o) New expert drivers for complex-complex FFT routines which allow greater control over data storage and scaling (names of expert driver routines end in the letter x). (o) The PGI SMP 64-bit and 32-bit variants of ACML now have SMP-parallelized code for 2D, 3D and multiple 1D FFTs, in all supported data formats. Performance of these routines is now vastly improved when run on a multi-processor machine. (o) SMP support has been added to various key level 2 and level 3 BLAS routines, thus improving performance of all other ACML routines that depend on them (such as LAPACK routines), when run on a multi-processor SMP machine. (o) 64-bit versions of ACML now come with a companion library of fast/vector math library functions. (o) A bug in FFT routine ZFFT1MX, where the user-supplied scale factor was not applied to a transform of length N=1, has been fixed. (o) A bug in OpenMP versions of ACML which affected level 3 BLAS routines xGEMM, xDTRSM, xDTRMM, xSYMM and xHEMM, and in some circumstances could cause a segmentation violation, has been fixed. New features added to release 2.1.0 of ACML since version 2.0.1 include: (o) The performance of DGEMM in the 64-bit libraries for small matrices has been significantly increased. (o) Further enhancements have been made to the performance of FFT routines in both the 64-bit and 32-bit versions of ACML. In the 64-bit version, FFTs on sequences having a length containing prime factors of 7, 11 or 13 are significantly faster; for sequence lengths containing prime factors greater than 13, the FFT routines also show performance gains. In the 32-bit ACML version, single precision FFT routines on sequences of lengths with prime factors 2 and/or 3 have significantly improved in performance, while those with lengths containing larger prime factors also show performance gains. (o) Routines that were not safe for use by multi-threaded programs in some earlier releases of ACML have now been made thread safe. If you use ACML in a multi-threaded environment you are recommended not to use versions of ACML earlier than 2.1.0. (Note that this does not apply to the PGI-compiled SMP versions of ACML which have always been safe for use in OpenMP programs.) New features added to release 2.0.1 of ACML since version 1.5 include: (o) A revised blocking mechanism in 64-bit versions of ACML kernel BLAS routines has been implemented to produce performance-enhancing improvements. Peak speeds of routines like DGEMM, DSYMM, DTRMM and their single precision equivalents show significant improvements over earlier versions of ACML. These improvements naturally carry over to LAPACK routines which make extensive use of such BLAS. (o) More highly tuned kernels have been added to 64-bit versions of ACML FFT routines. At earlier versions of ACML, highly tuned kernels only applied to FFT problem sizes which were a power of 2. Now problem sizes which have factors of 3, 5, 7, 11 and 13 have been tuned, leading to performance enhancements of up to more than a hundred percent in some cases. (o) For user convenience, the data output from the FFT routines ZDFFT/CSFFT and ZDFFTM/CSFFTM have been rescaled by a factor 2. This ensures that the result of transforming real data using DZFFT/SCFFT and the subsequent transforming back to real data using ZDFFT/CSFFT, has the same scaling as the original data. See the example programs dzfft_example.f and dzfft_c_example.c on how to recover original real data from these transforms. New features added to release 1.5 of ACML include: (o) 28 key LAPACK routines, including linear equation solvers, singular value decomposition and symmetric eigensolvers, have been upgraded to obtain significant speedup running on multiprocessor (SMP) machines when using PGI's OpenMP compliant compilers. (o) Significant performance improvements have been made to a set of kernel BLAS routines in the 32-bit versions of ACML which were not highly tuned in ACML 1.0. These improvements have beneficial effects for a large number of other 32-bit routines which depend on them. (o) ACML now contains Level 1 BLAS routines for operations involving sparse vectors. (o) All ACML routines now come with C as well as FORTRAN interfaces. The new interfaces allow the C programmer to pass arguments in a more natural way (for example, pass input scalar arguments by value instead of by reference), and dispense with the need for user-supplied workspace arguments. (o) 32-bit ACML libraries now come in versions that will run on older processors which do not have the complete set of Streaming SIMD Extension (SSE) instructions. Applicability ------------- Different ACML versions are provided for use with the GNU compilers gfortran/gcc, with the PGI compilers pgf77/pgf90/pgcc, with the open64 compilers openf95/opencc, with the Intel ifort compiler, with the NAG f95 Fortran compiler, and with the Sun f95/cc compilers. Note that from ACML 3.5.0, versions of ACML specific to older 32-bit processors which do not support SSE/SSE2 (Streaming SIMD Extensions) instructions are no longer produced. All versions of ACML described below are for machines which do have SSE and SSE2 instructions. This means they are suitable for use on all AMD 64-bit chips and newer 32-bit chips. Users wishing to use ACML on such machines should continue to use older (pre-3.5.0) versions of ACML. Installation material --------------------- Under Linux/Solaris, ACML is by default installed in the directory /opt/acml4.3.0. Under 32-bit and 64-bit Windows, ACML is by default installed in the directory C:\AMD\acml4.3.0. In all cases the following subdirectories are used for 64-bit builds: acml4.3.0/Doc acml4.3.0/util plus, for the Linux PGI 64-bit compiler version of ACML: acml4.3.0/pgi64 acml4.3.0/pgi64_mp plus, for the Windows PGI 64-bit compiler version of ACML: acml4.3.0/win64 acml4.3.0/win64_mp plus, for the GNU gfortran 64-bit compiler version of ACML: acml4.3.0/gfortran64 acml4.3.0/gfortran64_mp plus, for the open64 64-bit compiler version of ACML: acml4.3.0/open64_64 acml4.3.0/open64_64_mp plus, for the Linux or Windows Intel 64-bit compiler version of ACML: acml4.3.0/ifort64 acml4.3.0/ifort64_mp plus, for the Sun 64-bit compiler version of ACML: acml4.3.0/sun64 acml4.3.0/sun64_mp Any 32-bit ACML builds will appear in similarly named directories, for example acml4.3.0/gfortran32 acml4.3.0/gfortran32_mp Any or all of the different compiler variants of ACML can coexist in the same install location. ACML documentation is available in html, pdf, info and plain text format in the following files: Doc/html/index.html Doc/acml.pdf Doc/acml.info Doc/acml.txt Under Linux, the installation directories contain both static and shared versions of ACML. For scalar versions of ACML, the static and shared libraries will be named libacml.a and libacml.so respectively. For multi-threaded versions of ACML, the static and shared libraries will be named libacml_mp.a and libacml_mp.so respectively. Under Windows, the installation directories contain both static and DLL versions of ACML. For scalar versions of ACML, the static library will be named libacml.lib; the DLL library will be named libacml_dll.dll, and have an associated import library named libacml_dll.lib. For multi-threaded versions of ACML, the static library will be named libacml_mp.lib; the DLL library will be named libacml_mp_dll.dll, and have an associated import library named libacml_mp_dll.lib. In all cases, you should be careful to link to the library appropriate to your hardware and operating system. If you link to a library that uses SSE or SSE2 instructions when your processor does not support them, you may get "illegal instruction" or other errors when you run your program. The ACML installation directory "util" contains a program named cpuid.exe which can tell you whether or not your processor supports SSE or SSE2 instructions. In addition, under Linux, you may be able to determine whether or not your processor supports SSE or SSE2 instructions by examining the special file /proc/cpuinfo. In the section named "flags", look for the appearance of "sse" and "sse2"; if the flags don't appear, your processor doesn't have them. Using the ACML Libraries ------------------------ For detailed information on how to link with the libraries, see documentation (particularly the ACML User Guide, acml.pdf) in the acml4.3.0/Doc directory. To verify exactly the version of ACML, and which compiler versions and flags were used to build it, note that the routine acmlinfo() can be called. For the Windows versions of ACML, see also the batch scripts acmlexample.bat and acmlallexamples.bat in the examples directories. Required runtime libraries under Microsoft Windows -------------------------------------------------- ACML versions 4.2.0 and later link with the standard runtime library provided by the Microsoft Visual Studio 2008 compilers. This requires that the machine ACML runs on either has VS2K8 installed (or the Windows SDK for Windows Server 2008), or the runtime libraries can be separately downloaded from the appropriate Microsoft platform links provided below: Visual Studio 2K8 Redist: x86 x64 Recommendation for Windows Vista users with AMD's Family 0x10 processors ------------------------------------------------------------------------ (Barcelona, Shanghai cores) --------------------------- A performance loss has been seen in cpu bound single-threaded apps on Windows Vista OS's. An unfavorable interaction between Vista's threading policy and AMD's Family 0x10 processors ability to throttle clock speeds on a core-by-core basis results in degraded performance. Single-threaded apps on Vista tend to rotate around all available cores before an individual core fully powers up from a throttled state. Linux and XP builds of Windows have not shown this behavior, and fully-threaded apps where the number of threads is equal or greater than the number of available cores will keep all the cores busy at their unthrottled state. Several solutions exist for users who may be affected by this interaction. Users may isolate the affinity of their threads to individual cores to avoid the wandering of their threads. This is the most power efficient option as unused cores will stay throttled; this can be achieved either programmatically or through tools such as Process Explorer. An easier solution is to go into the 'Power Options' pane of the vista control panel, and change the machine from 'Balanced mode' to 'High Performance' mode; this stops Vista from throttling the cores. Lastly, in BIOS, the user can turn off 'Cool'n'Quiet' to stop the processor itself from throttling. Bug Reports ----------- Bugs should be reported to tech.support@amd.com with the string "ACML-Support" in the subject line.