ATLAS Changelog

What's new in ATLAS 3.11.36 Dev

Aug 16, 2015
  • Fixed stack overflow in extract
  • Extended extract's integer ops to make generator writing more concise:
  • Added = to @iif
  • Added comparisons and bit-level operations to @iexp
  • Extended atlas_simd.h:
  • Support vsplat (bcast from reg) for system where bcast is slow
  • Fixed ommision of vvrsum4 for single precision VSX
  • Extended ammm generator to support vld/vsplat as well as bcast
  • Added -T 1 to gmmsearch, makes it test all legal kerns -> res/FAIL.OUT
  • Found & fixed an error in k-vec code generator
  • Fixed failure to negate new files in ArchNew & made gmmsearch use archdefs

New in ATLAS 3.11.35 Dev (Aug 16, 2015)

  • Added basic configure support for Linux/Power8
  • Addition of atlas_simd.h to provide SIMD support that is independent of architecture and vector length.
  • gnuvec, VSX, ARM64 (Advanced SIMD), AVX2, AVX, SSE3, SSE2, SSE1
  • Complete rewrite of gemm code generator to target atlas_simd.h.
  • Supports vectorizing K dim as well as M
  • No support for explicit K unrolling yet (required for old x86)
  • uammsearch.c completely broken until finished evolving code gens
  • Addition of atlas_cplxsimd.h to provide support complex for L1/L2
  • axpy/dot written as test cases, but not tuned or put in archdefs
  • Partial rewrite of src/blas/ammm for higher abstraction & maintainability
  • Fixed bug in ammm kernel generation where different kerns had the same filename, resulting kern collisions, and thus errors.
  • Add ATL_ammm_syrk to provide 15-3% (small-asymp) serial Cholesky speedup
  • iFKO added to tarfile, but not yet hooked up to ATLAS install
  • Got rid of some race conditions in timers

New in ATLAS 3.11.34 Dev (Apr 27, 2015)

  • Made it so complex amm kerns tuned separately from real
  • Added tuning of square block factors for amm-based L3BLAS
  • Added ability for mmtime_pt to run serially, and to time trsmK

New in ATLAS 3.11.33 Dev (Apr 14, 2015)

  • Fixed bug tracker #246: error in SetAllBitsBV (fix by Dave Nuechterlein)
  • Fix for bug tracker #247 (missing symbols for shared object libs)
  • Added native ARM64 AtomicCnt functions based on Dave Nuechterlein's patch
  • Moved Rakib's prototype LU from threads/cbc to threads/cbc2d
  • Added ATL_[I,S,D,C,Z]SIZE to provide sizeof info in pure constant form, so it can be used in #if and other like cpp statements
  • Fixed bug in tgetf2 that caused the copy-code to be called all the time
  • Added new cache-based communication routines in src/threads/cbc - Read atlas_cbc.h for rundown of routines
  • Wrote amm-based TRSM (Left,Lower), but is not called for serial (not enough speedup currently to justify complexity)

New in ATLAS 3.11.32 Dev (Feb 27, 2015)

  • Fixed bug in code generator where sometimes the pB ptr was not incremented after the first K iteration peel
  • Removed block-major gemm from ATLAS
  • Fixed error in f77getri wrapper
  • Changed ATLAS to default to using walltime
  • Added support for passing operand movement macros to amm kernel compiles
  • Fixed bugs in [u]ammsearch where ldc passed as kb rather than mb.
  • Added new square NBs to always try in ammm search:
  • 32 & 64 added because LAPACK defaults to them
  • 24 added for small symmetric operations
  • Added src/testing ATL_cmpmatBV, print1dBV, print2dBV
  • Sped up Corei2 ammm kernels, particular single prec & double K-clean
  • added special case for N=K=4 in real ammm for recursive LU: ATL_rk4n4.c
  • -> Now have atlas_simd.h for simple typeless SIMD; will extend as needed, and need to get rid of duplicate headers eventually
  • New Opteron/K10h8 ammm kernels (based on old block-major kernel)
  • Fixed bug in emit_*amm's GenMakefile, where kmajor rank-K kernels were being compiled with -DKB set to values that weren't a multiple of VLEN
  • Improved single precision Corei1 ammm kernel
  • Improved double precision Core2 ammm kernel based on CASES/ATL_dmm4x2x128_sse2.c
  • install_uamm now installs routines to copy between block-major storage and row/column-major storage in both directions
  • Added am2[rm,cm] copy routines for A/B in atlas_mmg.base
  • Added src/threads/cbc for use in cbc-LU
  • Removed blank line from samcases.idx to avoid ammsearch errors

New in ATLAS 3.11.31 Dev (Nov 10, 2014)

  • Added basic support for ARM64:
  • Pre-production hardware provided by Applied Micro (www.apm.com)
  • Extensive patches handling almost all ARM64 functionality provided by Dave Nuechterlein of apm, including assembly kernels
  • Fixed bug in ammsearch where generated code often used wrong/slow kerns
  • Fixed bug in 'Right' case of tsymm, where recursion called 'Left'
  • Fixed bug where threadpool only starts a handful of cores when first call is made with small problem

New in ATLAS 3.11.30 Dev (Aug 26, 2014)

  • New timing/ directory, in BLDdir/timing do:
  • in BLDdir/timing issue "make all" to build scripts
  • ./tvgenmf[lst,rng].sh wt no args gives help
  • The following will time LLt,LU,QR and gemm for the 3 problem sizes: ./tvgenmf_lst.sh "3 1000 2000 4000" 5 -P t -b "gemm"
  • Addition of timing manipulation tools in bin/tvec*:
  • manpages for them in ATLAS/man/tvec
  • build with "make tvec_all" in BLDdir/bin
  • New timers xl3time_[ab,sb] available in BLDdir/bin
  • Made it so latime, l3time & l3blastst stops manual cache flushing when matrix flushes itself due to size
  • Add fflush calls after prints in latime & l3time
  • Attempted to apply IBM patches for new ppc64le:
  • https://sourceforge.net/p/math-atlas/patches/66/
  • https://sourceforge.net/p/math-atlas/patches/65/
  • Hacked in a workaround and/or fixed:
  • https://sourceforge.net/p/math-atlas/bugs/238/
  • https://sourceforge.net/p/math-atlas/bugs/237/

New in ATLAS 3.11.29 Dev (Aug 8, 2014)

  • Added option to have threadpool that is always polling, and where the master process is assigned affinity 0. Should provide best perf when ATLAS handles all threading (but will interfere with other threads).
  • Presently default, much faster than using mutex or cond vars.
  • should work on Windows (untested)
  • Force turnoff with -D c -DATL_TP_FULLPOLL=0
  • Changed atlas_taffinity.h to only have #defines, so it is safe to include in any/every file.
  • Added ATL_setmyaffinity as standalone file, and made it and ATL_thread_start handle all #includes formerly done in atlas_taffinity.h
  • Changed PCA xover in ATL_getrfC to work better wt tpool
  • Got OpenMP option working again, use: -Si omp 1 -Fa alg -fopenmp -> Performance is horrific compared to pthreads
  • Removed launch & join option, and quite a bit of associated code
  • Changed xperctvecs so it can take a scalar divisor as well as vector

New in ATLAS 3.10.2 (Jul 11, 2014)

  • Fixed all errataed bugs:
  • Failure to init workspace can cause NaNs in SYRK
  • Complex row-major Q-type factorizations produce bad TAU
  • Failure to cast causes integer overflow on 64-byt platforms
  • Missing IBM S390 assembly file
  • Fixed Make.bin to have threaded latime built to do parallel cache flushing
  • Extended extract string lengths as patched by SAGE folks
  • Backported fixes & some arch support to configure framework, including host of Itanium and UST1 stuff provided by SAGE folks
  • NOTE: 3.10.2 is terribly out of date, and was released only because the threading rewrite it taking too long. If possible, you should use a developer release after testing that it works for your particular platform. In particular, developer releases are *much* faster for any x86 that uses AVX or later SIMD ISA, or any machine with ncores >= 8.
  • The developer release also supports ARM architectures better (though performance is not hugely better if you can get stable installed).

New in ATLAS 3.11.28 Dev (Jun 12, 2014)

  • Fixed seg fault in archinfo_linux on x86
  • Added archdefs and archinfo_linux support for AMDLLANO64SSE3
  • Fixed numerous race conditions in PHI-specific SYRK
  • Fixed perf bug where LU failed to use PCA for high-core-count machines
  • Ensured parallel LU uses parallel swap, and serial LU uses serial swap
  • Added tgemm_amm case for tiny M,N, large K (inner product)
  • Prototyped access-major ttrsm, but it doesn't provide speedup over the old stuff except on PHI, where it is hugely faster.
  • Fixed bug in new goParallel so P > nthr is handled correctly
  • Added atlas_ttypes.h that can be included before thread tuning
  • Added optional thread pool that polls a bit before sleeping to improve performance where parallel jobs are performed write after each other

New in ATLAS 3.10.1 (May 2, 2014)

  • Fixed bad SSE guard that prevented PIII archdefs from working
  • Added return to main of ATLAS/tune/sysinfo/matime.c
  • Added ability for archinfo_x86.c to recognize more Corei2 platforms
  • Fixed premature KillAllMMNodes in emit_mm.c

New in ATLAS 3.11.27 Dev (May 2, 2014)

  • Fixed error in k-major cm2am code generator for alpha=X
  • Made it so C compiler search prioritizes exact match 'gcc' for gcc
  • Some aliases like c99-gcc don't work with pthreads due include mismatch
  • Hacked ATLAS to use a thread pool rather than launch & join:
  • Only pthread version currently works
  • Wrote some real threaded kernels, using dynamic algorithms:
  • SYRK cases for large N and small N with large K
  • Specialized XeonPHI version for large N
  • GEMM case for tiny N & K, large M (QR & LU panel factorizations)
  • GEMM case for medium-sized squarish matrices (helps LU & QR)
  • These changes provide significant speedup for high-core-count archs
  • Produced bitvector ops in ATLAS/src/auxil/ATL_bitvec.c
  • Fixed it so XeonPHI uses 4*(P-1) physical cores in affinity array, and starts affinity IDs at 1 rather than 0.

New in ATLAS 3.11.26 Dev (Mar 18, 2014)

  • Unified single & double XeonPHI AMM kernel wt FMAC swizzle, so single automatically gets to use double kernels again
  • Added fixed KB=8 double PHI kernel for small-case performance
  • New XeonPHI archdefs
  • Conjugated TAU for row-major Q-variant factorizations (bugfix!)

New in ATLAS 3.11.24 Dev (Feb 11, 2014)

  • Kludged some limits on parallel scale when using AMM
  • New kernels, archdefs and configure changes for XeonPHI
  • Changed Make.bin's serial testers link to SBLASlib not BLASlib

New in ATLAS 3.11.23 Dev (Feb 4, 2014)

  • Got basic cross-compilation working when host/target share filesystem and you can ssh w/o passwd to target: --rtarg=
  • Added basic support for XeonPHI working: --accel=2 --rtarg=mic0 - Expects icc in path and ready for use
  • Added SIMD vect support for MIC's AVX-512, called AVXZ in ATLAS
  • Added basic AMM AVXZ code generator -> presently disabled in search due to errors.
  • Fixed bugs in ammsearch where very large mu cause M/NB=0
  • Added PHI-specific assembly kernels

New in ATLAS 3.11.14 Dev (Oct 11, 2013)

  • Fixed several related bugs in uammsearch, including mixing non-square dimensions and failure to vectorize generated kernels & not trying K-vectorized kernels
  • Fixed ATL_damm2x12x2_sse2.S index file, which showed KU=1 not KU=2.

New in ATLAS 3.11.13 Dev (Sep 20, 2013)

  • Fixed bug in ATL_ammmNMK that occurred when main ammm kernel was used when K < kbmin or K%ku != 0.
  • Rewrote ammm_ src routs to call single rout to setup kernel usage
  • Fixed K=2, beta != 1 bug in ATL_ammm
  • Fixed bug in ammsearch causing it to use kernels wt out-of-bound KB
  • Added NB=16,32,64 explicit support to rank-K (32&64 for lapack)
  • Added kmaj user kernel support for rank-K & KCLEAN
  • Added ability to build user-generated AMM kerns seperate from ATLAS kerns
  • Fixed bug in k-major kernel where some k-major kerns had MFLOPs inflated
  • made emit_typ.c cast all ATL_sizeof macros to size_t

New in ATLAS 3.9.64 Dev (Feb 1, 2012)

  • Deleted MATGEN/*.o from lapack tester tarfile
  • Commented out nonsensical Q-type LWORK testing in error exit tests
  • Added new generic x86 architectures:
  • - x86x87, x86SSE1, x86SSE2, x86SSE3
  • Added (crappy) architectural defaults for generic archs:
  • - x86x8732, x86SSE132SSE1, x86SSE232SSE2
  • Added section on building generic libs in atlas_install
  • Added -M handling to gmmsearch.c's GetFlags
  • Fixed error when CacheEdge is 0 in threaded Level 2 BLAS and recursive
  • Q-type factorizations
  • Changed makefile so rec Q-type factorizations depend on atlas_qrrmeth.h
  • added lapack_test_pt_pt to test atlas threaded lapack + threaded blas
  • Added new archdefs for PIII32SSE1/PPRO2 for debian guys
  • -> Gcc 4.6.1 x87 performance is terrible, and gfortran has compiler bug
  • that causes all blas testers to fail unless -O1 or lower opt thrown
  • Add new configure flag -Si ieee 0, which allows non-IEEE crap like
  • ARM NEON to be used when set to 0
  • Added ARM NEON kernels for s/cGEMM, sGEMVT, sGER2K

New in ATLAS 3.9.49 Dev (Sep 2, 2011)

  • Fixed unitialized var in all l2 kernel searches
  • Fixed out-of-mem bugs in GERC and GER2C
  • Fixed a bunch of warnings coming from clang