Dyalog APL

Dyalog APL, or simply Dyalog, is a modern APL in the APL2 tradition, first released by British company Dyadic Systems Ltd. (now Dyalog Ltd.) in 1983 for the Zylog Z80 processor (the name Dyalog is a portmanteau of Dyadic and Zylog). Dyalog supports several platforms and interfaces with many languages and runtimes including native shared libraries, .NET, the JVM, R, and Python. It is actively developed and has introduced many new primitives and concepts to array programming. Major categories of features introduced to APL by Dyalog are tacit programming by allowing named derived functions and later trains, lexically-scoped functional programming using dfns, namespaces and object-oriented programming, and the addition of leading axis theory and the Rank operator to the nested array paradigm.

In 1995, two Dyalog developers—John Scholes and Peter Donnelly—were awarded the Iverson Award for their work on the interpreter. Gitte Christensen and Morten Kromberg were joint recipients of the Iverson Award in 2016.

Versions
Dyalog lists historical versions, along with release notes since 14.0, on its website. Its early history is recounted in more detail by Pete Donnelly in Dyalog APL: A Personal History (pdf).

Implementation
Dyalog APL is implemented primarily in C with some parts implemented in C++ in order to use templates. C intrinsics are used to access instruction set extensions. Some architecture-specific assembly, both compiled separately and inline from C, is used for functionality like exception flags which is not easily accessible in C. Prior to version 17.0, assembly was also used for vectorized arithmetic. In 17.0, this code was replaced by a new C++ implementation.

Internal types
Dyalog uses the following numeric types:
 * 1-bit packed Boolean
 * 1-byte integer
 * 2-byte integer
 * 4-byte integer
 * 8-byte double
 * 16-byte complex (one double for each component)
 * 16-byte decimal float "decf" (BID or DPD)

Character encodings differ for classic and unicode interpreters: classic interpreters use a custom 1-byte encoding for all characters, and are limited to a 256-character set, while characters in unicode interpreters are 1-, 2-, or 4-byte unsigned unicode code point values.

Nested and mixed arrays (that is, pointer arrays) are always stored as arrays of pointers, while simple numeric or character arrays are always stored using one of the above types. For both numbers and characters, an array may be represented using any type that can contain all the values. The interpreter may reduce the type of an array to the minimum possible ("squeeze" the array) during execution.

Because there is no complex representation using decimal floats for the components, arrays containing both decimal floats and complex numbers have no common representation. Dyalog converts such arrays to complex numbers, resulting in a loss of precision for decf elements.

Instruction set usage
Dyalog makes heavy use of vector instructions on all platforms, as well as other special instruction sets primarily on x86. Instruction set availability is checked at runtime, so that the minimum required instruction set remains low:
 * For 32-bit x86, only SSE2 is required.
 * For x86_64, there is no minimum requirement as every processor supports SSE2. SSE4.1 is required on macOS as all x86 Apple machines support this instruction set.
 * For ARM32, there is no minimum requirement.
 * As of version 17.1, POWER7 and above are supported. Support for older systems is dropped because Dyalog compiles separate binaries for each POWER architecture.

In Dyalog 17.0, the code for vectorized scalar functions was unified and extended to allow Intel AVX2 and ARM NEON in addition to Intel SSE2 and SSE4.1, and AltiVec VMX for IBM POWER. This code is also used for operations involving the scalar dyadics Plus, Minus, Times, Divide, Maximum, Minimum, and comparison functions, as well as some functions derived from operators applied to these functions, such as the Outer Product and Inner Product.

Dyalog also uses many other x86 extensions:
 * Since at least 12.1, SSE2 is used for scalar dyadics.
 * Since 17.0, AVX2 is used for scalar dyadics if available.
 * Since 14.1, SSE4.1 is used for Minimum and Maximum, and finding the range of an array. AVX2 can also be used for these purposes in 18.0.
 * Since 17.0, SSSE3 is used primarily for the shuffle instruction for permuting arrays and searching small lookup tables.
 * Since 14.0, SSE4.2 POPCNT is used to sum Boolean arrays.
 * Since 14.0, SSE4.2 CRC32 is used to compute fast hash functions.
 * Since 15.0, BMI2 is used for Boolean matrix transpose. Since 16.0, it is used for Boolean Compress and Expand, and several structural functions on Boolean arrays.
 * Since 18.0, CLMUL is used for xor reductions and scans.
 * Since 18.0, FMA3 is used to implement division by a singleton.

It also uses the POWER8 gather-bits-by-bytes instruction, which is equivalent to transposing an 8x8 bit matrix for Boolean Transpose since version 15.0 (expanded in applicability in 16.0) and the fused multiply-add instruction for division like x86 FMA3 in 18.0.