Dyalog APL: Difference between revisions

Jump to navigation Jump to search
627 bytes added ,  09:49, 22 November 2019
→‎Instruction set usage: Version numbers for x86 extension first usage
(→‎Implementation: Instruction sets)
(→‎Instruction set usage: Version numbers for x86 extension first usage)
Line 301: Line 301:
In Dyalog 17.0, the code for vectorised [[scalar function]]s was unified and extended to allow Intel [[wikipedia:AVX2|AVX2]] and ARM NEON in addition to Intel [[wikipedia:SSE2|SSE2]] and [[wikipedia:SSE4.1|SSE4.1]], and AltiVec VMX for IBM POWER. This code is also used for operations involving the scalar dyadics [[Plus]], [[Minus]], [[Times]], [[Divide]], [[Maximum]], [[Minimum]], and [[comparison function]]s, as well as some functions derived from operators applied to these functions, such as the [[Outer Product]] and [[Inner Product]].
In Dyalog 17.0, the code for vectorised [[scalar function]]s was unified and extended to allow Intel [[wikipedia:AVX2|AVX2]] and ARM NEON in addition to Intel [[wikipedia:SSE2|SSE2]] and [[wikipedia:SSE4.1|SSE4.1]], and AltiVec VMX for IBM POWER. This code is also used for operations involving the scalar dyadics [[Plus]], [[Minus]], [[Times]], [[Divide]], [[Maximum]], [[Minimum]], and [[comparison function]]s, as well as some functions derived from operators applied to these functions, such as the [[Outer Product]] and [[Inner Product]].


Dyalog also uses many other x86 extensions: in version 18.0,
Dyalog also uses many other x86 extensions:
* [[wikipedia:SSE2|SSE2]], [[wikipedia:SSE4.1|SSE4.1]], and [[wikipedia:AVX2|AVX2]] are used for [[scalar dyadic]]s.
* Since at least [[Dyalog APL versions#12.1|12.1]], [[wikipedia:SSE2|SSE2]] is used for [[scalar dyadic]]s.
* [[wikipedia:SSSE3|SSSE3]] is used primarily for the shuffle instruction for permuting arrays and searching small lookup tables.
* Since [[Dyalog APL versions#17.0|17.0]], [[wikipedia:AVX2|AVX2]] is used for scalar dyadics if available.
* [[wikipedia:SSE4.2|SSE4.2]] POPCNT is used to sum Boolean arrays.
* Since [[Dyalog APL versions#14.1|14.1]], [[wikipedia:SSE4.1|SSE4.1]] is used for [[Minimum]] and [[Maximum]], and finding the range of an array. [[wikipedia:AVX2|AVX2]] can also be used for these purposes in [[Dyalog APL versions#18.0|18.0]].
* [[wikipedia:SSE4.2|SSE4.2]] CRC32 is used to compute fast hash functions.
* Since [[Dyalog APL versions#17.0|17.0]], [[wikipedia:SSSE3|SSSE3]] is used primarily for the shuffle instruction for permuting arrays and searching small lookup tables.
* [[wikipedia:BMI2|BMI2]] is used for Boolean [[Compress]] and [[Expand]], and several [[structural function]]s on Boolean arrays.
* Since [[Dyalog APL versions#14.0|14.0]], [[wikipedia:SSE4.2|SSE4.2]] POPCNT is used to sum Boolean arrays.
* [[wikipedia:CLMUL instruction set|CLMUL]] is used for [[xor]] [[reduction]]s and [[scan]]s (new in 18.0).
* Since [[Dyalog APL versions#14.0|14.0]], [[wikipedia:SSE4.2|SSE4.2]] CRC32 is used to compute fast hash functions.
* [[wikipedia:FMA instruction set|FMA3]] is used to implement [[Divide|division]] by a [[singleton]] (new in 18.0).
* Since [[Dyalog APL versions#15.0|15.0]], [[wikipedia:BMI2|BMI2]] is used for Boolean matrix transpose. Since [[Dyalog APL versions#16.0|16.0]], it is used for Boolean [[Compress]] and [[Expand]], and several [[structural function]]s on Boolean arrays.
* Since [[Dyalog APL versions#18.0|18.0]], [[wikipedia:CLMUL instruction set|CLMUL]] is used for [[xor]] [[reduction]]s and [[scan]]s.
* Since [[Dyalog APL versions#18.0|18.0]], [[wikipedia:FMA instruction set|FMA3]] is used to implement [[Divide|division]] by a [[singleton]].


It also uses the POWER8 [https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.3/com.ibm.xlc1313.aix.doc/compiler_ref/vec_gbb.html gather-bits-by-bytes] instruction, which is equivalent to transposing an 8x8 bit matrix for [[Boolean]] [[Transpose]] since version 15.0 (expanded in applicability in 16.0) and the fused multiply-add instruction for division like x86 FMA3 in 18.0.
It also uses the POWER8 [https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.3/com.ibm.xlc1313.aix.doc/compiler_ref/vec_gbb.html gather-bits-by-bytes] instruction, which is equivalent to transposing an 8x8 bit matrix for [[Boolean]] [[Transpose]] since version 15.0 (expanded in applicability in 16.0) and the fused multiply-add instruction for division like x86 FMA3 in 18.0.

Navigation menu