Performance: Difference between revisions

Jump to navigation Jump to search
5,301 bytes added ,  20:43, 9 November 2020
m
mNo edit summary
(3 intermediate revisions by the same user not shown)
Line 2: Line 2:


While dynamically-typed interpreted languages are typically considered to be slow (that is, by nature they lead implementations to run slowly), APL code which uses primarily flat arrays has been described as an excellent fit for modern hardware,<ref>Martin Thompson. "Rectangles All The Way Down" ([https://www.dyalog.com/uploads/conference/dyalog18/presentations/U12_Rectangles_All_The_Way_Down.pdf slides], [https://dyalog.tv/Dyalog18/?v=mK2WUDIY4hk video]) at [[Dyalog '18]].</ref> and [[Dyalog APL]] can in some cases perform better than straightforward [[wikipedia:C (programming language)|C]] implementations.<ref>Matthew Maycock. [https://ummaycoc.github.io/wc.apl/ Beating C with Dyalog APL: wc]. 2019-10.</ref><ref name="advantage">[[Marshall Lochbaum]]. "The Interpretive Advantage" ([https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip slides (0.5 MB)], [https://dyalog.tv/Dyalog18/?v=-6no6N3i9Tg video]) at [[Dyalog '18]].</ref> Taking advantage of a high-performance implementation often requires writing in a flatter style, with few or no [[box]]es or [[nested]] arrays, and compiled or GPU-based APLs may not fully support nested arrays.
While dynamically-typed interpreted languages are typically considered to be slow (that is, by nature they lead implementations to run slowly), APL code which uses primarily flat arrays has been described as an excellent fit for modern hardware,<ref>Martin Thompson. "Rectangles All The Way Down" ([https://www.dyalog.com/uploads/conference/dyalog18/presentations/U12_Rectangles_All_The_Way_Down.pdf slides], [https://dyalog.tv/Dyalog18/?v=mK2WUDIY4hk video]) at [[Dyalog '18]].</ref> and [[Dyalog APL]] can in some cases perform better than straightforward [[wikipedia:C (programming language)|C]] implementations.<ref>Matthew Maycock. [https://ummaycoc.github.io/wc.apl/ Beating C with Dyalog APL: wc]. 2019-10.</ref><ref name="advantage">[[Marshall Lochbaum]]. "The Interpretive Advantage" ([https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip slides (0.5 MB)], [https://dyalog.tv/Dyalog18/?v=-6no6N3i9Tg video]) at [[Dyalog '18]].</ref> Taking advantage of a high-performance implementation often requires writing in a flatter style, with few or no [[box]]es or [[nested]] arrays, and compiled or GPU-based APLs may not fully support nested arrays.
== Arrays and performance ==
The central concept underlying the high-performance array languages currently in use is to take advantage of the regularity of arrays. Implementations implement primitives or short combinations of primitives with fast array algorithms, so that a program can make effective use of computer hardware if these primitives make up the bulk of its computation. Arrays allow for these fast algorithms when they are homogeneous in type: that is, they have a regular shape, which is part of the definition of a multidimensional array; and their elements can be stored with the same type. The type used should also be supported by the hardware (usually, the CPU's vector processor), which for example makes [[wikipedia:IEEE 754|IEEE 754]] floats the standard floating-point format in high-performance APLs.
When APL programs are evaluated as a sequence of array operations that don't interact with each other, the time taken for a program is simply the total of the time taken by each primitive. Two different types of costs contribute to a primitive's performance: "fixed" costs that don't scale with the size of the arguments (but could vary somewhat based on other properties), and "variable" costs that do. These variable costs are usually proportional to the argument or result size, but may be more complicated. The simplest model of array primitive performance that takes this divide into account is that a given primitive has a constant "overhead" time for each evaluation, and then processes its input at a fixed rate, called "throughput". The advantage of array programming is that this throughput can be made much lower than in other paradigms. However, when the arrays used are small, a program's performance will instead be dominated by overhead, which tends to be high even relative to other interpreted languages because most APL interpreters store all data in arrays. Overhead for a primitive might be tens or hundreds of nanoseconds while throughput can be many elements per nanosecond, so that throughput begins to be the dominant cost at arrays larger than about a thousand elements.
To make array operations fast, arrays are stored in a regular format in memory, so that all data about the format can be kept in the array's "header" and read only before, not during, data processing. This means that elements in the array must have the same storage type for efficient operation. [[Flat array model]] languages with or without boxes require that arrays have a homogeneous type (all numbers, all characters, or all boxes); [[Nested array model|nested]] APLs do not, but store homogeneous arrays differently and aren't as efficient when processing mixed-type arrays. Fast APLs tend to use several [[internal type]]s to operate efficiently on subsets of the available numbers like integers or booleans, and may [[Array model#Numeric type coercion|coerce]] numbers to a common type to ensure that an array is homogeneous in memory.
In the simplest case, homogeneous array storage allows an implementation to use specialized code to run a primitive on each type. Because this special code is written in a low-level language with complete knowledge of the types in use, it can attain a throughput similar to that of the implementation language (usually a low-level language such as [[wikipedia:C (programming language)|C]]) for the primitive—although this will tend to leave overall algorithms running slower than that language because in APL they have been split into multiple passes. APL's array operations are also ideal for implementation with [[wikipedia:SIMD|SIMD]], or "single instruction, multiple data", operations, that perform a single action on several different values. In some cases, such as [[scalar function]]s, the primitives are SIMD operations; in others such as [[Reverse]], they are easily implemented using SIMD—for Reverse, SIMD selection or "shuffle". While experimental SIMD machines (such as the APL-influenced [[wikipedia:CDC Star-100|CDC Star-100]]) were created as early as the 1960s, SIMD computing first entered the personal computing mainstream in the 1990s and has steadily grown in prominence for high-performance computing since then. In APL, CPU vector instruction sets such as Intel's [[wikipedia:Streaming SIMD Extensions|SSE]] are the most often way to access SIMD optimization, although [[Co-dfns]] instead runs on a GPU to attain much higher throughput at the cost of increased overhead and restriction of available algorithms.
Not all primitives are equally amenable to SIMD implementation or other high-performance techniques. In order to get the most out of an APL implementation, the programmer must take advantage of the fast operations available, which may require significant testing to determine when primitives perform well and alternative strategies either to work around cases where an operation is slow or to take advantage of information the programmer knows but the language doesn't.


== Performant implementation ==
== Performant implementation ==
Line 21: Line 33:


=== Alternate array representations ===
=== Alternate array representations ===
Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although some implementations also include the "stride"<ref>Nick Nickolov ''Compiling APL to JavaScript'' (Vector Volume 26)</ref>). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block.
Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although it's also possible to store the "stride", enabling different views of the same data<ref>NumPy Reference. [https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html "ndarray.strides"]. Accessed 2020-11-09.</ref><ref>[[Nick Nickolov]]. [http://archive.vector.org.uk/art10501160 "Compiling APL to JavaScript"]. [[Vector Journal]] Volume 26 No. 1. 2013-09. (The strided representation was later removed from [[ngn/apl]].)</ref>). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block.


=== Reference counting and data reuse ===
=== Reference counting and data reuse ===

Navigation menu