Performance: Difference between revisions
mNo edit summary |
(→Alternate array representations: Avoid implying ngn/apl uses strides) |
||
Line 21: | Line 21: | ||
=== Alternate array representations === | === Alternate array representations === | ||
Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although | Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although it's also possible to store the "stride", enabling different views of the same data<ref>NumPy Reference. [https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html "ndarray.strides"]. Accessed 2020-11-09.</ref><ref>[[Nick Nickolov]]. [http://archive.vector.org.uk/art10501160 "Compiling APL to JavaScript"]. [[Vector Journal]] Volume 26 No. 1. 2013-09. (The strided representation was later removed from [[ngn/apl]].)</ref>). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block. | ||
=== Reference counting and data reuse === | === Reference counting and data reuse === |
Revision as of 18:44, 9 November 2020
Performance refers to the speed with which programs are executed in a particular language implementation. While a language such as APL cannot inherently be fast or slow, it is often described as being suitable to high-performance implementation, and there are many APL implementations focused partially or exclusively on performance. Currently-developed array-family implementations that advertise high performance include Dyalog APL, J, K (both Kx and Shakti), and Q, while research projects focused primarily on performance include APEX, Co-dfns, SaC, Futhark, and TAIL.
While dynamically-typed interpreted languages are typically considered to be slow (that is, by nature they lead implementations to run slowly), APL code which uses primarily flat arrays has been described as an excellent fit for modern hardware,[1] and Dyalog APL can in some cases perform better than straightforward C implementations.[2][3] Taking advantage of a high-performance implementation often requires writing in a flatter style, with few or no boxes or nested arrays, and compiled or GPU-based APLs may not fully support nested arrays.
Performant implementation
Even the first APL implementation, APL\360, was considered fast for an interpreted language: Larry Breed said it executed programs "often one-tenth to one-fifth as fast as compiled code". He attributed its high performance to fast array operations with development guided by analysis of user code, and its low system overhead to a well-implemented superviser with complete control over system resources.[4] Performance of system operations remained a point of focus for various dialects in the time-sharing era, but in modern times resources such as files are always simply accessed through the host operating system.
Internal datatypes
- Main article: Internal type
Most APLs expose only a small number of scalar types to the user: one or two numeric types (such as double-precision real or complex numbers), and a single character type. However, for performance reasons these types can be implemented internally using various subset types. For example, APL\360 uses numeric arrays of 1-bit Booleans, 4-byte integers, or 8-byte floating point numbers, but converts between them transparently so that from the user's perspective all numbers behave like 8-byte floats (as this type contains the others). In Dyalog APL this hierarchy is significantly expanded, adding 1-byte and 2-byte integers as well as 16-byte complex numbers containing the other types (however, Dyalog also allows the user access to decimal floats if requested, which breaks the strict hierarchy).
When working with large arrays, an implementation can dynamically choose the type of arrays as execution progresses. For some operations it is advantageous to force an array to the smallest possible type, a procedure known as "squeezing". The ability to dynamically change array type can be a practical advantage of interpreted array languages over statically typed compiled languages, since the interpreter is sometimes able to choose a smaller type than the compiler. This may be because the programmer chooses a suboptimal type or because the interpreter can take advantage of situations where an array could possible require a larger type, but doesn't in a particular instance of a program. With an implementation using vector instructions, a smaller internal type can directly translate to faster execution because a vector register (and hence a vector operation) can fit more elements when they are smaller.[3]
Fast array operations
- Main article: APL primitive performance
Most of the effort in optimizing mainstream APL implementations is focused on optimizing particular array operations.[5]
Implementing APL with APL
- Main article: Magic function
The technique of implementing APL primitives using other primitives, or even simpler cases of the same primitive, can be advantageous for performance in addition to being easier for the implementer.[6] Even when a primitive does not use APL directly, reasoning in APL can lead to faster implementation techniques.[7]
Alternate array representations
Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although it's also possible to store the "stride", enabling different views of the same data[8][9]). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block.
Reference counting and data reuse
Because APL's immutable arrays do not permit circular references (although other features like objects might), APL implementations almost universally use reference counting as a memory management technique. In some implementations, such as Dyalog APL, reference counting is supplemented with tracing garbage collection, which is run infrequently to handle circular references.
Because reference counting keeps track of the exact number of references, and not just whether an array is referenced or not, it can be used not only to find when an array can be released (reference count 0), but also to find when it can be reused when passed as an argument (reference count 1).[10] When permitted, reusing arguments can reduce memory usage and improve cache locality. In some cases, it also allows for faster primitive implementations: for example, Reshape can change only an array's shape while leaving its ravel data in place, Take can free trailing major cells from an array while leaving the remainder, and At can modify only part of an array.
Operation merging and dynamic compilation
Ahead-of-time compilation
Performant usage
For the user, there are a few strategies to consider for reasonable performance.
Changing representation
While an APL user cannot change the way the language stores their arrays, a common optimization strategy is to improve the layout of data within arrays in a program. This is typically done to reduce the use of nested arrays with many leaves in favor of one or a few flat arrays. The most obvious such improvement is simply to change a nested array in which all child arrays have the same shape into a higher-rank array (the Mixed nested array); the Rank operator can make working with such arrays easier. Roger Hui has advocated the use of inverted tables to store database-like tables consisting of a matrix where elements in each column share a single type but different columns may have different types.[11] Bob Smith, before the introduction of nested APLs, suggested using a Boolean partition vector (like the one used by Partitioned Enclose) to encode vectors of vectors in flat arrays,[12] and Aaron Hsu has developed techniques for working with trees using flat depth, parent, or sibling vectors.[13]
References
- ↑ Martin Thompson. "Rectangles All The Way Down" (slides, video) at Dyalog '18.
- ↑ Matthew Maycock. Beating C with Dyalog APL: wc. 2019-10.
- ↑ 3.0 3.1 Marshall Lochbaum. "The Interpretive Advantage" (slides (0.5 MB), video) at Dyalog '18.
- ↑ Larry Breed. "The Implementation of APL\360". 1967-08.
- ↑ Morten Kromberg and Roger Hui. "D11: Primitive Performance" (slides (1.3 MB), materials (1.4 MB), video) at Dyalog '13.
- ↑ Roger Hui. "In Praise of Magic Functions: Part I". Dyalog blog. 2015-06-22.
- ↑ Marshall Lochbaum. "Expanding Bits in Shrinking Time". Dyalog blog. 2018-06-11.
- ↑ NumPy Reference. "ndarray.strides". Accessed 2020-11-09.
- ↑ Nick Nickolov. "Compiling APL to JavaScript". Vector Journal Volume 26 No. 1. 2013-09. (The strided representation was later removed from ngn/apl.)
- ↑ Cantrill, Brian. "A Conversation with Arthur Whitney". 2009.
- ↑ Roger Hui. "Inverted Tables" (slides (0.9 MB), video) at Dyalog '18.
- ↑ Bob Smith. "A programming technique for non-rectangular data" (included in Boolean functions (pdf)) at APL79.
- ↑ Aaron Hsu. "High-performance Tree Wrangling, the APL Way" (slides (0.3 MB), video) at Dyalog '18.