Key: Difference between revisions

From APL Wiki
Jump to navigation Jump to search
m (Text replacement - "</source>" to "</syntaxhighlight>")
Line 15: Line 15:
│p│9 10    │
│p│9 10    │
└─┴────────┘
└─┴────────┘
</source>
</syntaxhighlight>


In the [[dyadic]] case, Key applies the function to collections of major cells from the right argument corresponding to unique elements of the left argument:
In the [[dyadic]] case, Key applies the function to collections of major cells from the right argument corresponding to unique elements of the left argument:
Line 30: Line 30:
│p│IJ  │
│p│IJ  │
└─┴────┘
└─┴────┘
</source>
</syntaxhighlight>


The monadic case, <source lang=apl inline>f⌸Y</source> is equivalent to <source lang=apl inline>Y f⌸ ⍳≢Y</source>.
The monadic case, <source lang=apl inline>f⌸Y</syntaxhighlight> is equivalent to <source lang=apl inline>Y f⌸ ⍳≢Y</syntaxhighlight>.


== Problems ==
== Problems ==
Line 42: Line 42:
C 4
C 4
G 6
G 6
</source>
</syntaxhighlight>
Since A is entirely missing in the argument, it isn't mentioned in the result either. Likewise, the result is mis-ordered due to G and T appearing before the first C. A common solution is to inject the vocabulary before the actual data, and then decrement from the counts:
Since A is entirely missing in the argument, it isn't mentioned in the result either. Likewise, the result is mis-ordered due to G and T appearing before the first C. A common solution is to inject the vocabulary before the actual data, and then decrement from the counts:
<source lang=apl>      {⍺,¯1+≢⍵}⌸'ACGT','TCCGCGGTGGCG'
<source lang=apl>      {⍺,¯1+≢⍵}⌸'ACGT','TCCGCGGTGGCG'
Line 49: Line 49:
G 6
G 6
T 2
T 2
</source>
</syntaxhighlight>
Now that the meaning of each count is known, the operand's left argument can be ignored, and the decrementing can be factored out from the operand:
Now that the meaning of each count is known, the operand's left argument can be ignored, and the decrementing can be factored out from the operand:
<source lang=apl>
<source lang=apl>
       ¯1+{≢⍵}⌸'ACGT','TCCGCGGTGGCG'
       ¯1+{≢⍵}⌸'ACGT','TCCGCGGTGGCG'
0 4 6 2
0 4 6 2
</source>
</syntaxhighlight>
=== Computing the unique ===
=== Computing the unique ===
Key computes the set of [[unique]] major cells. Often, this collection is needed separately from the occurrence information, but can be hard to extract. For example, to get the most frequently occurring letter:
Key computes the set of [[unique]] major cells. Often, this collection is needed separately from the occurrence information, but can be hard to extract. For example, to get the most frequently occurring letter:
Line 60: Line 60:
       ⊃⍒{≢⍵}⌸'TCCGCGGTGGCG'
       ⊃⍒{≢⍵}⌸'TCCGCGGTGGCG'
3
3
</source>
</syntaxhighlight>
Notice that 3 is the index in the unique set of letters, and so it is tempting to write:
Notice that 3 is the index in the unique set of letters, and so it is tempting to write:
<source lang=apl>
<source lang=apl>
       {(⊃⍒{≢⍵}⌸⍵)⌷∪⍵}'TCCGCGGTGGCG'
       {(⊃⍒{≢⍵}⌸⍵)⌷∪⍵}'TCCGCGGTGGCG'
G
G
</source>
</syntaxhighlight>
However, while this code works, it is inefficient in that the unique is computed twice. This can be avoided by letting Key return the unique and using that:
However, while this code works, it is inefficient in that the unique is computed twice. This can be avoided by letting Key return the unique and using that:
<source lang=apl>
<source lang=apl>
Line 71: Line 71:
       keys⌷⍨⊃⍒counts
       keys⌷⍨⊃⍒counts
G
G
</source>
</syntaxhighlight>
Unfortunately, this can introduce a different inefficiency, in that the result of Key's operand can end up being a [[heterogeneous array]] (containing multiple [[datatype]]s), and these are stored as pointer arrays, consuming memory for one pointer per element, and forcing "pointer chasing" when addressing the data. A possible work-around is to collect the unique keys separately from the result of counts:
Unfortunately, this can introduce a different inefficiency, in that the result of Key's operand can end up being a [[heterogeneous array]] (containing multiple [[datatype]]s), and these are stored as pointer arrays, consuming memory for one pointer per element, and forcing "pointer chasing" when addressing the data. A possible work-around is to collect the unique keys separately from the result of counts:
<source lang=apl>
<source lang=apl>
Line 79: Line 79:
       keys⌷⍨⊃⍒counts
       keys⌷⍨⊃⍒counts
G
G
</source>
</syntaxhighlight>
If there are a large number of unique values, the repeated updating of the accumulating <source lang=apl inline>keys</source> variable can be an issue in itself.
If there are a large number of unique values, the repeated updating of the accumulating <source lang=apl inline>keys</syntaxhighlight> variable can be an issue in itself.


== External links ==
== External links ==

Revision as of 20:59, 10 September 2022

Key () is a primitive monadic operator which takes a dyadic function operand where specified keys group the indices or major cells of an argument. It was introduced in Dyalog APL version 14.0 and is commonly compared to SQL's GROUP BY statement.

Description

Monadically, Key will group identical major cells together and applies the function operand once for each unique major cell. The function is applied with the unique major cell as left argument, while the right argument is the indices of major cells that match it:

<source lang=apl>

     {⍺⍵}⌸'Mississippi'

┌─┬────────┐ │M│1 │ ├─┼────────┤ │i│2 5 8 11│ ├─┼────────┤ │s│3 4 6 7 │ ├─┼────────┤ │p│9 10 │ └─┴────────┘ </syntaxhighlight>

In the dyadic case, Key applies the function to collections of major cells from the right argument corresponding to unique elements of the left argument:

<source lang=apl>

     'Mississippi'{⍺⍵}⌸'ABCDEFGHIJK' 

┌─┬────┐ │M│A │ ├─┼────┤ │i│BEHK│ ├─┼────┤ │s│CDFG│ ├─┼────┤ │p│IJ │ └─┴────┘ </syntaxhighlight>

The monadic case, <source lang=apl inline>f⌸Y</syntaxhighlight> is equivalent to <source lang=apl inline>Y f⌸ ⍳≢Y</syntaxhighlight>.

Problems

Vocabulary

A common problem with Key is the inability to control the order of the result (as Key will use the order of appearance) and the "vocabulary" (as Key will never include information for a major cell that doesn't occur). For example, here we want to count occurrences of the letters A, C, G, T: <source lang=apl>

     {⍺,≢⍵}⌸'TCCGCGGTGGCG'

T 2 C 4 G 6 </syntaxhighlight> Since A is entirely missing in the argument, it isn't mentioned in the result either. Likewise, the result is mis-ordered due to G and T appearing before the first C. A common solution is to inject the vocabulary before the actual data, and then decrement from the counts: <source lang=apl> {⍺,¯1+≢⍵}⌸'ACGT','TCCGCGGTGGCG' A 0 C 4 G 6 T 2 </syntaxhighlight> Now that the meaning of each count is known, the operand's left argument can be ignored, and the decrementing can be factored out from the operand: <source lang=apl>

     ¯1+{≢⍵}⌸'ACGT','TCCGCGGTGGCG'

0 4 6 2 </syntaxhighlight>

Computing the unique

Key computes the set of unique major cells. Often, this collection is needed separately from the occurrence information, but can be hard to extract. For example, to get the most frequently occurring letter: <source lang=apl>

     ⊃⍒{≢⍵}⌸'TCCGCGGTGGCG'

3 </syntaxhighlight> Notice that 3 is the index in the unique set of letters, and so it is tempting to write: <source lang=apl>

     {(⊃⍒{≢⍵}⌸⍵)⌷∪⍵}'TCCGCGGTGGCG'

G </syntaxhighlight> However, while this code works, it is inefficient in that the unique is computed twice. This can be avoided by letting Key return the unique and using that: <source lang=apl>

     (keys counts)←,⌿{⍺,≢⍵}⌸'TCCGCGGTGGCG'
     keys⌷⍨⊃⍒counts

G </syntaxhighlight> Unfortunately, this can introduce a different inefficiency, in that the result of Key's operand can end up being a heterogeneous array (containing multiple datatypes), and these are stored as pointer arrays, consuming memory for one pointer per element, and forcing "pointer chasing" when addressing the data. A possible work-around is to collect the unique keys separately from the result of counts: <source lang=apl>

     data←'TCCGCGGTGGCG'
     keys←0⌿data
     counts←{keys⍪←⍺ ⋄ ≢⍵}⌸data
     keys⌷⍨⊃⍒counts

G </syntaxhighlight> If there are a large number of unique values, the repeated updating of the accumulating <source lang=apl inline>keys</syntaxhighlight> variable can be an issue in itself.

External links

Lessons

Documentation


APL built-ins [edit]
Primitives (Timeline) Functions
Scalar
Monadic ConjugateNegateSignumReciprocalMagnitudeExponentialNatural LogarithmFloorCeilingFactorialNotPi TimesRollTypeImaginarySquare Root
Dyadic AddSubtractTimesDivideResiduePowerLogarithmMinimumMaximumBinomialComparison functionsBoolean functions (And, Or, Nand, Nor) ∙ GCDLCMCircularComplexRoot
Non-Scalar
Structural ShapeReshapeTallyDepthRavelEnlistTableCatenateReverseRotateTransposeRazeMixSplitEncloseNestCut (K)PairLinkPartitioned EnclosePartition
Selection FirstPickTakeDropUniqueIdentityStopSelectReplicateExpandSet functions (IntersectionUnionWithout) ∙ Bracket indexingIndexCartesian ProductSort
Selector Index generatorGradeIndex OfInterval IndexIndicesDealPrefix and suffix vectors
Computational MatchNot MatchMembershipFindNub SieveEncodeDecodeMatrix InverseMatrix DivideFormatExecuteMaterialiseRange
Operators Monadic EachCommuteConstantReplicateExpandReduceWindowed ReduceScanOuter ProductKeyI-BeamSpawnFunction axis
Dyadic BindCompositions (Compose, Reverse Compose, Beside, Withe, Atop, Over) ∙ Inner ProductDeterminantPowerAtUnderRankDepthVariantStencilCutDirect definition (operator)
Quad names Index originComparison toleranceMigration levelAtomic vector