Error trapping with Dyalog APL

To write bug-free code in a complex system and to forecast all errors is impossible. Therefore, implementing some kind of error trapping in an application which is supposed to run in a productive environment is a must. But to do this in a general and efficient manner is not easy. This article discusses techniques to solve this problem.

Versions covered
The techniques we are going to discuss have been available in Dyalog APL for a long time. When you try to implement error trapping you should be very careful: it is easy to implement a non-interruptible loop. If this happens you have to kill the process in the task manager and the workspace is lost, so it is a good idea to save the workspace before you execute code after a change.

Considerations
When an application is executed in a production environment, error trapping can be used to solve the following goals:


 * Save a workspace as a snapshot reflecting the state of the application when the error appeared. This makes it easy to analyze a problem, and sometimes it is the only way to do so.
 * After having saved a snapshot it might be possible to restart the application. This might enable the user to run the application with different data, or to run different parts of the application on the same data.
 * Prevent the user from interrupting the application by pressing the keys the strong and the weak interrupt are associated with.
 * Continue execution in case a developer has forgotten to remove a stop vector.

When an application is started, it needs to be initialized. If an error occurs at this early stage, normally there is no way to recover. Once an application is fully initialized, it might be a good idea to try to restart it. However, if this procedure crashes itself, we must prevent an endless loop from generating tons of useless snapshot workspaces.

Preparation
It is good programming style to avoid using numbers in code. Instead of talking about 1001, for example, we should use a meaningful name:

In a large system you want those to be constants, so a user cannot change them. That's why they are niladic functions.

We need also a user-defined event for restarting the application. This is explained soon:

According to the help file, users should use the range from 500 to 999 to define their own events.

Setting
allows us to implement a general mechanism on a global level. For discussion purposes let's assume the following:
 * 1)   is set to run function
 * 2) This function calls 3 sub-functions: ,   and
 * 3)   initializes the application: it opens files, interprets an INI file, takes the Windows registry into account, builds the GUI and so forth.
 * 4)   simply runs
 * 5)   cleans up: it closes files, says good-by.

Solving the stop vector problem
Let's start with solving the stop vector problem: returns 1001 which is the event number a stop vector is associated with. As soon as APL stops on a stop vector  is set to 1001. This event can be caught with, so we can tell APL to execute  the expression given as third argument. In this case it tells APL to simply ignore stop vectors by resuming execution.

Preventing users from interrupting an application
The same technique can be used to prevent the user from interrupting the application, accidentally or purposely. Depending on the type of the application it might be a good idea to allow the user to interrupt the application by pressing either the key for the weak or the strong interrupt, to ask the user for confirmation and then to restart the application. This would allow the user to quit a lengthy operation that needs more time than expected.

Here, however, we will use this simple approach:

Restarting the application
For reasons explained in a minute we now have to define the “Restart the application” procedure. For this, for the first time we do not use the  statement but the   statement. The C is short for Cut back. This instructs APL to cut the status indicator back to the level where  is localized – that is not  necessarily where it was set – and execute the expression in the 3rd argument there. However, if  is not localized at all, i.e. it is in the workspace, the status indicator is cut back completely and the expression is executed in the workspace. To make this work the function in which  is localized must have a label   or a function that returns a valid line number to branch to of course.

Catching Errors
If an unexpected error occurs, we want to execute a particular function to do the hard work. The 0 stands for all the events from 1 to 999 while the 1000 stands for all events larger than 1000.

may contain more than one error catching group. Since the contents of  is scanned from left to right, a statement will ONLY be executed for an event not processed earlier. That is the reason why we must define the restart event first.

For example, in the following statement: event 333 will be caught by the 1st group and NOT by the 2nd even though 0 stands for “events from 1 to 999”. Only the expression  will be executed.

The function
The  function should do at least the following:


 * Perhaps neutralize
 * Save the  and   settings
 * Save the snapshot
 * Maybe create an HTML page with general information about the error
 * Ask the user about a restart
 * Either try a restart of shut the application down

Developers and others
Of course the error trapping mechanism must distinguish between developers and others. Often it is good enough to check the APL version: use error trapping in case of runtime, otherwise not. If this is not possible, because some or all of the users are running the development version too, you can specify a parameter to tell the application that you are a developer. By default the application can then use error trapping.

Testing Error Trapping
Keep in mind that you want to have an easy opportunity to test the system with error trapping. So you may need another parameter that tells the system that error trapping has to be used. Last but not least, there should be an easy opportunity to let the application crash on purpose. I prefer to have a “developers menu”, which is displayed only to developers. Among other useful commands it offers a “Let's crash” option.

Control Structure
If you use, keep in mind that   and   are both taken into account. That means that in case of

the error caused by the  statement is caught by the , while the   is caught by the   setting.

When using  try to be as specific as possible. For example, this code is faulty: because it tries to create a file not only if this file does not already exist but also if the current user lacks the right to tie it, for example because somebody else has already tied it exclusively. Therefore, it is a better to be specific: The best idea, however, is to use  to check the file for already being created. In general it is a good idea to use error trapping only for extraordinary problems.

The system function
Note that an event which is led can be intercepted with   but not If you execute this function: you get this:

Ensure future trouble
A very easy way to create problems in the future is to do this: This technique is called “silent trapping”. If something is going wrong, do not take care and do not tell anybody about it!

Switching Error Trapping on or off
When you use error trapping, make sure that you can switch off error trapping on a general level. The easiest way to implement this idea is something like this:

If the flag is true, error trapping is active, if not, the  statement will fail if an error occurs. This makes is much easier to debug an application.

You might need a more sophisticated mechanism for this, because under some circumstances you want to switch off most but not all error trapping statements. For example, if you use a logging mechanism which is logging every user action for analyzing purposes, the code doing this may cause an interrupt itself, for example because the disk is full which holds the logging files. In such a case it might be inappropriate that the logging code breaks the application. Therefore, you might control this code with -statements.

In such a case it might be a good idea to control the behavior of the application on different levels for code which is really essential in terms of business logic, for example, and for code which is not essential.

But even in such a case the problem should be communicated. I found the idea of a watchdog application very useful, which, among other tasks, is listening to UDP telegrams on a particular port. An application in trouble can then send a telegram to the watchdog, telling about the problem. Using a type of error class, the client can tell the watchdog about the seriousness of the problem, and the watchdog can then decide to simply display it on it's GUI or send a SMS message or/and an email to the admin.

Code
The below workspace contains all the code needed to implement the ideas discussed above.