Home Up Intro Contents Chapter 1 2 3 4 5 6 7 8 9 10 Design Assert Timing EBNF Report Pas Last Changed: July 12th, 1997
This is a conversion from Oberon text to HTML, and from German to English. The converter software is still under development, and some features or information may be missing in this converted version. HTML hypertext facilities are not yet active in this document. To exploit the interactive facilities, use Oberon System 3 and the source of this text, available for download using binary ftp as Oberon System 3 archive. The converter from German to English is still under development as well. A previous version is also available for Oberon V4. To access this and other additional material use ftp.
For the convenience of our students, most of this information and the related material is available in German as well.

Introduction to Oberon

The Oberon Programming Language

G. Sawitzki <gs@statlab.uni-heidelberg.de>



13 Timing and Optimization

This is a preliminary version. Please do not circulate. Comments to <gs@statlab.uni-heidelberg.de>
For recent material, see <http://statlab.uni-heidelberg.de/>.

In this chapter, we discuss some aspects of the problem of timing and optimization of program. More generally, the topic of this chapter is the usage of resources used by a program.

Time consumption, or more generally consumption of resources, may have several components. Comparison is made easier, if we transform consumption of resources to a general scale. Our personal preference is to use human time consumed as a general base scale. On the other side, resource usage may depend on computational parameters. If we have a model for the computation, comparison is made simpler if we can combine these parameters in a general measure of complexity, usually referred to as operations of some abstract machine. Time consumption is then modelled as operations*speed + setup time. Historical units for speed are MIPS= million instructions per second, where instructions refer to the instruction set of the classical DEC VAX, or FLOPS= floating point operations per second, where "operations" still needs and additional definition.

Other resources can often be converted to human time needed. For example, if you use an internet browser which requires xxx MB of RAM above your usual requirements, add the time you need to work to pay xxx MB of RAM, plus the "system time"(=time needed to order, pay and install the RAM) to the cost of the software. If the software is used for a group, add the time for training, configuration and support.

We make two fundamental assumptions. The first is that a program is developed and used. Timing and optimizing a program has to take into account both. Sometimes, development and usage may lead to different aspects. For example, time optimizing a program may be time consuming during the development process, but time saving during application. If a program is used for eternity, the application aspect will prevail. With a real life program however we have a finite scope, and there is a balance between development and application effort. For illustration, and to call attention to secondary expenses, we think of a cost model as
  cost  =  development  +  development testing
    +  maintenance  +  maintenance testing
    +  application    +  application testing
    +  replacement  +  replacement testing
We separated testing cost for each step to draw the attention to the fact that correct performance is not given for free at any stage of the software usage. Maintenance cost includes the cost of maintaining subsequent versions of the software. Application testing cost is the expense to check correct operation in the target environment, i.e. the additional expense of the user to check that a software actually operates correctly after installation. Replacement costs cover for the effort to install, and possibly remove versions of the software. While the development side (development and maintenance) is under control of the software developers, the application side (application and replacement) usually is subject to speculation. A helpful estimate is to factorize the cost e.g. as
  application= (usage man time units) * (usage cost per time unit)
    = (nr of users * expected software life time)
    * (expected usages per time unit * mean time per usage)

The second assumption is that any optimization needs to be visible. In particular, we assume a measurement process, and optimizations are judged within the limitation of this measurement process. While development and maintenance resources can (in principle) be observed directly, application and replacement resources can only be estimated or predicted.

We have several level of complexity. On the top level, we have the task as seen from the user. On the lowest level, we have machine instructions. Ideally, time calculations would add up and be consistent across all levels. In reality, we have sequence effects. One operation on a lower level may change the system state of the global system, and this may affect the performance of the next operation. As a first approach, we can use an additive model, but we always should check for violations of this assumption. Caching, pipelining and other effects may force us to use more elaborate models.

Timing and optimization is full of surprises. It is a helpful habit to formulate assumptions explicitly, and to include checks of these assumptions. Two basic assumptions generally are false. Time is no continuum. On a computer, time has a discrete structure. And time is not consistent. On a computer, two different timers will generally show inconsistent time. Here is a small program to check the timing of your Oberon installation:
  ItOTime.Grain ~
    returns the minimal time difference, over 100 runs
  ItOTime.Clock ~
    returns the number of time units per second
The source code is in ItO/ItOTime.Mod.



Exercises:

Repeat the measurements on the grain size. Is there any variation? Compare with the documentation of your Oberon installation. Is the result plausible?

Is there any variation in the clock measurement? The clock measurement output has an additional line which marks whether a measurement is below or above the overall mean. Do any peculiarities (patterns etc.) occur? Try to formulate you observations on the measurement process explicitly. Can you verify whether your observations reliable, or whether they just reflect artifacts?


Possibilities for optimization will be discussed first. Timing questions are discussed at the end of this chapter.

Optimization

The main optimization is to avoid a computation at all. This cannot be over-emphasized: if a computation is not necessary, don't do it.

If you have to do a computation, a list of common optimization techniques can be applied. In principle, these optimizations can be be done by a compiler. Unfortunately, not all compilers perform these common optimizations, and proper documentation of the optimization done by individual compilers is rare.

Although the common optimizations can be done automatically, the complexity of the necessary optimization algorithm sometimes forbids extensive optimization on the compiler level. On the other hand, the programmer may have a view of the intent and structure of the program which can be used to call the available optimizations into action. In many cases a clean program design is the most important step towards an effective implementation. In addition to this, a combination of an optimization-aware programming style and automatic optimization by the compiler is helpful. So we give a list of common optimization techniques, and add it as a check list for your programs or compilers at the end of this chapter.

Computation timing is not easy to interpret. The common expression is to represent an optimization gain as reduction factor (time with opimization) / (time without optimization). Howewer the time without optimization enters as a norming constant here, which can be affected severely by other implementation issues. The other common choice is to assign an implementation independent complexity to a problem, and to standardize using this complexity.

The programmer view of a code is generally in terms of statements and a raw estimated can be derived by adding the statement costs. Hiden costs are introduced by implicit actions. In particular,
- procedure calls may involve an overhead for setting up a procedure frame and for passing parameters.
- control statements, like FOR...DO...END, may involve a control overhead.
- memory allocation, like NEW, may involve houskeeping and garbage collection
- access to components of structured data types, like A[i,j,k] may involve an address calculation (usually the evaluation of a polynomial to get Addr(A)+ (LEN(A,2)*LEN(A,1))*i+LEN(A,1)*j + k)*SIZE(A[i,j,k]) )

To compare optimizations, we may refer to a coarser segmentation. We will refer to a basic block as a group of statements which is entered by a unique initial statement and left at a unique statement, with no exception or branching except at the end of the block. This means a block is a "run" of statements which is always executed as a whole and can be treated as a self contained unit. Since there are no branches in a basic block, all statements in the basic block are executed the same number of times, and time consumption factorizes as
  nr of block executions * block execution time

As a programmer, we have essentially two possibilities for optimization: reduction of operation cost and code motion.

Reduction of operation cost can be achieved by replacing expensive operations or data structures by cheaper ones with equivalent system status. The benefit is obvious, if we do not lose information. Some possibilities are exploited by nearly all compilers. For example, for integers on common CISC processors
  i*2  i+i  ASH(i,1)  are equivalent,
          the latter being the cheapest
  i:=i+k    INC(i,k)  are equivalent,
          the latter being the cheaper
Others are rarely found in compilers, and hand optimization is helpful. For example, we can check that a vector (x,y) has norm zero by
  (x*x)+(y*y) = 0
or
  (x=0)&(y=0)
the latter of course being cheaper (and more stable). But a compiler will rarely be able to make this reduction.
Reduction of operation cost strongly depends on the actual performance of the processor. For example, on common RISC processors - both i+i and ASH(i, 1) take one cycle and can be considered equivalent. More advanced processors use one or more instruction pipelines, and the performance may depend on the previous state of the pipeline, making reduction of operation cost a non trivial task.

Other reductions of operation cost may have a trade off. For example, we may have the choice of representing a data structure as a linked list or as an array. Using a linked list reduces the polynomial address calculation by a step through a pointer, which may be more time efficient if it comes to higher dimensions. On the other hand, we have to spend the additional memory for a pointer per cell. Again the actual performance depends on the architecture of the processor. If the processor uses a data cache, with a linked list the cache behavior may change drastically if the temporal locality is very bad.

Code motion can be used to move operations to places where they are executed less frequently, or at less time critical instances. The most prominent example is to move loop-invariant code outside a loop.

These possibilities can be applied to a range of targets. Simplest of all: Constant Elimination. The aim is to reduce run time (and possibly code size) by avoiding repeated evaluation of expression with constant value. It comes in two flavours. Expressions which are proper constants and can be evaluated at compile time. Expressions which are not proper constants, but are invariant during repeated invocations can be stored to temporary variables to avoid repeated calculations.

- Constant Folding
Expressions consisting of constants only can be evaluated at compile time.
Hand optimization: move to constant section. Potential draw back: most compilers evaluate constants using the precision of the compile time environment, not the run time environment, In particular, mathematical constants like e, pi, may have different values at run time than at compile time, if you use cross-platform environment,
Implementation: generally implemented for INTEGER and BOOLEAN constants, rarely for other cases.
Floating point constant folding is very tricky since the order of evaluation is critical and on some architectures even the use of registers or memory locations might change the result. Compilers usually do not do it.

- Common Subexpression Elimination
Common subexpressions are parts of a block which have a matching form and do not change their values between computations.
If the same expression occurs repeatedly, the program can be simplified.
Hand optimization: Introduce a temporary variable to hold the common subexpression.
Implementation: Used to reduce space (and hence module loading time) in OMI-based compilers.
One important point is the following: Compilers might optimize common subexpressions if the subexpression does not contain procedure calls e.g in the following example the compiler might translate statement (1) into statement (2)

(1)  IF x > pi/180*45 THEN x := pi/180*45 END;
(2)  t := pi/180*45; IF x > t THEN x := t END;

However, if the expression contains procedure calls then the compiler does not optimize common subexpressions due to possible sideeffects. In the following example the programmer has to perform the common subexpression elimination by hand:

(1)  IF x > f(y, z) THEN x := f(y, z) END;
(2)  t := f(y, z); IF x > t THEN x := t END;


- Loop invariant code motion
This is a special case of common subexpressions. If a subexpression does not change its value during repeated evaluations of a loop, it can be moved before the loop. For example
  WHILE (i >limit -2) DO... END;
can be changed to
  t:=limit-2; WHILE i>t DO... END;
if limit is a loop invariant. Identification of loop invariants, reduction of induction variables and loop invariant code motion are usually used in combination.

- Induction variables
Induction variables are variables in loops which keep a fixed relation to the loop index. Since typically these variables are evaluated very often, reduction of operation cost can be very effective. A first attempt may be to reduce operation cost by the introduction of new auxilliary loop constants. If that is not feasible, the next attempt is to keep the computation time for induction variables at a minimal level, e.g. by sorting nested loops, that is by using code motion.
For an example, consider the code fragment
  i := 0;
  WHILE i < j DO
    a[i] := 0;
    INC(i)
  END;

A straightforward compiler generates:

  i := 0;
  WHILE i < j DO
    adr := baseadr(a);
    idx := size(basetyp)*i;
    Mem[adr+idx] := 0;
    i := i + 1
  END;

A compiler with loop invariant code motion generates:

  i := 0; adr := baseadr;
  WHILE i < j DO
    idx := size(basetyp)*i;
    Mem[adr+idx] := 0;
    i := i + 1
  END

A compiler with loop invariant code motion and induction variables generates:

  i := 0; adr := baseadr;
  WHILE i < j DO
    Mem[adr] := 0;
    i := i + 1;
    adr := adr + size(basetyp)
  END



Techniques for elimination of constants may be obstucted by aliasing, that is by different names used for the same logical entity. A common strategy to avoid part of this problem is propagation of copies. This strategy means to use the oldest reference path if multiple reference paths are possible. This is a strategy to find "hidden" constants. For example,
  x:=y;
  a[z]:=b;
  a[w]:=x;
is transformed to
  x:=y;
  a[z]:=b;
  a[w]:=y;

The next group of strategies addresses proper use of the control flow of a program.

- Range-tracking and dead check removal
Range tracking means to move range checks for variables or expressions from a lower to a more global level. For example, access to an array in a loop does not need a range check at every step if we can guarantee that the maximum and minimum indices are within the proper bounds. More generally, dead check removal means to remove check code which can be proven to never fail. For example, assignments do not need an overflow check if the global minimum and maximum is under strict control.
If there is compiler control on a statement level, appropriate compiler options can be used to avoid compiler generated check in save code regions. If the compiler can only be controlled on a module level, this asks for a clean separation of "save" modules and common ones.

- Pre-calculation of array access
Reduces the operation cost for the special case of array access. For example, repeated access of the form [variable+constant] may be removed by either shifting the array one position, or using a redifined variable. In general, a change of the data structure should be considered in these instances as an alternative, for example using several one dimensional arrays instead of two dimensional arrays, or storing arrays in transposed form.

- Dead code elimination
Code fragments which never can be reached can be eliminated. If the code is guarded by a constant expression, this is usually done by the compiler, as is the elimination of procedures which are never called. If the code is a partial code of a procedure guarded by a variable expression, hand removal may be nessary. Dead code often is generated by propagation of constants. ("IF debugging THEN END;").
With cached memory access, code elimination may have the side effect to improve locality, thus reducing cache miss rate and leading to better time performance.

- Optimization of Boolean expressions
This is achieved by arranging Boolean expressions to stop evaluation as soon as the result is established. In Oberon, shortcut evaluation is done by definition. In other languages a careful use of shortcut operators or nested IF statements may be necessary.
Care has to be taken that no intended side effects are broken, as when the evaluation of one part of the expression changes intentionally some parameter which is used later on.

- Tabulation
Pre-calculation and storage of tabulated value may be used to reduce time in search and decision situation. This can be rarely achieved in an automatic way. Of course tabulation is only shifting the problem, and there is a memory/speed trade off which has to taken into account.

- Special casing and use of recursive structures
A more dynamic variant of tabulation. Care should be taken to avoid implicit use of recursive structures. A tail recursive propagation can always be replaced by an iteration, usually with less overhead.

- Loop unrolling
Loop unrolling reduces the number of loop passes by repeating the loop code inline. This allows to adapt to particularities of processors (e.g. using LONGINT access instead of 2*INTEGER access). High-quality compilers may provide loop unrolling, but it is still not common place.

- Software pipelining
The goal of software pipelining is to overlap the execution of several loop iterations, just like a processor pipeline overlaps the execution of several instructions. Software pipelining is often considered superior to loop unrolling as it allows a higher degree of overlap. The "provisional means" algorithm to calculate mean and variance is an example to reduce two or more loops to one.

- Variable placement
This has been an important technique for CPUs with a small number of registers, and the relevance for programmers decreased with the use of RISC processors and caches. Some languages and compilers allow programmer hints for variable placement, such as reserving registers for variable. Variable placement and register allocation may be done more effectively by a compiler, using elaborate information about the relative speed of register, cache and memory access.

- Algebraic/analytic reduction (Strength reduction)
This is one of the more powerful forms or reduction of operation costs. If we can replace an expression by an equivalent one whose evaluation consumes less time, we contribute to time optimization.
The most trivial of these reductions, such as removal of operations on neutral elements (x+0=0+x=x) are done by most compilers. Other require programmer support. Example: To get equidistiant points on a circle, we can use
  fak := 2 * pi / n;
  FOR i := 1 TO n DO
    NodeTable[i].Posx := -Math.Sin(i * fak);
    NodeTable[i].Posy := Math.Cos(i * fak)
  END;
We can instead use
  fak := 2 * pi / n;
  nsin:=Math.Sin(fak);
  ncos:=Math.Cos(fak);
  x:=1;y:=0;
  FOR i := 1 TO MaxUsedNode DO
    NodeTable[i].Posx := ncos*x-nsin*y;
    NodeTable[i].Posy := nsin*x+ncos*y;
    x:=NodeTable[i].Posx; y:=NodeTable[i].Posy
  END;
thus avoiding repeated calculations of sin and cos at the expense of two more multiplications - generally a cheaper operation. On the other side, roundoff errors may be different for both implementations and should be taken into account.

To conclude, we should mention types of optimization which are usually exclusive for the compiler. It may make sense to optimize a compiler for a certain hardware architecture or implementation. On this level, the compiler can try for

- Peephole optimization: local machine level variant of reduction of operation cost

- Instruction scheduling: arrange the instruction sequence to match the architecture of CPUs with independent execution units, avoiding interlocks and wait cycles

- Cache optimizations: arrange data and code access with respect to make most efficient use of any caches available.

Timing

At a first look, timing is the simple process to get the time needed for an operation. Unfortunately, time measurement is not a smooth process, and the granularity of computer time adds some difficulty. Moreover, most operations are not directly observable, which makes timing more complicated. Finally, each measurement changes the process under observation, and we have to work to separate the effects of the timing from those of the process under investigation.

Timing can be done in a more or less obtrusive way. The most obtrusive timing is interrupt timing: you start a clock process, giving interrupts at fixed intervals. At interrupt time, you check the program counter, look up the program fragment th program counter is in, and increase a corresponding hit counter. This requires implementation of the timing process on a rather low (system near) level for the programmer. For the application programmer/user, interrupt timing seems convenient, since it does not require code modification. Unfortunately interrupt timing changes the run time behaviour of the program, and it is nearly impossible to separate effects of the measurement process from the proper timing of the process under investigation. Artifacts in particular may be drastic for modern architectures where cache conditions may be critical for the performance of a program, and interrupt timing may distort these cache conditions. A slighly less obtrusive timing can be achieved by a profiler which instruments the source code with timing instructions at the beginning and end of each procedure. Still the measurement process is distorting the timing behaviour, but less drastically than for interrupt timing. We will concentrate on application level timing, and discuss the necessary error corrections for this case. The general rules can (and should) be applied to profile timing as well (an implemetation is available as gsProfiler).

Performance is usually reported as operations per time, so that bigger means better. For a critical analysis it is more convenient to use the reciprocal scale, time per operation. Ultimately we want to compare times needed, and assess of error propagation is simpler for this scale. So we calculate coefficient of variations on the time scale, and any error bounds for speed are calculated on the time scale and then converted to speed.

As an indicator of the quality of a measurement, we use the relative error, or coefficient of variation
  stddev of estimate / true value of parameter
which indicates the relative precision if the estimator is unbiased. Of course since we do not know the true value off hand, we have to resume to estimators like
  estimated stddev of estimate / est. mean value of estimate.
We should not start discussion unless we can guarantee a coefficient of variation of less than 10%, and we should get below 1% if we want to get serious.

To get a usable description, we have to introduce some formalization. We assume that the compute process to be timed takes some time
  C  compute process time
If some preparation or cleanup is needed, we collect this as
  P  preparation and clean up time
We can choose the number of repetitions, and may have the freedom to run the preparation without doing the proper process calculation. With N repetitions of the compute process, preparation included, and M additional repetitions of the preparation process, we have measurement for times
  TC  ; ei=1..N (Ci+Pi) + erri
  TP  ; ei=1..M Pi + erri
where err is the error term reflecting variations in the process to be timed. The measurements will not give the true time of the processes, but may be affected by some additional measurement error. We denote the measurements by T'C, T'P respectively.

In theory, we will assume that we take samples under equivalent conditions, so the systematic terms are constant and these errors are independent, and the error terms in each row have a common identical distribution. In practice, we have to take care to establish these conditions. If we do not have additional information or measurements on the preparation process, we cannot separate timing of the preparation and the proper computing process, and all we can achieve is a joint estimate for both.

If we have measurements on the preparation process, we can use
  C^ = (T'C - N 7 (T'P/M)) / N = T'C/N - T'P/M
as an estimate for C. Writing down this estimate is straight forward approach. Work has to be done to judge the quality of this estimator. How reliable are the figures it produces? If we can establish stochastic independence of the timing for C and P, variance decomposes as
  Var(C^) = Var(T'C)/N2 + Var(T'P)/M2.

Clock discretization

Since reading and registering the time changes the time behaviour, we try to keep the proper timing and registering operations to a minimum and include it in the preparation and clean up time P or ignore it. For any time T we want to take, the best we can get is a time reading T' = t1-t0 within the precision of the clock.


The time reading error T'-T= d0-d1 is best modelled as a stochastic error. Let delta be the precision of the clock. The errors d0 and d1 are related to T in a fixed relation: d1 is the remainder of (d0+T) div delta. This relation can be understood completely. For simplicity, we will scale by delta, that is we assume delta=1 and T= k + epsilon for some k=0,... and epsilon<1. For an ideal measurement using a time reading with discretized by delta, epsilon is the unavoidable discretization error. With a process starting at a random offset d0 after the previous time signal, if d0<1-epsilon, T'=k and the measurement error is -epsilon. If d0>1-epsilon, T+d0> km, so T'=k+1 and the measurement error is 1-epsilon. So given epsilon we have this error dependency on d0:



Now d0 reflects the starting time of our measurement. If we can assume a starting time indpendent of the clock cycle, we are led to a uniform distribution for d0. If the execution length is modelled by distribution and does not depend on the starting time, we have independence between epsilon and d0, If we can assume a uniform distribution for the "true discretization error" epsilon, this gives a triangular distribution for the observable discretization error d0-d1 on [-delta,delta], with mean = 0, and variance 1/6 delta2. We denote the mean of the discretization error d0-d1 by E(d) and the variance by Var(d).

We can of course try to avoid the initial discretization offset and start simulation after synchronizing with the clock, forcing d0 ; 0. This may however be counterproductive. We end at the conditional distribution for d0=0, which is uniform with a worse bias of E(d)=delta/2 but a more favorable variance of Var(d)=1/12 delta2.

Immediate timing precision

From observations on T' we get an estimated mean E(T') and estimated variance V(T'). Estimating the mean of T by E(T'), we write T as T'-d. We can estimate the variance of T' from observations, and we have the true variance of d, but unfortunately T' and d are not independent due to the common (unobservable) term epsilon. So for the variance of T, we have to use a conservative estimate
  V(T) = (Std(T')+Std(d))2
where Std(T'), Std(d) are estimates of the standard deviation of T' resp. d. If we estimate a coefficient of variation from this, we must take into account that we do not know the true mean for T. We know the true mean of E(d), and for practical purposes where we use the coefficient of variation to derive confidence intervals, we can take an adjusted estimate of V(T')+Var(d) for Var(T) to give
  (V(T))1/2 / (E(T') - E(d))
as an estimate for the coefficient of variation, and thus as an indicator of the relative precision. If we have to guarantee for the estimate of the coefficient of variation itself, we would take a more conservative estimate for the mean of T, yielding
  V(T)1/2 / (E(T') - delta).
as an estimate for the coefficient of variation.

If we can chain execution of operations, we effectively increase the time of execution from T' to an order of NT' without adding additional time truncation error. This leads to a coefficient of variation which combines the usual sample size effect with a deflation of the discretization error, leading to an improved estimator on the original scale.

Scaffold timing precision

More often, we have to pepare an initial setup before we can do the proper calculation: we build up a "scaffold" to prepare for the real timing. So the time T=TC is a sum TC=C+P with empirical timing T'C. We will usually have the possibility to time the preparation without doing the proper computation, giving a timing T'P for P, but we will not be able to time C without preparation. A plug in estimator is to estimate C by C^:=T'C-T'P. The above considerations apply to T'C and T'P. To judge the quality of C^, we try to avoid interaction of the measurement processes for TC and TP. As estimate, we plug in the adjusted estimators for the variance with independence assumption as above, yielding
  V(C); V(TC) + V(TP)
and use
  V(C)1/2 / C^
as an estimate for the coefficient of variation, or
  V(C)1/2 / (C^-27delta)
as a conservative estimate for the relative precision.

If we can chain execution of operations, we now have a choice. We can, but we need not use the same number of repetitions for the measurement of the scaffold P as for the proper measurement with target P+C. We can use N repetitions for the full computation and M repetitions for the setup timing only, yielding
  E(TC) = N7(C + P) + E(d)
  E(TP) = M7 P + E(d)
Assuming independece of the measurement series, if VarC and VarP are the original (single execution) variances of the computing time, and EC, EP the corresponding means, this leads to
  Var(C^); (N7VarC +Var(d))/N2 + (M7VarP+Var(d))/M2  (*)
consuming an expected compute time
  ET= E(TC) + E(TP)                          (**)
where we have to choose N and M to meet our time constraints and resources while meeting the target. We can either minimize (*) subject to ET # Tlimit, or we can minimize (**) while keeping the variance below a requested limit. Any way we have to balance variance reduction for T'C and T'P, compensating for the discretization aspect. This optimization problem can be solved directly, but the simple rule is to keep in mind that the total variance reduction contribution is proportional to the variance of the components, with weight proportional to 1/effort attributed to the components.

If the setup is a major source of variance, we should consider an approximate decomposition
  Var(C^) ; ((N7VarC+Var(d))/N2 +
    (M7VarC + M7VarP-C +Var(d))/M2)
  =   (((N+M)7VarC+Var(d))(1/N2+1/M2) + M7VarP-C)
which shows that the sum N+M contributes to the reduction of the T' variance estimate, and M is used for the T'' estimate.

Immediate timing revisited

Looking again at the immediate timing, we see that actually it has an implicit scaffold. We can treat the setup to do the proper timing as a preparation and apply the rules for scaffold timing.
So if timing creates considerable overhead, we can reduce the variance of the estimates for our process to be timed by running additional empty timing runs and correct for the timing overhead as above.

Real World Timing

In the examples above, we used coefficients of variation as an informative measure of relative precision. This requires knowledge of mean and variance of the target variable, and of course both are not known a priori but must be estimated from observations, thus introducing an additional source of error.

We can stay with a naive application of coefficients of variation, and be pleased with small interval estimators. But we run the risk of being lead astray by random variantion, discussing seeming timing effects which are but an artifact introduced by incomplete knowledge of the target mean and variance.

To avoid being lead astray by random variation, we have to use confidence intervals. The number of indpendent observations used to determine the variance estimate is critical for the reliabiltiy. A strict analysis and a proper model may be too much of an effort. Instead, we use rules of thumb to get a first approximation. Assuming a Gaussian distribution as an approximate error distribution, and allowing for a chance of 1 in 10 to be out of bound, i.e. a confidence level of 10%, we can use the table of the t- distribution to get a approximate confidence intervals. With k samples to estimate the variance (k-1 degrees of freedom) we catch the true parameter with 90% (99%) confidence using
confidence interval =
  = estimated mean +/- t 7estimated standard deviation
with
  k  t (90%)  t(99%)
  2  6.3    63.7
  3  3    10
  6  2    3.5
  16  1.75    2.9
  ..  1.65    2.58
As a rule of thumb, we can use these figures as factors to get confidence intervals from our estimates and standard deviation. To be more precise, we can use exact quantiles from the t distribution as factors.
We did not model the exact dependence of the error contributions, but made some simplifying plausible assumptions, and we are assuming a Gaussian distribution as a convenient smooth approximation. So we will not have exact confidence intervals.
But we have left vague approximations and reached explicit statements: the calculated intervals should cover the true (unobservable) time with guaranteed confidence of 90% (99%). This can be used to derive predictions on observable timings. So if we have additional timing resources, we can repeat the timing process and check the model assumptions and see whether we can improve the timing process.

A step by step illustration

Most timing routines will have one or two groups of three-lined output.
The first of these lines gives confidence intervals based on the assuption that you have exact (continuous time) measurement, and exact knowledge of the error variance. This is the most simple approximation, but the most error prone.
The second line corrects for the discretization error. It will give larger intervals than the first one, but these are more reliable.
The third line additionally takes into account that you do not have exact knowledge of the error variance. Instead, the error variance is estimated from observations. Again, the assumption of an approximate Gaussian distribution is used. If the information used to estimate the error variance is increased, the interval limits shown here will converge to the Gaussian approximation given in the second line. The figures given in the third line can be used as conservative bounds for practical use.

Timing gets more complicated if we move to smaller times. The extreme case is to time an empty loop, i.e. to time the timing overhead.
An initial solution is
  ItOTime.Nothing
To illustrate the various choices, ItOTime.Nothing gives an abundant report. In a first table, the plain measurements and elementary derived statistics are listed.
Next, the time statistic is evaluated. Confidence intervals are given for three variants: blue eyed statistics assuming a plain Gaussian model, a gausssian model with error correction for discretization, and a conservative correction adding a worst-case correction for the mean and using studentized confidence intervals.
As a complement, these results are translated to speed scale.
Next, the speed statistics (not recommended) is evaluated. As for the time scale, we give confidence intervals for blue eyed statistics assuming a plain Gaussian model, for a gausssian model with error correction for discretization, and for a conservative correction adding a worst-case correction for the mean. As a complement, these results are translated to time scale.
The recommended variant is to use a time based model with variance correction, using derived intervals for speed estimation.
The output is not corrected for artefacts. So if the Gauss aproximation leads to unreasonable results, you will see garbage in the output.

ItOTime.Nothing is a parametrized command so that we can iterate executions.
We can use it as
  ItOTime.Nothing
    <minimal nr of time grains> <nr of repeated runs>
e.g.
  ItOTime.Nothing 2 5
  ItOTime.Nothing 4 5
  ItOTime.Nothing 8 5
  ItOTime.Nothing 16 5
to see how variance is reduced if we move away from the initial time discretization.

Given real world limited resources, we have a choice of either fixing a time span and taking observations until the time is used up, or fixing a number of observations and taking the time when the observations are finished.

The simple timing loop consists of several components
  - increasing a counter
  - reading the time
  - checking limits
  - enclosing control structure.
To get an idea about the relative contribution, we can unroll the loop. For example, we still use a counter increment of 1, but we read the time and check the limits only in 1 of 8 steps.
  ItOTime.Nothing8
implements this variant, with parametrizations as above
e.g.
  ItOTime.Nothing8 2 5
  ItOTime.Nothing8 4 5
  ItOTime.Nothing8 8 5
  ItOTime.Nothing8 16 5
The example illustrates the large contribution from reading the time. So if we can avoid it, we should do.

To get a time measurement with prescribed precision, we can use a compound strategy. Here is one strategy, aiming to get a measurement with relative error (coefficient of variation) of alpha=1%.
-   Get an initial timing.
  If the process is above time resolution, take
  just one measurement.
  If it is below time resolution, find a sufficient
  measurement count.
-   Using this measurement count, get an initial
  estimate for the mean and variance of the speed, and
  the time per execution.
A verbose example is
  ItOTime.SomeThing
You can add a time limit to restrict the total time used by adding a time limit [in seconds] such as
  ItOTime.SomeThing 1 ~ 1 second. Not yet implemented
ItOTime.SomeThing tries to do a timing within this limit (but it does not guarantee it). If you add a second parameter, it will be used as a limit for the relative precision, for example
  ItOTime.SomeThing 60 0.01
      ~ 1% precision or 60 second. Not yet implemented
For the second form, the time usually will serve as an emergency exit to avoid too lengthy calculations.



Exercises:
Givens' rotation is used to rotate a general vector to the x axis. It is used as an elementary step to solve linear equations.
The file ItO/ItODROTG.Mod contains two variants of Givens' rotation, taken from the BLAS system (BLAS=basic linear algebra subroutines).
DROTGF is machine translated from the original FORTRAN source. DROTGO is a manual translation which should be equivalent to DROTGF. Compare these implementations. Use a random number generator to time both routines with random argument DA,DB and compare the timing. Try to improve DROTGF.





Introduction to the Oberon programming language. ItO/Ch13.Text
gs (c) G. Sawitzki, StatLab Heidelberg
<http://statlab.uni-heidelberg.de/projects/oberon/intro/>






Home Up Intro Contents Chapter 1 2 3 4 5 6 7 8 9 10 Design Assert Timing EBNF Report Pas