Home
Up
Intro
Contents
Chapter
1
2
3
4
5
6
7
8
9
10
Design
Assert
Timing
EBNF
Report
Pas
Last Changed: July 12th, 1997
This is a conversion from Oberon text to HTML, and from German to English. The converter software is still under development,
and some features or information may be missing in this converted version.
HTML hypertext facilities are not yet active in this document.
To exploit the interactive facilities, use Oberon System 3 and the source of this text,
available for download using binary ftp as Oberon System 3 archive.
The converter from German to English is still under development as well.
A previous version is also available for Oberon V4.
To access this and other additional material use
ftp.
For the convenience of our students, most of this information and the related material is available
in German as well.
Introduction to Oberon
The Oberon Programming Language
13 Timing and Optimization
This is a preliminary version. Please do not circulate. Comments to <gs@statlab.uni-heidelberg.de>
For recent material, see <http://statlab.uni-heidelberg.de/>.
In this chapter, we discuss some aspects of the problem of timing and optimization
of program. More generally, the topic of this chapter is the usage of resources
used by a program.
Time consumption, or more generally consumption of resources, may have several
components. Comparison is made easier, if we transform consumption of resources
to a general scale. Our personal preference is to use human time consumed
as a general base scale. On the other side, resource usage may depend on
computational parameters. If we have a model for the computation, comparison
is made simpler if we can combine these parameters in a general measure of
complexity, usually referred to as operations of some abstract machine. Time
consumption is then modelled as operations*speed + setup time. Historical
units for speed are MIPS= million instructions per second, where instructions
refer to the instruction set of the classical DEC VAX, or FLOPS= floating
point operations per second, where "operations" still needs and additional
definition.
Other resources can often be converted to human time needed. For example,
if you use an internet browser which requires xxx MB of RAM above your usual
requirements, add the time you need to work to pay xxx MB of RAM, plus the
"system time"(=time needed to order, pay and install the RAM) to the cost
of the software. If the software is used for a group, add the time for training,
configuration and support.
We make two fundamental assumptions. The first is that a program is developed
and used. Timing and optimizing a program has to take into account both.
Sometimes, development and usage may lead to different aspects. For example,
time optimizing a program may be time consuming during the development process,
but time saving during application. If a program is used for eternity, the
application aspect will prevail. With a real life program however we have
a finite scope, and there is a balance between development and application
effort. For illustration, and to call attention to secondary expenses, we
think of a cost model as
cost = development + development
testing
+ maintenance + maintenance
testing
+ application + application
testing
+ replacement + replacement
testing
We separated testing cost for each step to draw the attention to the fact
that correct performance is not given for free at any stage of the software
usage. Maintenance cost includes the cost of maintaining subsequent versions
of the software. Application testing cost is the expense to check correct
operation in the target environment, i.e. the additional expense of the user
to check that a software actually operates correctly after installation.
Replacement costs cover for the effort to install, and possibly remove versions
of the software. While the development side (development and maintenance)
is under control of the software developers, the application side (application
and replacement) usually is subject to speculation. A helpful estimate is
to factorize the cost e.g. as
application= (usage man time units)
* (usage cost per time unit)
= (nr of users *
expected software life time)
* (expected usages
per time unit * mean time per usage)
The second assumption is that any optimization needs to be visible. In particular,
we assume a measurement process, and optimizations are judged within the
limitation of this measurement process. While development and maintenance
resources can (in principle) be observed directly, application and replacement
resources can only be estimated or predicted.
We have several level of complexity. On the top level, we have the task as
seen from the user. On the lowest level, we have machine instructions. Ideally,
time calculations would add up and be consistent across all levels. In reality,
we have sequence effects. One operation on a lower level may change the system
state of the global system, and this may affect the performance of the next
operation. As a first approach, we can use an additive model, but we always
should check for violations of this assumption. Caching, pipelining and other
effects may force us to use more elaborate models.
Timing and optimization is full of surprises. It is a helpful habit to formulate
assumptions explicitly, and to include checks of these assumptions. Two basic
assumptions generally are false. Time is no continuum. On a computer, time
has a discrete structure. And time is not consistent. On a computer, two
different timers will generally show inconsistent time. Here is a small program
to check the timing of your Oberon installation:
ItOTime.Grain ~
returns the minimal
time difference, over 100 runs
ItOTime.Clock ~
returns the number
of time units per second
The source code is in ItO/ItOTime.Mod.
Exercises:
Repeat the measurements on the grain size. Is
there any variation? Compare with the documentation of your Oberon installation.
Is the result plausible?
Is there any variation in the clock measurement? The clock measurement output
has an additional line which marks whether a measurement is below or above
the overall mean. Do any peculiarities (patterns etc.) occur? Try to formulate
you observations on the measurement process explicitly. Can you verify whether
your observations reliable, or whether they just reflect artifacts?
Possibilities for optimization will be discussed first. Timing questions
are discussed at the end of this chapter.
Optimization
The main optimization is to avoid a computation at all. This cannot be over-emphasized:
if a computation is not necessary, don't do it.
If you have to do a computation, a list of common optimization techniques
can be applied. In principle, these optimizations can be be done by a compiler.
Unfortunately, not all compilers perform these common optimizations, and
proper documentation of the optimization done by individual compilers is
rare.
Although the common optimizations can be done automatically, the complexity
of the necessary optimization algorithm sometimes forbids extensive optimization
on the compiler level. On the other hand, the programmer may have a view
of the intent and structure of the program which can be used to call the
available optimizations into action. In many cases a clean program design
is the most important step towards an effective implementation. In addition
to this, a combination of an optimization-aware programming style and automatic
optimization by the compiler is helpful. So we give a list of common optimization
techniques, and add it as a check list for your programs or compilers at
the end of this chapter.
Computation timing is not easy to interpret. The common expression is to
represent an optimization gain as reduction factor (time with opimization)
/ (time without optimization). Howewer the time without optimization enters
as a norming constant here, which can be affected severely by other implementation
issues. The other common choice is to assign an implementation independent
complexity to a problem, and to standardize using this complexity.
The programmer view of a code is generally in terms of statements and a raw
estimated can be derived by adding the statement costs. Hiden costs are introduced
by implicit actions. In particular,
- procedure calls may involve an overhead for setting up a procedure frame
and for passing parameters.
- control statements, like FOR...DO...END, may involve a control overhead.
- memory allocation, like NEW, may involve houskeeping and garbage collection
- access to components of structured data types, like A[i,j,k] may involve
an address calculation (usually the evaluation of a polynomial to get Addr(A)+
(LEN(A,2)*LEN(A,1))*i+LEN(A,1)*j + k)*SIZE(A[i,j,k]) )
To compare optimizations, we may refer to a coarser segmentation. We will
refer to a basic block as a group of statements which is entered by a unique
initial statement and left at a unique statement, with no exception or branching
except at the end of the block. This means a block is a "run" of statements
which is always executed as a whole and can be treated as a self contained
unit. Since there are no branches in a basic block, all statements in the
basic block are executed the same number of times, and time consumption factorizes
as
nr of block executions * block
execution time
As a programmer, we have essentially two possibilities for optimization:
reduction of operation cost and code motion.
Reduction of operation cost can be achieved by replacing expensive
operations or data structures by cheaper ones with equivalent system status.
The benefit is obvious, if we do not lose information. Some possibilities
are exploited by nearly all compilers. For example, for integers on common
CISC processors
i*2 i+i ASH(i,1) are
equivalent,
the
latter being the cheapest
i:=i+k INC(i,k) are
equivalent,
the
latter being the cheaper
Others are rarely found in compilers, and hand optimization is helpful. For
example, we can check that a vector (x,y) has norm zero by
(x*x)+(y*y) = 0
or
(x=0)&(y=0)
the latter of course being cheaper (and more stable). But a compiler will
rarely be able to make this reduction.
Reduction of operation cost strongly depends on the actual performance of
the processor. For example, on common RISC processors - both i+i and ASH(i,
1) take one cycle and can be considered equivalent. More advanced processors
use one or more instruction pipelines, and the performance may depend on
the previous state of the pipeline, making reduction of operation cost a
non trivial task.
Other reductions of operation cost may have a trade off. For example, we
may have the choice of representing a data structure as a linked list or
as an array. Using a linked list reduces the polynomial address calculation
by a step through a pointer, which may be more time efficient if it comes
to higher dimensions. On the other hand, we have to spend the additional
memory for a pointer per cell. Again the actual performance depends on the
architecture of the processor. If the processor uses a data cache, with a
linked list the cache behavior may change drastically if the temporal locality
is very bad.
Code motion can be used to move operations to places where they are
executed less frequently, or at less time critical instances. The most prominent
example is to move loop-invariant code outside a loop.
These possibilities can be applied to a range of targets. Simplest of all:
Constant Elimination. The aim is to reduce run time (and possibly code size)
by avoiding repeated evaluation of expression with constant value. It comes
in two flavours. Expressions which are proper constants and can be evaluated
at compile time. Expressions which are not proper constants, but are invariant
during repeated invocations can be stored to temporary variables to avoid
repeated calculations.
- Constant Folding
Expressions consisting of constants only can be evaluated at compile time.
Hand optimization: move to constant section. Potential draw back: most compilers
evaluate constants using the precision of the compile time environment, not
the run time environment, In particular, mathematical constants like e, pi,
may have different values at run time than at compile time, if you use cross-platform
environment,
Implementation: generally implemented for INTEGER and BOOLEAN constants,
rarely for other cases.
Floating point constant folding is very tricky since the order of evaluation
is critical and on some architectures even the use of registers or memory
locations might change the result. Compilers usually do not do it.
- Common Subexpression Elimination
Common subexpressions are parts of a block which have a matching form
and do not change their values between computations.
If the same expression occurs repeatedly, the program can be simplified.
Hand optimization: Introduce a temporary variable to hold the common subexpression.
Implementation: Used to reduce space (and hence module loading time) in OMI-based
compilers.
One important point is the following: Compilers might optimize common subexpressions
if the subexpression does not contain procedure calls e.g in the following
example the compiler might translate statement (1) into statement (2)
(1) IF x > pi/180*45 THEN x :=
pi/180*45 END;
(2) t := pi/180*45; IF x > t THEN
x := t END;
However, if the expression contains procedure calls then the compiler does
not optimize common subexpressions due to possible sideeffects. In the following
example the programmer has to perform the common subexpression elimination
by hand:
(1) IF x > f(y, z) THEN x := f(y,
z) END;
(2) t := f(y, z); IF x > t THEN
x := t END;
- Loop invariant code motion
This is a special case of common subexpressions. If a subexpression does
not change its value during repeated evaluations of a loop, it can be moved
before the loop. For example
WHILE (i >limit -2) DO... END;
can be changed to
t:=limit-2; WHILE i>t DO... END;
if limit is a loop invariant. Identification of loop invariants, reduction
of induction variables and loop invariant code motion are usually used in
combination.
- Induction variables
Induction variables are variables in loops which keep a fixed relation
to the loop index. Since typically these variables are evaluated very often,
reduction of operation cost can be very effective. A first attempt may be
to reduce operation cost by the introduction of new auxilliary loop constants.
If that is not feasible, the next attempt is to keep the computation time
for induction variables at a minimal level, e.g. by sorting nested loops,
that is by using code motion.
For an example, consider the code fragment
i := 0;
WHILE i < j DO
a[i] := 0;
INC(i)
END;
A straightforward compiler generates:
i := 0;
WHILE i < j DO
adr := baseadr(a);
idx := size(basetyp)*i;
Mem[adr+idx] := 0;
i := i + 1
END;
A compiler with loop invariant code motion generates:
i := 0; adr := baseadr;
WHILE i < j DO
idx := size(basetyp)*i;
Mem[adr+idx] := 0;
i := i + 1
END
A compiler with loop invariant code motion and induction variables generates:
i := 0; adr := baseadr;
WHILE i < j DO
Mem[adr] := 0;
i := i + 1;
adr := adr + size(basetyp)
END
Techniques for elimination of constants may be obstucted by aliasing,
that is by different names used for the same logical entity. A common strategy
to avoid part of this problem is propagation of copies. This strategy
means to use the oldest reference path if multiple reference paths are possible.
This is a strategy to find "hidden" constants. For example,
x:=y;
a[z]:=b;
a[w]:=x;
is transformed to
x:=y;
a[z]:=b;
a[w]:=y;
The next group of strategies addresses proper use of the control flow of
a program.
- Range-tracking and dead check removal
Range tracking means to move range checks for variables or expressions
from a lower to a more global level. For example, access to an array in a
loop does not need a range check at every step if we can guarantee that the
maximum and minimum indices are within the proper bounds. More generally,
dead check removal means to remove check code which can be proven to never
fail. For example, assignments do not need an overflow check if the global
minimum and maximum is under strict control.
If there is compiler control on a statement level, appropriate compiler options
can be used to avoid compiler generated check in save code regions. If the
compiler can only be controlled on a module level, this asks for a clean
separation of "save" modules and common ones.
- Pre-calculation of array access
Reduces the operation cost for the special case of array access. For
example, repeated access of the form [variable+constant] may be removed by
either shifting the array one position, or using a redifined variable. In
general, a change of the data structure should be considered in these instances
as an alternative, for example using several one dimensional arrays instead
of two dimensional arrays, or storing arrays in transposed form.
- Dead code elimination
Code fragments which never can be reached can be eliminated. If the code
is guarded by a constant expression, this is usually done by the compiler,
as is the elimination of procedures which are never called. If the code is
a partial code of a procedure guarded by a variable expression, hand removal
may be nessary. Dead code often is generated by propagation of constants.
("IF debugging THEN END;").
With cached memory access, code elimination may have the side effect to improve
locality, thus reducing cache miss rate and leading to better time performance.
- Optimization of Boolean expressions
This is achieved by arranging Boolean expressions to stop evaluation
as soon as the result is established. In Oberon, shortcut evaluation is done
by definition. In other languages a careful use of shortcut operators or
nested IF statements may be necessary.
Care has to be taken that no intended side effects are broken, as when the
evaluation of one part of the expression changes intentionally some parameter
which is used later on.
- Tabulation
Pre-calculation and storage of tabulated value may be used to reduce time
in search and decision situation. This can be rarely achieved in an automatic
way. Of course tabulation is only shifting the problem, and there is a memory/speed
trade off which has to taken into account.
- Special casing and use of recursive structures
A more dynamic variant of tabulation. Care should be taken to avoid implicit
use of recursive structures. A tail recursive propagation can always be replaced
by an iteration, usually with less overhead.
- Loop unrolling
Loop unrolling reduces the number of loop passes by repeating the loop
code inline. This allows to adapt to particularities of processors (e.g.
using LONGINT access instead of 2*INTEGER access). High-quality compilers
may provide loop unrolling, but it is still not common place.
- Software pipelining
The goal of software pipelining is to overlap the execution of several
loop iterations, just like a processor pipeline overlaps the execution of
several instructions. Software pipelining is often considered superior to
loop unrolling as it allows a higher degree of overlap. The "provisional
means" algorithm to calculate mean and variance is an example to reduce two
or more loops to one.
- Variable placement
This has been an important technique for CPUs with a small number of
registers, and the relevance for programmers decreased with the use of RISC
processors and caches. Some languages and compilers allow programmer hints
for variable placement, such as reserving registers for variable. Variable
placement and register allocation may be done more effectively by a compiler,
using elaborate information about the relative speed of register, cache and
memory access.
- Algebraic/analytic reduction (Strength reduction)
This is one of the more powerful forms or reduction of operation costs.
If we can replace an expression by an equivalent one whose evaluation consumes
less time, we contribute to time optimization.
The most trivial of these reductions, such as removal of operations on neutral
elements (x+0=0+x=x) are done by most compilers. Other require programmer
support. Example: To get equidistiant points on a circle, we can use
fak := 2 * pi / n;
FOR i := 1 TO n DO
NodeTable[i].Posx
:= -Math.Sin(i * fak);
NodeTable[i].Posy
:= Math.Cos(i * fak)
END;
We can instead use
fak := 2 * pi / n;
nsin:=Math.Sin(fak);
ncos:=Math.Cos(fak);
x:=1;y:=0;
FOR i := 1 TO MaxUsedNode DO
NodeTable[i].Posx
:= ncos*x-nsin*y;
NodeTable[i].Posy
:= nsin*x+ncos*y;
x:=NodeTable[i].Posx;
y:=NodeTable[i].Posy
END;
thus avoiding repeated calculations of sin and cos at the expense of two
more multiplications - generally a cheaper operation. On the other side,
roundoff errors may be different for both implementations and should be taken
into account.
To conclude, we should mention types of optimization which are usually exclusive
for the compiler. It may make sense to optimize a compiler for a certain
hardware architecture or implementation. On this level, the compiler can
try for
- Peephole optimization: local machine level variant of reduction
of operation cost
- Instruction scheduling: arrange the instruction sequence to match
the architecture of CPUs with independent execution units, avoiding interlocks
and wait cycles
- Cache optimizations: arrange data and code access with respect to
make most efficient use of any caches available.
Timing
At a first look, timing is the simple process to get the time needed for
an operation. Unfortunately, time measurement is not a smooth process, and
the granularity of computer time adds some difficulty. Moreover, most operations
are not directly observable, which makes timing more complicated. Finally,
each measurement changes the process under observation, and we have to work
to separate the effects of the timing from those of the process under investigation.
Timing can be done in a more or less obtrusive way. The most obtrusive timing
is interrupt timing: you start a clock process, giving interrupts at fixed
intervals. At interrupt time, you check the program counter, look up the
program fragment th program counter is in, and increase a corresponding hit
counter. This requires implementation of the timing process on a rather low
(system near) level for the programmer. For the application programmer/user,
interrupt timing seems convenient, since it does not require code modification.
Unfortunately interrupt timing changes the run time behaviour of the program,
and it is nearly impossible to separate effects of the measurement process
from the proper timing of the process under investigation. Artifacts in particular
may be drastic for modern architectures where cache conditions may be critical
for the performance of a program, and interrupt timing may distort these
cache conditions. A slighly less obtrusive timing can be achieved by a profiler
which instruments the source code with timing instructions at the beginning
and end of each procedure. Still the measurement process is distorting the
timing behaviour, but less drastically than for interrupt timing. We will
concentrate on application level timing, and discuss the necessary error
corrections for this case. The general rules can (and should) be applied
to profile timing as well (an implemetation is available as gsProfiler).
Performance is usually reported as operations per time, so that bigger means
better. For a critical analysis it is more convenient to use the reciprocal
scale, time per operation. Ultimately we want to compare times needed, and
assess of error propagation is simpler for this scale. So we calculate coefficient
of variations on the time scale, and any error bounds for speed are calculated
on the time scale and then converted to speed.
As an indicator of the quality of a measurement, we use the relative error,
or coefficient of variation
stddev of estimate / true value
of parameter
which indicates the relative precision if the estimator is unbiased. Of course
since we do not know the true value off hand, we have to resume to estimators
like
estimated stddev of estimate /
est. mean value of estimate.
We should not start discussion unless we can guarantee a coefficient of variation
of less than 10%, and we should get below 1% if we want to get serious.
To get a usable description, we have to introduce some formalization. We
assume that the compute process to be timed takes some time
C compute
process time
If some preparation or cleanup is needed, we collect this as
P preparation
and clean up time
We can choose the number of repetitions, and may have the freedom to run
the preparation without doing the proper process calculation. With N repetitions
of the compute process, preparation included, and M additional repetitions
of the preparation process, we have measurement for times
TC ;
ei=1..N (Ci+Pi) + erri
TP ;
ei=1..M Pi + erri
where err is the error term reflecting variations in the process to be timed.
The measurements will not give the true time of the processes, but may be
affected by some additional measurement error. We denote the measurements
by T'C, T'P respectively.
In theory, we will assume that we take samples under equivalent conditions,
so the systematic terms are constant and these errors are independent, and
the error terms in each row have a common identical distribution. In practice,
we have to take care to establish these conditions. If we do not have additional
information or measurements on the preparation process, we cannot separate
timing of the preparation and the proper computing process, and all we can
achieve is a joint estimate for both.
If we have measurements on the preparation process, we can use
C^ = (T'C - N 7
(T'P/M)) / N = T'C/N - T'P/M
as an estimate for C. Writing down this estimate is straight forward approach.
Work has to be done to judge the quality of this estimator. How reliable
are the figures it produces? If we can establish stochastic independence
of the timing for C and P, variance decomposes as
Var(C^) = Var(T'C)/N2
+ Var(T'P)/M2.
Clock discretization
Since reading and registering the time changes the time behaviour, we try
to keep the proper timing and registering operations to a minimum and include
it in the preparation and clean up time P or ignore it. For any time T we
want to take, the best we can get is a time reading T' = t1-t0 within the
precision of the clock.
The time reading error T'-T= d0-d1 is best modelled as a stochastic error.
Let delta be the precision of the clock. The errors d0 and d1 are
related to T in a fixed relation: d1 is the remainder of (d0+T) div delta.
This relation can be understood completely. For simplicity, we will scale
by delta, that is we assume delta=1 and T= k + epsilon for some k=0,...
and epsilon<1. For an ideal measurement using a time reading with
discretized by delta, epsilon is the unavoidable discretization
error. With a process starting at a random offset d0 after the previous time
signal, if d0<1-epsilon, T'=k and the measurement error is -epsilon.
If d0>1-epsilon, T+d0> km, so T'=k+1 and the measurement error is
1-epsilon. So given epsilon we have this error dependency on
d0:
Now d0 reflects the starting time of our measurement. If we can assume a
starting time indpendent of the clock cycle, we are led to a uniform distribution
for d0. If the execution length is modelled by distribution and does not
depend on the starting time, we have independence between epsilon and d0,
If we can assume a uniform distribution for the "true discretization error"
epsilon, this gives a triangular distribution for the observable discretization
error d0-d1 on [-delta,delta], with mean = 0, and variance
1/6 delta2. We denote the mean of the discretization error
d0-d1 by E(d) and the variance by Var(d).
We can of course try to avoid the initial discretization offset and start
simulation after synchronizing with the clock, forcing d0 ;
0. This may however be counterproductive. We end at the conditional distribution
for d0=0, which is uniform with a worse bias of E(d)=delta/2 but
a more favorable variance of Var(d)=1/12 delta2.
Immediate timing precision
From observations on T' we get an estimated mean E(T') and estimated variance
V(T'). Estimating the mean of T by E(T'), we write T as T'-d. We can estimate
the variance of T' from observations, and we have the true variance of d,
but unfortunately T' and d are not independent due to the common (unobservable)
term epsilon. So for the variance of T, we have to use a conservative
estimate
V(T) = (Std(T')+Std(d))2
where Std(T'), Std(d) are estimates of the standard deviation
of T' resp. d. If we estimate a coefficient of variation from this, we must
take into account that we do not know the true mean for T. We know the true
mean of E(d), and for practical purposes where we use the coefficient of
variation to derive confidence intervals, we can take an adjusted estimate
of V(T')+Var(d) for Var(T) to give
(V(T))1/2 /
(E(T') - E(d))
as an estimate for the coefficient of variation, and thus as an indicator
of the relative precision. If we have to guarantee for the estimate of the
coefficient of variation itself, we would take a more conservative estimate
for the mean of T, yielding
V(T)1/2
/ (E(T') - delta).
as an estimate for the coefficient of variation.
If we can chain execution of operations, we effectively increase the time
of execution from T' to an order of NT' without adding additional time truncation
error. This leads to a coefficient of variation which combines the usual
sample size effect with a deflation of the discretization error, leading
to an improved estimator on the original scale.
Scaffold timing precision
More often, we have to pepare an initial setup before we can do the proper
calculation: we build up a "scaffold" to prepare for the real timing. So
the time T=TC is a sum TC=C+P with empirical timing
T'C. We will usually have the possibility to time the preparation
without doing the proper computation, giving a timing T'P for
P, but we will not be able to time C without preparation. A plug in estimator
is to estimate C by C^:=T'C-T'P. The above considerations
apply to T'C and T'P. To judge the quality of C^, we
try to avoid interaction of the measurement processes for TC and
TP. As estimate, we plug in the adjusted estimators for the variance
with independence assumption as above, yielding
V(C); V(TC)
+ V(TP)
and use
V(C)1/2 / C^
as an estimate for the coefficient of variation, or
V(C)1/2 / (C^-27delta)
as a conservative estimate for the relative precision.
If we can chain execution of operations, we now have a choice. We can, but
we need not use the same number of repetitions for the measurement of the
scaffold P as for the proper measurement with target P+C. We can use N repetitions
for the full computation and M repetitions for the setup timing only, yielding
E(TC) = N7(C
+ P) + E(d)
E(TP) = M7
P + E(d)
Assuming independece of the measurement series, if VarC and VarP
are the original (single execution) variances of the computing time, and
EC, EP the corresponding means, this leads to
Var(C^); (N7VarC
+Var(d))/N2 + (M7VarP+Var(d))/M2 (*)
consuming an expected compute time
ET= E(TC) + E(TP) (**)
where we have to choose N and M to meet our time constraints and resources
while meeting the target. We can either minimize (*) subject to ET #
Tlimit, or we can minimize (**) while keeping the variance below
a requested limit. Any way we have to balance variance reduction for T'C
and T'P, compensating for the discretization aspect. This optimization
problem can be solved directly, but the simple rule is to keep in mind that
the total variance reduction contribution is proportional to the variance
of the components, with weight proportional to 1/effort attributed to the
components.
If the setup is a major source of variance, we should consider an approximate
decomposition
Var(C^) ; ((N7VarC+Var(d))/N2
+
(M7VarC
+ M7VarP-C +Var(d))/M2)
= (((N+M)7VarC+Var(d))(1/N2+1/M2)
+ M7VarP-C)
which shows that the sum N+M contributes to the reduction of the T' variance
estimate, and M is used for the T'' estimate.
Immediate timing revisited
Looking again at the immediate timing, we see that actually it has an implicit
scaffold. We can treat the setup to do the proper timing as a preparation
and apply the rules for scaffold timing.
So if timing creates considerable overhead, we can reduce the variance of
the estimates for our process to be timed by running additional empty timing
runs and correct for the timing overhead as above.
Real World Timing
In the examples above, we used coefficients of variation as an informative
measure of relative precision. This requires knowledge of mean and variance
of the target variable, and of course both are not known a priori but must
be estimated from observations, thus introducing an additional source of
error.
We can stay with a naive application of coefficients of variation, and be
pleased with small interval estimators. But we run the risk of being lead
astray by random variantion, discussing seeming timing effects which are
but an artifact introduced by incomplete knowledge of the target mean and
variance.
To avoid being lead astray by random variation, we have to use confidence
intervals. The number of indpendent observations used to determine the variance
estimate is critical for the reliabiltiy. A strict analysis and a proper
model may be too much of an effort. Instead, we use rules of thumb to get
a first approximation. Assuming a Gaussian distribution as an approximate
error distribution, and allowing for a chance of 1 in 10 to be out of bound,
i.e. a confidence level of 10%, we can use the table of the t- distribution
to get a approximate confidence intervals. With k samples to estimate the
variance (k-1 degrees of freedom) we catch the true parameter with 90% (99%)
confidence using
confidence interval =
= estimated mean +/- t 7estimated
standard deviation
with
k t
(90%) t(99%)
2 6.3 63.7
3 3 10
6 2 3.5
16 1.75 2.9
.. 1.65 2.58
As a rule of thumb, we can use these figures as factors to get confidence
intervals from our estimates and standard deviation. To be more precise,
we can use exact quantiles from the t distribution as factors.
We did not model the exact dependence of the error contributions, but made
some simplifying plausible assumptions, and we are assuming a Gaussian distribution
as a convenient smooth approximation. So we will not have exact confidence
intervals.
But we have left vague approximations and reached explicit statements: the
calculated intervals should cover the true (unobservable) time with guaranteed
confidence of 90% (99%). This can be used to derive predictions on observable
timings. So if we have additional timing resources, we can repeat the timing
process and check the model assumptions and see whether we can improve the
timing process.
A step by step illustration
Most timing routines will have one or two groups of three-lined output.
The first of these lines gives confidence intervals based on the assuption
that you have exact (continuous time) measurement, and exact knowledge of
the error variance. This is the most simple approximation, but the most error
prone.
The second line corrects for the discretization error. It will give larger
intervals than the first one, but these are more reliable.
The third line additionally takes into account that you do not have exact
knowledge of the error variance. Instead, the error variance is estimated
from observations. Again, the assumption of an approximate Gaussian distribution
is used. If the information used to estimate the error variance is increased,
the interval limits shown here will converge to the Gaussian approximation
given in the second line. The figures given in the third line can be used
as conservative bounds for practical use.
Timing gets more complicated if we move to smaller times. The extreme case
is to time an empty loop, i.e. to time the timing overhead.
An initial solution is
ItOTime.Nothing
To illustrate the various choices, ItOTime.Nothing gives an abundant report.
In a first table, the plain measurements and elementary derived statistics
are listed.
Next, the time statistic is evaluated. Confidence intervals are given for
three variants: blue eyed statistics assuming a plain Gaussian model, a gausssian
model with error correction for discretization, and a conservative correction
adding a worst-case correction for the mean and using studentized confidence
intervals.
As a complement, these results are translated to speed scale.
Next, the speed statistics (not recommended) is evaluated. As for the time
scale, we give confidence intervals for blue eyed statistics assuming a plain
Gaussian model, for a gausssian model with error correction for discretization,
and for a conservative correction adding a worst-case correction for the
mean. As a complement, these results are translated to time scale.
The recommended variant is to use a time based model with variance correction,
using derived intervals for speed estimation.
The output is not corrected for artefacts. So if the Gauss aproximation leads
to unreasonable results, you will see garbage in the output.
ItOTime.Nothing is a parametrized command so that we can iterate executions.
We can use it as
ItOTime.Nothing
<minimal nr of
time grains> <nr of repeated runs>
e.g.
ItOTime.Nothing 2 5
ItOTime.Nothing 4 5
ItOTime.Nothing 8 5
ItOTime.Nothing 16 5
to see how variance is reduced if we move away from the initial time discretization.
Given real world limited resources, we have a choice of either fixing a time
span and taking observations until the time is used up, or fixing a number
of observations and taking the time when the observations are finished.
The simple timing loop consists of several components
- increasing a counter
- reading the time
- checking limits
- enclosing control structure.
To get an idea about the relative contribution, we can unroll the loop. For
example, we still use a counter increment of 1, but we read the time and
check the limits only in 1 of 8 steps.
ItOTime.Nothing8
implements this variant, with parametrizations as above
e.g.
ItOTime.Nothing8 2 5
ItOTime.Nothing8 4 5
ItOTime.Nothing8 8 5
ItOTime.Nothing8 16 5
The example illustrates the large contribution from reading the time. So
if we can avoid it, we should do.
To get a time measurement with prescribed precision, we can use a compound
strategy. Here is one strategy, aiming to get a measurement with relative
error (coefficient of variation) of alpha=1%.
- Get an initial timing.
If the process is above time resolution,
take
just one measurement.
If it is below time resolution,
find a sufficient
measurement count.
- Using this measurement count,
get an initial
estimate for the mean and variance
of the speed, and
the time per execution.
A verbose example is
ItOTime.SomeThing
You can add a time limit to restrict the total time used by adding a time
limit [in seconds] such as
ItOTime.SomeThing 1 ~ 1 second.
Not yet implemented
ItOTime.SomeThing tries to do a timing within this limit (but it does not
guarantee it). If you add a second parameter, it will be used as a limit
for the relative precision, for example
ItOTime.SomeThing 60 0.01
~ 1%
precision or 60 second. Not yet implemented
For the second form, the time usually will serve as an emergency exit to
avoid too lengthy calculations.
Exercises:
Givens' rotation is used to rotate a general vector
to the x axis. It is used as an elementary step to solve linear equations.
The file ItO/ItODROTG.Mod contains two variants of Givens' rotation, taken
from the BLAS system (BLAS=basic linear algebra subroutines).
DROTGF is machine translated from the original FORTRAN source. DROTGO is
a manual translation which should be equivalent to DROTGF. Compare these
implementations. Use a random number generator to time both routines with
random argument DA,DB and compare the timing. Try to improve DROTGF.
Introduction to the Oberon programming language. ItO/Ch13.Text
gs (c) G. Sawitzki, StatLab Heidelberg
<http://statlab.uni-heidelberg.de/projects/oberon/intro/>
Home
Up
Intro
Contents
Chapter
1
2
3
4
5
6
7
8
9
10
Design
Assert
Timing
EBNF
Report
Pas