Practical tips and general recommendations for seminar papers and
expositions, see [here…] (in German).
For copyright reasons, access to some course material may be restriced. These links are shown in this style.
For messages, please use
this mail address.
SS 2017
G. Sawitzki: Statistical Data Analysis
G. Sawitzki: Statistical Data Analysis
Module: MH21
Wed. 1113 or by appointment, INF 205, SR. 8
LSF reference: 11MMAV0829
Please register using
this form.
Topics for SS 2017
 Tools for data analysis
 Residual analysis and regression diagnostic
 Resampling and permutation methods
 Dimension diagnostic and dimension reduction
The course topics will include material not covered in the previous courses (see below).
This course is a supplement to the lectures of WS 2016/17. Knowledge of this course or equivalent
and a working knowledge of R is required.
Statistical data analysis is the investigative branch of statistics. Data analysis has two aims:
 find informative features in data,
and
 bring them to human perception.
And, in doing this, data analysis has to
 avoid artefacts coming from random fluctuation,
 avoid artefacts from perception.
To meet both aims, and to avoid both pitfalls, data analysis draws heavily on mathematical
statistics on one hand, and on various kinds of research in the humanities on the other hand.
Data analysis is a means to understand the measurement process by which the data are generated. A thorough investigation of the measurement process and its statistical implications is vital for data analysis.
Residual analysis is a special field in data analysis which is particularly good developed. In this area, data analysis is used to scrutiny data sets for observations which might be of critical influence for the application of classical statistical methods. We will use the experiences made in residual analysis as a guiding example.
Tentative Time Table SS 2017
 20170419
 Short Introduction
The main topics for this term will be chosen in this session. Requests and suggestions are welcome.
WS 2016/2017
To be scheduled for WS 2016/2017
Resampling and permutation methods
Dimension diagnostic and dimension reduction
Tentative Time Table WS 2016/2017
Assignment/project exercised due: Feb. 3. 2017.
For those who do not prepare a written assignment: Feb. 8. written examination.
20161221
Outlook
Questions & answers
20161214
Forward regression
Robust diagnostic regression analysis
Atkinson, Anthony and Riani, Marco
Springer (2012).
Transformation: Box&Cox
Exercises are in
exercises20161214.pdf
An evalation copy of the DataDesk software used for the Cars data set can be downloaded
from DataDescription, Inc..
The colour graphic example is in Colour perception.
Project lowdim gives
some background on the topics touched in the current seeries of lectures.
20161207
Robust regression
Break down
Influence function
Robust estimation of location
Maximum depth regression
Exercises are in
exercises20161207.pdf
Regression depth
Rousseeuw, Peter J and Hubert, Mia
Journal of the American Statistical Association 94 (1999) 388402.
Robust statistics:
the approach based on influence functions
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel
Wiley: New York 1986.
20161130
Added variables
Tukey multiple comparison
Residual analysis and regression diagnostic
Robust regression

Robust statistics:
the approach based on influence functions
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel
Wiley: New York 1986.
20161123
Residual analysis and regression diagnostic (cont.)
Regression confidence bands
Ftests
Regression confidence bands
Lit: Computational Statistics, Ch. 2.4
Exercises are in
exercises20161123.pdf
20161116
Residual analysis and regression diagnostic (cont.)
ttests
Residual distribution
Studentized residuals
Outliers and leverage
Influence: DFFITS, Cook's distance
Exercises (new projects) are in
exercises20161116.pdf
20161109
Box&whisker plots
Residual analysis and regression diagnostic
Raw residuals
Residuals vs fitted
Exercises are in
exercises20161109.pdf

Computational Statistics: An Introduction to R Ch. 2
G. Sawitzki
CRC Press: Boca Raton 2009.
20161102
Simple diagnostics
distribution functions
KolmogoroffSmirnov
Confidence bands
Exercises are in
exercises20161102.pdf
20161026
Simple diagnostics
histograms
distribution functions
sample size requirements
Lit: A draft section that puts the problems in context
has been placed in onedim.pdf.
Exercises are in exercises20161025.pdf
20161019
Statistical Background
Distributions, samples and statistics
Parametric models
Test procedures, power function, test level
NeymanPearson setting.
NonAdmissibility
The course topics will include material not covered in the previous courses (see below).
Tentative Time Table 2015/2016
The course topics 2015/2016 will overlap with the 2013/2014 course (see below).
Agenda and time table for this term will be fixed in the first sessions.
 Prediction error, Mallow's Cp, information criteria
 Find "honest" estimators for error.
Find measure for the quality of a model.
 Model selection
 In a polynomial regression, find an order of the polynomial to use.
In muliple regression, find the regressors to use.
 Cluster analysis & unsupervised learning
 For data as in classification, but with unknown classes:
Find clustering.
Estimate a number of clusters.
To be scheduled
Optimal procedures
Literature

Maximum likelihood: An introduction
Le Cam, Lucien
International Statistical Review
58 No.2
153171
(1980)

Theory of point estimation
Lehmann, E. L.
Wiley: New York 1983.
Influence functions and robustness
Literature

Robust statistics:
the approach based on influence functions
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel
Wiley: New York 1986.
Current time table
 20160204 Last lecture
 20160204
 Summary and Review
 20160202
 20160128
 Data Analysis for cDNA microarrays
slides (Goettingen
lectures)
Video
 20160126

Multiple comparison
Scheffé confidence bands
Tukey multiple comparison
Lit: Computational Statistics, Ch. 2.4
Assignments due
 20160121
 Nonparametric methods:
Permutation tests
Rank based methods
MonteCarlo
Lit: Computational Statistics, Ch. 3.3
 20160119
 Resampling methods III
Crossvalidation
See Arlot, Sylvain and Celisse, Alain:
A survey of crossvalidation procedures for model selection
 20160114
 Resampling methods II
Bootstrap & jackknife
See Jon Wellner:
Bootstrap and Jackknife Estimation of Sampling Distributions
 20160112
 Resampling methods I
Test and training sets
Bootstrap
Jackknife
Crossvalidation
Data: Low birthweight (birthweight)
 20160107
 Regression trees
Data: Scottish hill races, cars, fat
 Classification trees
Data: Olive oils (olive)
Season Break
 20151217
 Open for questions
 Inverse analysis
Inverse classification.
Quadrant test, corner test.
 20151215
 Parallel coordinates (parallel)
See A. Inselberg:
Parallel Coordinates: Visualization, Exploration and Classification of HighDimensional Data.
Data: old faithful geyser data (geyser)
Data: Olive oils (olive)
 Recurrence plots (recurrence8)
Data: Heart rate (pulse8)
See also: R Heart rate variation on Rforge,
in particular the tutorial.
N. Marwan: Structures in Recurrence Plots
 20151210

Data: old faithful geyser data cont. (geyser)
 20151208
 Statistical decision theory
Literature

Statistical decision theory and bayesian analysis
Berger, James O.
Springer Science
& Business Media 2013.

Mathematical statistics: A decision theoretic approach
Ferguson, Thomas S
Probability and Mathematical Statistics, Vol. 1, Academic Press, New York 1967.

Statistical decision theory
Wald, Abraham
Wald, Abraham. Statistical decision theory, Wiley, New York 1950.
 Course Evaluation 
 20151203
  No lecture 
Scientific advent matinee 11:15  12:45, 14:00(!!!)16:00
 20151201

Classification trees
Data: Olive oils (olive)
 Parallel coordinates
See Parallel Coordinates: Visualization, Exploration and Classification of HighDimensional Data.
Data: old faithful geyser data (geyser)
Data: Olive oils (olive)
 20151126
 Projection pursuit
Data: Olive oils (olive)
Software: ggobi. See also: rggobi()
.
 20151124
 Projection pursuit
Data: Olive oils (olive)
Software: ggobi. See also: rggobi()
.
 20151119
 Principal components
Data: Body fat( fat), Cars
 20151117
  no lecture 
 20151112
 ANOVA
BoxCoxTransformations
Errors in Variables
Data: Scottish hill races, Cars
 20151110
 PRIM
Video: PRIM9. See the ASA video library.
 20151105
 Linear regression: influence measures
Robust regression
Data: scottish hill runners data (scottish)
 20151103
 Example: colour data
Example: cars data
 20151029
 Leave one out estimators
Some more data Sets
R: data()
UC Irvine Machine Learning Repository,
e.g. Auto MPG
Cook & Swayne data (cookchapdata)
 20151027
  no lecture 
 20151022
 Linear models: Residuals
Lit: Computational Statistics, Ch. 2.2.3 ff
 20151020
 Linear models
Lit: Computational Statistics, Ch. 2
Data: scottish hill runners data (scottish)
 20151015
 Diagnostic of one dimensional data
Lit: Computational Statistics, Ch. 1
Data: old faithful geyser data (geyser)
 20151013
 Introduction
Tentative Time Table (2013/14)
 20140204
 Review
 20140128
 Tree Based Methods
 20140121
 Principal Component Analysis
 20140114
 Brushing, Linking and Prosections
 20140107
 Projection Pursuit
 20131217
 Curse and Blessing of Dimensionality
D. Donoho:
The Curses and Blessings of Dimensionality. Stanford, 2002
 20131210
 Influence and Breakdown; Robust Regression
 20131203
 Outliers in Regression; Diagnostics
 20131126
 Regression Diagnostics and Residual Analysis III
 20131119
 Regression Diagnostics and Residual Analysis II
 20131112
 Regression Diagnostics and Residual Analysis
 20131105
 Distribution Diagnostics III: Trend, correlation, …
 20131029
 Distribution Diagnostics II: Modality
 20131022
 Distribution Diagnostics I
Next Topics
 PRIM
 Brushing & Linking
 Resampling
 Tree based methods
Themes
Projection Pursuit

Keywords
 Picturing
 Rotation
 Isolation
 Masking

(John Tukey, 1973) 
PRIM 9 
Literature
Peter J. Huber: Projection Pursuit. The Annals of Statistics, Vol. 13, No. 2. (Jun., 1985), pp. 435475
G. P. Nason: Exploratory Projection Pursuit
Mathew Ward: Projection Pursuit: A Brief Overview
FlowingData: John Tukey and the Beginning of Interactive Graphics
See also:
Data Sets
 PRIM
 See Section 5.2 in (Cook and Swayne, 2007). Provided as example in ggobi.
 Data
Comments
 Color perception
 Documentation Data
 Chemical Diabetes
 Literature
 Mineral water
 See Section 7.3 in (Cook and Swayne, 2007),
and R package classifly.
See also http://www.mineralwaters.org/ for comparisons and detailed analysis.
Brushing

Keywords
 Linking,
 Scatterplot brushing

(John A. McDonald, 1980+) 
ORION 
Data Sets
 Boston housing data
 original at UCI Machine Learning Repository.
 UNT version.
 Harrell version.
 R data set:
data("Boston", package="MASS")
.
 Cars
 Henderson and Velleman. (1981). Building Regression Models
Interactively Biometrics 37 400.
Software: DataDesk (demo version)
Smoothing and Kernel Density Estimation
Data Sets
R (reduced version): data(faithful)
R (complete version): data("geyser",package="MASS")
 Old Faithful Geyser Data

A look at some data on the Old Faithful geyser
A. Azzalini and A. W. Bowman
Applied Statistics
39
357365
(1990)
Literature

A Brief Survey of Bandwidth Selection for Density Estimation
M. C. Jones and J. S. Marron and S. J. Sheather
Journal of the American Statistical Association
91
401407
(1996)


Bandwidth Selection in Kernel Density Estimation: A Review
B. Turlach
Principal Component Analysis
Data
R:
library("UsingR")
data(fat)
Literature
 mkb92ma: Chapter 8.3


Multivariate Analysis
K. V. Mardia and J.T.Kent and J.M.Bibby
(1979)

 Branden2005Robustclassifi


Robust classification in high dimensions based on the SIMCA method
K. Branden and M. Hubert
Chemometrics and Intelligent Laboratory Systems
79
1021
(2005)
In this paper we first investigate
the robustness of the SIMCA method for classifying highdimensional
observations. It turns out that both stages of the algorithm, the
estimation of principal components and the construction of a
classification rule, can be highly disturbed by the presence of
outliers. Therefore we propose a robust procedure RSIMCA which is based
on a robust Principal Component Analysis method for highdimensional
data (ROBPCA). Various simulations and real examples reveal the
robustness of our approach. (c) 2005 Elsevier B.V. All rights reserved.

OneDimensional Diagnostics
Literature
 gs94oned


Diagnostic Plots for OneDimensional Data
G. Sawitzki
in:
P. Dirschedl, R. Ostermann (eds.) Computational Statistics. Papers
Collected on the Occasion of the 25th Conference on Statistical
Computing at Schloss Reisensburg. PhysicaVerlag, Heidelberg 1994.
pp. 237258
(1994)
Software and more information: http://www.statlab.uniheidelberg.de/projects/onedim/.

In preparation
Dimension Reduction
Literature
 Li1991SlicedInverse


Sliced Inverse Regression for Dimension Reduction
K.C. Li
Journal of the American Statistical Association
86
316327
(1991)

Resampling
Classification and Regression Trees, DART
Literature
 Breiman1984CART


Classification and Regression Trees
R. O. L. Breiman, J. Friedman and C. Stone
(1984)

 593439


J. H. Friedman
(Aug. 1996a)
"Local
Learning Based on Recursive Covering"
(software)

See also...
Courses to look at
Hao Helen Zhang (University of Arizona):
MATH 574M  Statistical Machine Learning and Data Mining 2013
Andreas Buja (University of Pennsyvania):
Lectures on statistics and data analysis,
Columbia University 2009
Heike Hoffmann et al. (Iowa State University):
Visualizing Quantitative Information
Ross Ihaka et al. (Auckland):
Computational Data Analysis and Graphics
Hadley Wickham (Rice University):
Data Visualisation
Data
D. F. Andrews and A. M. Herzberg:
Data
XX, 442 S.
(Springer 1985) Data sets online
D. Cook and D. F. Swayne: Interactive and Dynamic Graphics for Data Analysis
(Springer 2007), Data Descriptions (Feb 2007, PDF, 1.5Mb), Data: See
Data section of the book home page.
See also Data.
Literature
 (Belsley et al. 1980)

Belsley, Kuh & Welsch, Regression Diagnostics, Wiley, 1980.
 (Cook and Swayne, 2007)

D. Cook and D. F. Swayne: Interactive and Dynamic Graphics for Data Analysis
(Springer 2007), Data Descriptions (Feb 2007, PDF, 1.5Mb)
Software
 DataDesk
 http://www.datadesk.com/
 ggobi

www.ggobi.org
 R

www.cran.rproject.org
$Source: /u/math/sa3/cvswww/www/www.statlab.uniheidelberg.de/studinfo/sda/index.html,v $
$Revision: 1.38 $
$Date: 2017/03/08 17:49:16 $