G. Sawitzki Statistical Data Analysis StatLab Heidelberg Last revision: 2017-03-08 by gs
StatLab Heidelberg  >  studinfo  > sda


    
Data
Data Visualization  





G. Sawitzki home page
Panoptikum


G. Sawitzki bookmarks


Practical tips and general recommendations for seminar papers and expositions, see [here…] (in German).
For copyright reasons, access to some course material may be restriced. These links are shown in this style.

For messages, please use this mail address.

SS 2017
G. Sawitzki: Statistical Data Analysis

G. Sawitzki: Statistical Data Analysis
Module: MH21
Wed. 11-13 or by appointment, INF 205, SR. 8

LSF reference: 11MMAV0829
Please register using this form.

Topics for SS 2017

Tools for data analysis
Residual analysis and regression diagnostic
Resampling and permutation methods
Dimension diagnostic and dimension reduction

The course topics will include material not covered in the previous courses (see below).

This course is a supplement to the lectures of WS 2016/17. Knowledge of this course or equivalent and a working knowledge of R is required.



Statistical data analysis is the investigative branch of statistics. Data analysis has two aims:

To meet both aims, and to avoid both pitfalls, data analysis draws heavily on mathematical statistics on one hand, and on various kinds of research in the humanities on the other hand.

Data analysis is a means to understand the measurement process by which the data are generated. A thorough investigation of the measurement process and its statistical implications is vital for data analysis.

Residual analysis is a special field in data analysis which is particularly good developed. In this area, data analysis is used to scrutiny data sets for observations which might be of critical influence for the application of classical statistical methods. We will use the experiences made in residual analysis as a guiding example.

Tentative Time Table SS 2017

2017-04-19
Short Introduction
The main topics for this term will be chosen in this session. Requests and suggestions are welcome.

WS 2016/2017

To be scheduled for WS 2016/2017

Resampling and permutation methods
Dimension diagnostic and dimension reduction

Tentative Time Table WS 2016/2017

new  Assignment/project exercised due: Feb. 3. 2017. For those who do not prepare a written assignment: Feb. 8. written examination.
2016-12-21
Outlook
Questions & answers

2016-12-14
Forward regression
Robust diagnostic regression analysis
Atkinson, Anthony and Riani, Marco
Springer (2012).
Transformation: Box&Cox
new  Exercises are in exercises20161214.pdf

An evalation copy of the DataDesk software used for the Cars data set can be downloaded from DataDescription, Inc..

The colour graphic example is in Colour perception.

Project lowdim gives some background on the topics touched in the current seeries of lectures.

2016-12-07
Robust regression
Break down
Influence function
Robust estimation of location
Maximum depth regression
Exercises are in exercises20161207.pdf
Regression depth
Rousseeuw, Peter J and Hubert, Mia
Journal of the American Statistical Association 94 (1999) 388-402.
Robust statistics: the approach based on influence functions
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel
Wiley: New York 1986.
2016-11-30
Added variables
Tukey multiple comparison
Residual analysis and regression diagnostic
Robust regression
Robust statistics: the approach based on influence functions
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel
Wiley: New York 1986.
2016-11-23
Residual analysis and regression diagnostic (cont.)
Regression confidence bands
F-tests
Regression confidence bands
Lit: Computational Statistics, Ch. 2.4
Exercises are in exercises20161123.pdf
2016-11-16
Residual analysis and regression diagnostic (cont.)
t-tests
Residual distribution
Studentized residuals Outliers and leverage
Influence: DFFITS, Cook's distance
Exercises (new projects) are in exercises20161116.pdf
2016-11-09
Box&whisker plots
Residual analysis and regression diagnostic
Raw residuals
Residuals vs fitted
Exercises are in exercises20161109.pdf
Computational Statistics: An Introduction to R Ch. 2
G. Sawitzki
CRC Press: Boca Raton 2009.
2016-11-02
Simple diagnostics
distribution functions
Kolmogoroff-Smirnov
Confidence bands
Exercises are in exercises20161102.pdf
2016-10-26
Simple diagnostics
histograms
distribution functions
sample size requirements
Lit: A draft section that puts the problems in context has been placed in onedim.pdf.
Exercises are in exercises20161025.pdf
2016-10-19
Statistical Background
Distributions, samples and statistics
Parametric models
Test procedures, power function, test level
Neyman-Pearson setting.
Non-Admissibility

The course topics will include material not covered in the previous courses (see below).


Tentative Time Table 2015/2016

The course topics 2015/2016 will overlap with the 2013/2014 course (see below). Agenda and time table for this term will be fixed in the first sessions.

Prediction error, Mallow's Cp, information criteria
Find "honest" estimators for error.
Find measure for the quality of a model.
Model selection
In a polynomial regression, find an order of the polynomial to use.
In muliple regression, find the regressors to use.
Cluster analysis & unsupervised learning
For data as in classification, but with unknown classes:
Find clustering. Estimate a number of clusters.

To be scheduled

Optimal procedures

Literature

Maximum likelihood: An introduction
Le Cam, Lucien
International Statistical Review  58 No.2  153-171  (1980)
Theory of point estimation
Lehmann, E. L.
Wiley: New York 1983.
Influence functions and robustness

Literature

Robust statistics: the approach based on influence functions
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel
Wiley: New York 1986.

Current time table

2016-02-04 Last lecture
2016-02-04
Summary and Review
2016-02-02
2016-01-28
Data Analysis for cDNA microarrays
slides (Goettingen lectures)
video Video
2016-01-26
Multiple comparison
Scheffé confidence bands
Tukey multiple comparison
Lit: Computational Statistics, Ch. 2.4

Assignments due

2016-01-21
Nonparametric methods:
Permutation tests
Rank based methods
Monte-Carlo
Lit: Computational Statistics, Ch. 3.3
2016-01-19
Resampling methods III
Cross-validation
See Arlot, Sylvain and Celisse, Alain: A survey of cross-validation procedures for model selection
2016-01-14
Resampling methods II
Bootstrap & jackknife
See Jon Wellner: Bootstrap and Jackknife Estimation of Sampling Distributions
2016-01-12
Resampling methods I
Test and training sets
Bootstrap
Jackknife
Cross-validation
Data: Low birthweight (birthweight)
2016-01-07
Regression trees
Data: Scottish hill races, cars, fat
Classification trees
Data: Olive oils (olive)

Season Break

2015-12-17
Open for questions

Inverse analysis
Inverse classification.
Quadrant test, corner test.

2015-12-15
Parallel coordinates (parallel)
See A. Inselberg: Parallel Coordinates: Visualization, Exploration and Classification of High-Dimensional Data.
Data: old faithful geyser data (geyser)
Data: Olive oils (olive)

Recurrence plots (recurrence8)
Data: Heart rate (pulse8)
See also: R Heart rate variation on Rforge, in particular the tutorial.
N. Marwan: Structures in Recurrence Plots
2015-12-10
Data: old faithful geyser data cont. (geyser)
2015-12-08
Statistical decision theory

Literature

Statistical decision theory and bayesian analysis
Berger, James O.
Springer Science & Business Media 2013.
Mathematical statistics: A decision theoretic approach
Ferguson, Thomas S
Probability and Mathematical Statistics, Vol. 1, Academic Press, New York 1967.
Statistical decision theory
Wald, Abraham
Wald, Abraham. Statistical decision theory, Wiley, New York 1950.
-- Course Evaluation --
2015-12-03
-- No lecture --
Scientific advent matinee 11:15 - 12:45, 14:00(!!!)-16:00
2015-12-01
Classification trees
Data: Olive oils (olive)
Parallel coordinates
See Parallel Coordinates: Visualization, Exploration and Classification of High-Dimensional Data. Data: old faithful geyser data (geyser)
Data: Olive oils (olive)
2015-11-26
Projection pursuit
Data: Olive oils (olive)
Software: ggobi. See also: rggobi().
2015-11-24
Projection pursuit
Data: Olive oils (olive)
Software: ggobi. See also: rggobi().
2015-11-19
Principal components
Data: Body fat( fat), Cars
2015-11-17
- no lecture -
2015-11-12
ANOVA
Box-Cox-Transformations
Errors in Variables
Data: Scottish hill races, Cars
2015-11-10
PRIM
Video: PRIM-9. See the ASA video library.
2015-11-05
Linear regression: influence measures
Robust regression
Data: scottish hill runners data (scottish)
2015-11-03
Example: colour data
Example: cars data
2015-10-29
Leave one out estimators

new Some more data Sets
R: data()
UC Irvine Machine Learning Repository, e.g. Auto MPG
Cook & Swayne data (cook-chap-data)

2015-10-27
- no lecture -
2015-10-22
Linear models: Residuals
Lit: Computational Statistics, Ch. 2.2.3 ff
2015-10-20
Linear models
Lit: Computational Statistics, Ch. 2
Data: scottish hill runners data (scottish)
2015-10-15
Diagnostic of one dimensional data
Lit: Computational Statistics, Ch. 1
Data: old faithful geyser data (geyser)
2015-10-13
Introduction

Tentative Time Table (2013/14)

2014-02-04
Review
2014-01-28
Tree Based Methods
2014-01-21
Principal Component Analysis
2014-01-14
Brushing, Linking and Prosections
2014-01-07
Projection Pursuit
2013-12-17
Curse and Blessing of Dimensionality
D. Donoho: The Curses and Blessings of Dimensionality. Stanford, 2002
2013-12-10
Influence and Breakdown; Robust Regression
2013-12-03
Outliers in Regression; Diagnostics
2013-11-26
Regression Diagnostics and Residual Analysis III
2013-11-19
Regression Diagnostics and Residual Analysis II
2013-11-12
Regression Diagnostics and Residual Analysis
2013-11-05
Distribution Diagnostics III: Trend, correlation, …
2013-10-29
Distribution Diagnostics II: Modality
2013-10-22
Distribution Diagnostics I

Next Topics

PRIM
Brushing & Linking
Resampling
Tree based methods

Themes

Projection Pursuit


Tukey & Prim 9 Keywords

  • Picturing
  • Rotation
  • Isolation
  • Masking
(John Tukey, 1973) PRIM 9

Literature

Peter J. Huber: Projection Pursuit. The Annals of Statistics, Vol. 13, No. 2. (Jun., 1985), pp. 435-475
G. P. Nason: Exploratory Projection Pursuit
Mathew Ward: Projection Pursuit: A Brief Overview
FlowingData: John Tukey and the Beginning of Interactive Graphics
See also:

http://www.public.iastate.edu/~dicook/JSS/paper/paper.html

Data Sets

PRIM
See Section 5.2 in (Cook and Swayne, 2007). Provided as example in ggobi.
Data Comments
Color perception
Documentation Data
Chemical Diabetes
Literature
Mineral water
See Section 7.3 in (Cook and Swayne, 2007), and R package classifly.
See also http://www.mineralwaters.org/ for comparisons and detailed analysis.

Brushing


M. A. Fisherkeller; Orion Keywords
  • Linking,
  • Scatterplot brushing
(John A. McDonald, 1980+) ORION

Data Sets

Boston housing data
original at UCI Machine Learning Repository.
UNT version.
Harrell version.
R data set: data("Boston", package="MASS").
Cars
Henderson and Velleman. (1981). Building Regression Models Interactively Biometrics 37  400.
Software: DataDesk (demo version)


Smoothing and Kernel Density Estimation

Data Sets

R (reduced version): data(faithful)
R (complete version): data("geyser",package="MASS")
Old Faithful Geyser Data
A look at some data on the Old Faithful geyser
A. Azzalini and A. W. Bowman
Applied Statistics  39  357--365  (1990)

Literature

A Brief Survey of Bandwidth Selection for Density Estimation
M. C. Jones and J. S. Marron and S. J. Sheather
Journal of the American Statistical Association  91  401--407  (1996)


Bandwidth Selection in Kernel Density Estimation: A Review
B. Turlach


Principal Component Analysis

Data

R:
library("UsingR") data(fat)

Literature

mkb92ma: Chapter 8.3

Multivariate Analysis
K. V. Mardia and J.T.Kent and J.M.Bibby
      (1979)


Branden2005Robust-classifi

Robust classification in high dimensions based on the SIMCA method
K. Branden and M. Hubert
Chemometrics and Intelligent Laboratory Systems  79  10--21  (2005)

In this paper we first investigate the robustness of the SIMCA method for classifying high-dimensional observations. It turns out that both stages of the algorithm, the estimation of principal components and the construction of a classification rule, can be highly disturbed by the presence of outliers. Therefore we propose a robust procedure RSIMCA which is based on a robust Principal Component Analysis method for high-dimensional data (ROBPCA). Various simulations and real examples reveal the robustness of our approach. (c) 2005 Elsevier B.V. All rights reserved.



One-Dimensional Diagnostics

Literature

gs94oned

Diagnostic Plots for One-Dimensional Data
G. Sawitzki
in: P. Dirschedl, R. Ostermann (eds.) Computational Statistics. Papers Collected on the Occasion of the 25th Conference on Statistical Computing at Schloss Reisensburg. Physica-Verlag, Heidelberg 1994. pp. 237--258  (1994)
Software and more information: http://www.statlab.uni-heidelberg.de/projects/onedim/.


In preparation

Dimension Reduction

Literature

Li1991Sliced-Inverse-

Sliced Inverse Regression for Dimension Reduction
K.-C. Li
Journal of the American Statistical Association  86  316-327  (1991)


Resampling

Classification and Regression Trees, DART

Literature

Breiman1984CART

Classification and Regression Trees
R. O. L. Breiman, J. Friedman and C. Stone
      (1984)

593439

J. H. Friedman (Aug. 1996a)
"Local Learning Based on Recursive Covering"
(software)


See also...

Courses to look at

Hao Helen Zhang (University of Arizona): MATH 574M - Statistical Machine Learning and Data Mining 2013
Andreas Buja (University of Pennsyvania): Lectures on statistics and data analysis, Columbia University 2009
Heike Hoffmann et al. (Iowa State University): Visualizing Quantitative Information
Ross Ihaka et al. (Auckland): Computational Data Analysis and Graphics
Hadley Wickham (Rice University): Data Visualisation


Data

D. F. Andrews and A. M. Herzberg: Data
    XX, 442 S.  (Springer 1985) Data sets online
D. Cook and D. F. Swayne: Interactive and Dynamic Graphics for Data Analysis
    (Springer 2007), Data Descriptions (Feb 2007, PDF, 1.5Mb), Data: See Data section of the book home page.

See also Data.

to top of page


Literature

(Belsley et al. 1980)
Belsley, Kuh & Welsch, Regression Diagnostics, Wiley, 1980.
(Cook and Swayne, 2007)
D. Cook and D. F. Swayne: Interactive and Dynamic Graphics for Data Analysis
    (Springer 2007), Data Descriptions (Feb 2007, PDF, 1.5Mb)

to top of page



Software

DataDesk
http://www.datadesk.com/
ggobi
www.ggobi.org
R
www.cran.r-project.org


to top of page


$Source: /u/math/sa3/cvswww/www/www.statlab.uni-heidelberg.de/studinfo/sda/index.html,v $
$Revision: 1.38 $
$Date: 2017/03/08 17:49:16 $