Stat 445: Introduction to Exploratory Data Analysis

This is archived information for Stat 445 Sect 201 (Spring, 2005).

Project #1 Marking Results

I was quite impressed with the quality of reports. Almost all students had obviously put in a lot of work, and even when the final grade was low, it was not for lack of effort but rather because a student had misunderstood the assignment (usually handing in a lab-like assignment with R code and unedited R output but almost no explanatory text).

Distribution of Marks

The distribution of marks was as follows:

Project #1 grade
    histogram (median=39.5/60=66%)

The distributions of the individual mark components (and their correlations with the total mark) are given by the following set of boxplots.

Project #1 grade
    component boxplots

The components and their maximum possible values were:

desc (6)description of data and problem
eda (12)exploratory data analysis
cda (9)confirmatory analysis
conc (6)conclusions
open (3)open issues and limitations
level (6)level appropriate to audience
org (6)organization and structure
pres (6)presentation of analysis
clarity (6)clarity

The component meanings are described in more detail on the Marking Guidelines webpage.

Since the total mark is the sum of the components, we would expect a positive correlation between total mark and each component. Because "eda" and "cda" are the largest valued components, we might expect them to have larger correlations than the others, but in fact, the top three correlated components are "pres" (0.79), "eda" (0.77), and "clarity" (0.76).

Fitting the model total ~ pres gives the following fit:

fitted total mark = 19.9 + 4.97 pres

with R2=0.62 and a residual std err of 5.0. Using the R function add1(...), we can see which additional component adds the most explanatory power when added to the model. It turns out that the model total ~ pres + clarity with the following fit:

fitted total mark = 11.1 + 3.57 pres + 3.32 clarity

with R2=0.84 and an RSE=3.3 is the best two-component model. The correlation between "pres" and "clarity" is only 0.43, and including the interaction term in the model (i.e., fitting total ~ pres*clarity) doesn't result in a significantly better model (F test gives p=0.42).

Therefore, it seems that good performance on the whole project can be predicted well by good performance on two relatively independent aspects: the presentation of analysis and its clarity (by which I mean the logical flow from problem to analysis to conclusion).

Top-Scoring Projects

Note: Sorry for the earlier technical difficulties. The sample projects are now available again.

Here are the 6 projects that received the highest marks. (All authors wished to remain anonymous, presumably so fellow group members wouldn't make them do all the hard work for the final project.)

A note on report length: I believe I said the projects should be around 6 pages, single-spaced with an 8 page maximum limit. Now, these top-scoring projects were generally closer to 8 pages than 6, but text was typically double-spaced (or maybe 1.5-spaced) and figures were sometimes big. Also, just about everyone wrote an 8 page report, and lots of long projects got comparatively low grades, so I still maintain that a 6 page report had as good a chance of getting the highest mark as an 8 page report.

This is archived information for Stat 445 Sect 201 (Spring, 2005).