x

x

Thursday, July 31, 2014

Let The Data Speak For Itself


As a Six Sigma practitioner (and not a trained statistician), it is very easy to learn all the statistical
tools such as ANOVA and the t-test, Capability Study, MSE, and Control charts. Statistical analysis
tools, such as Minitab, makes completing ones analysis just a matter of clicking buttons.

In all the delight of creating wonderful charts and analyses, it is difficult for any Statistical
Practitioner, no matter how experienced to forget that not all data sets have a normal distribution, and
not all statistical tools are robust to non-normal data.

What I like to do first is channel John Tukey, one of the main proponents of exploratory data analysis
(EDS). Dr. Tukey encourage researchers to first examine the data without any assumption of distribution.
Let the data speak for itself. One tool, created by Dr. Tukey is the Box Plot, a graphical tool that
shows the data in quartiles, with outliers indicated.

Another very good tool for comparing data sets without assuming a distribution is to create an Empirical Distribution Function graph (EDF) for each data set and compare how well they overlap each other.

The best way to explain the EDF is by example. Say, I have set of data points that I have arranged in
order from lowest value to highest value [2,3,5,7,9,10,11,12,15,19]. We can therefore see that 100% of
the data lies at or below the value "19" (10th value/10 total data points). Also that 50% of the data
lies at or below the value "9" (the 5th value/10), and that the value "2" comprises 10% of the data
(1/10). From this we can plot a graph with the percentage (or proportion) on the Y axis and the data points on the X axis

But how do we use this technique to help us evaluate data? If one is faced with two data sets, for instance time to complete a task under Condition 1 versus time to complete a task under Condition 2. The Six Sigma problem solver wants to see if the two data sets are the same. But since he or she is not sure if the data is normal ('time to an event' data is often not normal), they may use the EDF comparison first.

Let us first look at the raw data. Always a good start (see Data Sets A). We can see the two data sets (Data 1 is the time to complete the task under Condition 1 and Data 2 is for Condition 2).

We can make a a couple conclusions from this data. Data 1 and Data 2 have similar low times, but Data 1 has some times much higher than Data 2.

But lets calculate the EDF data (seeData Sets B). We can see that we now have the data sets aligned with the proportion data. So lets plot the EDF (see CDF Data 1-2). The shapes of the two EDFs for Data 1 and Data 2 are very disimilar. the EDF graphical comparison highlights these differences.

Where this really comes in handy is when two data sets are more similar (see CDF KS1). See that the maximum difference between the two EDF curves are highlighted. This max distance value (in this case 0.85 - 0.55 = 0.3) can be used to calculate a statistic for significance. This is called the Kolmogorov-Smirnov Test and more on this may be found HERE

This test may be done by creating the graphs in Excel and using the simple calculations of the KS-Test. Also Minitab has a macro that runs the KS-Test.

Now go "get all Tukey" on your data and evaluate it first without assuming a distribution.

John

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.