Contingency Tables as tool for EDA

Have you ever thought, how a simple looking contingency table where the data is organized into rows and columns can be used to explore relationship between two categorical variables?

Well, just the counts in the contingency table doesn’t really help answer our question as no pattern is readily evident. However, if we use proportions (or percentages), we can start seeing a relationship.

Marginal and Conditional Distribution

The percentages give us the marginal and conditional distribution of our variables. The marginal distribution of a categorical variable shows what is expected in the normal state of affairs if we don’t consider the other categorical variable at all.

The idea is very simple, we need to look at both our marginal and conditional distributions together to discover if there is a relationship between our categorical variables.

We need to see if the marginal distribution holds up when we compare it to the conditional distribution of our variable of interest.

Mathematically, if P(A|B) = P(A) then there is no relationship.

Let us understand this through an example. We have the Titanic dataset and the question we are asking is:

Is there any association between passenger class and survival rate.

The contingency table of the data is shown below:

First Second Third Grand Total
Survived 136 87 119 342
Not Survived 80 97 372 549
Grand Total 216 184 491 891

Here our outcome variable is Survivorship, so we need to know how this variable behaves in normal state of affair which is its marginal distribution. The below table shows marginal and conditional distribution of the outcome variable.

First Second Third Grand Total
Survived 63% 47% 24% 38%
Not Survived 37% 53% 76% 62%
Grand Total 100% 100% 100% 100%

Typical survival percentage in Titanic shipwreck is 38%, if we do not take class into account.  However, the likelihood of survival is 63% for a first class passenger. The same disparity in the conditional and marginal probabilities is seen for the second class and the third class passengers.

We conclude that our conditional probability of survival within any of our classes doesn’t actually match our marginal probability if we don’t take class into consideration.

Thus we can say that there is strong relationship between survival rate and passenger class.

Summary

In summary, here are the steps needed to determine if there is a relationship between two categorical variables.

  1. Identify the variable of interest or outcome variable in our contingency table.
  2. Determine marginal distributions and conditional distributions for the outcome variable
  3. Compare the two distribution probabilities (conditional and marginal).
  4. We have a relationship between the categorical variables, if the marginal distribution does not come close to conditional distribution.

 

Chi square test of independence is a statistical way of testing the independence of two categorical variables using the same principle.

Check out this post to see a how chi-square test can be used to test the independence of two categorical variables.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *