Chi Square
Chi Square ( )
All of the analyses we have covered this semester involve categorical independent variables and a continuous dependent variable. But what should you do when your outcome data is also categorical? Luckily, there are set of categorical data analytic techniques you can use when your data are categorical. The one we will focus on this week is the
goodness-of-fit test
The
In this example, 200 people are asked to “draw” cards from an imaginary deck of cards, and their responses are recorded. We are interested in determining whether the cards people selected are really random. Load the package if you haven’t already done so previously, and then load the the dataset (randomness.Rdata, available here: Download Randomness Dataset.). Save the file to your working directory for easy loading.
load("randomness.Rdata")
For this example, we will focus on the first choice that people made (“choice_1” in the dataset). Use the table
function to see the distribution of responses across the four types of playing cards. Save the table to an object called “observed”.
observed <- table(cards$choice_1)
observed
##
## clubs diamonds hearts spades
## 35 51 64 50
We want to test whether this distribution of scores is random. Our null hypothesis is that all four suits of cards are chosen with equal probability. In other words, each card should have a 25% chance of being drawn. We will test whether the distriubtion we observed is the same or different as what we expected.
We can save the probabilities to an object, and then multiply those probabilities by our sample size (N = 200) to get our expected values.
probabilities <- c(.25, .25, .25, .25)
N <- 200
expected <- (N*probabilities)
expected
## [1] 50 50 50 50
The formula for a
We need to subtract the expected scores from our observed scores and then square them. Then we need to divide each of those values by the expected value. Finally, we sum all of those numbers together to get the value of
sum((observed - expected)^2 / expected)
## [1] 8.44
The last step is to determine whether our pchisq
function to determine the p-value associated with our lower.tail = FALSE
to get the p-value for a score equal to or greater than 8.44
pchisq(8.44, df=3, lower.tail=FALSE)
## [1] 0.03774185
Using the lsr
package to run the goodness-of-fit test
Luckily, the lsr
package has a function that will calculate the
goodnessOfFitTest(cards$choice_1)
##
## Chi-square test against specified probabilities
##
## Data variable: cards$choice_1
##
## Hypotheses:
## null: true probabilities are as specified
## alternative: true probabilities differ from those specified
##
## Descriptives:
## observed freq. expected freq. specified prob.
## clubs 35 50 0.25
## diamonds 51 50 0.25
## hearts 64 50 0.25
## spades 50 50 0.25
##
## Test results:
## X-squared statistic: 8.44
## degrees of freedom: 3
## p-value: 0.038
Goodness-of-fit tests can also be used when you know what the distribution of scores are across categories in a population and they aren’t necessarily equal to one another. For example, pretend that we knew that people in general tend to select red cards 60% of the time compared to 40% of the time for black cards. We can specify the probabilities of the population and then test our observed data against those probabilities.
redpref <- c(clubs=.2, diamonds =.3, hearts= .3, spades = .2)
goodnessOfFitTest(cards$choice_1, p=redpref)
##
## Chi-square test against specified probabilities
##
## Data variable: cards$choice_1
##
## Hypotheses:
## null: true probabilities are as specified
## alternative: true probabilities differ from those specified
##
## Descriptives:
## observed freq. expected freq. specified prob.
## clubs 35 40 0.2
## diamonds 51 60 0.3
## hearts 64 60 0.3
## spades 50 40 0.2
##
## Test results:
## X-squared statistic: 4.742
## degrees of freedom: 3
## p-value: 0.192
The goodnessOfFit
function in the lsr
package is very user-friendly, as it was designed for introductory statistics students. As you know, most functions in R do not give you as much helpful output, so you have to know what you are doing! There is a chisq.test
function in R that will calculate the chi-square statistic for you (but won’t give as much output as the lsr
package).
options(digits = 3)
#equal probability
chisq.test(observed)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 8, df = 3, p-value = 0.04
#red more likely example
chisq.test(observed, p=c(.20,.30,.30,.20))
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 5, df = 3, p-value = 0.2
The Test of Independence
The goodness-of-fit test is for one categorical variable where you want to test the proportions against a known population (or equal probability, chance). The
Creating a Contingency Table
We’ll use the cats
dataset available from the Andy Field DSUR companion website,
Discovering Statistics using R. Or, you can access a csv version here:
cats <- read.csv("cats.csv", fileEncoding = "UTF-8-BOM")
fileEncoding = "UTF-8-BOM"
. This bit of code will prevent a common issue that sometimes occurs when reading in csv files where the variable name for the first column comes in with symbols. If you don’t include that piece of code and find that the first variable of your dataframe is not what you expect, add this code to you read-in and try again.
Let’s take a look at a contingency table of the the two variables, Dance and Training. The addmargins
function will add a column and row to the table that contains the corresponding sums.
#creating a contingency table
cats_frequencies <- addmargins(table(cats$Dance, cats$Training))
cats_frequencies
##
## Affection as Reward Food as Reward Sum
## No 114 10 124
## Yes 48 28 76
## Sum 162 38 200
Remember the formula for a
But now, we’ll need to calculate expected values for 4 conditions (Food/No, Food/Yes, Affection/No, and Affection/Yes), replacing the
To calculate the expected values, we use the formula:
We will take the row total for row i and multiply by the column total j. Then divide by n. For example, to calculate the expected value for Food/Yes, we would do the following:
If we were going to continue doing this by hand, we would then go ahead and calculate the expected values for the other two conditions. Then we would need to plug in the expected and observed values for each of the four conditions back into the formula for
R can do the dirty work for us. Navarro has created another user-friendly function to calculate associationTest
.
library(lsr)
associationTest(~ Dance+Training, data=cats)
##
## Chi-square test of categorical association
##
## Variables: Dance, Training
##
## Hypotheses:
## null: variables are independent of one another
## alternative: some contingency exists between variables
##
## Observed contingency table:
## Training
## Dance Affection as Reward Food as Reward
## No 114 10
## Yes 48 28
##
## Expected contingency table under the null hypothesis:
## Training
## Dance Affection as Reward Food as Reward
## No 100.4 23.6
## Yes 61.6 14.4
##
## Test results:
## X-squared statistic: 23.5
## degrees of freedom: 1
## p-value: <.001
##
## Other information:
## estimated effect size (Cramer's v): 0.343
## Yates' continuity correction has been applied
We can also use the chisq.test
function to calculate the \(\chi^2\)
:
chisq.test(cats$Training, cats$Dance)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cats$Training and cats$Dance
## X-squared = 24, df = 1, p-value = 1e-06
Yates Continuity Correction
Note that for both of the above analyses, the answer is the same, 23.5, P<.001. However, Field gets a slightly different answer (25.35). Why the difference? Both associationTest
and chisq.test
apply a Yates Continuity Correction to the
To adjust for this, the Yates correction is applied by subtracting 0.5 from the deviation scores before squaring. There is some debate about whether this correction is overly conservative, or whether it is necessary at all. Just know that most statistical software (and in this case, functions in R) apply the correction automatically when you do a 2 x 2
Field uses the CrossTable
function in the gmodels
package to run the
#install.packages("gmodels")
library(gmodels)
CrossTable(cats$Training, cats$Dance, fisher = TRUE, chisq = TRUE, expected = TRUE, format = "SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 200
##
## | cats$Dance
## cats$Training | No | Yes | Row Total |
## --------------------|-----------|-----------|-----------|
## Affection as Reward | 114 | 48 | 162 |
## | 100.440 | 61.560 | |
## | 1.831 | 2.987 | |
## | 70.370% | 29.630% | 81.000% |
## | 91.935% | 63.158% | |
## | 57.000% | 24.000% | |
## --------------------|-----------|-----------|-----------|
## Food as Reward | 10 | 28 | 38 |
## | 23.560 | 14.440 | |
## | 7.804 | 12.734 | |
## | 26.316% | 73.684% | 19.000% |
## | 8.065% | 36.842% | |
## | 5.000% | 14.000% | |
## --------------------|-----------|-----------|-----------|
## Column Total | 124 | 76 | 200 |
## | 62.000% | 38.000% | |
## --------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 25.4 d.f. = 1 p = 4.77e-07
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 23.5 d.f. = 1 p = 1.24e-06
##
##
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio: 6.58
##
## Alternative hypothesis: true odds ratio is not equal to 1
## p = 1.31e-06
## 95% confidence interval: 2.84 16.4
##
## Alternative hypothesis: true odds ratio is less than 1
## p = 1
## 95% confidence interval: 0 14.3
##
## Alternative hypothesis: true odds ratio is greater than 1
## p = 7.71e-07
## 95% confidence interval: 3.19 Inf
##
##
##
## Minimum expected frequency: 14.4
Assumptions of
There are two assumptions you need to be concerned with when doing a
Independence of Observations. Each person, item, or entity can only contribute to one cell of the contingency table.
Expected frequencies per cell are sufficiently large. The expected frequencies of each cell of a contingency table should be greater than or equal to 5. In larger tables, you can get away with 80% of the cells having an expected frequency greater than 5 and none below 1. If expected cell counts are small (and sample size is likely to be small), you can use the Fisher Exact Test instead of
fisher.test(cats$Training, cats$Dance)
##
## Fisher's Exact Test for Count Data
##
## data: cats$Training and cats$Dance
## p-value = 1e-06
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.84 16.43
## sample estimates:
## odds ratio
## 6.58
Effect size
For a 2 x 2 contingency table, there are a few different effect sizes you can calculate and report: Odds ratio, Cramer’s V, and Phi (
Phi
Cramer’s V
Cramer’s V measures the strength of association between two nominal/categorical variables. It is included in the output of associationTest
. You can also get it by using cramersV
function in the lsr
package. In the formula below, N is the total sample size and k is the smallest number for either the rows or the columns (choose the smallest one overall).
The formula for Cramer’s V is:
To calculate cramer’s V, you can use the cramersV
function:
cramersV(cats$Training, cats$Dance)
## [1] 0.343