Categorical Predictors

Packages you should have loaded to successfully run this week’s exercises: haven, mosaic, dplyr, ggplot2, forcats, ez, emmeans, and car.

This week we will focus on running and interpreting regression analyses with categorical predictors. We will use the NELS dataset referenced in the Keith book. The questions we are interested in answering are: Does family structure affect adolescents’ use of dangerous and illegal substances? Are adolescents from intact families less or more likely to use alcohol, tobacco, and other drugs? (Keith, page 120).

NELS <- read_sav("NELS n=1000.sav")

Data Preparation

Before we dive into the regression analysis, we first need to create the variables that we need for the analysis. The NELS dataset as a variable called FamComp that has 6 levels:

1 = Mother & father
2 = Mother and male guardian
3 = Father and female guardian
4 = Other two adult families
5 = Adult female only
6 = Adult male only

We are interested in comparing three groups of adolescents: those who live with 2 parents, those who live with one parent and one guardian, and those who live with a single parent. We will need to do some recoding/collapsing before we are ready to analyze this variable. To create a famstruc variable, we will leave group 1 alone, collapse groups 2 and 3, delete group 4, and collapse groups 5 and 6. All other groups/levels will be dropped from the variable. We will use the forcats package to help recode the variable into what we need.

NELS <- NELS %>%
  mutate(famcomp = as_factor(famcomp),
         famstruc = fct_collapse(famcomp, "Two-parent family" = c("MOTHER & FATHER"),
                                 "One parent, one guardian" = c("MOTHER & MALE GUARDN", "FEMALE GDN & FATHER"), 
                                 "Single-parent family" = c("ADULT FEMALE ONLY", "ADULT MALE ONLY")),
         famstruc = factor(famstruc, levels = c("Two-parent family", "One parent, one guardian", "Single-parent family")))

#checking to make sure that the categories add up to what they should 
fct_count(NELS$famcomp)

## # A tibble: 11 x 2
##    f                        n
##    <fct>                <int>
##  1 MOTHER & FATHER        677
##  2 MOTHER & MALE GUARDN   101
##  3 FEMALE GDN & FATHER     17
##  4 OTH 2 ADULT FAMILIES     8
##  5 ADULT FEMALE ONLY      156
##  6 ADULT MALE ONLY         21
##  7 MULTIPLE RESPONSE        0
##  8 REFUSAL                  0
##  9 MISSING                  0
## 10 LEGITIMATE SKIP          0
## 11 <NA>                    20

fct_count(NELS$famstruc)

## # A tibble: 4 x 2
##   f                            n
##   <fct>                    <int>
## 1 Two-parent family          677
## 2 One parent, one guardian   118
## 3 Single-parent family       177
## 4 <NA>                        28

There isn’t a general substance abuse variable in the NELS dataset, but there are several questions that we could combine to create a composite of substance abuse. Create a variable that is the mean of f1s77, f1s78a, and f1s80aa. These three variables represent lifetime total use of cigarettes, alcohol, and marijuana. These variables were measured on different scales, so we will need to standardized them before we create the composite variable.

#creating z-scores
NELS <- NELS %>%
  mutate(zf1s77 = scale(f1s77),
         zf1s78a = scale(f1s78a),
         zf1s80aa = scale(f1s80aa))

summary(NELS$zf1s77)

##        V1         
##  Min.   :-0.3668  
##  1st Qu.:-0.3668  
##  Median :-0.3668  
##  Mean   : 0.0000  
##  3rd Qu.:-0.3668  
##  Max.   : 5.1786  
##  NA's   :96

#creating the substance abuse variable
NELS <- NELS %>%
  mutate(Substance = rowMeans(dplyr::select(NELS, zf1s77, zf1s78a, zf1s80aa)))

#descriptive statistics
favstats(~Substance, data=NELS)

##        min         Q1     median        Q3      max          mean        sd   n
##  -0.811183 -0.4874187 -0.1636545 0.2060405 3.349166 -0.0007963072 0.7720046 855
##  missing
##      145

ANOVA versus Regression with Categorical Predictors

As you might remember from last semester, if we have a categorical IV and a continuous DV, we can use an ANOVA to compare the means across groups. In this example, we have one categorical IV with 3 levels, and a continuous DV. So let’s run this as a one-way ANOVA and graph the means. ezANOVA has trouble with missing data so I am going to create a smaller dataset with just the variables we need first that contains only complete cases.

#examining the means and standard deviations by the famstructure variable
favstats(Substance ~ famstruc, data = NELS)

##                   famstruc       min         Q1       median        Q3      max
## 1        Two-parent family -0.811183 -0.4874187 -0.163654467 0.1601098 3.349166
## 2 One parent, one guardian -0.811183 -0.4874187 -0.001772328 0.5298047 2.979471
## 3     Single-parent family -0.811183 -0.4874187 -0.163654467 0.6414183 2.979471
##          mean        sd   n missing
## 1 -0.05845186 0.7262071 597      80
## 2  0.11956000 0.7615271  94      24
## 3  0.19184481 0.9361701 142      35

#creating dataset with no missing
NELS_short <- NELS %>%
  dplyr::select(stu_ids, famstruc, Substance) %>%
  na.omit()  #removes cases with missing data

#running the anova model
anova_model <- ezANOVA(data=NELS_short,
                       wid = stu_ids,
                       between=famstruc,
                       dv = Substance,
                       type = 3,
                       detailed= TRUE,
                       return_aov = TRUE)
print(anova_model)

## $ANOVA
##        Effect DFn DFd      SSn      SSd        F            p p<.05        ges
## 1 (Intercept)   1 830 3.305773 491.8238 5.578809 0.0184090634     * 0.00667658
## 2    famstruc   2 830 8.594215 491.8238 7.251782 0.0007547294     * 0.01717407
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd      SSn      SSd        F            p p<.05
## 1   2 830 5.092453 282.1811 7.489404 0.0005975776     *
## 
## $aov
## Call:
##    aov(formula = formula(aov_formula), data = data)
## 
## Terms:
##                 famstruc Residuals
## Sum of Squares    8.5942  491.8238
## Deg. of Freedom        2       830
## 
## Residual standard error: 0.7697784
## Estimated effects may be unbalanced

The above results correspond to the SPSS results presented in Keith on page 120-121. The overall ANOVA is statistically significant, F(2,830) = 7.25, p <.001. Let’s now graph the means and examine post-hoc mean comparisons to see which groups are significantly different from one another.

cleanup = theme(panel.grid.major = element_blank(), 
                panel.grid.minor = element_blank(), 
                panel.background = element_blank(), 
                axis.line.x = element_line(color = "black"),
                axis.line.y = element_line(color = "black"),
                legend.key = element_rect(fill = "white"),
                text = element_text(size = 15))

mean_plot <-ggplot(data=NELS_short, aes(x = famstruc , y = Substance)) +
  stat_summary(fun.y = mean, 
               geom = "point") +
  stat_summary(fun.y = mean, 
               geom = "line", aes(group=1)) +
  stat_summary(fun.data = mean_cl_normal, 
               geom = "errorbar", 
               width = .2, 
               position = "dodge")  + 
  cleanup
mean_plot

emmeans(anova_model$aov, pairwise~famstruc)

## $emmeans
##  famstruc                  emmean     SE  df lower.CL upper.CL
##  Two-parent family        -0.0585 0.0315 830  -0.1203  0.00339
##  One parent, one guardian  0.1196 0.0794 830  -0.0363  0.27540
##  Single-parent family      0.1918 0.0646 830   0.0650  0.31864
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                                        estimate     SE  df t.ratio
##  Two-parent family - One parent, one guardian     -0.1780 0.0854 830 -2.084 
##  Two-parent family - Single-parent family         -0.2503 0.0719 830 -3.483 
##  One parent, one guardian - Single-parent family  -0.0723 0.1024 830 -0.706 
##  p.value
##  0.0938 
##  0.0015 
##  0.7599 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Based on the results of the ANOVA model, we can conclude that there are significant differences between the three groups. The post-hoc comparisons tell us that the two-parent family structure is associated with lower substance abuse compared to the other two groups. Further, the difference between the one-parent one guardian group and the single parent group is not statistically significant.

Linear Regression Model

Now we will run this as a regression model and see how the results compare.

mod1 <- lm(Substance ~ famstruc, data=NELS_short)
summary(mod1)

## 
## Call:
## lm(formula = Substance ~ famstruc, data = NELS_short)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0030 -0.4290 -0.1052  0.2186  3.4076 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -0.05845    0.03150  -1.855 0.063904 .  
## famstrucOne parent, one guardian  0.17801    0.08542   2.084 0.037467 *  
## famstrucSingle-parent family      0.25030    0.07187   3.483 0.000523 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7698 on 830 degrees of freedom
## Multiple R-squared:  0.01717,	Adjusted R-squared:  0.01481 
## F-statistic: 7.252 on 2 and 830 DF,  p-value: 0.0007547

Compare the results of the one-way ANOVA and the post-hoc comparisons against the results of linear regression model. What do you notice?
How do you interpret the coefficients in the regression model?

Note that there is one comparison missing from the regresion–one parent, one guardian versus single-parent family. If we decide we need that comparison (you might not!), how can we get it? The easiest way is to re-run the regression model and change the reference group.

#method 1
contrasts(NELS$famstruc) <- contr.treatment(3, base = 2)

#method 2
NELS$famstruc <- relevel(NELS$famstruc, ref=3)

#re-run the model
mod2 <- lm(Substance ~ famstruc, data=NELS)
summary(mod2)

## 
## Call:
## lm(formula = Substance ~ famstruc, data = NELS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0030 -0.4290 -0.1052  0.2186  3.4076 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       0.19184    0.06460   2.970 0.003066 ** 
## famstrucTwo-parent family        -0.25030    0.07187  -3.483 0.000523 ***
## famstrucOne parent, one guardian -0.07228    0.10236  -0.706 0.480256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7698 on 830 degrees of freedom
##   (167 observations deleted due to missingness)
## Multiple R-squared:  0.01717,	Adjusted R-squared:  0.01481 
## F-statistic: 7.252 on 2 and 830 DF,  p-value: 0.0007547

Summary of the results

To determine if family structure predicted substance use, we used a linear regression model. Family structure was broken down into three groups: single parent family, one parent and one guardian, and two-parent families. The family structure variable was then dummy coded with two-parent family structure group as the reference category. The overall regression model was statistically significant, F(2, 830) = 7.252, p<.001, $R^{2}$ = .017. Substance abuse was significantly lower in adolescents from two-parent families versus one parent, one guardian families, b = 0.178, SE = .085, t(830)= 2.08, p = .037, and single-parent families, b = 0.25, SE = .072, t(830) = .25. The difference between one parent and one guadarian versus single parent was not statistically significant, t(830)= -0.706, p = .48.

Last updated on Jun 1, 2020