Statsmodels Tutorial¶
Author: Tabitha Githinji, Najma Golicha
1. Introduction to Statsmodels¶
Statsmodels is a Python package used for statistical modeling and econometric analysis. It provides tools for estimating models, performing hypothesis tests, and interpreting relationships between variables in a statistically rigorous way.
Unlike machine learning libraries that focus primarily on prediction, Statsmodels is designed for inference and explanation. It produces detailed statistical output such as:
- Coefficients
- Standard errors
- Confidence intervals
- p-values
- Model fit statistics (R², AIC, etc.)
2. When and Why Use Statsmodels¶
Statsmodels is used when the goal is to understand relationships between variables rather than just make predictions.
Typical use cases include:
- Testing hypotheses about group differences
- Estimating the effect of variables using regression
- Building interpretable statistical models
- Analyzing time series or economic data
3. Comparison to R (stats package)¶
Statsmodels plays a similar role in Python to the stats package in R.
In R, statistical modeling tools are built into the language
In Python, Statsmodels provides those missing statistical inference tools on top of pandas and numpy.
Both allow:
- Linear regression
- Hypothesis testing
- Distribution-based analysis
- Statistical summaries
The key difference is ecosystem structure:
R is statistics-first Python is general-purpose, and Statsmodels fills the statistics gap
4.Core Functional Areas of Statsmodels¶
This section introduces the main categories of tools used in Statsmodels.
4.1 Summary Statistics and Hypothesis Testing¶
Statsmodels provides formal statistical testing tools beyond pandas descriptive summaries.
These are used when you want inference, not just description.
Common functions:
statsmodels.stats.weightstats.ttest_ind → two-sample t-test statsmodels.stats.api.DescrStatsW → weighted descriptive statistics statsmodels.stats.proportion → tests for proportions
These allow you to test whether differences between groups are statistically significant.
4.2 Linear Regression (OLS)¶
The most widely used feature in Statsmodels is Ordinary Least Squares (OLS) regression. The main function is: statsmodels.formula.api.ols() - fits an ordinary least squares (OLS) regression model You write your model using a formula, similar to R: y ~ x1 + x2 Then you call .fit() to estimate the model and .summary() to get the full statistical output.
Key outputs:
- Coefficients (effect sizes)
- p-values (statistical significance)
- R-squared (model fit)
- Standard errors
- Confidence intervals
This is directly comparable to lm() in R.
4.3 Categorical Variables¶
Statsmodels handles categorical variables using the C() function.
model = smf.ols("y ~ C(category) + x", data=df).fit()
This automatically:
- Converts categories into dummy variables
- Selects a reference group
- Avoids manual encoding
This is useful for variables like region and race. When you do this, Statsmodels automatically creates dummy variables for each category and chooses a baseline for the intercept.
4.4 Logistic Regression (Binary Outcomes)¶
For binary dependent variables, Statsmodels uses logistic regression.
logit_model = smf.logit("y ~ x1 + x2", data=df).fit() logit_model.summary()
Key features:
- Models probability of an event
- Outputs log-odds coefficients
- Provides full statistical inference output
The syntax is similar to linear regression, but the model uses a logit link function. You still get a full statistical summary, which makes it easy to interpret coefficients, odds ratios, and significance levels. This is helpful when you’re working with classification problems but still want the kind of statistical output that scikit‑learn doesn’t provide.
These four areas - summary stats, linear regression, categorical regression, and logistic regression - cover the majority of real‑world statistical modeling. Together, they show how Statsmodels supports both explanation and inference, which is the main reason people use it.
5. Dataset Overview: Wages Data¶
We will apply Statsmodels to the Wages dataset from The Statistical Sleuth (Ramsey & Schafer, 2002).
This dataset contains information on 25,628 full-time male workers in the United States aged 18–70.
Variables used:
- Wage: weekly wage (response variable)
- Education: years of schooling
- Experience: years of labor market experience
- Black: race indicator (Black vs Non-Black)
- Region: US census region (MW, NE, S, W)
- SMSA: Urban vs Non-urban
We are particularly interested in this dataset because it relates to labor economics and human capital theory. It allows us to examine how education and experience influence wages, as well as explore patterns of inequality and regional variation using regression analysis.
6. Research Questions¶
This analysis investigates wage disparities using regression modeling.
Some of the research questions we aim to answer are:
1. Is there evidence of a racial wage gap after controlling for education and experience?
2. Does geographic region and urban location explain additional variation in wages?
3. Does the racial wage gap differ across regions (interaction effects)?
We will use:
- OLS regression models
- Nested model comparison (F-tests)
- Coefficient interpretation
- Model diagnostics
7.Loading and Inspecting the Data¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.weightstats import ttest_ind
# loading the data
wage = pd.read_csv("wage.csv")
wage.head()
| Wage | Education | Experience | Black | SMSA | Region | |
|---|---|---|---|---|---|---|
| 0 | 354.94 | 7 | 45 | No | Yes | NE |
| 1 | 370.37 | 9 | 9 | No | Yes | NE |
| 2 | 754.94 | 11 | 46 | No | Yes | NE |
| 3 | 593.54 | 12 | 36 | No | Yes | NE |
| 4 | 377.23 | 16 | 22 | No | Yes | NE |
8. Data Cleaning and Encoding¶
We convert categorical variables(Black and SMSA) into numeric format for analysis.
wage["Black"] = wage["Black"].map({"Yes": 1, "No": 0})
wage["SMSA"] = wage["SMSA"].map({"Yes": 1, "No":0})
wage.head()
| Wage | Education | Experience | Black | SMSA | Region | |
|---|---|---|---|---|---|---|
| 0 | 354.94 | 7 | 45 | 0 | 1 | NE |
| 1 | 370.37 | 9 | 9 | 0 | 1 | NE |
| 2 | 754.94 | 11 | 46 | 0 | 1 | NE |
| 3 | 593.54 | 12 | 36 | 0 | 1 | NE |
| 4 | 377.23 | 16 | 22 | 0 | 1 | NE |
wage.describe()
| Wage | Education | Experience | Black | SMSA | |
|---|---|---|---|---|---|
| count | 25631.000000 | 25631.000000 | 25631.000000 | 25631.000000 | 25631.000000 |
| mean | 640.162470 | 13.076275 | 18.586555 | 0.077562 | 0.742850 |
| std | 444.283273 | 2.904286 | 12.424661 | 0.267487 | 0.437071 |
| min | 50.390000 | 0.000000 | -4.000000 | 0.000000 | 0.000000 |
| 25% | 356.130000 | 12.000000 | 9.000000 | 0.000000 | 0.000000 |
| 50% | 567.230000 | 12.000000 | 16.000000 | 0.000000 | 1.000000 |
| 75% | 826.210000 | 16.000000 | 27.000000 | 0.000000 | 1.000000 |
| max | 18777.200000 | 18.000000 | 63.000000 | 1.000000 | 1.000000 |
Looking at the descriptive statistics, the first thing that stands out is the very large standard deviation of hourly wages (around 444). This tells us that wages vary a lot across individuals in the dataset. The minimum wage is around 50, while the maximum is over 18,000, which shows that the distribution is extremely spread out.
The mean wage is about 640, but because the standard deviation is so large and the maximum value is so extreme, the mean may not represent a “typical” worker very well. This suggests that the wage distribution has a few very high earners pulling the average upward, so we will be using median to compare between groups.
# Wage difference by Region
wage.groupby("Region")["Wage"].median()
Region MW 581.67 NE 593.54 S 498.58 W 569.80 Name: Wage, dtype: float64
When we compare median wages across regions, the South has the lowest median wage (about 498), followed by the Midwest (around 582) and the West (about 570). The Northeast has the highest median wage at roughly 594. These differences suggest that regional labor markets differ in meaningful ways, and that region may play an important role in explaining wage variation once we move into regression analysis.
10. Hypothesis Testing¶
We test the null hypothesis that wages do not differ between urban and non-urban workers.
# Wage difference by SMSA (urban vs non-urban)
group1 = wage[wage["SMSA"] == 1]["Wage"]
group2 = wage[wage["SMSA"] == 0]["Wage"]
ttest_ind(group1, group2)
(19.984874212965273, 3.505710453796782e-88, 25629.0)
There is a statistically significant difference in wages between SMSA and non‑SMSA workers.
The t‑statistic is 19.98 and the p‑value is effectively zero (3.5e‑88), indicating that the observed difference in mean wages is far too large to be due to random chance.
We reject the null hypothesis and conclude that workers in metropolitan areas earn significantly different wages than those in non‑metropolitan areas.
11. Regression Analysis¶
Modeling Strategy
To understand wage disparities, we estimate a sequence of nested OLS regression models. Each step adds variables to test whether they explain additional variation in wages and whether earlier relationships remain stable.
We proceed as follows:
- Model 1: Human capital baseline (education + experience)
- Model 2: Adds race (Black indicator) to test wage gap
- Model 3: Adds urban location (SMSA) to test geographic effects
- Model 4 (extension): Regional differences and interaction effects (race × region)
This stepwise structure allows us to isolate the contribution of each factor. Because wages are highly skewed, we apply a log transformation to better approximate a normal distribution and improve model interpretation.
Model 1 - Human Capital Baseline
model1 = smf.ols("np.log(Wage) ~ Education + Experience", data=wage).fit()
print(model1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(Wage) R-squared: 0.256
Model: OLS Adj. R-squared: 0.256
Method: Least Squares F-statistic: 4414.
Date: Fri, 01 May 2026 Prob (F-statistic): 0.00
Time: 16:34:26 Log-Likelihood: -20624.
No. Observations: 25631 AIC: 4.125e+04
Df Residuals: 25628 BIC: 4.128e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.6058 0.018 250.273 0.000 4.570 4.642
Education 0.1014 0.001 83.558 0.000 0.099 0.104
Experience 0.0184 0.000 64.944 0.000 0.018 0.019
==============================================================================
Omnibus: 1659.008 Durbin-Watson: 1.788
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3153.192
Skew: -0.470 Prob(JB): 0.00
Kurtosis: 4.438 Cond. No. 136.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The first model estimates log wages as a function of education and experience. The results show that both variables are strong and statistically significant predictors of wages. Specifically, an additional year of education is associated with approximately a 10.1% increase in wages, while an additional year of experience is associated with about a 1.8% increase, holding other factors constant.
The model explains approximately 26 percent of the variation in log wages, indicating that while human capital is important, a substantial portion of wage variation remains unexplained.
Overall, this model supports standard human capital theory, suggesting that individuals with higher levels of education and experience tend to earn higher wages. However, it also highlights that additional factors beyond human capital likely influence wage outcomes.
Model 2 - Adding Race
model2 = smf.ols("np.log(Wage) ~ Education + Experience + Black", data=wage).fit()
print(model2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(Wage) R-squared: 0.266
Model: OLS Adj. R-squared: 0.266
Method: Least Squares F-statistic: 3099.
Date: Fri, 01 May 2026 Prob (F-statistic): 0.00
Time: 16:38:08 Log-Likelihood: -20451.
No. Observations: 25631 AIC: 4.091e+04
Df Residuals: 25627 BIC: 4.094e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.6466 0.018 252.401 0.000 4.610 4.683
Education 0.0998 0.001 82.540 0.000 0.097 0.102
Experience 0.0184 0.000 65.172 0.000 0.018 0.019
Black -0.2350 0.013 -18.675 0.000 -0.260 -0.210
==============================================================================
Omnibus: 1699.141 Durbin-Watson: 1.792
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3289.481
Skew: -0.474 Prob(JB): 0.00
Kurtosis: 4.477 Cond. No. 138.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In the second model, we include a binary indicator for race (Black) to examine whether a racial wage gap persists after controlling for education and experience. The results show that the coefficient on the Black indicator is negative and statistically significant.
Specifically, the coefficient of −0.235 implies that Black workers earn approximately 23.5% lower wages than non-Black workers, holding education and experience constant. This indicates a substantial wage gap that cannot be explained by differences in human capital alone.
The inclusion of the race variable does not meaningfully alter the coefficients on education and experience, suggesting that these effects are stable across groups. While the model’s explanatory power increases slightly, the more important result is the identification of a persistent and statistically robust wage gap associated with race.
Overall, this provides evidence of structural wage inequality in the data, as differences in observable human capital characteristics do not fully account for disparities in earnings.
Model 3 - Adding Urban location and Region
model3 = smf.ols("np.log(Wage) ~ Education + Experience + Black + SMSA + C(Region)", data=wage).fit()
print(model3.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(Wage) R-squared: 0.283
Model: OLS Adj. R-squared: 0.283
Method: Least Squares F-statistic: 1443.
Date: Fri, 01 May 2026 Prob (F-statistic): 0.00
Time: 16:40:37 Log-Likelihood: -20159.
No. Observations: 25631 AIC: 4.033e+04
Df Residuals: 25623 BIC: 4.040e+04
Df Model: 7
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 4.5764 0.020 232.281 0.000 4.538 4.615
C(Region)[T.NE] 0.0360 0.010 3.702 0.000 0.017 0.055
C(Region)[T.S] -0.0612 0.009 -6.743 0.000 -0.079 -0.043
C(Region)[T.W] -0.0018 0.010 -0.182 0.855 -0.021 0.018
Education 0.0970 0.001 80.773 0.000 0.095 0.099
Experience 0.0184 0.000 65.810 0.000 0.018 0.019
Black -0.2304 0.013 -18.206 0.000 -0.255 -0.206
SMSA 0.1578 0.008 20.467 0.000 0.143 0.173
==============================================================================
Omnibus: 1767.285 Durbin-Watson: 1.832
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3555.155
Skew: -0.479 Prob(JB): 0.00
Kurtosis: 4.553 Cond. No. 154.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Model 3 adds regional indicators and an SMSA (urban) variable to examine whether geographic factors explain wage differences beyond education, experience, and race.
The results show significant regional variation: relative to the baseline region, workers in the Northeast earn about 3.6% higher wages, while those in the South earn about 6.1% lower wages. The West shows no significant difference. In addition, the SMSA coefficient is positive and significant, indicating that workers in metropolitan areas earn roughly 15.8% higher wages, consistent with an urban wage premium.
The coefficient on Black remains negative and statistically significant (−0.2304), implying that Black workers earn about 23% lower wages even after accounting for geographic factors. Although slightly reduced, the gap persists, suggesting that location does not fully explain racial wage disparities.
Model 4 -Interaction Between Race and Region
model4 = smf.ols("np.log(Wage) ~ Education + Experience + Black * Region", data=wage).fit()
model4.summary()
| Dep. Variable: | np.log(Wage) | R-squared: | 0.271 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.271 |
| Method: | Least Squares | F-statistic: | 1059. |
| Date: | Fri, 01 May 2026 | Prob (F-statistic): | 0.00 |
| Time: | 20:04:55 | Log-Likelihood: | -20365. |
| No. Observations: | 25631 | AIC: | 4.075e+04 |
| Df Residuals: | 25621 | BIC: | 4.083e+04 |
| Df Model: | 9 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 4.6601 | 0.019 | 239.211 | 0.000 | 4.622 | 4.698 |
| Region[T.NE] | 0.0607 | 0.010 | 6.078 | 0.000 | 0.041 | 0.080 |
| Region[T.S] | -0.0548 | 0.010 | -5.751 | 0.000 | -0.073 | -0.036 |
| Region[T.W] | 0.0034 | 0.010 | 0.333 | 0.739 | -0.017 | 0.023 |
| Education | 0.0989 | 0.001 | 81.926 | 0.000 | 0.097 | 0.101 |
| Experience | 0.0183 | 0.000 | 64.992 | 0.000 | 0.018 | 0.019 |
| Black | -0.1928 | 0.031 | -6.287 | 0.000 | -0.253 | -0.133 |
| Black:Region[T.NE] | -0.0035 | 0.043 | -0.082 | 0.935 | -0.088 | 0.081 |
| Black:Region[T.S] | -0.0418 | 0.035 | -1.192 | 0.233 | -0.110 | 0.027 |
| Black:Region[T.W] | 0.0382 | 0.051 | 0.743 | 0.458 | -0.063 | 0.139 |
| Omnibus: | 1711.542 | Durbin-Watson: | 1.804 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3339.172 |
| Skew: | -0.475 | Prob(JB): | 0.00 |
| Kurtosis: | 4.492 | Cond. No. | 507. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Model 4 introduces interaction terms between race and region to examine whether the racial wage gap varies across geographic areas. Education and experience remain strong and statistically significant predictors, and regional wage differences continue to be present.
The coefficient on Black remains negative and significant (−0.1928), implying that Black workers earn approximately 19% lower wages on average, holding other factors constant.
However, the interaction terms between Black and region are not statistically significant, indicating that the size of this wage gap does not differ meaningfully across regions.
Overall, these results suggest that while region influences wage levels, it does not moderate racial wage disparities. The inclusion of interaction terms also does not improve model fit, reinforcing the stability of the racial wage gap across locations.
Models Summary Across all models, several consistent patterns emerge:
Education and experience are stable and strongly significant predictors of wages, confirming the importance of human capital.
A persistent racial wage gap exists across all specifications, even after controlling for education, experience, geographic region, and urban location.
Geographic factors do influence wages. Workers in metropolitan areas earn more than those in non-metropolitan areas, and wages vary across regions. However, these geographic differences do not explain or eliminate racial disparities.
The interaction analysis shows that the racial wage gap does not significantly differ across regions, suggesting that this inequality is structurally consistent rather than location-dependent.
13. Findings and Conclusion¶
The analysis produces four main findings.
Human capital variables such as education and experience are strong and consistent predictors of wages. Individuals with higher education levels and more labor market experience tend to earn higher wages, which supports standard economic theory. However, these variables explain only a modest portion of total wage variation.
There is clear and persistent evidence of a racial wage gap. Across all model specifications, Black workers earn significantly less than non-Black workers, even after controlling for education, experience, and geographic factors. This suggests that observable human capital differences do not fully account for wage inequality.
Geographic location plays an important role in wage determination. Workers in metropolitan areas earn more than those in non-metropolitan areas, and wages vary across regions of the United States. However, these geographic differences do not eliminate racial disparities in wages.
The racial wage gap does not significantly vary by region. The interaction analysis shows no meaningful evidence that the effect of race on wages depends on geographic location.
In conclusion, the key result is that a statistically significant racial wage gap persists across all model specifications and remains stable even after accounting for geographic differences and labor market characteristics. The findings indicate that wage inequality is shaped by multiple factors, but observable characteristics such as education, experience, and geography are insufficient to fully explain persistent differences across racial groups.