Scikit-learn Tutorial: Predicting Happiness with Machine LearningΒΆ
Author: Aliyah Ismail, Eugenie Kawera, Doreen Musahara
1. What is scikit-learn?ΒΆ
Scikit-learn is a Python library used for machine learning. It helps us build models to predict outcomes and understand patterns in data. It was created by a group of developers and is widely used in data science.
It is useful because it follows a clear workflow: prepare the data, train a model, make predictions, and evaluate how well the model performs. It also works well with pandas and NumPy, which makes it easy to use with tools we already know from SDS 271.
In this tutorial, we will use scikit-learn to predict a countryβs happiness score using variables like GDP, social support, health, freedom, generosity, and corruption.
Why use scikit-learn?ΒΆ
Before learning scikit-learn, we will use pandas, NumPy, and seaborn to clean data, calculate summaries, and make plots. Those tools are good for exploring data, but scikit-learn helps us go further by building models that can make predictions.
For instance, instead of manually calculating a regression using formulas in pandas or NumPy, scikit-learn gives us tools like LinearRegression() that can fit the model for us. It also gives us tools to split data, evaluate models, and compare different machine learning methods.
When would you use scikit-learn?ΒΆ
You would use scikit-learn when you want to build a machine learning model.
Regression: when you want to predict a number, like happiness score.
Classification: when you want to predict a category, like happy vs. not happy.
Clustering: when you want to find hidden groups in data, like grouping countries with similar social and economic conditions.
For this tutorial, we will focus mostly on regression because our outcome variable, happiness score, is numeric.
Import librariesΒΆ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
The dataset: World Happiness ReportΒΆ
The dataset comes from the World Happiness Report, which measures how happy people are across different countries. The dataset was published in 2015 and is based on global survey data collected through the Gallup World Poll and includes information from over 158 countries across different regions of the world. In the surveys where people rate their life satisfaction, along with economic and social indicators that might explain those ratings.
The variables that we find in this dataset include: Happiness Score, GDP per capita, Social support, Healthy life expectancy, Freedom, and Trust Government Corruption.
df = pd.read_csv("2015.csv")
df.head()
| Country | Region | Happiness Rank | Happiness Score | Standard Error | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| 1 | Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| 2 | Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
| 3 | Norway | Western Europe | 4 | 7.522 | 0.03880 | 1.45900 | 1.33095 | 0.88521 | 0.66973 | 0.36503 | 0.34699 | 2.46531 |
| 4 | Canada | North America | 5 | 7.427 | 0.03553 | 1.32629 | 1.32261 | 0.90563 | 0.63297 | 0.32957 | 0.45811 | 2.45176 |
df.columns
Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
'Standard Error', 'Economy (GDP per Capita)', 'Family',
'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
'Generosity', 'Dystopia Residual'],
dtype='object')
VariablesΒΆ
Happiness Score β This is the main outcome variable. It measures how people rate their overall life satisfaction on a scale.
GDP per capita β This represents the economic level of a country and shows how much income is available per person.
Social support β This measures whether people feel they have someone to rely on in times of need.
Healthy life expectancy β This reflects how long people are expected to live in good health.
Freedom to make life choices β This shows how free people feel to make decisions about their own lives.
Trust / Government corruption β This measures how much people trust their government and how they perceive corruption levels.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 158 entries, 0 to 157 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 158 non-null object 1 Region 158 non-null object 2 Happiness Rank 158 non-null int64 3 Happiness Score 158 non-null float64 4 Standard Error 158 non-null float64 5 Economy (GDP per Capita) 158 non-null float64 6 Family 158 non-null float64 7 Health (Life Expectancy) 158 non-null float64 8 Freedom 158 non-null float64 9 Trust (Government Corruption) 158 non-null float64 10 Generosity 158 non-null float64 11 Dystopia Residual 158 non-null float64 dtypes: float64(9), int64(1), object(2) memory usage: 14.9+ KB
df.shape
(158, 12)
df.describe()
| Happiness Rank | Happiness Score | Standard Error | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 158.000000 | 158.000000 | 158.000000 | 158.000000 | 158.000000 | 158.000000 | 158.000000 | 158.000000 | 158.000000 | 158.000000 |
| mean | 79.493671 | 5.375734 | 0.047885 | 0.846137 | 0.991046 | 0.630259 | 0.428615 | 0.143422 | 0.237296 | 2.098977 |
| std | 45.754363 | 1.145010 | 0.017146 | 0.403121 | 0.272369 | 0.247078 | 0.150693 | 0.120034 | 0.126685 | 0.553550 |
| min | 1.000000 | 2.839000 | 0.018480 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.328580 |
| 25% | 40.250000 | 4.526000 | 0.037268 | 0.545808 | 0.856823 | 0.439185 | 0.328330 | 0.061675 | 0.150553 | 1.759410 |
| 50% | 79.500000 | 5.232500 | 0.043940 | 0.910245 | 1.029510 | 0.696705 | 0.435515 | 0.107220 | 0.216130 | 2.095415 |
| 75% | 118.750000 | 6.243750 | 0.052300 | 1.158448 | 1.214405 | 0.811013 | 0.549092 | 0.180255 | 0.309883 | 2.462415 |
| max | 158.000000 | 7.587000 | 0.136930 | 1.690420 | 1.402230 | 1.025250 | 0.669730 | 0.551910 | 0.795880 | 3.602140 |
df.isnull().sum()
Country 0 Region 0 Happiness Rank 0 Happiness Score 0 Standard Error 0 Economy (GDP per Capita) 0 Family 0 Health (Life Expectancy) 0 Freedom 0 Trust (Government Corruption) 0 Generosity 0 Dystopia Residual 0 dtype: int64
Why This Dataset Matter to UsΒΆ
We chose this dataset because it connects to our interests in Economics and SDS. It allows us to study how economic and social factors relate to quality of life.
This dataset is also personally meaningful to us because we are interested in seeing how our countries (Rwanda and Somaliland) are doing in terms of happiness compared to other countries in the world. We especially want to understand how factors like GDP, region (or continent), and social conditions relate to happiness in these places.
Our focus is to explore whether countries like ours follow similar patterns as others or if there are differences. This helps us better understand how different factors shape peopleβs daily lives and well-being
3. Exploratory data analysisΒΆ
Before using scikit-learn, we first use pandas and seaborn to understand the data.
df = df.rename(columns={
"Country": "country",
"Region": "region",
"Happiness Rank": "rank",
"Happiness Score": "happiness_score",
"Standard Error": "standard_error",
"Economy (GDP per Capita)": "gdp_per_capita",
"Family": "social_support",
"Health (Life Expectancy)": "healthy_life_expectancy",
"Freedom": "freedom",
"Trust (Government Corruption)": "corruption",
"Generosity": "generosity",
"Dystopia Residual": "dystopia_residual"
})
df.head()
| country | region | rank | happiness_score | standard_error | gdp_per_capita | social_support | healthy_life_expectancy | freedom | corruption | generosity | dystopia_residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| 1 | Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| 2 | Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
| 3 | Norway | Western Europe | 4 | 7.522 | 0.03880 | 1.45900 | 1.33095 | 0.88521 | 0.66973 | 0.36503 | 0.34699 | 2.46531 |
| 4 | Canada | North America | 5 | 7.427 | 0.03553 | 1.32629 | 1.32261 | 0.90563 | 0.63297 | 0.32957 | 0.45811 | 2.45176 |
Distribution of happiness scores
sns.histplot(data=df, x="happiness_score", kde=True)
plt.title("Distribution of Happiness Scores")
plt.xlabel("Happiness Score")
plt.ylabel("Count")
plt.show()
This plot shows how happiness scores are spread across countries. Most countries are in the middle range, while fewer countries have very low or very high happiness scores.
Relationship between GDP and happinessΒΆ
sns.scatterplot(data=df, x="gdp_per_capita", y="happiness_score")
plt.title("GDP per Capita and Happiness Score")
plt.xlabel("GDP per Capita")
plt.ylabel("Happiness Score")
plt.show()
This plot helps us see whether countries with higher GDP per capita also tend to have higher happiness scores. The points are moving upward as GDP increases, which suggests a positive relationship.
Correlation heatmapΒΆ
numeric_df = df[["happiness_score", "gdp_per_capita", "social_support",
"healthy_life_expectancy", "freedom", "generosity", "corruption"]]
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Between Variables")
plt.show()
The heatmap shows how strongly each variable is correlated with happiness score. GDP per capita, social support, and healthy life expectancy have the highest positive correlations with happiness (above 0.7), meaning countries with higher income, stronger social networks, and better health tend to report higher happiness. Freedom also shows a moderate positive correlation. Generosity and corruption have weaker correlations, suggesting they matter less in predicting happiness on their own. We also notice that GDP per capita and healthy life expectancy are strongly correlated with each other (around 0.8). This makes sense, wealthier countries tend to have better healthcare systems. This overlap means we should be careful about over-interpreting the individual contribution of each feature in the regression model.
Part1: Regression with scikit-learnΒΆ
Choose features and target variable
In machine learning, the features are the variables we use to make predictions. The target is the variable we are trying to predict.
Here, we want to predict happiness_score.
X = df[["gdp_per_capita", "social_support", "healthy_life_expectancy",
"freedom", "generosity", "corruption"]]
y = df["happiness_score"]
X.head()
| gdp_per_capita | social_support | healthy_life_expectancy | freedom | generosity | corruption | |
|---|---|---|---|---|---|---|
| 0 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.29678 | 0.41978 |
| 1 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.43630 | 0.14145 |
| 2 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.34139 | 0.48357 |
| 3 | 1.45900 | 1.33095 | 0.88521 | 0.66973 | 0.34699 | 0.36503 |
| 4 | 1.32629 | 1.32261 | 0.90563 | 0.63297 | 0.45811 | 0.32957 |
y.head()
0 7.587 1 7.561 2 7.527 3 7.522 4 7.427 Name: happiness_score, dtype: float64
Train-test splitΒΆ
We will split the data into training data and testing data.
The training set is used to teach the model.
The testing set is used to check how well the model works on new data.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
print("Training data size:", X_train.shape)
print("Testing data size:", X_test.shape)
Training data size: (126, 6) Testing data size: (32, 6)
Fit Linear Regression ModelΒΆ
Linear regression model predicts a numeric value. In this case, it predicts happiness score based on the countryβs social and economic variables.
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred_linear = linear_model.predict(X_test)
y_pred_linear[:5]
array([4.7601377 , 6.46907653, 4.61372277, 3.05290197, 4.96414597])
Important scikit-learn methods:
.fit() trains the model using the training data.
.predict() makes predictions using new data.
.score() gives a quick measure of model performance.
Create a table comparing actual and predicted happiness scores:ΒΆ
results = pd.DataFrame({
"actual_happiness": y_test,
"predicted_happiness": y_pred_linear})
results.head()
| actual_happiness | predicted_happiness | |
|---|---|---|
| 128 | 4.307 | 4.760138 |
| 45 | 5.987 | 6.469077 |
| 134 | 4.194 | 4.613723 |
| 156 | 2.905 | 3.052902 |
| 90 | 5.057 | 4.964146 |
Evaluate the Linear Regression modelΒΆ
We will use two common metrics:
Mean Squared Error (MSE): measures how far predictions are from the real values. Smaller is better.
R^2 score: measures how much variation in happiness score the model explains. Closer to 1 is better.
linear_mse = mean_squared_error(y_test, y_pred_linear)
linear_r2 = r2_score(y_test, y_pred_linear)
print("Linear regression MSE:", linear_mse)
print("Linear regression R^2:", linear_r2)
Linear regression MSE: 0.24193882833563737 Linear regression R^2: 0.8294705100069293
plt.figure(figsize=(6, 5))
plt.scatter(y_test, y_pred_linear, alpha=0.7, color="steelblue")
plt.plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()], 'r--', label="Perfect prediction")
plt.xlabel("Actual Happiness Score")
plt.ylabel("Predicted Happiness Score")
plt.title("Linear Regression: Actual vs Predicted")
plt.legend()
plt.tight_layout()
plt.show()
Points close to the red line mean accurate predictions. The model performs well overall (RΒ²=0.83), but a few countries are notable outliers.
The R^2 = 0.83 means that about 83% of the variation in happiness scores across countries is explained by the variables in our model. This shows that factors like GDP, social support, health, freedom, generosity, and corruption do a strong job in predicting happiness.
This high value suggests that the model fits the data well, but it is not perfect. There is still about 17% of the variation that is not explained, which could be due to other factors like culture, history, or individual experiences that are not included in the dataset.
Random Forest RegressionΒΆ
Let's try another model.
Linear regression assumes a straight-line relationship between the features and the outcome. But real-world data is not always that simple.
A Random Forest Model is more flexible. It uses many decision trees and combines their results to make a prediction. This can sometimes improve prediction accuracy.
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, y_pred_rf)
rf_r2 = r2_score(y_test, y_pred_rf)
print("Random Forest MSE:", rf_mse)
print("Random Forest R^2:", rf_r2)
Random Forest MSE: 0.2638721893812502 Random Forest R^2: 0.8140108795760777
Compare Two ModelsΒΆ
model_comparison = pd.DataFrame({
"Model": ["Linear Regression", "Random Forest"],
"MSE": [linear_mse, rf_mse],
"R^2": [linear_r2, rf_r2]})
model_comparison
| Model | MSE | R^2 | |
|---|---|---|---|
| 0 | Linear Regression | 0.241939 | 0.829471 |
| 1 | Random Forest | 0.263872 | 0.814011 |
This table compares the performance of the two models. The Linear Regression model has a lower MSE (0.242) and a higher R^2 (0.829) compared to the Random Forest model, which has an MSE of 0.264 and R^2 of 0.814.
This means that Linear Regression performs slightly better for this dataset. It suggests that the relationship between happiness and the predictors is mostly linear, so a simple model is enough to explain the data. Random Forest, which is more complex, does not improve the results in this case.
Feature importance
Random Forest also allows us to look at feature importance. Feature importance tells us which variables were most useful for making predictions.
feature_importance = pd.DataFrame({
"feature": X.columns,
"importance": rf_model.feature_importances_
})
feature_importance = feature_importance.sort_values(by="importance", ascending=False)
feature_importance
| feature | importance | |
|---|---|---|
| 0 | gdp_per_capita | 0.419622 |
| 1 | social_support | 0.193982 |
| 2 | healthy_life_expectancy | 0.186078 |
| 3 | freedom | 0.099359 |
| 5 | corruption | 0.056074 |
| 4 | generosity | 0.044885 |
sns.barplot(data=feature_importance, x="importance", y="feature")
plt.title("Feature Importance from Random Forest")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()
This plot shows which variables the Random Forest model used most when predicting happiness score. It shows that GDP per capita, social support, or health have high importance are strong predictors of happiness in this dataset.
PART 3: Clustering countriesΒΆ
Clustering is used when we want to find hidden groups in the data. Unlike regression, clustering does not use a target variable. Instead, it groups observations that are similar to each other.
Here, we will group countries based on their social and economic characteristics.
Standardize the data
Clustering is affected by scale, so we standardize the variables first. Standardizing means putting the variables on a similar scale.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Fit K-Means clusteringΒΆ
K-Means is a clustering method that groups data into a chosen number of clusters.
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X_scaled)
df[["country", "happiness_score", "cluster"]].head()
| country | happiness_score | cluster | |
|---|---|---|---|
| 0 | Switzerland | 7.587 | 2 |
| 1 | Iceland | 7.561 | 2 |
| 2 | Denmark | 7.527 | 2 |
| 3 | Norway | 7.522 | 2 |
| 4 | Canada | 7.427 | 2 |
Explore the clustersΒΆ
df.groupby("cluster")[["happiness_score", "gdp_per_capita", "social_support",
"healthy_life_expectancy", "freedom", "generosity", "corruption"]].mean()
| happiness_score | gdp_per_capita | social_support | healthy_life_expectancy | freedom | generosity | corruption | |
|---|---|---|---|---|---|---|---|
| cluster | |||||||
| 0 | 5.576582 | 0.980120 | 1.056143 | 0.732992 | 0.401858 | 0.184052 | 0.088927 |
| 1 | 4.206500 | 0.369585 | 0.737940 | 0.336673 | 0.366064 | 0.254268 | 0.132607 |
| 2 | 6.844517 | 1.302792 | 1.250101 | 0.856586 | 0.609351 | 0.353076 | 0.310519 |
This table shows the average happiness score and average predictor values for each cluster. We can use it to understand what makes the groups different. For example, one cluster may include countries with higher GDP, stronger social support, and higher happiness scores.
Visualize the clustersΒΆ
sns.scatterplot(
data=df,
x="gdp_per_capita",
y="happiness_score",
hue="cluster",
palette="Set2"
)
plt.title("Country Clusters Based on Social and Economic Factors")
plt.xlabel("GDP per Capita")
plt.ylabel("Happiness Score")
plt.show()
This plot shows how countries are grouped based on their characteristics. Countries in the same cluster are more similar to each other than to countries in other clusters.
Our FocusΒΆ
In this project, we focus on Rwanda and Somaliland. We want to understand how their happiness scores compare to other countries and how factors like GDP, region, and social conditions affect their well-being.
focus_countries = df[df["country"].isin(["Rwanda", "Somaliland region"])]
focus_countries
| country | region | rank | happiness_score | standard_error | gdp_per_capita | social_support | healthy_life_expectancy | freedom | corruption | generosity | dystopia_residual | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90 | Somaliland region | Sub-Saharan Africa | 91 | 5.057 | 0.06161 | 0.18847 | 0.95152 | 0.43873 | 0.46582 | 0.39928 | 0.50318 | 2.11032 | 1 |
| 153 | Rwanda | Sub-Saharan Africa | 154 | 3.465 | 0.03464 | 0.22208 | 0.77370 | 0.42864 | 0.59201 | 0.55191 | 0.22628 | 0.67042 | 1 |
Rwanda's actual score (3.47) is more than a point below what the model predicts (4.86). This gap suggests that factors outside our dataset, such as historical trauma or how respondents interpret the happiness question culturally, affect Rwanda's score. Somaliland's prediction (4.96) is very close to its actual score (5.06), meaning the model captures it well. This difference shows a key limitation of machine learning: it can only learn from the variables it is given.
global_avg = df[[
"happiness_score",
"gdp_per_capita",
"social_support",
"healthy_life_expectancy",
"freedom",
"generosity",
"corruption"]].mean()
global_avg
happiness_score 5.375734 gdp_per_capita 0.846137 social_support 0.991046 healthy_life_expectancy 0.630259 freedom 0.428615 generosity 0.237296 corruption 0.143422 dtype: float64
Compare to global averages
We selected Rwanda and Somaliland because they represent our home contexts. This allows us to connect the data to real-world experiences and better understand how happiness is measured in these countries.
focus_countries[[
"country",
"happiness_score",
"gdp_per_capita",
"social_support",
"healthy_life_expectancy",
"freedom",
"generosity",
"corruption"]]
| country | happiness_score | gdp_per_capita | social_support | healthy_life_expectancy | freedom | generosity | corruption | |
|---|---|---|---|---|---|---|---|---|
| 90 | Somaliland region | 5.057 | 0.18847 | 0.95152 | 0.43873 | 0.46582 | 0.50318 | 0.39928 |
| 153 | Rwanda | 3.465 | 0.22208 | 0.77370 | 0.42864 | 0.59201 | 0.22628 | 0.55191 |
Rwanda and Somaliland have lower GDP compared to the global average, but their happiness scores are not as low as expected. This suggests that other factors like social support or freedom may also play an important role in explaining happiness.
Region / Continent comparisonΒΆ
df.groupby("region")[["happiness_score", "gdp_per_capita"]].mean()
| happiness_score | gdp_per_capita | |
|---|---|---|
| region | ||
| Australia and New Zealand | 7.285000 | 1.291880 |
| Central and Eastern Europe | 5.332931 | 0.942438 |
| Eastern Asia | 5.626167 | 1.151780 |
| Latin America and Caribbean | 6.144682 | 0.876815 |
| Middle East and Northern Africa | 5.406900 | 1.066974 |
| North America | 7.273000 | 1.360400 |
| Southeastern Asia | 5.317444 | 0.789054 |
| Southern Asia | 4.580857 | 0.560486 |
| Sub-Saharan Africa | 4.202800 | 0.380473 |
| Western Europe | 6.689619 | 1.298596 |
df[df["country"].isin(["Rwanda", "Somaliland region"])][["country", "region"]]
| country | region | |
|---|---|---|
| 90 | Somaliland region | Sub-Saharan Africa |
| 153 | Rwanda | Sub-Saharan Africa |
focus_X = focus_countries[[
"gdp_per_capita",
"social_support",
"healthy_life_expectancy",
"freedom",
"generosity",
"corruption"]]
focus_countries.loc[:, "predicted_happiness"] = linear_model.predict(focus_X)
focus_countries[["country", "happiness_score", "predicted_happiness"]]
| country | happiness_score | predicted_happiness | |
|---|---|---|---|
| 90 | Somaliland region | 5.057 | 4.964146 |
| 153 | Rwanda | 3.465 | 4.862202 |
df[df["country"].isin(["Rwanda", "Somaliland region"])][["country", "cluster"]]
| country | cluster | |
|---|---|---|
| 90 | Somaliland region | 1 |
| 153 | Rwanda | 1 |
Rwanda and Somaliland are grouped in the same/different cluster as countries with similar economic and social conditions. This suggests that they share similar patterns in factors like GDP, health, and social support.
import seaborn as sns
sns.scatterplot(
data=df,
x="gdp_per_capita",
y="happiness_score",
hue="region")
sns.scatterplot(
data=focus_countries,
x="gdp_per_capita",
y="happiness_score",
color="Red",
s=100)
plt.title("Rwanda and Somaliland Compared to Global Data")
plt.show()
This plot shows where Rwanda and Somaliland fall compared to other countries. Even with lower GDP, their position shows that happiness is influenced by multiple factors, not just economic wealth.
7. ConclusionΒΆ
Scikit-learn fits directly into the workflow we know from SDS 271: we used pandas to load and clean the data, seaborn to explore it, and then passed it into scikit-learn to build models. The library's consistent pattern, fit(), predict(), score(), makes it easy to swap between models like LinearRegression and RandomForestRegressor with minimal code changes, something pandas alone cannot do. Use scikit-learn when you want to go beyond exploration and actually predict or group: regression for numeric outcomes, classification for categories, clustering when you have no labels. For this dataset, it revealed that GDP, social support, and health explain 83% of happiness variation globally, but also that Rwanda's lived experience falls outside what the data can capture.