The purpose of this regression study is to identify the key factors influencing medical insurance charges and develop a predictive model for pricing. By analyzing variables such as age, BMI, smoking status, number of children, and region, the study determines which factors significantly impact costs. It also examines the interaction between smoking and BMI, the nonlinear effect of age, and regional pricing differences. The findings help insurers refine risk assessment and pricing strategies for more accurate and fair premium calculations.
The dataset for this study was sourced from Kaggle.com, containing information on medical insurance charges alongside variables such as age, sex, BMI, number of children, smoker status, and region. The data was cleaned by first checking for missing values, and any incomplete records were removed to maintain accuracy. Categorical variables were converted into dummy variables, with southeast used as the reference category for region. The dataset was then reorganized by grouping independent variables together for efficient regression modeling. This structured approach ensured the data was properly prepared for analysis.
Datasource: https://www.kaggle.com/datasets/mirichoi0218/insurance
Age
Body Mass Index (BMI)
Smoking Status
Number of Children
Regions
This study employed multiple linear regression analysis to examine the factors influencing medical insurance charges. Categorical variables such as sex, smoker status, and region were converted into dummy variables, with southeast as the reference category. An interaction term (Smoker × BMI) was included to assess whether BMI affects insurance costs differently for smokers. Additionally, a quadratic term for Age (Age²) was tested to capture nonlinear effects. The model was refined through stepwise variable selection, removing insignificant predictors to improve interpretability while maintaining strong predictive accuracy. The final model, selected based on Adjusted R² and statistical significance, provided insights into the most impactful cost drivers.
The initial regression analysis identified age, BMI, number of children, smoking status, and the northeast region as significant factors influencing medical insurance charges (p < 0.05). Smoking had the strongest impact, with smokers paying $23,038 to $24,659 more than non-smokers. Age and BMI also significantly increased costs, while sex, the northwest, and southwest regions were found to be insignificant. These results provided a foundation for refining the model by testing interactions and nonlinear effects.
The regression analysis incorporating the interaction between smoker status and BMI confirmed that BMI has a significantly stronger effect on insurance charges for smokers than for non-smokers (p < 0.001). While BMI alone was not significant (p = 0.358), the interaction term (Smoker × BMI) was highly significant (coefficient = 1443.10, p < 0.001), indicating that for smokers, each additional unit of BMI leads to a substantial increase in costs. This improved model fit, with R² increasing from 0.751 to 0.841, confirming that the interaction effect better explains cost variations than BMI alone.
BMI alone is not statistically significant (p = 0.358) and has moderate increase per unit per cost (Figure 3), the interaction between BMI and smoker status (coefficient = 1443.10, p < 0.001) shows that for smokers, each additional unit of BMI leads to a substantial increase in charges.
The visualization of the interaction shows that non-smokers have relatively stable charges across BMI levels, whereas smokers experience a steep increase in costs as BMI rises. This finding highlights the importance of considering combined lifestyle factors when assessing insurance pricing and health risk.
The regression analysis incorporating a quadratic term for Age (Age²) confirmed a nonlinear relationship between Age and insurance charges. While the linear Age term was not significant (p = 0.607), Age² was highly significant (coefficient = 3.74, p < 0.001), indicating that insurance costs increase at an accelerating rate as individuals get older. This improved model fit, with R² increasing to 0.843, making it a more precise representation of how Age impacts insurance pricing.
A visualization of this relationship depicts a curved upward trend, where younger individuals experience lower, stable costs, while insurance charges increase sharply with age. This pattern is modeled using the following quadratic equation ( y = ax² + bx + c ), which incorporates the coefficient of Age² (3.74), the coefficient of Age (-33.21), and the intercept (1647.79)
The final regression analysis included only significant variables, improving model interpretability while maintaining strong predictive accuracy (R² = 0.8426). The results confirmed that smoking remains the strongest cost driver, with smokers paying $17,750 to $23,476 more than non-smokers (p < 0.001). The interaction between BMI and smoking was also highly significant (p < 0.001), showing that higher BMI leads to greater insurance costs for smokers. Additionally, having more children increased costs ($659 to $873 per child, p < 0.001), and living in the northeast was associated with higher charges compared to the southeast (p = 0.0015). The final model strikes a balance between statistical power and simplicity, explaining 84.2% of the variation in insurance costs.