DataFrame-level Stats Examples#
Basic examples of statistical test methods.
Used Libraries#
The following methods selectively use these libraries:
import plotly.io as pio
pio.renderers.default = "notebook"
Normality Test#
normality() - Tests for normal distribution
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.normality(
dv='age',
between='class',
method='jarque_bera',
show_qqplot=True
)
⚠️ Found 177 (19.87%) missing values in dependent variable 'age' ⚠️ Group 'First' in factor 'class' has 30 (13.89%) missing values in 'age' ⚠️ Group 'Second' in factor 'class' has 11 (5.98%) missing values in 'age' ⚠️ Group 'Third' in factor 'class' has 136 (27.70%) missing values in 'age' Removed 177 rows (19.87% of data) with missing values in: age, class Normality Test Report - Dependent Variable: age - Grouping Variable: class - Alpha Level: 0.05 - Test Method: jarque_bera 🔹 Hypothesis Formulation - H0: The data follows a normal distribution - H1: The data does not follow a normal distribution
| W | pval | normal | Reject H0 | |
|---|---|---|---|---|
| group | ||||
| Third | 27.48 | 0.00 | False | True |
| First | 1.57 | 0.46 | True | False |
| Second | 0.80 | 0.67 | True | False |
Warning: Non-normal distributions detected in groups: Third
Levene’s Test#
levene() - Tests equality of variances across groups.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.levene(
dv='age',
between='class'
)
⚠️ Found 177 (19.87%) missing values in dependent variable 'age'
⚠️ Group 'First' in factor 'class' has 30 (13.89%) missing values in 'age'
⚠️ Group 'Second' in factor 'class' has 11 (5.98%) missing values in 'age'
⚠️ Group 'Third' in factor 'class' has 136 (27.70%) missing values in 'age'
Removed 177 rows (19.87% of data) with missing values in: age, class
Levene's Test Report
- Dependent Variable: age
- Groups: Third vs First
- Alpha Level: 0.05
| W | pval | equal_var | |
|---|---|---|---|
| levene | 5.620164 | 0.003787 | False |
Conclusion: At alpha = 0.05, we reject the null hypothesis that the variances are equal (p=0.0038).
T-Test#
ttest() - Compares means between two groups using independent samples t-test.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.ttest(
dv='age',
between='alive',
)
⚠️ Found 177 (19.87%) missing values in dependent variable 'age' ⚠️ Group 'no' in factor 'alive' has 125 (22.77%) missing values in 'age' ⚠️ Group 'yes' in factor 'alive' has 52 (15.20%) missing values in 'age' Removed 177 rows (19.87% of data) with missing values in: age, alive T-Test Report - Dependent Variable: age - Between Factor: alive (no vs yes) - Alpha Level: 0.05 - Alternative: two-sided 🔹 Hypothesis Formulation - H0: The mean of no equals the mean of yes - H1: The mean of no differs from the mean of yes 🔹 Descriptive Statistics
| Group | count | mean | std | min | max |
|---|---|---|---|---|---|
| no | 424 | 30.63 | 14.17 | 1.0 | 74.0 |
| yes | 290 | 28.34 | 14.95 | 0.42 | 80.0 |
🔹 Homogeneity of Variance (Levene's Test)
| W | pval | equal_var | |
|---|---|---|---|
| levene | 1.195 | 0.275 | True |
- Using Student's t-test - equal variances assumed (pval = 0.275 ≥ 0.05)
🔹 T-Test Results
| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | 2.067 | 712 | two-sided | 0.039 | [0.11 4.45] | 0.157 | 0.685 | 0.541 |
🔹 Conclusion: At alpha = 0.05, we reject the null hypothesis (p-val = 0.0391).
Mann-Whitney U Test#
mwu() - Non-parametric test for comparing two independent groups.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.mwu(
dv='age',
between='alive'
)
⚠️ Found 177 (19.87%) missing values in dependent variable 'age' ⚠️ Group 'no' in factor 'alive' has 125 (22.77%) missing values in 'age' ⚠️ Group 'yes' in factor 'alive' has 52 (15.20%) missing values in 'age' Removed 177 rows (19.87% of data) with missing values in: age, alive Mann-Whitney U Test Report - Dependent Variable: age - Between Factor: alive (no vs yes) - Alpha Level: 0.05 - Alternative: two-sided 🔹 Hypothesis Formulation - H0: The distribution of no equals the distribution of yes - H1: The distribution of no differs from the distribution of yes 🔹 Descriptive Statistics
| Group | count | median | mean | std | min | max |
|---|---|---|---|---|---|---|
| no | 424 | 28.0 | 30.63 | 14.17 | 1.0 | 74.0 |
| yes | 290 | 28.0 | 28.34 | 14.95 | 0.42 | 80.0 |
🔹 Mann-Whitney U Test Results
| U-val | alternative | p-val | RBC | CLES | |
|---|---|---|---|---|---|
| MWU | 65278.0 | two-sided | 0.16 | 0.062 | 0.531 |
Effect Sizes: - Rank-Biserial Correlation (RBC) = 0.06 (small effect) - Common Language Effect Size (CLES) = 0.53 (small effect) 🔹 Conclusion: At alpha = 0.05, we fail to reject the null hypothesis (p-val = 0.1605).
ANOVA#
anova() - One-way analysis of variance for comparing means across multiple groups.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.anova(
dv='age',
between='class'
)
⚠️ Found 177 (19.87%) missing values in dependent variable 'age' ⚠️ Group 'First' in factor 'class' has 30 (13.89%) missing values in 'age' ⚠️ Group 'Second' in factor 'class' has 11 (5.98%) missing values in 'age' ⚠️ Group 'Third' in factor 'class' has 136 (27.70%) missing values in 'age' Removed 177 rows (19.87% of data) with missing values in: age, class One-way ANOVA Report - Dependent Variable: age - Between Factors: class - Alpha Level: 0.05 🔹 Hypothesis Formulation - H0: All group means are equal (Third, First, Second) - H1: At least one group mean differs among Third, First, Second 🔹 Descriptive Statistics
| Group | count | mean | std | min | max | median |
|---|---|---|---|---|---|---|
| First | 186 | 38.23 | 14.8 | 0.92 | 80.0 | 37.0 |
| Second | 173 | 29.88 | 14.0 | 0.67 | 70.0 | 29.0 |
| Third | 355 | 25.14 | 12.5 | 0.42 | 74.0 | 24.0 |
🔹 Homogeneity of Variance Check
| W | pval | equal_var | |
|---|---|---|---|
| levene | 5.62 | 0.004 | False |
- Using Welch's ANOVA - Levene's test showed unequal variances (pval = 0.004 < 0.05)
🔹 Welch ANOVA Results
| Source | ddof1 | ddof2 | F | p-unc | np2 |
|---|---|---|---|---|---|
| class | 2 | 359.444 | 53.355 | 0.0 | 0.139 |
🔹 Post-hoc Analysis
| A | B | mean(A) | mean(B) | diff | se | T | df | pval | hedges |
|---|---|---|---|---|---|---|---|---|---|
| First | Second | 38.233 | 29.878 | 8.356 | 1.52 | 5.496 | 356.897 | 0.0 | 0.578 |
| First | Third | 38.233 | 25.141 | 13.093 | 1.272 | 10.293 | 325.228 | 0.0 | 0.981 |
| Second | Third | 29.878 | 25.141 | 4.737 | 1.254 | 3.777 | 308.829 | 0.001 | 0.364 |
🔹 Conclusion: Significant effects found (p-unc = {aov['p-unc'].iloc[0]:.3f}) Significant pairwise comparisons: - First vs Second: pval = 0.000 - First vs Third: pval = 0.000 - Second vs Third: pval = 0.001
Kruskal-Wallis Test#
kruskal() - Non-parametric alternative to one-way ANOVA.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.kruskal(
dv='age',
between='class'
)
⚠️ Found 177 (19.87%) missing values in dependent variable 'age' ⚠️ Group 'First' in factor 'class' has 30 (13.89%) missing values in 'age' ⚠️ Group 'Second' in factor 'class' has 11 (5.98%) missing values in 'age' ⚠️ Group 'Third' in factor 'class' has 136 (27.70%) missing values in 'age' Removed 177 rows (19.87% of data) with missing values in: age, class Kruskal-Wallis Test Report - Dependent Variable: age - Grouping Variable: class - Groups: Third, First, Second - Alpha Level: 0.05 🔹 Hypothesis Formulation - H0: The distributions of all groups are equal (Third, First, Second) - H1: At least one group distribution differs among Third, First, Second 🔹 Descriptive Statistics
| Group | count | median | mean | std | min | max |
|---|---|---|---|---|---|---|
| First | 186 | 37.0 | 38.23 | 14.8 | 0.92 | 80.0 |
| Second | 173 | 29.0 | 29.88 | 14.0 | 0.67 | 70.0 |
| Third | 355 | 24.0 | 25.14 | 12.5 | 0.42 | 74.0 |
🔹 Kruskal-Wallis Test Results
| Source | ddof1 | H | p-unc |
|---|---|---|---|
| class | 2 | 95.995 | 0.0 |
🔹 Post-hoc Analysis (Dunn's Test)
| A | B | p-corr |
|---|---|---|
| First | Second | 0.0 |
| First | Third | 0.0 |
| Second | Third | 0.0 |
🔹 Conclusion: Significant difference found (p-unc = 0.000) Significant pairwise comparisons: - First vs Second: pval = 0.000 - First vs Third: pval = 0.000 - Second vs Third: pval = 0.000
Chi-Square Test#
chi2_independence() - Tests categorical variable independence
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.chi2_independence(
x='class',
y='alive',
test='pearson'
)
Categorical Association Report - class vs alive - Total observations: 891 🔹 Hypothesis Formulation - H0: class and alive are independent (Third, First, Second vs no, yes) - H1: class and alive are associated (Third, First, Second vs no, yes) 🔹 Contingency Table
| alive | no | yes |
|---|---|---|
| class | ||
| First | 80 | 136 |
| Second | 97 | 87 |
| Third | 372 | 119 |
🔹 Statistical Tests
| test | lambda | chi2 | dof | pval | cramer | power | cramer strength |
|---|---|---|---|---|---|---|---|
| pearson | 1.0 | 102.889 | 2.0 | 0.0 | 0.34 | 1.0 | Moderate |
🔹 Conclusion: At alpha = 0.05, we reject the null hypothesis of independence (pearson test, p = 0.000). No significant association found between class and alive.
Bootstrap#
bootstrap() - Computes bootstrap confidence intervals
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic.stats.bootstrap(
dv='age',
between='alive',
reference_group='no',
statistic='mean_diff',
plot=True
)
⚠️ Found 177 (19.87%) missing values in dependent variable 'age' ⚠️ Group 'no' in factor 'alive' has 125 (22.77%) missing values in 'age' ⚠️ Group 'yes' in factor 'alive' has 52 (15.20%) missing values in 'age' Removed 177 rows (19.87% of data) with missing values in: age, alive
| method | statistic | n_resamples | standard_error | ci | alternative |
|---|---|---|---|---|---|
| BCa | 2.28 | 9999 | 1.11 | [0.16, 4.53] | two-sided |
Linear Regression#
ols() - Ordinary least squares regression
from frameon import load_dataset, FrameOn as fo
diamonds = fo(load_dataset('diamonds'))
diamonds.stats.ols(
formula='price ~ carat + C(cut, Treatment("Ideal"))',
cov_type='HC3',
show_plots=True
)
| Variable | VIF |
|---|---|
| carat | 2.08 |
| C(cut, Treatment("Ideal"))[T.Premium] | 1.491 |
| C(cut, Treatment("Ideal"))[T.Very Good] | 1.352 |
| C(cut, Treatment("Ideal"))[T.Good] | 1.158 |
| C(cut, Treatment("Ideal"))[T.Fair] | 1.079 |
| Dep. Variable: | price | R-squared: | 0.856 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.856 |
| Method: | Least Squares | F-statistic: | 2.092e+04 |
| Date: | Mon, 13 Oct 2025 | Prob (F-statistic): | 0.00 |
| Time: | 03:11:16 | Log-Likelihood: | -4.7142e+05 |
| No. Observations: | 53940 | AIC: | 9.429e+05 |
| Df Residuals: | 53934 | BIC: | 9.429e+05 |
| Df Model: | 5 | ||
| Covariance Type: | HC3 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -2074.5457 | 16.092 | -128.920 | 0.000 | -2106.086 | -2043.006 |
| C(cut, Treatment("Ideal"))[T.Fair] | -1800.9240 | 51.723 | -34.819 | 0.000 | -1902.301 | -1699.547 |
| C(cut, Treatment("Ideal"))[T.Good] | -680.5921 | 22.873 | -29.755 | 0.000 | -725.423 | -635.761 |
| C(cut, Treatment("Ideal"))[T.Premium] | -361.8468 | 17.209 | -21.026 | 0.000 | -395.577 | -328.116 |
| C(cut, Treatment("Ideal"))[T.Very Good] | -290.7886 | 16.684 | -17.430 | 0.000 | -323.488 | -258.089 |
| carat | 7871.0821 | 24.892 | 316.207 | 0.000 | 7822.293 | 7919.871 |
| Omnibus: | 14616.138 | Durbin-Watson: | 1.027 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 150962.278 |
| Skew: | 1.007 | Prob(JB): | 0.00 |
| Kurtosis: | 10.944 | Cond. No. | 8.39 |
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
Robust Linear Model#
rlm() - Regression with robust standard errors
from frameon import load_dataset, FrameOn as fo
diamonds = fo(load_dataset('diamonds'))
diamonds.stats.rlm(
formula='price ~ carat + C(cut, Treatment("Ideal"))',
)
| Variable | VIF |
|---|---|
| carat | 2.08 |
| C(cut, Treatment("Ideal"))[T.Premium] | 1.491 |
| C(cut, Treatment("Ideal"))[T.Very Good] | 1.352 |
| C(cut, Treatment("Ideal"))[T.Good] | 1.158 |
| C(cut, Treatment("Ideal"))[T.Fair] | 1.079 |
| Dep. Variable: | price | No. Observations: | 53940 |
|---|---|---|---|
| Model: | RLM | Df Residuals: | 53934 |
| Method: | IRLS | Df Model: | 5 |
| Norm: | HuberT | ||
| Scale Est.: | mad | ||
| Cov Type: | H1 | ||
| Date: | Mon, 13 Oct 2025 | ||
| Time: | 03:11:17 | ||
| No. Iterations: | 23 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -1834.9416 | 8.689 | -211.172 | 0.000 | -1851.972 | -1817.911 |
| C(cut, Treatment("Ideal"))[T.Fair] | -1263.4005 | 24.022 | -52.593 | 0.000 | -1310.483 | -1216.318 |
| C(cut, Treatment("Ideal"))[T.Good] | -490.9289 | 14.651 | -33.507 | 0.000 | -519.645 | -462.213 |
| C(cut, Treatment("Ideal"))[T.Premium] | -231.6302 | 10.192 | -22.727 | 0.000 | -251.606 | -211.654 |
| C(cut, Treatment("Ideal"))[T.Very Good] | -220.2565 | 10.525 | -20.926 | 0.000 | -240.886 | -199.627 |
| carat | 7219.3427 | 8.535 | 845.812 | 0.000 | 7202.614 | 7236.072 |
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
Generalized Linear Model#
glm() - Generalized linear regression
from frameon import load_dataset, FrameOn as fo
import statsmodels.api as sm
titanic = fo(load_dataset('titanic'))
titanic.stats.glm(
formula='survived ~ age + sex + pclass',
family=sm.families.Binomial()
)
| Variable | VIF |
|---|---|
| pclass | 3.359 |
| age | 2.924 |
| sex[T.male] | 2.859 |
| Dep. Variable: | survived | No. Observations: | 714 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 710 |
| Model Family: | Binomial | Df Model: | 3 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -323.65 |
| Date: | Mon, 13 Oct 2025 | Deviance: | 647.29 |
| Time: | 03:11:17 | Pearson chi2: | 768. |
| No. Iterations: | 5 | Pseudo R-squ. (CS): | 0.3587 |
| Covariance Type: | HC3 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 5.0560 | 0.526 | 9.607 | 0.000 | 4.023 | 6.089 |
| sex[T.male] | -2.5221 | 0.203 | -12.444 | 0.000 | -2.920 | -2.124 |
| age | -0.0369 | 0.008 | -4.610 | 0.000 | -0.053 | -0.021 |
| pclass | -1.2885 | 0.148 | -8.718 | 0.000 | -1.579 | -0.998 |
Quantile Regression#
quantreg() - Quantile regression
from frameon import load_dataset, FrameOn as fo
diamonds = fo(load_dataset('diamonds'))
diamonds.stats.quantreg(
formula='price ~ carat + C(cut, Treatment("Ideal"))',
q=0.95
)
| Variable | VIF |
|---|---|
| carat | 2.08 |
| C(cut, Treatment("Ideal"))[T.Premium] | 1.491 |
| C(cut, Treatment("Ideal"))[T.Very Good] | 1.352 |
| C(cut, Treatment("Ideal"))[T.Good] | 1.158 |
| C(cut, Treatment("Ideal"))[T.Fair] | 1.079 |
| Dep. Variable: | price | Pseudo R-squared: | 0.7297 |
|---|---|---|---|
| Model: | QuantReg | Bandwidth: | 220.4 |
| Method: | Least Squares | Sparsity: | 6958. |
| Date: | Mon, 13 Oct 2025 | No. Observations: | 53940 |
| Time: | 03:11:18 | Df Residuals: | 53934 |
| Df Model: | 5 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -2220.0986 | 14.000 | -158.574 | 0.000 | -2247.540 | -2192.658 |
| C(cut, Treatment("Ideal"))[T.Fair] | -886.8169 | 40.760 | -21.757 | 0.000 | -966.707 | -806.927 |
| C(cut, Treatment("Ideal"))[T.Good] | -291.6197 | 24.700 | -11.807 | 0.000 | -340.031 | -243.208 |
| C(cut, Treatment("Ideal"))[T.Premium] | -102.2676 | 17.318 | -5.905 | 0.000 | -136.211 | -68.325 |
| C(cut, Treatment("Ideal"))[T.Very Good] | 137.0000 | 17.739 | 7.723 | 0.000 | 102.231 | 171.769 |
| carat | 1.063e+04 | 16.741 | 635.114 | 0.000 | 1.06e+04 | 1.07e+04 |
Ordered Model#
ordered_model() - Regression for ordered categorical dependent variables.
from frameon import load_dataset, FrameOn as fo
import pandas as pd
diamonds = fo(load_dataset('diamonds'))
cut_order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
diamonds['cut'] = pd.Categorical(diamonds['cut'], categories=cut_order, ordered=True)
diamonds.stats.ordered_model(
formula='cut ~ carat + price',
distr='logit'
)
Optimization terminated successfully.
Current function value: 1.349193
Iterations: 28
Function evaluations: 33
Gradient evaluations: 33
| Dep. Variable: | cut | Log-Likelihood: | -72775. |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 1.456e+05 |
| Method: | Maximum Likelihood | BIC: | 1.456e+05 |
| Date: | Mon, 13 Oct 2025 | ||
| Time: | 03:11:19 | ||
| No. Observations: | 53940 | ||
| Df Residuals: | 53934 | ||
| Df Model: | 2 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| carat | -2.1468 | 0.045 | -47.500 | 0.000 | -2.235 | -2.058 |
| price | 0.0002 | 5.37e-06 | 38.579 | 0.000 | 0.000 | 0.000 |
| Fair/Good | -4.4558 | 0.033 | -136.610 | 0.000 | -4.520 | -4.392 |
| Good/Very Good | 0.4227 | 0.015 | 28.551 | 0.000 | 0.394 | 0.452 |
| Very Good/Premium | 0.3132 | 0.009 | 36.365 | 0.000 | 0.296 | 0.330 |
| Premium/Ideal | 0.0793 | 0.008 | 10.283 | 0.000 | 0.064 | 0.094 |
Mixed Linear Model#
mixedlm() - Linear regression with random effects.
from frameon import load_dataset, FrameOn as fo
penguins = fo(load_dataset('penguins'))
penguins.stats.mixedlm(
"body_mass_g ~ bill_length_mm + C(sex)",
groups="island"
)
| Variable | VIF |
|---|---|
| bill_length_mm | 2.173 |
| C(sex)[T.MALE] | 2.173 |
| Model: | MixedLM | Dependent Variable: | body_mass_g |
| No. Observations: | 333 | Method: | REML |
| No. Groups: | 6 | Scale: | 0.0000 |
| Min. group size: | 1 | Log-Likelihood: | 2410.1918 |
| Max. group size: | 1 | Converged: | Yes |
| Mean group size: | 1.0 |
| Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 3876.707 | 330.617 | 11.726 | 0.000 | 3228.709 | 4524.705 |
| C(sex)[T.MALE] | 171.865 | 19.946 | 8.617 | 0.000 | 132.771 | 210.958 |
| bill_length_mm | -8.892 | 8.505 | -1.046 | 0.296 | -25.562 | 7.777 |
| Group Var | 518.642 | 279357.703 |
Feature Importance Analysis#
feature_importance_analysis() - Analyzes feature importance for data analysis.
from frameon import load_dataset, FrameOn as fo
iris = fo(load_dataset('iris'))
iris.stats.feature_importance_analysis(
target_column='species',
feature_columns=iris.columns.difference(['species', 'species_id']).tolist()
)