Regression -Based Simpson’s Paradox
Contents
Regression -Based Simpson’s Paradox#
#imports
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import mlsim
from mlsim.anomaly import sp_plot
The basic version of SP in the regression form is the clustering model based SP- we use a gaussian mixture model and control parameters of the shape
# setup
r_clusters = -.6 # correlation coefficient of clusters
cluster_spread = .8 # pearson correlation of means
p_sp_clusters = .5 # portion of clusters with SP
k = 5 # number of clusters
cluster_size = [2,3]
domain_range = [0, 20, 0, 20]
N = 200 # number of points
p_clusters = [1.0/k]*k
We choose the portion of the clusters to have SP and then can draw samples
p_sp_clusters = .9
sp_df2 = mlsim.anomaly.geometric_2d_gmm_sp(r_clusters,cluster_size,cluster_spread,
p_sp_clusters, domain_range,k,N,p_clusters)
sp_plot(sp_df2,'x1','x2','color')
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

We can change the parameters and see variation
# setup
r_clusters = -.4 # correlation coefficient of clusters
cluster_spread = .8 # pearson correlation of means
p_sp_clusters = .6 # portion of clusters with SP
k = 5 # number of clusters
cluster_size = [4,4]
domain_range = [0, 20, 0, 20]
N = 200 # number of points
p_clusters = [.5, .2, .1, .1, .1]
sp_df3 = mlsim.anomaly.geometric_2d_gmm_sp(r_clusters,cluster_size,cluster_spread,
p_sp_clusters, domain_range,k,N,p_clusters)
sp_plot(sp_df3,'x1','x2','color')
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

Multiple Views#
The first extension is to add multiple independent views, we have a wrapper function for that
many_sp_df = mlsim.anomaly.geometric_indep_views_gmm_sp(2,r_clusters,cluster_size,cluster_spread,p_sp_clusters,
domain_range,k,N,p_clusters)
sp_plot(many_sp_df,'x1','x2','A')
sp_plot(many_sp_df,'x3','x4','B')
many_sp_df.head()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
x1 | x2 | x3 | x4 | A | B | |
---|---|---|---|---|---|---|
0 | 13.172482 | 14.230178 | 12.409012 | 10.026851 | A3 | B1 |
1 | 7.908730 | 11.563656 | 11.753394 | 10.979174 | A2 | B1 |
2 | 4.377354 | 8.629879 | 14.536545 | 15.306251 | A0 | B0 |
3 | 4.943237 | 7.607691 | 8.204048 | 7.204525 | A0 | B2 |
4 | 3.957393 | 9.682433 | 13.900453 | 15.426102 | A0 | B0 |


The views do not have to have the same parameters though. We can make each parameter a list of values with the length set to the number of views.
# setup
r_clusters = [.8, -.2] # correlation coefficient of clusters
cluster_spread = [.8, .2] # pearson correlation of means
p_sp_clusters = [.6, 1] # portion of clusters with SP
k = [5,3] # number of clusters
cluster_size = [4,4]
domain_range = [0, 20, 0, 20]
N = 200 # number of points
p_clusters = [[.5, .2, .1, .1, .1],[1.0/3]*3]
many_sp_df_diff = mlsim.anomaly.geometric_indep_views_gmm_sp(2,r_clusters,cluster_size,cluster_spread,p_sp_clusters,
domain_range,k,N,p_clusters)
sp_plot(many_sp_df_diff,'x1','x2','A')
sp_plot(many_sp_df_diff,'x3','x4','B')
many_sp_df.head()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
x1 | x2 | x3 | x4 | A | B | |
---|---|---|---|---|---|---|
0 | 13.172482 | 14.230178 | 12.409012 | 10.026851 | A3 | B1 |
1 | 7.908730 | 11.563656 | 11.753394 | 10.979174 | A2 | B1 |
2 | 4.377354 | 8.629879 | 14.536545 | 15.306251 | A0 | B0 |
3 | 4.943237 | 7.607691 | 8.204048 | 7.204525 | A0 | B2 |
4 | 3.957393 | 9.682433 | 13.900453 | 15.426102 | A0 | B0 |

