Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 0.076236 True 5.553473
1 0.060844 False 0.368952
2 -0.028633 False -0.494726
3 -1.427719 True 5.404953
4 -0.671959 False -1.001133
... ... ... ...
995 0.700718 True 5.251570
996 0.946380 True 7.180262
997 -0.004310 True 4.482246
998 0.618509 True 4.455502
999 -1.381954 False -0.086505

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<AxesSubplot: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<AxesSubplot: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 -0.352919 False 0.446923 0.584806 1.709968
1 0.814069 False 0.333252 0.282563 3.539036
2 -2.343394 False 1.050524 0.925262 1.080776
3 0.828998 False -0.382004 0.279270 3.580766
4 1.064405 False 0.701340 0.230561 4.337242
... ... ... ... ... ...
995 0.895503 False 0.163814 0.264889 3.775160
996 2.014276 False -0.379485 0.096009 10.415706
997 -0.376085 False -0.769199 0.590935 1.692233
998 1.836714 False 1.371098 0.114205 8.756149
999 0.428957 False -0.634386 0.374900 2.667378

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 1.303760 True 5.675314 0.812527 1.230728
1 1.436695 True 6.111159 0.833644 1.199553
2 0.631619 True 6.052427 0.675366 1.480680
3 0.354007 True 6.152660 0.605733 1.650893
4 -2.741986 True 2.913577 0.049674 20.131449
... ... ... ... ... ...
995 0.721921 True 5.594780 0.696601 1.435543
996 1.337709 True 3.280717 0.818109 1.222331
997 -2.741986 True 2.913577 0.049674 20.131449
998 1.812561 True 5.540571 0.883099 1.132376
999 0.124867 True 5.605328 0.544679 1.835943

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.06489946748145$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.0887509312439617$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.942
Model: OLS Adj. R-squared (uncentered): 0.941
Method: Least Squares F-statistic: 8038.
Date: Fri, 16 Dec 2022 Prob (F-statistic): 0.00
Time: 19:37:21 Log-Likelihood: -1410.4
No. Observations: 1000 AIC: 2825.
Df Residuals: 998 BIC: 2835.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.2405 0.036 6.761 0.000 0.171 0.310
x2 5.0122 0.048 104.315 0.000 4.918 5.106
Omnibus: 1.561 Durbin-Watson: 1.932
Prob(Omnibus): 0.458 Jarque-Bera (JB): 1.428
Skew: -0.062 Prob(JB): 0.490
Kurtosis: 3.137 Cond. No. 1.94


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.