Title: | Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables |
---|---|
Description: | Fits linear models with endogenous regressor using latent instrumental variable approaches. The methods included in the package are Lewbel's (1997) <doi:10.2307/2171884> higher moments approach as well as Lewbel's (2012) <doi:10.1080/07350015.2012.643126> heteroscedasticity approach, Park and Gupta's (2012) <doi:10.1287/mksc.1120.0718> joint estimation method that uses Gaussian copula and Kim and Frees's (2007) <doi:10.1007/s11336-007-9008-1> multilevel generalized method of moment approach that deals with endogeneity in a multilevel setting. These are statistical techniques to address the endogeneity problem where no external instrumental variables are needed. See the publication related to this package in the Journal of Statistical Software for more details: <doi:10.18637/jss.v107.i03>. Note that with version 2.0.0 sweeping changes were introduced which greatly improve functionality and usability but break backwards compatibility. |
Authors: | Raluca Gui [cre, aut], Markus Meierer [aut], Rene Algesheimer [aut], Patrik Schilter [aut] |
Maintainer: | Raluca Gui <[email protected]> |
License: | GPL-3 |
Version: | 2.4.10 |
Built: | 2024-11-09 04:54:07 UTC |
Source: | https://github.com/mmeierer/rendo |
Confidence Intervals for Bootstrapped Model Parameters
## S3 method for class 'rendo.boots' confint(object, parm, level = 0.95, ...)
## S3 method for class 'rendo.boots' confint(object, parm, level = 0.95, ...)
object |
a fitted model object with bootstrapped parameters. Typically from |
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
level |
the confidence level required. |
... |
ignored, for consistency with the generic function. |
Computes the two-sided percentile confidence intervals from the bootstrapped parameter estimates. The intervals are obtained by selecting the quantile of the bootstrapped parameter estimates corresponding to the given alpha level.
A minimum of 1/min(level, 1-level) parameters estimates are needed to derive the confidence interval. The reason for this is that there is otherwise no natural way to derive the percentiles (ie one cannot reasonably estimate the 95% quantile of only 7 values).
Fits linear models with continuous or discrete endogenous regressors (or a mixture of both) using Gaussian copulas, as presented in Park and Gupta (2012). This is a statistical technique to address the endogeneity problem where no external instrumental variables are needed. The important assumption of the model is that the endogenous variables should NOT be normally distributed, if continuous, preferably with a skewed distribution. The corrections proposed by Qian, Koschmann, and Xie (2024, p.19-22) are implemented. These mitigate the bias of the original paper for small and moderate sample sizes.
copulaCorrection(formula, data, num.boots = 1000, verbose = TRUE, ...)
copulaCorrection(formula, data, num.boots = 1000, verbose = TRUE, ...)
formula |
A symbolic description of the model to be fitted. See the "Details" section for the exact notation. |
data |
A data.frame containing the data of all parts specified in the formula parameter. |
num.boots |
Number of bootstrapping iterations. Defaults to 1000. |
verbose |
Show details about the running of the function. |
... |
Arguments for the log-likelihood optimization function in the case of a single continuous endogenous regressor. Ignored with a warning otherwise.
|
The underlying idea of the joint estimation method is that using information contained in the observed data, one selects marginal distributions for the endogenous regressor and the structural error term, respectively. Then, the copula model enables the construction of a flexible multivariate joint distribution allowing a wide range of correlations between the two marginals.
Consider the model:
where indexes either time or cross-sectional units, Yt is a
response variable,
Xt is a
exogenous regressor,
Pt is a
continuous endogenous regressor,
εt is a normally distributed structural error term with mean zero and
E(ε2)=σε2,
and
are model parameters.
The marginal distribution of the endogenous regressor Pt is obtained using the Epanechnikov kernel density estimator (Epanechnikov, 1969), as below:
where Pt is the endogenous regressor,
K(x)=0.75·(1-x2)·I(|x|<=1)
and the bandwidth is the one proposed by Silverman (1986),
and is equal to b=0.9·T-1.5·min(s, IQR/1.34).
is the interquartile range while
is the data sample standard deviation
and
is the number of time periods observed in the data.
After obtaining the joint distribution of the error term and the continuous endogenous regressor, the model parameters are estimated using
maximum likelihood estimation.
The additional parameters used during model fitting and printed in summary
hence are:
rho
The correlation between the endogenous regressor and the error.
sigma
The variance of the model's error.
With more than one continuous endogenous regressor or an endogenous discrete regressor, an alternative approach to the
estimation using Gaussian copula should be applied. This approach is similar to the control function approach (Petrin and Train, 2010).
The core idea is to apply OLS estimation on the original set of explanatory variables in the model equation above, plus an additional regressor
Pt*=Φ-1(H(Pt)).
Here, H(Pt) is the marginal distribution of the endogenous regressor .
Including this regressor solves the correlation between the endogenous regressor and the structural error,
,
OLS providing consistent parameter estimates. Due to identification problems, the discrete endogenous regressor cannot have a binomial
distribution.
Hence, only in the case of a single continuous endogenous regressor maximum likelihood estimation is used. In all other cases, augmented OLS based on Gaussian copula is applied. This includes cases of multiple endogenous regressors of both discrete and continuous distributions.
In the case of discrete endogenous regressors, a random seed needs to be assigned because the marginal distribution function of the endogenous regressor is a step function in this case. This means that the value of P* lies between 2 values, Φ-1(H(Pt-1)) and Φ-1(H(Pt)). However, the reported upper and lower bounds of the 95% bootstrapped confidence interval gives indication of the variance of the estimates.
Since the inference procedure in both cases, augmented OLS and maximum likelihood, occurs in two stages (first the empirical distribution of the endogenous regressor is computed and then used in constructing the likelihood function), the standard errors are not correct. Therefore, in both cases, the standard errors and the confidence intervals are obtained based on the sampling distributions resulted from bootstrapping. Since the distribution of the bootstrapped parameters is highly skewed, we report the percentile confidence intervals. Moreover, the variance-covariance matrix is also computed based on the bootstrapped parameters, and not based on the Hessian.
The formula
argument follows a two part notation:
A two-sided formula describing the model (e.g. y ~ X1 + X2 + P
) to be estimated and a
second right-hand side part in which the endogenous regressors and their distributional
assumptions are indicated (e.g. continuous(P)
). These two parts are separated by a single vertical bar (|
).
In the second part, the special functions continuous
, discrete
, or a combination
of both, are used to indicate the endogenous regressors and their respective distribution.
Both functions use the ...
parameter in which the respective endogenous regressors is specified.
Note that no argument to continuous
or discrete
is to be supplied as character
but as symbols without quotation marks.
See the example section for illustrations on how to specify the formula
parameter.
For all cases, an object of classes rendo.copula.correction
, rendo.boots
, and rendo.base
is returned
which is a list and contains the following components:
formula |
The formula given to specify the fitted model. |
terms |
The terms object used for model fitting. |
model |
The model.frame used for model fitting. |
coefficients |
A named vector of all coefficients resulting from model fitting. |
names.main.coefs |
a vector specifying which coefficients are from the model. For internal usage. |
names.vars.continuous |
The names of the continuous endogenous regressors. |
names.vars.discrete |
The names of the discrete endogenous regressors. |
fitted.values |
Fitted values at the found solution. |
residuals |
The residuals at the found solution. |
boots.params |
The bootstrapped coefficients. |
For the case of a single continuous endogenous regressor, the returned object further contains the following components:
start.params |
A named vector with the initial set of parameters used to optimize the log-likelihood function. |
res.optimx |
The result object returned by the function |
For all other cases, the returned object further contains the following component:
res.lm.real.data |
The linear model fitted on the original data together with generated p.star data. |
The function summary
can be used to obtain and print a summary of the results.
Depending on the returned object, the generic accessor functions coefficients
, fitted.values
,
residuals
, vcov
, logLik
, AIC
, BIC
, and nobs
are available.
Park, S. and Gupta, S., (2012), "Handling Endogenous Regressors by Joint Estimation Using Copulas", Marketing Science, 31(4), 567-86.
Qian, Y., Koschmann, A., and Xie, H. (2024). "A Practical Guide to Endogeneity Correction Using Copulas". National Bureau of Economic Research, w32231.
Epanechnikov V (1969). "Nonparametric Estimation of a Multidimensional Probability Density." Teoriya veroyatnostei i ee primeneniya, 14(1), 156–161.
Silverman B (1986). "Density Estimation for Statistics and Data Analysis". CRC Monographs on Statistics and Applied Probability. London: Chapman & Hall.
Petrin A, Train K (2010). "A Control Function Approach to Endogeneity in Consumer Choice Models." Journal of Marketing Research, 47(1), 3–13.
summary
for how fitted models are summarized
vcov
for how the variance-covariance matrix is derived
confint
for how confidence intervals are derived
optimx
for possible elements of parameter optimx.arg
data("dataCopCont") data("dataCopCont2") data("dataCopDis") data("dataCopDis2") data("dataCopDisCont") ## Not run: # Single continuous: log-likelihood optimization c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont) # same as above, with start.parameters and number of bootstrappings c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont, start.params = c("(Intercept)"=1, X1=1, X2=-2, P=-1)) # All following examples fit linear model with Gaussian copulas # 2 continuous endogenous regressors c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1, P2), num.boots=10, data=dataCopCont2) # same as above c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1)+continuous(P2), num.boots=10, data=dataCopCont2) # single discrete endogenous regressor d1 <- copulaCorrection(y~X1+X2+P|discrete(P), num.boots=10, data=dataCopDis) # two discrete endogenous regressor d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+discrete(P2), num.boots=10, data=dataCopDis2) # same as above but less bootstrap runs d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1, P2), num.boots = 10, data=dataCopDis2) # single discrete, single continuous cd <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+continuous(P2), num.boots=10, data=dataCopDisCont) # For single continuous only: use own optimization settings (see optimx()) # set maximum number of iterations to 50'000 res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(itnmax = 50000), num.boots=10, data=dataCopCont) # print detailed tracing information on progress res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(control = list(trace = 6)), num.boots=10, data=dataCopCont) # use method L-BFGS-B instead of Nelder-Mead and print report every 50 iterations res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(method = "L-BFGS-B", control=list(trace = 2, REPORT=50)), num.boots=10, data=dataCopCont) # For coef(), the parameter "complete" determines if only the # main model parameters or also the auxiliary coefficients are returned c1.all.coefs <- coef(res.c1) # also returns rho and sigma # same as above c1.all.coefs <- coef(res.c1, complete = TRUE) # only main model coefs c1.main.coefs <- coef(res.c1, complete = FALSE) ## End(Not run)
data("dataCopCont") data("dataCopCont2") data("dataCopDis") data("dataCopDis2") data("dataCopDisCont") ## Not run: # Single continuous: log-likelihood optimization c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont) # same as above, with start.parameters and number of bootstrappings c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont, start.params = c("(Intercept)"=1, X1=1, X2=-2, P=-1)) # All following examples fit linear model with Gaussian copulas # 2 continuous endogenous regressors c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1, P2), num.boots=10, data=dataCopCont2) # same as above c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1)+continuous(P2), num.boots=10, data=dataCopCont2) # single discrete endogenous regressor d1 <- copulaCorrection(y~X1+X2+P|discrete(P), num.boots=10, data=dataCopDis) # two discrete endogenous regressor d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+discrete(P2), num.boots=10, data=dataCopDis2) # same as above but less bootstrap runs d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1, P2), num.boots = 10, data=dataCopDis2) # single discrete, single continuous cd <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+continuous(P2), num.boots=10, data=dataCopDisCont) # For single continuous only: use own optimization settings (see optimx()) # set maximum number of iterations to 50'000 res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(itnmax = 50000), num.boots=10, data=dataCopCont) # print detailed tracing information on progress res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(control = list(trace = 6)), num.boots=10, data=dataCopCont) # use method L-BFGS-B instead of Nelder-Mead and print report every 50 iterations res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(method = "L-BFGS-B", control=list(trace = 2, REPORT=50)), num.boots=10, data=dataCopCont) # For coef(), the parameter "complete" determines if only the # main model parameters or also the auxiliary coefficients are returned c1.all.coefs <- coef(res.c1) # also returns rho and sigma # same as above c1.all.coefs <- coef(res.c1, complete = TRUE) # only main model coefs c1.main.coefs <- coef(res.c1, complete = FALSE) ## End(Not run)
A dataset with two exogenous regressors,
X1
,X2
, and one endogenous, continuous regressor,
P
, having a T-distribution with 3 degrees of freedom.
An intercept and a dependent variable, y
, are also included.
The true parameter values for the coefficients are: b0 = 2
, b1 = 1.5
,
b2 = -3
and the coefficient of the endogenous regressor, P, is equal to a1 = -1
.
data("dataCopCont")
data("dataCopCont")
A data frame with 2500 observations on 4 variables:
y
a numeric vector representing the dependent variable.
X1
a numeric vector, normally distributed and exogenous.
X2
a numeric vector, normally distributed and exogenous.
P
a numeric vector, continuous and endogenous having T-distribution with 3 degrees of freedom.
Raluca Gui [email protected]
A dataset with two exogenous regressors,
X1
,X2
, and two endogenous, continuous regressors,
P1
and P2
, having a T-distribution with 3 degrees of freedom.
An intercept and a dependent variable, y
, are also included.
The true parameter values for the intercept and the exogenous regressors' coefficients are: b0 = 2
, b1 = 1.5
,
b2 = -3
. The coefficient of the endogenous regressor P1
is equal to a1 = -1
and
of P2
is equal to a2 = 0.8
.
data("dataCopCont2")
data("dataCopCont2")
A data frame with 2500 observations on 5 variables:
y
a numeric vector representing the dependent variable.
X1
a numeric vector, normally distributed and exogenous.
X2
a numeric vector, normally distributed and exogenous.
P1
a numeric vector, continuous and endogenous having T-distribution with 3 degrees of freedom.
P2
a numeric vector, continuous and endogenous having T-distribution with 3 degrees of freedom.
Raluca Gui [email protected]
A dataset with two exogenous regressors,
X1
,X2
, and one endogenous, discrete (Poisson distributed) regressor,
P
.
An intercept and a dependent variable, y
, are also included.
The true parameter values for the coefficients are: b0 = 2
, b1 = 1.5
,
b2 = -3
and the coefficient of the endogenous regressor, P, is equal to a1 = -1
.
data("dataCopDis")
data("dataCopDis")
A data frame with 2500 observations on 4 variables:
y
a numeric vector representing the dependent variable.
X1
a numeric vector, normally distributed and exogenous.
X2
a numeric vector, normally distributed and exogenous.
P
a numeric vector, continuous and endogenous having T-distribution with 3 degrees of freedom.
Raluca Gui [email protected]
A dataset with two exogenous regressors,
X1
,X2
, and two endogenous, discrete (Poisson distributed) regressors,
P1
and P2
.
An intercept and a dependent variable, y
, are also included.
The true parameter values for the coefficients of the intercept and the exogenous variables are: b0 = 2
, b1 = 1.5
,
b2 = -3
. The true parameter values for the coefficients of the endogenous regressors are a1 = -1
for P1
and
a2 = 0.8
for P2
.
data("dataCopDis2")
data("dataCopDis2")
A data frame with 2500 observations on 5 variables:
y
a numeric vector representing the dependent variable.
X1
a numeric vector, normally distributed and exogenous.
X2
a numeric vector, normally distributed and exogenous.
P1
a numeric vector, having a Poisson distribution with parameter lambda equal to 3, and endogenous.
P2
a numeric vector, having a Poisson distribution with parameter lambda equal to 3, and endogenous.
Raluca Gui [email protected]
A dataset with two exogenous regressors,
X1
,X2
, and two endogenous regressors,
P1
, having a Poisson distribution with lambda parameter equal to 3, and P2
, having a T-distribution with 3 degrees of freedom.
An intercept and a dependent variable, y
, are also included.
The true parameter values for the coefficients are: b0 = 2
, b1 = 1.5
,
b2 = -3
and the coefficient of the endogenous regressor P1
is set to a1 = -1
and of P2
is set to a2=0.8
.
data("dataCopDisCont")
data("dataCopDisCont")
A data frame with 2500 observations on 5 variables:
y
a numeric vector representing the dependent variable.
X1
a numeric vector, normally distributed and exogenous.
X2
a numeric vector, normally distributed and exogenous.
P1
a numeric vector, continuous and endogenous having Poisson distribution with parameter lambda equal to 3.
P2
a numeric vector, continuous and endogenous having T-distribution with 3 degrees of freedom.
Raluca Gui [email protected]
A dataset with two exogenous regressors,
X1
,X2
, one endogenous, continuous regressor P
, and the dependent variable y
.
The true parameter values for the coefficients are: b0 = 2
, b1 = 1.5
,
b2 = 3
and the coefficient of the endogenous regressor, P
, is equal to a1 = -1
.
data("dataHetIV")
data("dataHetIV")
A data frame with 2500 observations on 4 variables:
y
a numeric vector representing the dependent variable.
X1
a numeric vector, normally distributed and exogenous.
X2
a numeric vector, normally distributed and exogenous.
P
a numeric vector, continuous and endogenous regressor, normally distributed.
Raluca Gui [email protected]
A dataset with two exogenous regressors,
X1
,X2
, and one endogenous, continuous regressor P
.
An intercept and a dependent variable, y
, are also included.
The true parameter values for the coefficients are: b0 = 2
, b1 = 1.5
,
b2 = 3
and the coefficient of the endogenous regressor, P, is equal to a1 = -1
.
data("dataHigherMoments")
data("dataHigherMoments")
A data frame with 2500 observations on 4 variables:
y
a numeric vector representing the dependent variable.
X1
a numeric vector, normally distributed and exogenous.
X2
a numeric vector, normally distributed and exogenous.
P
a numeric vector, continuous and endogenous regressor, normally distributed.
Raluca Gui [email protected]
data("dataHigherMoments") # to recover the parameters, # on average over many simulations higherMomentsIV(formula = y ~ X1 + X2 + P|P|IIV(iiv=yp), data=dataHigherMoments)
data("dataHigherMoments") # to recover the parameters, # on average over many simulations higherMomentsIV(formula = y ~ X1 + X2 + P|P|IIV(iiv=yp), data=dataHigherMoments)
A dataset with one endogenous regressor P
, an instrument Z
used to build P
, an intercept and a dependent variable, y
.
The true parameter values for the coefficients are: b0 = 3
for the intercept
and a1 = -1
for P
.
data("dataLatentIV")
data("dataLatentIV")
A data frame with 2500 observations on 3 variables:
y
a numeric vector representing the dependent variable.
P
a numeric vector representing the endogenous variable.
Z
a numeric vector used in the construction of the endogenous variable, P.
Raluca Gui [email protected]
A dataset simulated to exemplify the use of the multilevelIV()
function.
It has 2767 observations, clustered into 40 level-three variables and 1347 observations at level two. The endogenous regressor is X15
with a true
coefficient value of -1.
data("dataMultilevelIV")
data("dataMultilevelIV")
A data frame with 2767 observations clustered into 40 level-three variables and 1347 level-two variables.
y
a numeric vector representing the dependent variable.
X11
a level-one numeric vector representing a categorical exogenous variable with true parameter value equal to 3.
X12
a level-one numeric vector representing a binomial distributed exogenous variable with true parameter value equal to 9.
X13
a level-one numeric vector representing a binomial distributed exogenous variable with true parameter value equal to -2.
X14
a level-two numeric vector representing a normally distributed exogenous variable with true parameter value equal to 2.
X15
a level-two numeric vector representing a normally distributed endogenous variable, correlated with the level-two errors.
It true parameter value equals to and it has a correlation with the level two errors equal to 0.7.
X21
a level-two numeric vector representing a binomial distributed exogenous variable with true parameter value equal to -1.5.
X22
a level-two numeric vector representing a binomial distributed exogenous variable with true parameter value equal to -4.
X23
a level-two numeric vector representing a binomial distributed exogenous variable with true parameter value equal to -3.
X24
a level-teo numeric vector representing a normally distributed exogenous variable with true parameter value equal to 6.
X31
a level-three numeric vector representing a normally distributed exogenous variable with true parameter value equal to 0.5.
X32
a level-three numeric vector representing a truncated normally distributed exogenous variable with true parameter value equal to 0.1.
X33
a level-three numeric vector representing a truncated normally distributed exogenous variable with true parameter value equal to -0.5.
SID
a numeric vector identifying each level-three observations.
CID
a numeric vector identifying each level-two observations.
Raluca Gui [email protected]
This function estimates the model parameters and associated standard errors for a linear regression model with one endogenous regressor. Identification is achieved through heteroscedastic covariance restrictions within the triangular system as proposed in Lewbel(2012).
hetErrorsIV(formula, data, verbose = TRUE)
hetErrorsIV(formula, data, verbose = TRUE)
formula |
A symbolic description of the model to be fitted. See the "Details" section for the exact notation. |
data |
A data.frame containing the data of all parts specified in the formula parameter. |
verbose |
Show details about the running of the function. |
The method proposed in Lewbel(2012) identifies structural parameters in regression models with endogenous regressors by means of variables that are uncorrelated with the product of heteroskedastic errors. The instruments are constructed as simple functions of the model's data. The method can be applied when no external instruments are available or to supplement external instruments to improve the efficiency of the IV estimator. Consider the model in the equation:
where indexes either time or cross-sectional units.The endogeneity problem arises from the correlation of
Pt and εt.
As such: Pt = Zt+νt,
where Zt is a subset of variables in Xt.
The errors, ε and ν, may be correlated with each other.
Structural parameters are identified by an ordinary two-stage least squares regression of on
and
, using
and
as instruments.
A vital assumption for identification is that cov(Z,ν2)≠0.
The strength of the instrument is proportional to the covariance of (Z-Z̅)ν with
, which corresponds to
the degree of heteroskedasticity of
with respect to
(Lewbel 2012).
The assumption that the covariance between and the squared error is different from zero can be empirically tested (this is checked in the background when calling the
function). If it is zero or close to zero, the instrument is weak, producing imprecise estimates, with large standard errors.
The formula
argument follows a four part notation:
A two-sided formula describing the model (e.g. y ~ X1 + X2 + P
), a single endogenous regressor
(e.g. P
), and the exogenous variables from which the internal instrumental variables should
be build (e.g. IIV(X1) + IIV(X2)
), each part separated by a single vertical bar (|
).
The instrumental variables that should be built are specified as (multiple) functions, one for each
instrument. This function is IIV
and uses the following arguments:
...
The exogenous regressors to build the internal instruments from. If more than one is given, separate instruments are built for each.
Note that no argument to IIV
is to be supplied as character but as symbols without quotation marks.
Optionally, additional external instrumental variables to also include in the instrumental variable regression can be specified. These external instruments have to be already present in the data and are provided as the fourth right-hand side part of the formula, again separated by a vertical bar.
See the example section for illustrations on how to specify the formula
parameter.
Returns an object of classes rendo.ivreg
and ivreg
, It extends the object returned from
function ivreg
of package AER
and slightly modifies it by adapting the call
and formula
components. The summary
function prints additional diagnostic information as
described in documentation for summary.ivreg
.
All generic accessor functions for ivreg
such as anova
, hatvalues
, or vcov
are available.
Lewbel, A. (2012). Using Heteroskedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models, Journal of Business & Economic Statistics, 30(1), 67-80.
Angrist, J. and Pischke, J.S. (2009). Mostly Harmless Econometrics: An Empiricists Companion, Princeton University Press.
data("dataHetIV") # P is the endogenous regressor in all examples # X1 generates a weak instrument but for the examples # this is ignored # 2 IVs, one from X1, one from X2 het <- hetErrorsIV(y~X1+X2+P|P|IIV(X1)+IIV(X2), data=dataHetIV) # same as above het <- hetErrorsIV(y~X1+X2+P|P|IIV(X1,X2), data=dataHetIV) # use X2 as an external IV het <- hetErrorsIV(y~X1+P|P|IIV(X1)|X2, data=dataHetIV) summary(het)
data("dataHetIV") # P is the endogenous regressor in all examples # X1 generates a weak instrument but for the examples # this is ignored # 2 IVs, one from X1, one from X2 het <- hetErrorsIV(y~X1+X2+P|P|IIV(X1)+IIV(X2), data=dataHetIV) # same as above het <- hetErrorsIV(y~X1+X2+P|P|IIV(X1,X2), data=dataHetIV) # use X2 as an external IV het <- hetErrorsIV(y~X1+P|P|IIV(X1)|X2, data=dataHetIV) summary(het)
Fits linear models with one endogenous regressor using internal instruments built using the approach described in Lewbel A. (1997). This is a statistical technique to address the endogeneity problem where no external instrumental variables are needed. The implementation allows the incorporation of external instruments if available. An important assumption for identification is that the endogenous variable has a skewed distribution.
higherMomentsIV(formula, data, verbose = TRUE)
higherMomentsIV(formula, data, verbose = TRUE)
formula |
A symbolic description of the model to be fitted. See the "Details" section for the exact notation. |
data |
A data.frame containing the data of all parts specified in the formula parameter. |
verbose |
Show details about the running of the function. |
Consider the model:
The observed data consist of Yt, Xt and Pt, while Zt, εt, and νt are unobserved. The endogeneity problem arises from the correlation of Pt with the structural error εt, since E(εν)≠0. The requirement for the structural and measurement error is to have mean zero, but no restriction is imposed on their distribution.
Let S̅ be the sample mean of a variable St
and Gt=G(Xt) for any given function that
has finite third own and cross moments. Lewbel(1997) proves that the following instruments can be constructed and used with two-stage least squares to obtain consistent estimates:
Instruments in equations and
can be used only when the measurement and the structural errors are symmetrically distributed.
Otherwise, the use of the instruments does not require any distributional assumptions for the errors. Given that the regressors
are included as instruments,
should not be linear in
in equation
.
Let small letter denote deviation from the sample mean: si = Si-S̅.
Then, using as instruments the variables presented in equations together with
and
Xt, the two-stage-least-squares estimation will provide consistent estimates for the parameters
in equation
under the assumptions exposed in Lewbel(1997).
The formula
argument follows a four part notation:
A two-sided formula describing the model (e.g. y ~ X1 + X2 + P
), a single endogenous regressor
(e.g. P
), and the exogenous variables from which the internal instrumental variables should
be build (e.g. IIV(iiv=y2)
), each part separated by a single vertical bar (|
).
The instrumental variables that should be built are specified as (multiple) functions, one for each
instrument. This function is IIV
and uses the following arguments:
iiv
Which internal instrument to build. One of g, gp, gy, yp, p2, y2
can be chosen.
g
Which function g
represents in iiv
.
One of x2, x3, lnx, 1/x
can be chosen.
Only required if the type of internal instrument demands it.
...
The exogenous regressors to build the internal instrument. If more than one is given, separate instruments are built for each. Only required if the type of internal instrument demands it.
Note that no argument to IIV
is to be supplied as character but as symbols without quotation marks.
Optionally, additional external instrumental variables to also include in the instrumental variable regression can be specified. These external instruments have to be already present in the data and are provided as the fourth right-hand side part of the formula, again separated by a vertical bar.
See the example section for illustrations on how to specify the formula
parameter.
Returns an object of classes rendo.ivreg
and ivreg
, It extends the object returned from
function ivreg
of package AER
and slightly modifies it by adapting the call
and formula
components. The summary
function prints additional diagnostic information as
described in documentation for summary.ivreg
.
All generic accessor functions for ivreg
such as anova
, hatvalues
, or vcov
are available.
Lewbel A (1997). “Constructing Instruments for Regressions with Measurement Error When No Additional Data are Available, With an Application to Patents and R&D.” Econometrica, 65(5), 1201–1213.
data("dataHigherMoments") # P is the endogenous regressor in all examples # 2 IVs with g*p, g=x^2, separately for each regressor X1 and X2. hm <- higherMomentsIV(y~X1+X2+P|P|IIV(iiv=gp, g=x2, X1, X2), data = dataHigherMoments) # same as above hm <- higherMomentsIV(y~X1+X2+P|P|IIV(iiv=gp, g=x2, X1) + IIV(iiv=gp, g=x2, X2), data = dataHigherMoments) # 3 different IVs hm <- higherMomentsIV(y~X1+X2+P|P|IIV(iiv=y2) + IIV(iiv=yp) + IIV(iiv=g,g=x3,X1), data = dataHigherMoments) # use X2 as external IV hm <- higherMomentsIV(y~X1+P|P|IIV(iiv=y2)+IIV(iiv=g,g=lnx,X1)| X2, data = dataHigherMoments) summary(hm)
data("dataHigherMoments") # P is the endogenous regressor in all examples # 2 IVs with g*p, g=x^2, separately for each regressor X1 and X2. hm <- higherMomentsIV(y~X1+X2+P|P|IIV(iiv=gp, g=x2, X1, X2), data = dataHigherMoments) # same as above hm <- higherMomentsIV(y~X1+X2+P|P|IIV(iiv=gp, g=x2, X1) + IIV(iiv=gp, g=x2, X2), data = dataHigherMoments) # 3 different IVs hm <- higherMomentsIV(y~X1+X2+P|P|IIV(iiv=y2) + IIV(iiv=yp) + IIV(iiv=g,g=x3,X1), data = dataHigherMoments) # use X2 as external IV hm <- higherMomentsIV(y~X1+P|P|IIV(iiv=y2)+IIV(iiv=g,g=lnx,X1)| X2, data = dataHigherMoments) summary(hm)
Fits linear models with one endogenous regressor and no additional explanatory variables using the latent instrumental variable approach presented in Ebbes, P., Wedel, M., Böckenholt, U., and Steerneman, A. G. M. (2005). This is a statistical technique to address the endogeneity problem where no external instrumental variables are needed. The important assumption of the model is that the latent variables are discrete with at least two groups with different means and the structural error is normally distributed.
latentIV( formula, data, start.params = c(), optimx.args = list(), verbose = TRUE )
latentIV( formula, data, start.params = c(), optimx.args = list(), verbose = TRUE )
formula |
A symbolic description of the model to be fitted. Of class "formula". |
data |
A data.frame containing the data of all parts specified in the formula parameter. |
start.params |
A named vector containing a set of parameters to use in the first optimization iteration. The names have to correspond exactly to the names of the components specified in the formula parameter. If not provided, a linear model is fitted to derive them. |
optimx.args |
A named list of arguments which are passed to |
verbose |
Show details about the running of the function. |
Let's consider the model:
where indexes either time or cross-sectional units, Yt is the dependent variable,
Pt is a
k x 1
continuous, endogenous regressor,
εt is a structural error term with mean zero
and E(ε2)=σε2,
and β0 are model parameters.
Z;t is a
l x 1
vector of instruments,
and νt is a random error with mean zero and
E(ν2)=σν2.
The endogeneity problem arises from the correlation of and εt
through E(εν)=σεν
latentIV
considers Zt' to be a latent, discrete, exogenous variable with an unknown number of groups and
is a vector of group means.
It is assumed that
is independent of the error terms
and
and that it has at least two groups with different means.
The structural and random errors are considered normally distributed with mean zero and variance-covariance matrix
:
The identification of the model lies in the assumption of the non-normality of Pt, the discreteness of the unobserved instruments and the existence of at least two groups with different means.
The method has been implemented such that the latent variable has two groups. Ebbes et al.(2005) show in a Monte Carlo experiment that even if the true number of the categories of the instrument is larger than two, estimates are approximately consistent. Besides, overfitting in terms of the number of groups/categories reduces the degrees of freedom and leads to efficiency loss. For a model with additional explanatory variables a Bayesian approach is needed, since in a frequentist approach identification issues appear.
Identification of the parameters relies on the distributional assumptions of the latent instruments as well as that of
the endogenous regressor Pt.
Specifically, the endogenous regressor should have a non-normal distribution while the unobserved instruments, , should be discrete and have at least two groups with different means Ebbes, Wedel, and Böckenholt (2009).
A continuous distribution for the instruments leads to an unidentified model, while a normal distribution of the endogenous regressor gives rise to inefficient estimates.
Additional parameters used during model fitting and printed in summary
are:
The instrumental variables are assumed to be divided into two groups.
pi1
represents the estimated group mean of the first group.
The estimated group mean of the second group of the instrumental variables .
The probability of being in the first group of the instruments.
The variance, σε2
The covariance, σεν
The variance, σν2
An object of classes rendo.latent.IV
and rendo.base
is returned which is a list and contains the following components:
formula |
The formula given to specify the fitted model. |
terms |
The terms object used for model fitting. |
model |
The model.frame used for model fitting. |
coefficients |
A named vector of all coefficients resulting from model fitting. |
names.main.coefs |
a vector specifying which coefficients are from the model. For internal usage. |
start.params |
A named vector with the initial set of parameters used to optimize the log-likelihood function. |
res.optimx |
The result object returned by the function |
hessian |
A named, symmetric matrix giving an estimate of the Hessian at the found solution. |
m.delta.diag |
A diagonal matrix needed when deriving the vcov to apply the delta method on theta5 which was transformed during the LL optimization. |
fitted.values |
Fitted values at the found optimal solution. |
residuals |
The residuals at the found optimal solution. |
The function summary
can be used to obtain and print a summary of the results.
The generic accessor functions coefficients
, fitted.values
, residuals
, vcov
, confint
, logLik
, AIC
, BIC
, case.names
, and nobs
are available.
Ebbes, P., Wedel,M., Böckenholt, U., and Steerneman, A. G. M. (2005). 'Solving and Testing for Regressor-Error (in)Dependence When no Instrumental Variables are Available: With New Evidence for the Effect of Education on Income'. Quantitative Marketing and Economics, 3:365–392.
Ebbes P., Wedel M., Böckenholt U. (2009). “Frugal IV Alternatives to Identify the Parameter for an Endogenous Regressor.” Journal of Applied Econometrics, 24(3), 446–468.
summary
for how fitted models are summarized
optimx
for possible elements of parameter optimx.arg
data("dataLatentIV") # function call without any initial parameter values l <- latentIV(y ~ P, data = dataLatentIV) summary(l) # function call with initial parameter values given by the user l1 <- latentIV(y ~ P, start.params = c("(Intercept)"=2.5, P=-0.5), data = dataLatentIV) summary(l1) # use own optimization settings (see optimx()) # set maximum number of iterations to 50'000 l2 <- latentIV(y ~ P, optimx.args = list(itnmax = 50000), data = dataLatentIV) # print detailed tracing information on progress l3 <- latentIV(y ~ P, optimx.args = list(control = list(trace = 6)), data = dataLatentIV) # use method L-BFGS-B instead of Nelder-Mead and print report every 50 iterations l4 <- latentIV(y ~ P, optimx.args = list(method = "L-BFGS-B", control=list(trace = 2, REPORT=50)), data = dataLatentIV) # read out all coefficients, incl auxiliary coefs lat.all.coefs <- coef(l4) # same as above lat.all.coefs <- coef(l4, complete = TRUE) # only main model coefs lat.main.coefs <- coef(l4, complete = FALSE)
data("dataLatentIV") # function call without any initial parameter values l <- latentIV(y ~ P, data = dataLatentIV) summary(l) # function call with initial parameter values given by the user l1 <- latentIV(y ~ P, start.params = c("(Intercept)"=2.5, P=-0.5), data = dataLatentIV) summary(l1) # use own optimization settings (see optimx()) # set maximum number of iterations to 50'000 l2 <- latentIV(y ~ P, optimx.args = list(itnmax = 50000), data = dataLatentIV) # print detailed tracing information on progress l3 <- latentIV(y ~ P, optimx.args = list(control = list(trace = 6)), data = dataLatentIV) # use method L-BFGS-B instead of Nelder-Mead and print report every 50 iterations l4 <- latentIV(y ~ P, optimx.args = list(method = "L-BFGS-B", control=list(trace = 2, REPORT=50)), data = dataLatentIV) # read out all coefficients, incl auxiliary coefs lat.all.coefs <- coef(l4) # same as above lat.all.coefs <- coef(l4, complete = TRUE) # only main model coefs lat.main.coefs <- coef(l4, complete = FALSE)
Estimates multilevel models (max. 3 levels) employing the GMM approach presented in Kim and Frees (2007). One of the important features is that, using the hierarchical structure of the data, no external instrumental variables are needed, unlike traditional instrumental variable techniques. Specifically, the approach controls for endogeneity at higher levels in the data hierarchy. For example, for a three-level model, endogeneity can be handled either if present at level two, at level three or at both levels. Level one endogeneity, where the regressors are correlated with the structural errors (errors at level one), is not addressed. Moreover, if considered, random slopes cannot be endogenous. Also, the dependent variable has to have a continuous distribution. The function returns the coefficient estimates obtained with fixed effects, random effects and the GMM estimator proposed by Kim and Frees (2007), such that a comparison across models can be done. Asymptotically, the multilevel GMM estimators share the same properties of corresponding fixed effects estimators, but they allow the estimation of all the variables in the model, unlike the fixed effects counterpart.
To facilitate the choice of the estimator to be used for the given data, the function also conducts
omitted variable test based on the Hausman-test for panel data (Hausman, 1978). It allows to compare
a robust estimator and an estimator that is efficient under the null hypothesis of no omitted variables,
and to compare two robust estimators at different levels. The results of these tests are returned when
calling summary()
on a fitted model.
multilevelIV( formula, data, lmer.control = lmerControl(optimizer = "Nelder_Mead", optCtrl = list(maxfun = 1e+05)), verbose = TRUE )
multilevelIV( formula, data, lmer.control = lmerControl(optimizer = "Nelder_Mead", optCtrl = list(maxfun = 1e+05)), verbose = TRUE )
formula |
A symbolic description of the model to be fitted. See the "Details" section for the exact notation. |
data |
A data.frame containing the data of all parts specified in the formula parameter. |
lmer.control |
An output from |
verbose |
Show details about the running of the function. |
Multilevel modeling is a generalization of regression methods that recognize the existence of such data hierarchies by allowing for residual components at each level in the hierarchy. For example, a three-level multilevel model which allows for grouping of students within classrooms, over time, would include time, student and classroom residuals (see equation below). Thus, the residual variance is partitioned into four components: between-classroom (the variance of the classroom-level residuals), within-classroom (the variance of the student-level residuals), between student (the variance of the student-level residuals) and within-student (the variance of the time-level residuals). The classroom residuals represent the unobserved classroom characteristics that affect student's outcomes. These unobserved variables lead to correlation between outcomes for students from the same classroom. Similarly, the unobserved time residuals lead to correlation between a student's outcomes over time. A three-level model can be described as follows:
Like in single-level regression, in multilevel models endogeneity is also a concern. The additional problem is that in multilevel models there are multiple independent assumptions involving various random components at different levels. Any moderate correlation between some predictors and a random component or error term, can result in a significant bias of the coefficients and of the variance components. The multilevel GMM approach for addressing endogeneity uses both the between and within variations of the exogenous variables, but only the within variation of the variables assumed endogenous. The assumptions in the multilevel generalized moment of moments model is that the errors at each level are normally distributed and independent of each other. Moreover, the slope variables are assumed exogenous. Since the model does not handle "level 1 dependencies", an additional assumption is that the level 1 structural error is uncorrelated with any of the regressors. If this assumption is not met, additional, external instruments are necessary. The coefficients of the explanatory variables appear in the vectors β1, β2 and β3. The term β1cs captures latent, unobserved characteristics that are classroom and student specific while β2c captures latent, unobserved characteristics that are classroom specific. For identification, the disturbance term εcst is assumed independent of the other variables, Z1cst and X1cst. When all model variables are assumed exogenous, the GMM estimator is the usual GLS estimator, denoted as REF. When all variables (except the variables used as slope) are assumed endogenous, the fixed-effects estimator is used, FE. While REF assumes all explanatory variables are uncorrelated with the random intercepts and slopes in the model, FE allows for endogeneity of all effects but sweeps out the random components as well as the explanatory variables at the same levels. The more general estimator GMM proposed by Kim and Frees (2007) allows for some of the explanatory variables to be endogenous and uses this information to build instrumental variables. The multilevel GMM estimator uses both the between and within variations of the exogenous variables, but only the within variation of the variables assumed endogenous. When all variables are assumed exogenous, GMM estimator equals REF. When all covariates are assume endogenous, GMM equals FE.
The formula
argument follows a two part notation:
In the first part, the model is specified while in the second part, the endogenous regressors are indicated.
These two parts are separated by a single vertical bar (|
).
The first RHS follows the exact same model specification as required by the lmer
function of package lme4
and internally will be used to fit a lmer
model. In the second part,
one or multiple endogenous regressors are indicated by passing them to the special function endo
(e.g. endo(X1, X2)
). Note that no argument to endo()
is to be supplied as character
but as symbols without quotation marks.
See the example section for illustrations on how to specify the formula
parameter.
multilevelIV
returns an object of class "rendo.multilevel
".
The generic accessor functions coef
, fitted
, residuals
, vcov
, confint
, and nobs
, are available.
Note that an additional argument model
with possible values "REF", "FE_L2", "FE_L3", "GMM_L2"
, or "GMM_L3"
is
available for summary
, fitted
, residuals
, confint
, and vcov
to extract the features for the specified model.
Note that the obtained coefficients are rounded with round(x, digits=getOption("digits"))
.
An object of class rendo.multilevel
is returned that is a list and contains the following components:
formula |
the formula given to specify the model to be fitted. |
num.levels |
the number of levels detected from the model. |
dt.model.data |
a data.table of model data including data for slopes and level group ids |
coefficients |
a matrix of rounded coefficients, one column per model. |
coefficients.se |
a matrix of coefficients' SE, one column per model. |
l.fitted |
a named list which contains the fitted values per model sorted as the input data |
l.residuals |
a named list which contains the residuals per model sorted as the input data |
l.vcov |
a list of variance-covariance matrix, named per model. |
V |
the variance–covariance matrix V of the disturbance term. |
W |
the weight matrix W, such that W=V^(-1/2) per highest level group. |
l.ovt |
a list of results of the Hausman OVT, named per model. |
Hausman J (1978). “Specification Tests in Econometrics.” Econometrica, 46(6), 1251–1271.
Kim, Jee-Seon and Frees, Edward W. (2007). "Multilevel Modeling with Correlated Effects". Psychometrika, 72(4), 505-533.
lmer
for more details on how to specify the formula
parameter
lmerControl
for more details on how to provide the lmer.control
parameter
summary
for how fitted models are summarized
data("dataMultilevelIV") # Two levels res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) # Three levels res.ml.L3 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1| CID) + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) # L2 with multiple endogenous regressors res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15, X21, X22), data = dataMultilevelIV, verbose = FALSE) # same as above res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15, X21) + endo(X22), data = dataMultilevelIV, verbose = FALSE) # Fit above model with different settings for lmer() lmer.control <- lme4::lmerControl(optimizer="nloptwrap", optCtrl=list(algorithm="NLOPT_LN_COBYLA", xtol_rel=1e-6)) res.ml.L2.cob <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15, X21) + endo(X22), data = dataMultilevelIV, verbose = FALSE, lmer.control = lmer.control) # use different controls for lmer # specify argument "model" in the S3 methods to obtain results for the respective model # default is "REF" for all methods summary(res.ml.L3) # same as above summary(res.ml.L3, model = "REF") # complete pval table for L3 fixed effects L3.FE.p <- coef(summary(res.ml.L3, model = "FE_L3")) # variance covariance matrix L2.FE.var <- vcov(res.ml.L2, model = "FE_L2") L2.GMM.var <- vcov(res.ml.L2, model = "GMM_L2") # residuals L3.REF.resid <- resid(res.ml.L3, model = "REF")
data("dataMultilevelIV") # Two levels res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) # Three levels res.ml.L3 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1| CID) + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) # L2 with multiple endogenous regressors res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15, X21, X22), data = dataMultilevelIV, verbose = FALSE) # same as above res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15, X21) + endo(X22), data = dataMultilevelIV, verbose = FALSE) # Fit above model with different settings for lmer() lmer.control <- lme4::lmerControl(optimizer="nloptwrap", optCtrl=list(algorithm="NLOPT_LN_COBYLA", xtol_rel=1e-6)) res.ml.L2.cob <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15, X21) + endo(X22), data = dataMultilevelIV, verbose = FALSE, lmer.control = lmer.control) # use different controls for lmer # specify argument "model" in the S3 methods to obtain results for the respective model # default is "REF" for all methods summary(res.ml.L3) # same as above summary(res.ml.L3, model = "REF") # complete pval table for L3 fixed effects L3.FE.p <- coef(summary(res.ml.L3, model = "FE_L3")) # variance covariance matrix L2.FE.var <- vcov(res.ml.L2, model = "FE_L2") L2.GMM.var <- vcov(res.ml.L2, model = "GMM_L2") # residuals L3.REF.resid <- resid(res.ml.L3, model = "REF")
Predicted values based on linear models with endogenous regressors estimated using the gaussian copula.
## S3 method for class 'rendo.copula.correction' predict(object, newdata, ...)
## S3 method for class 'rendo.copula.correction' predict(object, newdata, ...)
object |
Object of class inheriting from "rendo.copula.correction" |
newdata |
An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are returned. |
... |
ignored, for consistency with the generic function. |
predict.copula.correction
produces a vector of predictions
The model fitting function copulaCorrection
## Not run: data("dataCopCont") c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont) # returns the fitted values predict(c1) # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(c1, dataCopCont), fitted(c1)) # TRUE ## End(Not run)
## Not run: data("dataCopCont") c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont) # returns the fitted values predict(c1) # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(c1, dataCopCont), fitted(c1)) # TRUE ## End(Not run)
Predicted values based on model objects fitted using the instrumental variables regression fitted with IVs generated from the data.
## S3 method for class 'rendo.ivreg' predict(object, newdata, ...)
## S3 method for class 'rendo.ivreg' predict(object, newdata, ...)
object |
Object of class inheriting from "rendo.ivreg" |
newdata |
An optional data frame without any instrumental variables in which to look for variables with which to predict. If omitted, the fitted values are returned. |
... |
ignored, for consistency with the generic function. |
predict.rendo.ivreg
produces a vector of predictions
The model fitting functions hetErrorsIV
,
higherMomentsIV
.
data("dataHetIV") het <- hetErrorsIV(y~X1+X2+P|P|IIV(X1, X2), data = dataHetIV) # returns the fitted values predict(het) # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(het, dataHetIV), fitted(het)) # TRUE
data("dataHetIV") het <- hetErrorsIV(y~X1+X2+P|P|IIV(X1, X2), data = dataHetIV) # returns the fitted values predict(het) # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(het, dataHetIV), fitted(het)) # TRUE
Predicted values based on linear models estimated using the latent instrumental variables approach for a single endogenous regressor.
## S3 method for class 'rendo.latent.IV' predict(object, newdata, ...)
## S3 method for class 'rendo.latent.IV' predict(object, newdata, ...)
object |
Object of class inheriting from "rendo.latent.IV" |
newdata |
An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are returned. |
... |
ignored, for consistency with the generic function. |
predict.rendo.latent.IV
produces a vector of predictions
The model fitting function latentIV
data("dataLatentIV") lat <- latentIV(y ~ P, data = dataLatentIV) # returns the fitted values predict(lat) # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(lat, dataLatentIV), fitted(lat)) # TRUE
data("dataLatentIV") lat <- latentIV(y ~ P, data = dataLatentIV) # returns the fitted values predict(lat) # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(lat, dataLatentIV), fitted(lat)) # TRUE
Predicted values based on multilevel models employing the GMM approach for hierarchical data with endogenous regressors.
## S3 method for class 'rendo.multilevel' predict( object, newdata, model = c("REF", "FE_L2", "FE_L3", "GMM_L2", "GMM_L3"), ... )
## S3 method for class 'rendo.multilevel' predict( object, newdata, model = c("REF", "FE_L2", "FE_L3", "GMM_L2", "GMM_L3"), ... )
object |
Object of class inheriting from "rendo.multilevel" |
newdata |
An optional data frame in which to look for variables with which to predict. If omitted, the fitted values for the specified model are returned. |
model |
character string to indicate for which fitted model predictions are made.
Possible values are: |
... |
ignored, for consistency with the generic function. |
predict.rendo.multilevel
produces a vector of predictions
The model fitting function multilevelIV
data("dataMultilevelIV") # Two levels res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) predict(res.ml.L2, model = "FE_L2") # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(res.ml.L2, dataMultilevelIV, model = "GMM_L2"), fitted(res.ml.L2, model = "GMM_L2")) # TRUE
data("dataMultilevelIV") # Two levels res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) predict(res.ml.L2, model = "FE_L2") # using the data used for fitting also for predicting, # correctly results in fitted values all.equal(predict(res.ml.L2, dataMultilevelIV, model = "GMM_L2"), fitted(res.ml.L2, model = "GMM_L2")) # TRUE
Fits linear models with endogenous regressor using latent instrumental variable approaches.
The methods included in the package are Lewbel's (1997) <doi:10.2307/2171884> higher moments approach as well as Lewbel's (2012) <doi:10.1080/07350015.2012.643126> heteroskedasticity approach, Park and Gupta's (2012) <doi:10.1287/mksc.1120.0718> joint estimation method that uses Gaussian copula and Kim and Frees's (2007) <doi:10.1007/s11336-007-9008-1> multilevel generalized method of moment approach that deals with endogeneity in a multilevel setting. These are statistical techniques to address the endogeneity problem where no external instrumental variables are needed.
The main functions to estimate models are:
latentIV()
the latent instrumental variables method of Ebbes et al. (2005)
copulaCorrection()
copula correction method proposed by Paek and Gupta (2012)
hetErrorsIV()
heteroskedastic errors approach proposed by Lewbel(2012)
higherMomentsIV()
higher moments method proposed by Lewbel (1997)
multilevelIV()
multilevel GMM method proposed by Kim and Frees (2007)
Differences between current (2.0.0) and previous version of REndo
Note that with version 2.0.0 sweeping changes were which greatly improve functionality but break backwards compatibility. Various bugs were fixed, performance improved, handling of S3 objects and methods across the package was harmonized, and a set of argument checks has been added. Starting with REndo 2.0, all functions support the use of transformations such as I(x^2) or log(x) in the formulas. Moreover, the call of most of the functions (except latentIV() and multilevelIV()) changed from the previous versions, making use of the Formula package.
Check the NEWS file or our github page for the latest updates and for reporting issues.
See our publication in the Journal of Statistical Software for more details: doi:10.18637/jss.v107.i03.
Maintainer: Raluca Gui [email protected]
Authors:
Markus Meierer [email protected]
Rene Algesheimer [email protected]
Patrik Schilter [email protected]
Gui R, Meierer M, Schilter P, Algesheimer R (2023). “REndo: Internal Instrumental Variables to Address Endogeneity.” Journal of Statistical Software, 107 (3), 1-43. doi:10.18637/jss.v107.i03
Useful links:
summary
method for a model of class rendo.copula.correction
resulting from fitting copulaCorrection
.
## S3 method for class 'rendo.copula.correction' summary(object, ...)
## S3 method for class 'rendo.copula.correction' summary(object, ...)
object |
an object of class |
... |
ignored, for consistency with the generic function. |
For a single continuous endogenous regressor, the estimation is realized in two steps by first obtaining the empirical distribution of the endogenous regressor and then the likelihood function is built. Also for all other cases the estimation is realized in two steps and hence the standard errors reported by the fitted OLS model are not correct.
The standard errors and the confidence intervals are therefore obtained using bootstrapping with replacement as described in Effron (1979). The reported lower and upper boundaries are from the 95% bootstrapped percentile confidence interval. If there are too few bootstrapped estimates, no boundaries are reported.
For a single continuous endogenous regressor the model was fitted using maximum likelihood optimization. The related goodness of fit measures and convergence indicators are also reported here.
The function computes and returns a list of summary statistics which contains the following components:
coefficients |
a |
num.boots |
the number of bootstraps performed. |
names.main.coefs |
a vector specifying which coefficients are from the model. For internal usage. |
start.params |
a named vector with the initial set of parameters used to optimize the log-likelihood function. |
vcov |
variance covariance matrix derived from the bootstrapped parameters. |
names.vars.continuous |
the names of the continuous endogenous regressors. |
names.vars.discrete |
the names of the discrete endogenous regressors. |
For the case of a single continuous endogenous regressor, also the following components resulting from the log-likelihood optimization are returned:
AIC |
Akaike's An Information Criterion for the model fitted on the provided data. |
BIC |
Schwarz's Bayesian Criterion for the model fitted on the provided data. |
KKT1 |
first Kuhn, Karush, Tucker optimality condition as returned by optimx. |
KKT2 |
second Kuhn, Karush, Tucker optimality condition as returned by optimx. |
conv.code |
the convergence code as returned by optimx. |
log.likelihood |
the value of the log-likelihood function at the found solution for the provided data. |
Effron, B.(1979). "Bootstrap Methods: Another Look at the Jackknife", The Annals of Statistics, 7(1), 1-26.
The model fitting function copulaCorrection
confint
for how the confidence intervals are derived
vcov
for how the variance-covariance matrix is derived
optimx
for explanations about the returned conv.code
and KKT
.
Function coef
will extract the coefficients
matrix and
function vcov
will extract the component vcov
from the returned summary object.
summary
method for a model of class rendo.latent.IV
resulting from fitting latentIV
## S3 method for class 'rendo.latent.IV' summary(object, ...)
## S3 method for class 'rendo.latent.IV' summary(object, ...)
object |
an object of class |
... |
ignored, for consistency with the generic function. |
The function summary.rendo.latent.IV
computes and returns a list of summary statistics
which contains the following components:
coefficients |
a |
start.params |
a named vector with the initial set of parameters used to optimize the log-likelihood function. |
names.main.coefs |
a vector specifying which coefficients are from the model. For internal usage. |
vcov |
variance covariance matrix derived from the hessian. |
AIC |
Akaike's An Information Criterion for the model fitted on the provided data. |
BIC |
Schwarz's Bayesian Criterion for the model fitted on the provided data. |
KKT1 |
first Kuhn, Karush, Tucker optimality condition as returned by optimx. |
KKT2 |
second Kuhn, Karush, Tucker optimality condition as returned by optimx. |
conv.code |
the convergence code as returned by optimx. |
log.likelihood |
the value of the log-likelihood function at the found solution for the provided data. |
The model fitting function latentIV
Function coef
will extract the coefficients
matrix and
function vcov
will extract the component vcov
from the returned summary object.
summary
method for class "rendo.multilevel
".
## S3 method for class 'rendo.multilevel' summary(object, model = c("REF", "FE_L2", "FE_L3", "GMM_L2", "GMM_L3"), ...)
## S3 method for class 'rendo.multilevel' summary(object, model = c("REF", "FE_L2", "FE_L3", "GMM_L2", "GMM_L3"), ...)
object |
an object of class "rendo.multilevel", usually, a result of a call to |
model |
character string to indicate which fitted model should be summarized.
Possible values are: |
... |
ignored, for consistency with the generic function. |
The multilevelIV() function estimates three models, namely: the usual random effects model (REF), the fixed effects model (FE) and the hierarchical GMM model (GMM) proposed by Kim and Frees (2007). The fixed effects and the GMM estimators are calculated at each level - so in the case of a three-level model, the function estimates, besides the random effects, fixed effects models at level two (FE_L2) and at level three (FE_L3). The same is true for the GMM estimators, the multilevelIV() function will return a GMM estimator at level-three (GMM_L3) and a GMM estimator at level two (GMM_L2).
In order to facilitate the choice of estimator to be used, the summary()
function also returns an omitted variable test (OVT).
This test is based on the Hausman test for panel data. The OVT allows the comparison of a robust eastimator and an estimator which is efficient
under the null hypothesis of no omitted variables. Moreover, it allows the comparison of two robust
estimators at different levels.
For the model specified in argument model
, the summary()
function returns the
summary statistics of the estimated coefficients, together with the results of the omitted variable test
between the specified model and each other model.
For the model specified in argument model
, the function summary.rendo.multilevel
computes and returns
a list of summary statistics and the results of the omitted variable tests for the fitted multilevel object given in object
.
An object of class summary.rendo.multilevel
is returned that is a list using the component call
of argument object
, plus,
summary.model |
the model parameter with which the summary function was called. |
coefficients |
a |
OVT.table |
results of the Hausman omitted variable test for the specified model compared to all other models. |
vcov |
variance covariance matrix derived from the GMM fit of this model. |
The model fitting function multilevelIV
Function coef
will extract the coefficients
matrix and
function vcov
will extract the component vcov
.
data("dataMultilevelIV") # Fit two levels model res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) # Get summary for FE_L2 (does not print) res.sum <- summary(res.ml.L2, model = "FE_L2") # extract table with coefficients summary statistics sum.stat.FE_L2 <- coef(res.sum) # extract vcov of model FE_L2 FE_L2.vcov <- vcov(res.sum) # same as above FE_L2.vcov <- vcov(res.ml.L2, model = "FE_L2")
data("dataMultilevelIV") # Fit two levels model res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1|SID) | endo(X15), data = dataMultilevelIV, verbose = FALSE) # Get summary for FE_L2 (does not print) res.sum <- summary(res.ml.L2, model = "FE_L2") # extract table with coefficients summary statistics sum.stat.FE_L2 <- coef(res.sum) # extract vcov of model FE_L2 FE_L2.vcov <- vcov(res.sum) # same as above FE_L2.vcov <- vcov(res.ml.L2, model = "FE_L2")
The variance-covariance matrix is derived from the bootstrapped parameter estimates stored in the object. It is based on Efron (1979) and calculates the result as follows:
where B is the number of bootstraps and θ̅ is the mean of the bootstrapped coefficients.
## S3 method for class 'rendo.boots' vcov(object, ...)
## S3 method for class 'rendo.boots' vcov(object, ...)
object |
a fitted model object with bootstrapped parameters. Typically from |
... |
ignored, for consistency with the generic function. |
A matrix of the estimated covariances between the parameter estimates of the model.
The row and column names correspond to the parameter names given by the coef
method.
Effron, B.(1979). "Bootstrap Methods: Another Look at the Jackknife", The Annals of Statistics, 7(1), 1-26.