% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/lm2.R
\name{lm2}
\alias{lm2}
\title{Enhanced alternative to lm()}
\arguments{
\item{se_type}{The type of standard error to use. Default is \code{"HC3"}.
Without clusters: \code{"HC0"}, \code{"HC1"}, \code{"HC2"}, or \code{"HC3"}.
When \code{clusters} is specified, \code{se_type} is automatically set to \code{"CR2"}.}

\item{notes}{Logical. If TRUE (default), print explanatory notes below the table
when the result is printed.}

\item{clusters}{An optional variable indicating clusters for cluster-robust standard 
errors. When specified, \code{se_type} is automatically set to \code{"CR2"} 
(bias-reduced cluster-robust estimator). Passed to \code{\link[estimatr]{lm_robust}}.}

\item{fixed_effects}{An optional right-sided formula containing the fixed effects 
to be projected out (absorbed) before estimation. Useful for models with many 
fixed effect groups (e.g., \code{~ firm_id} or \code{~ firm_id + year}). 
Passed to \code{\link[estimatr]{lm_robust}}.}

\item{...}{Additional arguments passed to \code{\link[estimatr]{lm_robust}}.}
}
\value{
An object of class \code{c("lm2", "lm_robust", "lm")}. This inherits 
  from \code{\link[estimatr]{lm_robust}} and can be used with packages like 
  \code{marginaleffects}. The object contains all components of an \code{lm_robust} 
  object plus additional attributes:
  \describe{
    \item{statuser_table}{A data frame with columns: \code{term}, \code{estimate}, 
      \code{SE.robust}, \code{SE.classical}, \code{t}, \code{df}, \code{p.value}, 
      \code{B} (standardized coefficient), and optionally \code{SE.cluster} when 
      clustered standard errors are used.}
    \item{classical_fit}{The underlying \code{lm} object with classical standard errors.}
    \item{na_counts}{Integer vector of missing value counts per variable.}
    \item{n_missing}{Total number of observations excluded due to missing values.}
    \item{has_clusters}{Logical indicating whether clustered standard errors were used.}
  }
  When printed, displays a formatted regression table with robust and classical 
  standard errors, effect sizes, and diagnostic red flags.
}
\description{
Runs a linear regression with better defaults (robust SE), and richer & better 
formatted output than \code{lm}. For robust and clustered errors it relies on \code{\link[estimatr]{lm_robust}}. 
The output reports classical and robust errors, number of missing observations per
variable, an effect size column (standardized regression coefficient), and a red.flag column per variable 
flagging the need to conduct specific diagnostics. It relies by default on HC3 for standard errors;
\code{lm_robust} relies on HC2 (and Stata's 'reg y x, robust' on HC1), which can have
inflated false-positive rates in smaller samples (Long & Ervin, 2000).
}
\details{
Robust standard errors and clustered standard errors are computed using 
\code{\link[estimatr]{lm_robust}}; see the documentation of that function for details (using by default CR2 errors)
The output shows both standard errors and when clustering errors it reports all three.
The red.flag column is based on the difference between robust and classical standard errors.

The \code{red.flag} column provides diagnostic warnings:
\itemize{
  \item \code{!}, \code{!!}, \code{!!!}: Robust and classical standard errors differ by 
    more than 25\%, 50\%, or 100\%, respectively. Large differences may suggest model 
    misspecification or outliers (but they may also be benign). When encountering a red flag,
    authors should plot the distributions to look for outliers or skewed data, and use \code{\link{scatter.gam}}
    to look for possible nonlinearities in the relevant variables.
    King & Roberts (2015) propose a higher cutoff, at 100\%, and a bootstrapped significance test; 
    \code{statuser} does not follow either recommendation. The former seems too liberal, the 
    latter too time consuming to include in every regression, plus the focus here is on individual variables rather than joint tests.
  \item \code{X}: For interaction terms, the component variables are correlated (|r| > 0.3 or p < .05),
    which means the interaction term is likely to be biased. See Simonsohn (2024) "Interacting with curves"
    \doi{10.1177/25152459231207787}.
}
}
\examples{
# Basic usage with data argument
lm2(mpg ~ wt + hp, data = mtcars)

# Without data argument (variables from environment)
y <- mtcars$mpg
x1 <- mtcars$wt
x2 <- mtcars$hp
lm2(y ~ x1 + x2)

# RED FLAG EXAMPLES

# Example 1: red flag catches a nonlinearity
# True model is quadratic: y = x^2
set.seed(123)
x <- runif(200, -3, 3)
y <- x^2 + rnorm(200, sd = 2)

# lm2() shows red flag due to misspecification
lm2(y ~ x)

# Follow up with scatter.gam() to diagnose it
scatter.gam(x, y)

# Example 2: red flag catches an outlier in y
# True model is y = x, but one observation has a very large y value
set.seed(123)
x <- sort(rnorm(200))
y <- round(x + rnorm(200, sd = 2), 1)
y[200] <- 100  # Outlier

# lm2() flags x
lm2(y ~ x)

# Look at distribution of y to spot the outlier
plot_freq(y)

# Example 3: red flag catches an outlier in one predictor
# True model is y = x1 + x2, but x2 has an extreme value
set.seed(123)
x1 <- round(rnorm(200),.1)
x2 <- round(rnorm(200),.1)
y <- x1 + x2 + rnorm(200, sd = 0.5)
x2[200] <- 50  # Outlier in x2

# lm2() flags x2 (but not x1)
lm2(y ~ x1 + x2)

# Look at distribution of x2 to spot the outlier
plot_freq(x2)

# CLUSTERED STANDARD ERRORS
# When observations are grouped (e.g., students within schools),
# use clusters to account for within-group correlation
set.seed(123)
n_clusters <- 20
n_per_cluster <- 15
cluster_id <- rep(1:n_clusters, each = n_per_cluster)
cluster_effect <- rnorm(n_clusters, sd = 2)[cluster_id]
x <- rnorm(n_clusters * n_per_cluster)
y <- 1 + 0.5 * x + cluster_effect + rnorm(n_clusters * n_per_cluster)
mydata <- data.frame(y = y, x = x, cluster_id = cluster_id)

# Clustered SE (CR2) - note the SE.cluster column in output
lm2(y ~ x, data = mydata, clusters = cluster_id)

# FIXED EFFECTS
# Use fixed_effects to absorb group-level variation (e.g., firm or year effects)
# This is useful for panel data or when you have many fixed effect levels
set.seed(456)
n_firms <- 30
n_years <- 5
firm_id <- rep(1:n_firms, each = n_years)
year <- rep(2018:2022, times = n_firms)
firm_effect <- rnorm(n_firms, sd = 3)[firm_id]
x <- rnorm(n_firms * n_years)
y <- 2 + 0.8 * x + firm_effect + rnorm(n_firms * n_years)
panel <- data.frame(y = y, x = x, firm_id = factor(firm_id), year = factor(year))

# Absorb firm fixed effects (coefficient on x is estimated, firm dummies are not shown)
lm2(y ~ x, data = panel, fixed_effects = ~ firm_id)

# Two-way fixed effects (firm and year)
lm2(y ~ x, data = panel, fixed_effects = ~ firm_id + year)

}
\references{
King, G., & Roberts, M. E. (2015). How robust standard errors expose methodological 
problems they do not fix, and what to do about it. \emph{Political Analysis}, 23(2), 159-179.

Long, J. S., & Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors 
in the linear regression model. \emph{The American Statistician}, 54(3), 217-224.

Simonsohn, U. (2024). Interacting with curves: How to validly test and probe 
interactions in the real (nonlinear) world. \emph{Advances in Methods and Practices in 
Psychological Science}, 7(1), 1-22. \doi{10.1177/25152459231207787}
}
\seealso{
\code{\link[estimatr]{lm_robust}}, \code{\link{scatter.gam}}
}
