Evidence-based Bayesian disaggregation

What this package does

Given an observed aggregate price index \(\mathrm{cpi}_t\) and a matrix of (known) sectoral aggregation weights \(W_{t,k}\) — value-added (VAB) shares — the goal is to recover the \(K\) latent sectoral price indices \(\varphi_{t,k}\) that the aggregate is made of. The sectoral indices then feed a downstream nested Ornstein–Uhlenbeck model (bayesianOU) as the market price \(\varphi\).

The disaggregation is genuinely Bayesian: the aggregate enters as evidence (an observation density), and the sectoral indices come out as a posterior with credible intervals, not as a single deterministic re-weighting.

Why “evidence-based”: the F1–F6 history

The 0.1.x family advertised “MCMC-free Bayesian disaggregation”, but the aggregate CPI never entered the computation (F1): the “posterior” was derived from the prior weight matrix alone, the Dirichlet concentration cancelled on renormalization (F2), the temporal pattern cancelled too (F3), an “efficiency” term was a fixed constant (F4), there were no recovery tests (F5), and a correlation helper opportunistically picked whichever of Pearson/Spearman was larger (F6). That foundational defect — not using the data — cannot be patched within a deterministic re-weighting; the fix is a model that conditions on the aggregate. The deterministic family has been removed; two honest Bayesian engines replace it.

The model (state-space, “Model A”)

Latent state in logs, with a random walk plus drift and partial pooling: \[ \log \varphi_{t,k} = \log \varphi_{t-1,k} + \delta_k + \tau_k\,\eta_{t,k}, \qquad \eta_{t,k}\sim\mathcal N(0,1), \] with \(\delta_k \sim \mathcal N(\delta_\mu,\delta_\sigma)\) and \(\log\tau_k \sim \mathcal N(\mu_{\log\tau}, \sigma_{\log\tau})\) (the drift and the innovation scale are pooled across sectors). The cross-section at \(t=1\) is anchored at the aggregate level with an estimable dispersion \(\omega_{\text{struct}}\) (the real concentration the old Dirichlet \(\gamma\) failed to be): \[ \log\varphi_{1,k} = \log(\text{phi1\_center}) + \omega_{\text{struct}}\,z_k . \] The aggregate is the genuine observation: \[ \mathrm{cpi}_t \sim \mathrm{Student\text{-}t}\!\left(\nu,\ \textstyle\sum_k W_{t,k}\,\varphi_{t,k},\ \sigma\right), \] (Gaussian if student_obs = FALSE).

Identification, honestly (rigour by layers)

The aggregate \(\sum_k W\varphi\) is strongly identified by the observation density. The per-sector split is only weakly identified: at each period one linear combination of the \(K\) sectors is pinned by the CPI, and the remaining \(K-1\) directions are governed by the cross-sectional prior plus temporal smoothness. So the per-sector intervals are honestly wide and prior-influenced. This is not a defect to hide — it is the correct uncertainty, and it is precisely why we feed the full posterior draws (not a point estimate) to the OU by multiple imputation: the sectoral uncertainty is propagated, not faked away.

Two engines, one trade-off

Both are Bayesian. Closed form buys speed and exactness at the cost of a simpler (Gaussian, linear) model; MCMC buys richness at the cost of sampling.

A runnable example (closed-form engine)

library(BayesianDisaggregation)

sim <- simulate_disagg(T = 30, K = 4, seed = 1)   # synthetic CPI + VAB weights
bl  <- disaggregate_conjugate(sim$cpi, sim$W, n_draws = 100, seed = 1)
bl
#> <disagg_conjugate>  closed-form linear-Gaussian baseline (Kalman/RTS)
#>   periods T = 30, sectors K = 4, joint draws = 100
#>   aggregate Gaussian log-likelihood = -64.56

## the smoothed aggregate tracks the CPI tightly (aggregate is well identified)
round(cor(bl$agg_summary[, "median"], sim$cpi), 4)
#> [1] 0.999

## joint posterior draws: the [T, K, draws] contract consumed by the nested OU
dim(bl$phi_draws)
#> [1]  30   4 100

The MCMC engine (sketch, not evaluated here)

fit <- disaggregate_statespace(sim$cpi, sim$W, chains = 4, iter = 2000, warmup = 1000)
fit$diagnostics                 # rhat_max, divergences
dim(fit$phi_draws)              # T x K x draws
str(fit$phi_summary)            # median, q2.5, q97.5 (T x K each)

## couple to the nested OU (uncertainty propagated by Rubin's rule):
## bayesianOU::fit_ou_nested_mi(phi_draws = fit$phi_draws, X = Phi_index, ...)

From Excel directly, reusing the bundled readers:

cpi_file <- system.file("extdata", "CPI.xlsx", package = "BayesianDisaggregation")
w_file   <- system.file("extdata", "WEIGHTS.xlsx", package = "BayesianDisaggregation")
fit <- disaggregate_from_files(cpi_file, w_file, chains = 2, iter = 1000)

Data note: comparing index vs index

The model is about index levels, so the CPI must be a level series (FRED units “Index source base”, aggregation “Average” for annual data — never a rate-of-change), re-indexed to the same base as the production prices it will be compared against (e.g. 1982–1984 = 100 via the project’s convert_to_index). Feeding a percent-change series here is a category error: the aggregate would not be on the same scale as \(\sum_k W\varphi\).

Coupling to the nested OU

disaggregate_statespace()$phi_draws (or disaggregate_conjugate(..., n_draws = M)$phi_draws) is a [T, K, M] array — exactly the multiple-imputation input of bayesianOU::fit_ou_nested_mi(). The OU refits once per imputation and combines the analyses by Rubin’s rule, so the disaggregation uncertainty becomes part of the OU posterior. ```