get_agent_report()
rendered a bare “no interrogation performed” plan with those columns
blank, on both the in-memory and the lazy paths. The synthetic report
agent is now marked as interrogated in constant time (without re-running
the per-column interrogation) and its validation set is built by
replicating a single template step, so the report is correct and
build_report_agent() stays fast on wide tables (~0.04 s for
1448 columns). coverage / summary were already
correct (issue #6).compare_datasets_from_yaml() is much faster on
wide tables, including all-pass (green) comparisons -
the nominal case of non-regression suites. Measured on 300 numeric
columns x 200 000 identical rows: the in-memory (data.frame) path drops
from ~13.8 s to ~5.2 s, and the lazy (DuckDB) path from ~64.5 s to ~9.8
s. The verdict (all_passed) and the extractable failing
cells are unchanged. Three levers (issue #4):
anyDuplicated() / duplicated() (a single
hashed pass) instead of dplyr::count() /
group_by(); lazy tables keep the SQL-native count. Same
warning message.<col>__ok booleans, with a fast path that
skips the special-value handling when a column has no
NA/NaN/Inf.<col>__ok /
<col>__eq booleans in a single templated SQL
SELECT instead of one dplyr::mutate()
expression per column, eliminating the dbplyr query-construction cost
that dominated wide lazy comparisons (verified equivalent on DuckDB and
SQLite).The lazy HTML report now keeps the genuine failure detail.
Printing result$reponse (or calling
datadiff_report_html()) on a failing comparison again
shows, for each failing column, the number of failing rows, the
offending cells, and the downloadable CSV extract - while still listing
every check performed, with the col_exists existence checks
and the col_vals_equal value checks presented as distinct
steps. The report is now built on top of the real interrogated agent
(which holds the extracts) rather than a counts-only synthesis, so
nothing is lost.
build_report_agent() threads warn_at /
stop_at so the report’s warn/stop colouring follows the
supplied thresholds instead of hard-coded values.
coverage /
summary and the lazy report, and are normalised to
ASCII.@noRd (documentation kept
in source, no .Rd).compare_datasets_from_yaml() now returns
coverage and summary. The
coverage data.frame is a faithful, instantly computed
record of every check performed (one row per column with its check type,
number of rows, number of failures and PASS/FAIL status);
summary holds the aggregate counts. This keeps the verified
checks visible even when the all-pass fast path skips the per-column
{pointblank} agent, so an all-green run no longer looks empty.
Printing result$reponse now lazily renders a full
{pointblank}-style report. result$reponse stays a real
interrogated agent (so pointblank::all_passed() and
pointblank::get_data_extracts() keep working), but printing
it builds the per-column report on demand from the pre-computed counts
(no re-interrogation) and memoizes the result, so the rendering cost is
paid only when the report is actually displayed.
New exported datadiff_report_html() writes that
{pointblank}-style report to a standalone HTML file.
compare_datasets_from_yaml() is dramatically faster
on wide tables. The verdict is computed in a single vectorised pass over
the precomputed comparison booleans; when everything passes, the
expensive per-column {pointblank} agent (one validation step per column,
roughly quadratic in the number of columns) is skipped entirely. On a
failing comparison, validation steps are built only for the columns that
actually fail. all_passed and the extracted failing cells
are unchanged.
add_tolerance_columns(): the per-column tolerance
computation was rewritten to avoid quadratic data.frame growth (the
result columns are bound in a single operation instead of one assignment
per column), with no change to the verdict.
Replace \(x) lambda syntax with
function(x) throughout to maintain compatibility with the
declared R (>= 4.0.0) dependency (\()
requires R >= 4.1.0).
All test files: replace hardcoded YAML file paths with
tempfile() + on.exit(unlink()) to comply with
the CRAN policy prohibiting tests from writing outside
tempdir().
compare_datasets_from_yaml(): change default
language from "fr" / "fr_FR" to
"en" / "en_US" for international use.
compare_datasets_from_yaml(): the
error_msg_no_key parameter is now used in the message
emitted when row counts differ without keys (was previously a dead
parameter).
write_rules_template(): fix @return
documentation (invisible(path), not
invisible(NULL)); fix default label prefix from
"comparaison" to "comparison"; fix
@param documentation for warn_at and
stop_at (actual default is 1e-14, not
0.01).
%||% is no longer exported to avoid namespace
conflicts with rlang.
DESCRIPTION: add URL and
BugReports fields; add
Config/testthat/edition: 3.
cand = ref + threshold mathematically, the
subtraction cand - ref can exceed the threshold by a few
ULPs (e.g. 100.01 - 100.00 = 0.0100000000000051 > 0.01
in double precision), causing a false validation failure. Fixed by
adding a correction of 8 * .Machine$double.eps * |ref| to
the threshold before comparing - a value proportional to the operand
magnitude that absorbs floating-point representation error without
meaningfully widening the user-specified tolerance. The correction is
applied in both the local data.frame path and the lazy SQL path.by_name example showing column-level
rule overrides (per-column absolute tolerance, mixed abs+rel,
case-insensitive character comparison) and a table explaining how
by_name takes precedence over by_type.Fix crash
(sign(cand_vals): non-numeric argument to a mathematical function)
when a column is numeric in the reference dataset but stored as
character in the candidate. The tolerance arithmetic
(sign(), abs(), subtraction) was applied
unconditionally to all numeric reference columns, regardless of the
candidate column type.
The fix detects type mismatches between reference and candidate at
the start of compare_datasets_from_yaml() and:
'year' (reference: numeric, candidate: character)).==.type_mismatch: <column>) to the pointblank report for
each mismatched column, so the mismatch is clearly surfaced in the
validation output.setup_pointblank_agent() gains a
type_mismatch_cols parameter (character vector) to receive
the list of columns with incompatible types between reference and
candidate.
Fix out-of-memory crash when comparing large Parquet files (e.g. two 4 GB files with 125 numeric columns × 4 M rows). Root causes and fixes:
Arrow virtual-scanner memory is untracked by
DuckDB. arrow::to_duckdb() registers an
Arrow-format virtual table whose read buffers are allocated
outside DuckDB’s memory manager. Combined Arrow + DuckDB memory
could exceed physical RAM before DuckDB’s spilling threshold was
reached, causing an
Out of Memory Error: Allocation failure. Fixed by
materialising file-backed Parquet datasets with DuckDB’s native
read_parquet() instead
(CREATE TEMP TABLE … AS SELECT * FROM read_parquet([…])),
so all memory is tracked and disk-spilling works end-to-end.
Non-file-backed or non-Parquet Arrow inputs continue to use
arrow::to_duckdb() as a fallback.
DuckDB’s default memory cap (80 % of total RAM) left no
headroom for R, Arrow, and OS memory. Fixed by adding an
explicit SET memory_limit after opening the DuckDB
connection, defaulting to "8GB". This forces aggressive
spilling well before system RAM is exhausted.
is_tbl_mssql(logical(0)) crash in
pointblank. After materialisation via
dplyr::compute(), pointblank’s reporting code inspected the
connection metadata of the lazy table and received
character(0), which propagated to
if(logical(0)). Fixed by collect()-ing the
slim boolean table (~125 logical columns) after compute()
so that pointblank receives a plain data.frame - no live
DuckDB connection required during interrogation.
compare_datasets_from_yaml() gains
duckdb_memory_limit (default "8GB"): the
SET memory_limit value passed to the user-managed DuckDB
connection when Arrow datasets are used. Raise it on machines with ample
free RAM to reduce disk I/O; lower it on memory-constrained machines.
Has no effect for data.frame or tbl_lazy
inputs.compare_datasets_from_yaml()
and write_rules_template() now accept Arrow Tables
(arrow::read_parquet(as_data_frame=FALSE)) and Arrow
Datasets (arrow::open_dataset()) in addition to data.frames
and lazy tables. Key design decisions:
arrow::to_duckdb() before pointblank validation, keeping
the entire pipeline lazy. No full collect() is performed -
even very large Parquet files that don’t fit in RAM can be
validated.__ok), equality columns
(__eq), and text preprocessing are computed via
dplyr::mutate() on the Arrow side before the DuckDB
handoff.stringr::str_trim() (Arrow
does not support trimws()).collect()
since neither Arrow nor DuckDB guarantee row order.Suggests.tbl_lazy / dbplyr).
compare_datasets_from_yaml() and
write_rules_template() now accept SQL-backed lazy tables in
addition to local data.frames. Key changes:
collect(head(x, 0L)) schema
(no data fetched).GROUP BY +
COUNT.__absdiff, __thresh,
__ok) and equality columns (__eq) are computed
via dplyr::mutate() + case_when() for SQL
translation.trimws(), tolower())
is translated to SQL TRIM() and LOWER() by
dbplyr.collect()
with a user-facing message, since SQL does not guarantee row order.Suggests.requireNamespace("dbplyr") guard when lazy tables
are passed, with a clear error message if the package is missing.numeric_rel in
write_rules_template() from 0.000000001 to
0. Relative tolerance is now disabled by default, which
avoids unintended failures when comparing values near zero.Validation now fails when columns present in the reference
dataset are missing from the candidate dataset. Previously, missing
columns were silently ignored. A dedicated failing step (labelled
col_exists: <column>) is added to the pointblank
report for each absent column.
setup_pointblank_agent() gains a
missing_in_candidate parameter (character vector) to
receive the list of columns absent from the candidate.
Change default value of check_count_default in
write_rules_template() from FALSE to
TRUE. Row-count validation is now enabled by default in
generated YAML templates.
Fix assignment of row_count_ok when the joined
comparison dataframe is empty (zero rows), which previously caused an
error.
compare_datasets_from_yaml():
extract_failed: Control whether to collect failed rows
(default: TRUE). Set to FALSE for lightweight validation without row
extraction.get_first_n: Limit to first n failed rows per
validation step.sample_n: Randomly sample n failed rows per validation
step.sample_frac: Sample a fraction (0-1) of failed
rows.sample_limit: Maximum rows when using sample_frac
(default: 5000).lang and locale parameters in
compare_datasets_from_yaml() and
setup_pointblank_agent(). French remains the default for
backwards compatibility. Supported languages include: en, fr, de, es,
pt, it, zh, ja, ru.Make YAML rules file optional with automatic default rules
(#7602523). The path parameter in
compare_datasets_from_yaml() is now optional. When NULL,
default rules are auto-generated based on the reference dataset
structure.
Make key parameter optional in
write_rules_template(). NULL means positional comparison
(row by row).
Fix critical edge cases for Inf/NaN handling (#79b9fc6):
Inf == Inf (same sign) now correctly returns
TRUE
NaN == NaN respects the na_equal
setting
Avoid NaN results from Inf - Inf
calculations
Add input validation for
compare_datasets_from_yaml():
data_reference and
data_candidate are data.framespath exists and is a single stringAdd duplicate keys detection with detailed warning message showing examples of duplicate key values.
Add reserved suffix conflict detection with warning when column
names contain the reference suffix (default:
__reference).