datadiff 0.5.0

Release consolidating the 0.4.x maintenance series (wide-table performance, lazy / Arrow / Parquet support, accurate HTML reports and IEEE 754 tolerance handling). No user-facing behaviour changes since 0.4.9; see the entries below for the details.

datadiff 0.4.9

Bug fixes

The HTML report of an all-pass comparison again shows the per-check evaluated row counts (the TBL / EVAL / UNITS / PASS columns). The all-pass fast path built a report agent that {pointblank} did not consider interrogated, so get_agent_report() rendered a bare “no interrogation performed” plan with those columns blank, on both the in-memory and the lazy paths. The synthetic report agent is now marked as interrogated in constant time (without re-running the per-column interrogation) and its validation set is built by replicating a single template step, so the report is correct and build_report_agent() stays fast on wide tables (~0.04 s for 1448 columns). coverage / summary were already correct (issue #6).

datadiff 0.4.8

Performance

compare_datasets_from_yaml() is much faster on wide tables, including all-pass (green) comparisons - the nominal case of non-regression suites. Measured on 300 numeric columns x 200 000 identical rows: the in-memory (data.frame) path drops from ~13.8 s to ~5.2 s, and the lazy (DuckDB) path from ~64.5 s to ~9.8 s. The verdict (all_passed) and the extractable failing cells are unchanged. Three levers (issue #4):
- Duplicate-key check on local data.frames now uses anyDuplicated() / duplicated() (a single hashed pass) instead of dplyr::count() / group_by(); lazy tables keep the SQL-native count. Same warning message.
- Tolerance arithmetic on the local path computes only the <col>__ok booleans, with a fast path that skips the special-value handling when a column has no NA/NaN/Inf.
- Lazy path builds the <col>__ok / <col>__eq booleans in a single templated SQL SELECT instead of one dplyr::mutate() expression per column, eliminating the dbplyr query-construction cost that dominated wide lazy comparisons (verified equivalent on DuckDB and SQLite).

datadiff 0.4.7

Bug fixes

The lazy HTML report now keeps the genuine failure detail. Printing result$reponse (or calling datadiff_report_html()) on a failing comparison again shows, for each failing column, the number of failing rows, the offending cells, and the downloadable CSV extract - while still listing every check performed, with the col_exists existence checks and the col_vals_equal value checks presented as distinct steps. The report is now built on top of the real interrogated agent (which holds the extracts) rather than a counts-only synthesis, so nothing is lost.
build_report_agent() threads warn_at / stop_at so the report’s warn/stop colouring follows the supplied thresholds instead of hard-coded values.

Documentation

Vignette and README document coverage / summary and the lazy report, and are normalised to ASCII.
DESCRIPTION quotes software names (‘pointblank’, ‘YAML’) per CRAN convention, clearing the “Possibly misspelled words” NOTE.
Internal helpers are marked @noRd (documentation kept in source, no .Rd).

datadiff 0.4.6

New features

compare_datasets_from_yaml() now returns coverage and summary. The coverage data.frame is a faithful, instantly computed record of every check performed (one row per column with its check type, number of rows, number of failures and PASS/FAIL status); summary holds the aggregate counts. This keeps the verified checks visible even when the all-pass fast path skips the per-column {pointblank} agent, so an all-green run no longer looks empty.
Printing result$reponse now lazily renders a full {pointblank}-style report. result$reponse stays a real interrogated agent (so pointblank::all_passed() and pointblank::get_data_extracts() keep working), but printing it builds the per-column report on demand from the pre-computed counts (no re-interrogation) and memoizes the result, so the rendering cost is paid only when the report is actually displayed.
New exported datadiff_report_html() writes that {pointblank}-style report to a standalone HTML file.

datadiff 0.4.5

Performance

compare_datasets_from_yaml() is dramatically faster on wide tables. The verdict is computed in a single vectorised pass over the precomputed comparison booleans; when everything passes, the expensive per-column {pointblank} agent (one validation step per column, roughly quadratic in the number of columns) is skipped entirely. On a failing comparison, validation steps are built only for the columns that actually fail. all_passed and the extracted failing cells are unchanged.
add_tolerance_columns(): the per-column tolerance computation was rewritten to avoid quadratic data.frame growth (the result columns are bound in a single operation instead of one assignment per column), with no change to the verdict.

datadiff 0.4.4

CRAN submission preparation

Replace \(x) lambda syntax with function(x) throughout to maintain compatibility with the declared R (>= 4.0.0) dependency (\() requires R >= 4.1.0).
All test files: replace hardcoded YAML file paths with tempfile() + on.exit(unlink()) to comply with the CRAN policy prohibiting tests from writing outside tempdir().
compare_datasets_from_yaml(): change default language from "fr" / "fr_FR" to "en" / "en_US" for international use.
compare_datasets_from_yaml(): the error_msg_no_key parameter is now used in the message emitted when row counts differ without keys (was previously a dead parameter).
write_rules_template(): fix @return documentation (invisible(path), not invisible(NULL)); fix default label prefix from "comparaison" to "comparison"; fix @param documentation for warn_at and stop_at (actual default is 1e-14, not 0.01).
%||% is no longer exported to avoid namespace conflicts with rlang.
DESCRIPTION: add URL and BugReports fields; add Config/testthat/edition: 3.

datadiff 0.4.3

Bug fixes

Fix inconsistent tolerance comparison due to IEEE 754 floating-point rounding. When cand = ref + threshold mathematically, the subtraction cand - ref can exceed the threshold by a few ULPs (e.g. 100.01 - 100.00 = 0.0100000000000051 > 0.01 in double precision), causing a false validation failure. Fixed by adding a correction of 8 * .Machine$double.eps * |ref| to the threshold before comparing - a value proportional to the operand magnitude that absorbs floating-point representation error without meaningfully widening the user-specified tolerance. The correction is applied in both the local data.frame path and the lazy SQL path.

Documentation

README: add a full by_name example showing column-level rule overrides (per-column absolute tolerance, mixed abs+rel, case-insensitive character comparison) and a table explaining how by_name takes precedence over by_type.

datadiff 0.4.2

Bug fixes

Fix crash (sign(cand_vals): non-numeric argument to a mathematical function) when a column is numeric in the reference dataset but stored as character in the candidate. The tolerance arithmetic (sign(), abs(), subtraction) was applied unconditionally to all numeric reference columns, regardless of the candidate column type.

The fix detects type mismatches between reference and candidate at the start of compare_datasets_from_yaml() and:
- Emits a warning listing each mismatched column with its reference and candidate types (e.g. 'year' (reference: numeric, candidate: character)).
- Excludes mismatched columns from tolerance and equality comparison to avoid the crash and prevent silent pass-through due to R’s type coercion in ==.
- Adds a dedicated failing validation step (labelled type_mismatch: <column>) to the pointblank report for each mismatched column, so the mismatch is clearly surfaced in the validation output.
setup_pointblank_agent() gains a type_mismatch_cols parameter (character vector) to receive the list of columns with incompatible types between reference and candidate.

datadiff 0.4.1

Bug fixes

Fix out-of-memory crash when comparing large Parquet files (e.g. two 4 GB files with 125 numeric columns × 4 M rows). Root causes and fixes:
- Arrow virtual-scanner memory is untracked by DuckDB. arrow::to_duckdb() registers an Arrow-format virtual table whose read buffers are allocated outside DuckDB’s memory manager. Combined Arrow + DuckDB memory could exceed physical RAM before DuckDB’s spilling threshold was reached, causing an Out of Memory Error: Allocation failure. Fixed by materialising file-backed Parquet datasets with DuckDB’s native read_parquet() instead (CREATE TEMP TABLE … AS SELECT * FROM read_parquet([…])), so all memory is tracked and disk-spilling works end-to-end. Non-file-backed or non-Parquet Arrow inputs continue to use arrow::to_duckdb() as a fallback.
- DuckDB’s default memory cap (80 % of total RAM) left no headroom for R, Arrow, and OS memory. Fixed by adding an explicit SET memory_limit after opening the DuckDB connection, defaulting to "8GB". This forces aggressive spilling well before system RAM is exhausted.
- is_tbl_mssql(logical(0)) crash in pointblank. After materialisation via dplyr::compute(), pointblank’s reporting code inspected the connection metadata of the lazy table and received character(0), which propagated to if(logical(0)). Fixed by collect()-ing the slim boolean table (~125 logical columns) after compute() so that pointblank receives a plain data.frame - no live DuckDB connection required during interrogation.

New parameters

compare_datasets_from_yaml() gains duckdb_memory_limit (default "8GB"): the SET memory_limit value passed to the user-managed DuckDB connection when Arrow datasets are used. Raise it on machines with ample free RAM to reduce disk I/O; lower it on memory-constrained machines. Has no effect for data.frame or tbl_lazy inputs.

datadiff 0.4.0

New features

Add Arrow/Parquet support. compare_datasets_from_yaml() and write_rules_template() now accept Arrow Tables (arrow::read_parquet(as_data_frame=FALSE)) and Arrow Datasets (arrow::open_dataset()) in addition to data.frames and lazy tables. Key design decisions:
- Arrow objects are converted to DuckDB-backed lazy tables via arrow::to_duckdb() before pointblank validation, keeping the entire pipeline lazy. No full collect() is performed - even very large Parquet files that don’t fit in RAM can be validated.
- Tolerance columns (__ok), equality columns (__eq), and text preprocessing are computed via dplyr::mutate() on the Arrow side before the DuckDB handoff.
- Text trimming on Arrow uses stringr::str_trim() (Arrow does not support trimws()).
- Positional comparison (no key) still requires collect() since neither Arrow nor DuckDB guarantee row order.
- arrow, duckdb, and stringr are added as Suggests.

datadiff 0.3.0

New features

Add lazy table support (tbl_lazy / dbplyr). compare_datasets_from_yaml() and write_rules_template() now accept SQL-backed lazy tables in addition to local data.frames. Key changes:
- Type detection uses a 0-row collect(head(x, 0L)) schema (no data fetched).
- Duplicate key detection uses SQL-native GROUP BY + COUNT.
- Tolerance columns (__absdiff, __thresh, __ok) and equality columns (__eq) are computed via dplyr::mutate() + case_when() for SQL translation.
- Text preprocessing (trimws(), tolower()) is translated to SQL TRIM() and LOWER() by dbplyr.
- Positional comparison (no key) falls back to collect() with a user-facing message, since SQL does not guarantee row order.
- DBI, dbplyr, and RSQLite are added as Suggests.
Add requireNamespace("dbplyr") guard when lazy tables are passed, with a clear error message if the package is missing.

datadiff 0.2.1

Bug fixes

Change default value of numeric_rel in write_rules_template() from 0.000000001 to 0. Relative tolerance is now disabled by default, which avoids unintended failures when comparing values near zero.

datadiff 0.2.0

New features

Validation now fails when columns present in the reference dataset are missing from the candidate dataset. Previously, missing columns were silently ignored. A dedicated failing step (labelled col_exists: <column>) is added to the pointblank report for each absent column.
setup_pointblank_agent() gains a missing_in_candidate parameter (character vector) to receive the list of columns absent from the candidate.

Bug fixes

Change default value of check_count_default in write_rules_template() from FALSE to TRUE. Row-count validation is now enabled by default in generated YAML templates.
Fix assignment of row_count_ok when the joined comparison dataframe is empty (zero rows), which previously caused an error.

datadiff 0.1.6

New features

Add memory optimization parameters to limit output size for large datasets. New parameters in compare_datasets_from_yaml():
- extract_failed: Control whether to collect failed rows (default: TRUE). Set to FALSE for lightweight validation without row extraction.
- get_first_n: Limit to first n failed rows per validation step.
- sample_n: Randomly sample n failed rows per validation step.
- sample_frac: Sample a fraction (0-1) of failed rows.
- sample_limit: Maximum rows when using sample_frac (default: 5000).
Add multilingual support for pointblank reports. New lang and locale parameters in compare_datasets_from_yaml() and setup_pointblank_agent(). French remains the default for backwards compatibility. Supported languages include: en, fr, de, es, pt, it, zh, ja, ru.

datadiff 0.1.5

New features

Make YAML rules file optional with automatic default rules (#7602523). The path parameter in compare_datasets_from_yaml() is now optional. When NULL, default rules are auto-generated based on the reference dataset structure.
Make key parameter optional in write_rules_template(). NULL means positional comparison (row by row).

Bug fixes

Fix critical edge cases for Inf/NaN handling (#79b9fc6):
- Inf == Inf (same sign) now correctly returns TRUE
- NaN == NaN respects the na_equal setting
- Avoid NaN results from Inf - Inf calculations
Add input validation for compare_datasets_from_yaml():
- Validate that data_reference and data_candidate are data.frames
- Validate that path exists and is a single string
Add duplicate keys detection with detailed warning message showing examples of duplicate key values.
Add reserved suffix conflict detection with warning when column names contain the reference suffix (default: __reference).

datadiff 0.1.0

Initial release with core functionality:
- YAML-based validation rules configuration
- Tolerance-based numeric comparisons (absolute and relative)
- Text normalization (case-insensitive, trim whitespace)
- Row count validation
- Column existence checks
- Integration with pointblank for validation reporting