---
title: "Reading bibliometric data into bibnets"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Reading bibliometric data into bibnets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(bibnets)
```

## 1. Introduction and the standard schema

`bibnets` reads bibliographic data from two kinds of source. The first is the
standard database exports — Scopus, Web of Science, OpenAlex, Lens.org,
Dimensions, Crossref, BibTeX, and RIS — which it recognises and parses
automatically; give it a single file, several files, or a whole folder and it
works out each format on its own. The second is **any custom table of your
own**: a CSV or data frame that is not a known export, where you simply name
the columns that hold the authors, references, or keywords and `bibnets`
reads it into the same structure. Either way you get the same structure — the
**bibnets format** — that every network builder (`author_network()`,
`keyword_network()`, `reference_network()`, `document_network()`,
`source_network()`, `country_network()`, `institution_network()`,
`conetwork()`) works from. The bibnets format is a data frame with one row
per paper: most columns hold a single value (title, year, journal), while the
fields that can have many values per paper — authors, references, keywords —
hold a list in each row.

In full, the bibnets format has these columns:

| Column | Type | Meaning |
|---|---|---|
| `id` | chr | Document identifier (EID, OpenAlex W-ID, DOI, etc.) |
| `title` | chr | Document title |
| `year` | int | Publication year |
| `journal` | chr | Source / journal / venue name |
| `doi` | chr | DOI without the `https://doi.org/` prefix |
| `cited_by_count` | int | Citations received (as reported by source) |
| `abstract` | chr | Abstract text; `NA` for sources that do not expose it |
| `type` | chr | Document type (article, review, book-chapter, ...) |
| `authors` | list | Character vector of author names per row |
| `references` | list | Character vector of cited references per row |
| `keywords` | list | Character vector of keywords per row |

Some sources add extra columns (such as `index_keywords`, `keywords_plus`,
`affiliations`, or `countries`); these are kept after the standard ones.

This vignette documents the `read_biblio()` entry point and each reader, the
generic-CSV path, network construction directly from custom columns and
separators, the `split_field()` helper, and the manual construction of a
compatible data frame.

## 2. Custom data and separators

### Custom CSV — map columns by name

For CSV files that do not match any of the recognised signatures
(in-house exports, custom dumps, public datasets), map each source column
onto a standard field **by name**. The identifier column is named via
`id`; each multi-valued field is named via its own argument — `authors`,
`keywords`, `references`, `countries`, `affiliations` — and `journal` for
the scalar source/venue. `sep` is the delimiter applied inside those cells.
Naming any of these columns implies `format = "generic"`, so you do not
need to pass `format` yourself.

Hypothetical call:

```{r generic-call, eval = FALSE}
data <- read_biblio(
  "my_data.csv",
  id       = "doc_id",
  authors  = "Authors",
  keywords = "Keywords",
  sep      = ";"
)
```

Demonstrated on the bundled OpenAlex CSV (which uses `|` as the delimiter).
The source columns have long dotted names; mapping them by argument yields
the standard `authors` and `keywords` list-columns:

```{r generic-demo}
f <- system.file("extdata", "openalex_works.csv", package = "bibnets")
generic <- read_biblio(
  f,
  id       = "id",
  authors  = "authorships.author.display_name",
  keywords = "primary_topic.display_name",
  sep      = "|"
)
generic$authors[[1]]
generic$keywords[[1]]
```

Each mapped column is split on `sep` and stored under its standard name as a
list-column; the original source column is left in place. For any further
columns that have no dedicated argument, `list_cols` splits them in place
(keeping their original names).

### Custom columns and separators (no reader needed)

Often a dataset is already a plain data frame or CSV with its own column
names and its own delimiter — you do not need to coerce it into the
standard schema first. Every network builder accepts a column argument
named after the entity it builds (`authors`, `keywords`, `references`,
`journal`, `countries`, `affiliations`) plus a `sep` for splitting a
delimited character column. The builder splits, normalises, and builds in
one call.

```{r custom-columns}
papers <- data.frame(
  id            = 1:4,
  `Author Names`= c("Smith J, Doe A, Lee K", "Smith J, Lee K",
                    "Doe A, Lee K", "Smith J, Doe A"),
  Tags          = c("ml, ai", "ml, nlp", "ai, nlp", "ml, ai"),
  check.names   = FALSE,
  stringsAsFactors = FALSE
)

# Point the builder at the column and give it the delimiter — no renaming.
author_network(papers, authors = "Author Names", sep = ",")
keyword_network(papers, keywords = "Tags", sep = ",")
```

### The document identifier

The works dimension (the rows of the `works x entities` matrix) is the `id`
column. You do not have to supply one: `id = NULL` (the default) uses an
existing `id` column when present and otherwise numbers the rows, treating
each row as one document. The example above has no `id` column and still
works for that reason. To use a differently-named identifier column, name it
with the `id` argument:

```{r custom-id}
papers2 <- data.frame(
  paper_id = c("P1", "P2", "P3"),
  authors  = c("Alice, Bob", "Alice, Carol", "Bob, Carol"),
  stringsAsFactors = FALSE
)
author_network(papers2, authors = "authors", sep = ",", id = "paper_id")
```

Two entities are linked when they share the same `id`, so the identifier
controls what counts as "the same document" during projection.

`sep` is any literal delimiter, so BibTeX-style `" and "` or pipe-delimited
exports work too:

```{r custom-sep}
bib <- data.frame(
  id      = 1:3,
  creators = c("Alice and Bob", "Alice and Carol", "Bob and Carol"),
  stringsAsFactors = FALSE
)
author_network(bib, authors = "creators", sep = " and ")
```

### A separate separator for references

In a coupling network the *entity* column and the *references* column can
use different delimiters. Reference strings frequently contain internal
commas (`"Smith J, 2020, Journal"`), so `references` is split on `";"` by
default, independent of `sep`. Override it with `references_sep` when your
references use another delimiter:

```{r references-sep}
d <- data.frame(
  id         = c("P1", "P2", "P3"),
  auth       = c("Alice, Bob", "Alice, Carol", "Bob, Carol"),
  references = c("R1, R2", "R1, R3", "R2, R3"),
  stringsAsFactors = FALSE
)
author_network(d, "coupling", authors = "auth", sep = ",",
               references_sep = ",")
```

### Quoted values

Values exported with surrounding quotes (`"Alice"`, or the CSV doubled
form `""Alice""`) are cleaned automatically — `strip_quotes = TRUE` is the
default, so a quoted label and its bare form collapse to the same node.
Internal apostrophes (e.g. `O'Brien`) are left untouched. Set
`strip_quotes = FALSE` to keep the quotes as part of the label.

```{r strip-quotes}
q <- data.frame(
  id      = 1:3,
  authors = c('"Alice"; "Bob"', '"Alice"; "Carol"', '"Bob"; "Carol"'),
  stringsAsFactors = FALSE
)
author_network(q)                       # quotes stripped -> ALICE, BOB, CAROL
```

### A safety net for the wrong delimiter

If you pass a `sep` that does not actually split the column — for example
the data is pipe-delimited but you left `sep = ";"` — and the values
contain a structural delimiter (`";"`, `"|"`, or a tab), the builder warns
you instead of silently treating each whole cell as one entity:

```{r wrong-sep, warning = TRUE}
bad <- data.frame(
  id      = 1:3,
  authors = c("Smith J| Doe A", "Smith J| Lee K", "Doe A| Lee K"),
  stringsAsFactors = FALSE
)
invisible(author_network(bad))          # warns: values contain "|"
```

The check is deliberately quiet for commas and `" and "`, which appear
inside perfectly valid single labels (`"Last, First"` names,
one-reference-per-row citation strings, organisations like
`"Smith and Sons"`).

## 3. `read_biblio()`

`read_biblio()` accepts a single file, a vector of file paths, or a
directory. When `format = "auto"` (the default) it detects the format
from the contents of the file:

```{r read-biblio-signature, eval = FALSE}
data <- read_biblio("export.csv")          # auto-detect format
data <- read_biblio("scopus_dir/")         # entire directory, rbind'd
data <- read_biblio(c("a.csv", "b.csv"))   # multiple files, rbind'd
data <- read_biblio("file.csv", format = "scopus")   # force a format
```

When given a directory, `read_biblio()` collects every `.csv`, `.txt`,
`.bib`, `.ris`, `.xls`, and `.xlsx` file in it, reads each one, and
combines the results with `rbind()`. For more than one file a summary
message is emitted:

```
Read 3 files: 1247 rows total
```

Format detection is performed on the first non-empty line of the file:

- BibTeX: line 1 begins with `@`
- RIS: line 1 begins with `TY  -`
- Web of Science plaintext: line 1 begins with `FN ` or `PT `
- CSV-based: detection inspects the header row. When the first line
  matches the Dimensions preamble (`"About the data: ..."`), line 2 is
  used instead. Header tokens determine the format: `eid` for Scopus,
  `lens id` for Lens.org, `publication id` or `dimensions url` for
  Dimensions, `authorships.author.display_name` for the OpenAlex flat
  CSV.

If detection fails, `read_biblio()` raises an error that lists the
supported formats and indicates how to pass `format` explicitly or name
the entity columns (`authors`, `keywords`, ...), which reads the file as a
generic CSV.

Two readers are not dispatched by `read_biblio()`:

- `read_openalex()` accepts an in-memory tibble from
  `openalexR::oa_fetch()`, not a file path.
- `read_crossref()` accepts the `data` element of `rcrossref::cr_works()`.

Both take R objects rather than files and are called directly.

## 4. Scopus

```{r scopus-call, eval = FALSE}
sc <- read_scopus("scopus.csv")
```

`read_scopus()` ingests the standard Scopus CSV export (`File -> Export ->
CSV` from the Scopus search UI). Mappings from Scopus columns to the
bibnets schema:

| Scopus column | Standard column |
|---|---|
| `EID` (or `Article No.`) | `id` |
| `Title` | `title` |
| `Year` | `year` |
| `Source title` | `journal` |
| `DOI` | `doi` (prefix stripped) |
| `Cited by` | `cited_by_count` |
| `Abstract` | `abstract` |
| `Document Type` | `type` |
| `Authors` (`;`-delimited) | `authors` (list) |
| `References` (`;`-delimited) | `references` (list) |
| `Author Keywords` (`;`-delimited) | `keywords` (list) |
| `Index Keywords` (`;`-delimited) | `index_keywords` (list, extra) |
| `Affiliations` (`;`-delimited) | `affiliations` (list, extra) |
| `Language of Original Document` | `language` (extra) |

Scopus stores each cited reference as one semicolon-delimited string in a
single cell. `read_scopus()` splits on `;` and applies
`standardize_refs()` to each entry: uppercasing, whitespace
normalisation, and removal of a trailing DOI where present. References
differing only in case or trailing DOI then resolve to the same node in
co-citation and reference networks.

## 5. Web of Science

WoS exports come in two shapes:

```{r wos-call, eval = FALSE}
wos1 <- read_wos("savedrecs.txt")                       # plaintext (default)
wos2 <- read_wos("savedrecs.tsv", format = "tab")       # tab-delimited
```

The plaintext format is a tagged record syntax. Each record begins with a
`PT` (publication type) tag and ends with `ER` (end record). Within the
record, every field is introduced by a 2-letter tag at the start of a
line, with continuation lines indented:

| Tag | Field |
|---|---|
| `AU` | Authors (one per line) |
| `TI` | Title |
| `SO` | Source / journal |
| `PY` | Year |
| `DI` | DOI |
| `TC` | Times cited |
| `AB` | Abstract |
| `DT` | Document type |
| `DE` | Author keywords |
| `ID` | Keywords plus (extra: `keywords_plus`) |
| `CR` | Cited references (one per line) |

`read_wos()` walks the file, splitting on `ER` boundaries, and emits one
row per record. The tab-delimited variant carries the same fields in a
flat CSV-like grid. Either way the output schema is identical.

## 6. Dimensions

```{r dimensions-call, eval = FALSE}
dm <- read_dimensions("dimensions_export.csv")
```

The Dimensions CSV begins with a metadata row of the form

```
"About the data: This export was generated on YYYY-MM-DD ..."
```

before the column header. `read_dimensions()` detects this preamble and
skips it. If the line has been removed (for example, by manual editing
of the file), the reader continues to function because it identifies the
column row by the Dimensions header tokens `Publication ID` and
`Dimensions URL`.

Extras returned: `affiliations` and `countries` as list-columns,
analogous to the OpenAlex schema.

## 7. Lens.org

```{r lens-call, eval = FALSE}
ln <- read_lens("lens_export.csv")
```

Key Lens columns and how they map:

| Lens column | Standard column |
|---|---|
| `Lens ID` | `id` |
| `Title` | `title` |
| `Publication Year` | `year` |
| `Source Title` | `journal` |
| `DOI` | `doi` |
| `Cited by Count` | `cited_by_count` |
| `Abstract` | `abstract` |
| `Publication Type` | `type` |
| `Author/s` | `authors` (list) |
| `Reference Identifiers` | `references` (list) |
| `Keywords` | `keywords` (list) |

## 8. BibTeX & RIS

```{r bibtex-ris-call, eval = FALSE}
bt <- read_bibtex("library.bib")
ri <- read_ris("savedrecs.ris")
```

`read_bibtex()` parses `@type{key, field = {value}, ...}` blocks.
`read_ris()` parses tagged `TY  - ... ER  -` blocks; the structure is
equivalent to WoS plaintext, but with a different tag dictionary.

Standard BibTeX and RIS do not contain cited-reference data, so the
`references` column in the resulting data frame is empty on every row.
These formats are sufficient for co-authorship and keyword co-occurrence
networks. For co-citation, coupling, or direct citation networks, the
appropriate sources are Scopus, Web of Science, OpenAlex (via
`oa_fetch()`), Dimensions, Lens, or Crossref.

## 9. Crossref via rcrossref

```{r crossref-call, eval = FALSE}
library(rcrossref)
raw  <- cr_works(query = "graph neural networks", limit = 100)
data <- read_crossref(raw$data)
```

`read_crossref()` accepts the `data` element of the `cr_works()` result
(a data frame, not the wrapping list). The function handles the two
field-naming variants Crossref returns (`container.title` vs
`container-title`; `is.referenced.by.count` vs
`is-referenced-by-count`) and maps both to the standard schema.

## 10. OpenAlex — two paths

OpenAlex ships data through two routes that bibnets supports separately.

### Path A: flat CSV

The package includes a 30-row OpenAlex flat CSV at
`inst/extdata/openalex_works.csv`, corresponding to the export produced
by downloading "Works" results from the OpenAlex web interface.
Multi-valued fields use `|` as the delimiter.

```{r openalex-csv-demo}
f <- system.file("extdata", "openalex_works.csv", package = "bibnets")
oa <- read_openalex_csv(f)
str(oa, max.level = 1)
head(oa[, c("id", "title", "year", "journal", "type")], 5)
```

The list-columns:

```{r openalex-csv-lists}
oa$authors[[1]]
oa$affiliations[[1]]
oa$countries[[1]]
```

References and abstracts are absent from the OpenAlex flat export:
`references` is empty and `abstract` is `NA` because the web download does
not include those fields. Use OpenAlex via `openalexR::oa_fetch()` and
`read_openalex()` when you need cited references or abstracts.

The remaining fields support several network constructions that do not
require references — co-authorship, country, institution, keyword,
source, and document networks:

```{r openalex-csv-networks}
co <- country_network(oa, counting = "fractional")
head(co, 5)
```

### Path B: in-memory tibble from `openalexR`

This path is used when references and abstracts are required.
`openalexR::oa_fetch()` returns a nested tibble with `author`,
`referenced_works`, `concepts`, and `keywords` list-columns;
`read_openalex()` converts it to the standard schema:

```{r openalex-fetch, eval = FALSE}
library(openalexR)
raw  <- oa_fetch(entity = "works", search = "learning analytics", per_page = 200)
data <- read_openalex(raw)
```

References are returned as OpenAlex Work IDs (e.g. `W2769342982`) rather
than formatted citation strings. The IDs are stable identifiers suitable
for co-citation and direct-citation networks; visualisations that need
human-readable labels can join the IDs back to titles in a separate step.

## 11. Building data manually

When data does not come from any of the supported sources, a
bibnets-compatible data frame can be constructed directly. The
requirement is: standard scalar columns are character or integer;
multi-valued fields are list-columns whose elements are character
vectors.

```{r manual-build}
df <- data.frame(
  id    = c("p1", "p2", "p3"),
  title = c("Paper A", "Paper B", "Paper C"),
  year  = c(2020L, 2021L, 2022L),
  stringsAsFactors = FALSE
)
df$authors <- list(
  c("ALICE", "BOB"),
  c("BOB", "CAROL"),
  c("ALICE", "CAROL", "DAVE")
)
df$references <- list(
  c("R1", "R2"),
  c("R1", "R3"),
  c("R2", "R3", "R4")
)
df$keywords <- list(
  c("graph", "network"),
  c("network", "embedding"),
  c("graph", "embedding", "neural")
)

author_network(df, "collaboration")
keyword_network(df)
reference_network(df)
```

`build_bipartite()` applies `toupper(trimws(...))` to every entity label
before constructing the sparse matrix, so `"graph"`, `"Graph"`, and
`"GRAPH"` are mapped to the same node `"GRAPH"`. Tests or comparisons
that reference node names should use uppercase strings.

## 12. The `split_field()` helper

`split_field()` converts a character column with semicolon-delimited (or
otherwise delimited) values into a list-column without going through
`read_biblio(format = "generic")`:

```{r split-field-demo}
split_field(c("Alice; Bob; Carol", "Dave; Eve"))
split_field(c("a|b|c", "d|e"), sep = "|")
```

This is the same operation that `read_scopus()` and the other readers
apply internally to multi-valued columns; it is exported for use in
custom pipelines.

## 13. Combining data from multiple sources

Different readers expose different extras: WoS provides `keywords_plus`,
Scopus provides `index_keywords`, OpenAlex provides `countries`. To
combine sources, restrict each frame to the standard columns and bind:

```{r combine-sources}
common <- c("id", "title", "year", "journal", "doi", "cited_by_count",
            "abstract", "type", "authors", "references", "keywords")

data(biblio_data)
b1 <- biblio_data
b2 <- biblio_data
b2$id <- paste0(b2$id, "_dup")

cols <- intersect(common, names(b1))
combined <- rbind(b1[, cols], b2[, cols])
nrow(combined)
```

Two practical notes:

1. When document IDs overlap across sources (which occurs when Scopus and
   WoS both index the same article), prefixing the IDs as shown
   prevents duplicate documents from inflating co-occurrence counts.
2. Source-specific extras (e.g. WoS `keywords_plus`) should be retained
   on the per-source frame and merged selectively rather than coerced
   into the combined frame.

## 14. Inspecting and sanity-checking

After reading, basic checks on the list-column sizes and the scalar
columns help detect silent corruption. Empty list-columns and
out-of-range years are common indicators that an export is incomplete.

```{r sanity-check}
data(scopus_quantum_cloud)
sc <- scopus_quantum_cloud

range(lengths(sc$authors))
range(lengths(sc$references))
range(lengths(sc$keywords))

head(sort(table(sc$journal), decreasing = TRUE), 5)
range(sc$year, na.rm = TRUE)
table(sc$type)
```

Indicators to check:

- A `lengths()` of `0` on every row of `references` for a Scopus or WoS
  file indicates that the export did not include the references column.
  Re-export from the source with the references field selected.
- A year of `0` or `NA` indicates an empty source field.
- A single dominant document type (e.g. only `"article"`) is expected
  for filtered searches; broader mixes are expected for thematic
  searches.

## 15. Troubleshooting

| Symptom | Cause | Fix |
|---|---|---|
| `Could not detect file format` | First line doesn't match any signature | Pass `format = "scopus"` (etc.) explicitly, or name the entity columns (`authors`, `keywords`, ...) to read it as a generic CSV |
| Empty `references` list on every row | BibTeX/RIS or OpenAlex flat CSV — these don't carry citations | Use Scopus/WoS, OpenAlex via `oa_fetch()`, Dimensions, Lens, or Crossref |
| `Invalid multibyte string` on read | Wrong encoding | Most readers accept `encoding = "latin1"`; pass it through `read_biblio(..., encoding = "latin1")` |
| Author names look like `LASTNAME, F.J.` not `FJ LASTNAME` | Default is `flip_names = FALSE` | The reader returns names as-is from the source. Cluster them by string match downstream, or pass `flip_names = TRUE` if all names follow `Last, First` |
| Dimensions file silently fails | "About the data" preamble removed and column header edited | `read_dimensions()` detects the standard preamble and falls back to header-token detection; the failure mode requires the column header itself to have been edited |
| Co-authorship network contains duplicate nodes (e.g. `"Alice"` and `"ALICE"`) | Mixed casing in the source | The standard readers and `build_bipartite()` apply `toupper(trimws(...))` to entity labels. Manually constructed frames should apply the same normalisation |

## Further reading

- The companion vignette, `vignette("bibnets")`, covers network
  construction on the in-package datasets.
- `vignette("parsing-author-names")` covers `parse_names()` for
  normalising author labels before a network is built.
- All network builders (`author_network()`, `keyword_network()`,
  `reference_network()`, `document_network()`, `source_network()`,
  `country_network()`, `institution_network()`, `conetwork()`) accept
  the same core arguments (`type`, `counting`, `similarity`,
  `threshold`, `top_n`, `format`) plus the custom-column arguments
  shown above (`id`, the entity column, `sep`, `references_sep`,
  `strip_quotes`), so switching between network types on data already
  in the standard schema requires only a function-name change.