A library and command-line tool that parses scientific names — including the
authorship, rank, hybrid markers and nomenclatural notes — into a structured
ParsedName
model.
| Module | Purpose |
|---|---|
name-parser-api |
Pure model + interface module: ParsedName, Authorship, Rank, NomCode, NameType, the NameParser interface, plus formatter / Unicode utilities. Depend on this if you only need the data model. |
name-parser |
The parser implementation. Single public entry point: org.gbif.nameparser.NameParserImpl. |
name-parser-cli |
Command-line tools (parse, compare, benchmark) wrapping the parser, packaged as an executable shaded jar. |
Build everything with mvn install from the repo root.
<dependency>
<groupId>org.gbif</groupId>
<artifactId>name-parser</artifactId>
<version>4.0.0-SNAPSHOT</version>
</dependency>NameParser parser = new NameParserImpl();
ParsedName pn = parser.parse("Vulpes vulpes silaceus Miller, 1907", null, null, null);After mvn install, the executable jar is at
name-parser-cli/target/name-parser-cli-<version>-shaded.jar.
java -jar name-parser-cli-<version>-shaded.jar <command> [options]
| Command | What it does |
|---|---|
parse |
Stream a text file with one name per row through the parser and write a JSONL file (one JSON object per row). |
compare |
Stream two JSONL files in lockstep, report aggregate metrics and a per-row dump of every differing parsed value. |
benchmark |
Measure parser throughput against a name-per-line input file (count, total / avg / min / p50 / p95 / max). |
Run <command> --help for the full per-command option list.
All commands stream their input — memory use stays flat regardless of input size, so multi-million-row inputs are fine.
Sample inputs ship in name-parser-cli/data/:
benchmark-data.txt— ~8k mixed names (hand-picked + test-assertion inputs + random Catalogue of Life rows with authorship) used for throughput benchmarking. Top up with more random names anytime via:The script reservoir-samples col-names.tsv in a single pass and appends rows aspython3 name-parser-cli/scripts/append-colnames-sample.py [-n 2000] [--seed 17]
scientificName authorship— manual edits to the benchmark file are preserved.col-names.tsv— the full Catalogue of Life names dump (~6.3M rows, ~340 MB, not tracked in git — drop your own copy here)
Each command's --input defaults assume you run it from the repo root.
Usage: name-parser-cli parse [options]
Options:
--input=PATH source file (default: data/col-names.tsv; '-' = stdin)
--output=PATH target file (default: <input>.<format-ext>; '-' = stdout)
--format=FMT output format: jsonl (default), json, csv, tsv
csv / tsv produce a flat ColDP Name file with header
--quiet suppress progress output
-h --help print this message and exit
Use - as the input or output path to stream from stdin / to stdout — the
command is fully unix-pipe friendly. Progress messages and the final summary
are written to stderr so stdout stays a clean data stream:
cat names.txt | name-parser-cli parse --input=- --output=- --format=tsv | head
xz -dc col-names.tsv.xz | name-parser-cli parse --input=- --output=- --format=jsonl > col.jsonlThe input format is auto-detected from the first non-blank, non-comment line:
- ColDP Name file (TSV or CSV) — recognised when the header row contains
any
ColdpTermproperty names (looked up viaColdpTerm.find). Only the columns the parser interface accepts are honoured:ID,scientificName,authorship,rank,code. Other columns are read but ignored. - Plain text — one name per line. If a line contains a tab, only the
substring before the first tab is treated as the name (so
col-names.tsvis usable both as ColDP-style TSV and as bare plain text).
Lines starting with # and blank lines are skipped.
| Format | Description |
|---|---|
jsonl (default) |
One self-contained JSON object per line; consumed by compare. |
json |
Single document containing a JSON array of all rows (streamed; not held in memory). |
csv / tsv |
Flat ColDP Name file with header row. |
JSON / JSONL rows look like:
{"line":42,"id":"42","input":"Felis catus","parsed":{ ...full ParsedName... }}
{"line":99,"id":"99","input":"Iridoviridae","error":{"type":"VIRUS","message":"..."}}The id field is populated from the ColDP ID column when present; otherwise
it is omitted.
Every structural ParsedName field maps to a ColDP column. Where the ColDP
Name entity lacks a column but the NameUsage entity defines one, that
NameUsage term is used (nameStatus, namePhrase, namePublishedInPage,
provisional, extinct). Parser-only fields without a ColDP equivalent are
written into custom columns prefixed with np: — strict ColDP readers ignore
unknown columns, so the file stays valid ColDP.
Multi-value rules: author lists join with | (the ColDP convention); notho
parts join with ,.
ParsedName field |
ColDP column |
|---|---|
id (from input) |
ID (falls back to verbatim scientificName when absent) |
canonicalNameWithoutAuthorship() (Candidatus prefixed when applicable) |
scientificName |
authorshipComplete() |
authorship |
rank, code |
rank, code (lower-cased) |
nomenclaturalNote (or manuscript flag) |
nameStatus |
uninomial, genus, infragenericEpithet, specificEpithet, infraspecificEpithet, cultivarEpithet |
same column names |
notho (every flagged part, comma-joined) |
notho |
originalSpelling |
originalSpelling |
combinationAuthorship.{authors,exAuthors,year} |
combinationAuthorship, combinationExAuthorship, combinationAuthorshipYear (authors joined with |) |
basionymAuthorship.{authors,exAuthors,year} |
basionymAuthorship, basionymExAuthorship, basionymAuthorshipYear (authors joined with |) |
publishedIn (free text) |
namePublishedInPage |
extinct |
extinct |
phrase |
namePhrase |
doubtful |
provisional |
type (when not SCIENTIFIC) |
np:type |
sanctioningAuthor |
np:sanctioningAuthor |
taxonomicNote (sensu) |
np:taxonomicNote |
unparsed |
np:unparsed |
warnings (joined with |) |
np:warnings |
| (parser failure message) | np:error |
Unparsable rows are still written: ID, scientificName (the verbatim input)
and the np:type / np:error columns are populated.
Usage: name-parser-cli compare [options] <a.jsonl> <b.jsonl> [diffs.txt]
Options:
--a=PATH first JSONL file (alt. to first positional arg)
--b=PATH second JSONL file (alt. to second positional arg)
--output=PATH write per-row diffs here (default: stdout)
--ignore-whitespace strip whitespace from string leaves before compare
--max-diffs=N cap per-row diff dump at N rows (default: 100)
-h --help print this message and exit
Both inputs are expected to come from the same source file (matching line
numbers, same row order). The summary reports rows compared / identical /
differing, status transitions (PARSED→ERROR, ERROR→PARSED, …) and the top
differing field paths. Whitespace inside parsed string values is significant by
default — pass --ignore-whitespace to suppress whitespace-only differences in
parsed values (the JSON formatting itself is ignored either way).
Usage: name-parser-cli benchmark [options]
Options:
--input=PATH source file (default: data/benchmark-data.txt)
--warmup do an extra untimed pass over the input first to warm the JIT
-h --help print this message and exit
Pure throughput measurement — every input row is parsed and timed. JIT warmup
is opt-in via --warmup, in which case the input is streamed through the
parser once without timing before the timed pass; on subsequent runs the
HotSpot-warmed numbers tend to be ~10× lower. Nothing is written to disk; the
report goes to stdout.
Apache 2.0.