About
Motivation
Chemistry data repositories hold thousands of datasets, but finding the right one is hard. A researcher looking for "all 13C NMR spectra of thiazole derivatives measured above 290 K" or "all Suzuki couplings that used a palladium catalyst and achieved yields above 80%" cannot express these queries against plain DCAT-AP metadata. DCAT-AP describes datasets with free-text titles, descriptions, simple keywords tags and coarse theme vocabularies. It currently has no structured way to say what analysis method generated a dataset, which molecule was studied, what reaction type was performed, which instrument was used, or what parameters were applied.
ChemDCAT-AP exists to make such queries possible. By structuring the metadata about how a dataset was generated and what it is about, it enables domain-specific faceted search within data repositories. Concrete examples of filters this enables:
- By analysis method → Find all datasets generated by heteronuclear single quantum coherence (HSQC) NMR spectroscopy, or by gas chromatography-mass spectrometry, or by X-ray powder diffraction.
- By analysed substance → Find all datasets about a specific molecule (via InChIKey), or about any compound containing a thiazole substructure (via SMILES substructure search on the
composed_ofchain). - By reaction characteristics → Find all datasets about reactions that used a specific catalyst, produced a specific product, or achieved a yield above a threshold.
- By instrument parameters → Find all NMR datasets acquired with a specific pulse sequence, at a specific excitation frequency, or on a specific spectrometer model.
These filters require structured, machine-actionable metadata at a level of detail that DCAT-AP alone cannot provide. DCAT-AP+ adds the generic provenance and attribute machinery that makes this expressivity possible, but using it for chemistry requires manually classifying every instance with the correct ontology term via rdf_type (e.g., typing a qualitative attribute as CHEMINF:000059 to indicate it is an InChIKey). ChemDCAT-AP bakes such domain knowledge into the schema: it provides intuitive subclasses and dedicated slots with the right ontology mappings already in place, so that developers and data stewards can produce precise chemistry metadata without needing to look up the correct ontology terms themselves.
Origin
ChemDCAT-AP grew out of a collaboration between NFDI4Chem (NFDI for Chemistry) and NFDI4Cat (NFDI for Catalysis-Related Sciences). Both consortia needed fine-grained, machine-actionable metadata for domain-specific search in their data repositories, and given the overlap between chemistry and catalysis (e.g. both need chemical identifiers, characterization methods, reaction descriptions, ...) a joint effort was the natural choice. From the start it was clear that a DCAT-AP extension for this scope would need to be modular. As the schema took shape, it became apparent that its upper layer — structured provenance, generic attribute patterns, flexible classification — was entirely domain-agnostic and could serve as a reusable module for other domains. This layer was spun off as DCAT-AP+, and ChemDCAT-AP became the first domain profile built on top of it.
Consequently, ChemDCAT-AP serves a dual role:
- it will be implemented and further extended for production metadata schemata used by NFDI4Cat and NFDI4Chem services, such as NFDI4Chem's Search Service or NFDI4Cat's Repo4Cat and Metadata4cat,
- and a reference implementation that demonstrates the DCAT-AP+ extension rules in practice. Every design decision documented here can serve as a template for other domains building their own DCAT-AP+ extension profiles.
Design principles
The collaborative development process between NFDI4Chem and NFDI4Cat followed four core principles:
- Conformance — Following DCAT-AP's extension guidelines, all extensions strictly adhere to the mandatory constraints of the official DCAT-AP 3.0 specification.
- Discoverability-centric — New properties are chosen to directly improve dataset findability for specific chemistry and catalysis use cases (see Motivation).
- Semantic grounding — Every added class and property is mapped to an established ontology (PROV-O, QUDT, BFO, IAO, OBI, CHEBI, CHMO, SIO, etc.) via
class_uriandslot_uri. - Simplicity for data stewards and developers — The schema remains usable without requiring deep ontology expertise. Dedicated typed slots, clear naming, and the quick start lower the adoption barrier.
For more on DCAT-AP+'s design rationale and the gap in DCAT-AP, see the DCAT-AP+ design patterns documentation.
Publication
The design, implementation, and evaluation of DCAT-AP+ and ChemDCAT-AP were presented at the 19th International Conference on Metadata and Semantics Research (MTSR 2025), Thessaloniki, Greece. Preprint: https://doi.org/10.48550/arXiv.2602.01822.
Funding
This work is funded by the German Research Foundation (DFG) as part of the National Research Data Infrastructure (NFDI):
| Project | DFG Grant | Link |
|---|---|---|
| NFDI4Chem — NFDI for Chemistry | 441958208 | nfdi4chem.de |
| NFDI4Cat — NFDI for Catalysis-Related Sciences | 441926934 | nfdi4cat.org |
License
ChemDCAT-AP is released under CC-BY 4.0. The repository code is licensed under MIT.