Dataset and activity shapes

ChemDCAT-AP defines four convenience shapes that connect the DCAT-AP+ Dataset and DataGeneratingActivity classes to the chemistry-specific subject matter classes (SubstanceSample, ChemicalReaction). These shapes add no new slots. They only narrow the ranges of inherited slots to the right chemistry classes.

Design pattern

The left side shows the reaction pattern: a ReactionMonitoringDataset is about a ChemicalReaction and was generated by a ReactionMonitoring activity that evaluated the ChemicalReaction. The right side shows the substance pattern: a SubstanceSampleCharacterizationDataset is about a SubstanceSample and was generated by a SubstanceSampleCharacterization activity that evaluated the SubstanceSample. Dashed arrows indicate shape inheritance (is_a); solid arrows indicate slot relationships, labeled with both the LinkML slot name and the RDF predicate it maps to.

Why these shapes are needed

Consider what happens without them. You have a SubstanceSample with chemical composition, concentration, and physical properties. You want to describe a dataset about this sample. In DCAT-AP+, the Dataset class has an is_about_entity slot with range: EvaluatedEntity. But your SubstanceSample carries slots from ChemicalSubstanceMixin (like composed_of, has_concentration) that do not exist on EvaluatedEntity. If you try to validate or convert your instance data using the DCAT-AP+ Dataset shape, the LinkML tooling will reject it, because the data does not match the expected shape.

This is not a bug. LinkML validates against the shape declared in the range, and each shape defines exactly which slots are allowed. A SubstanceSample is a different shape from an EvaluatedEntity, even though it extends it via is_a. The tooling does not infer that "SubstanceSample inherits from EvaluatedEntity, so it should be accepted where EvaluatedEntity is expected." That would be OWL subsumption reasoning, which LinkML does not perform at validation time.

These four shapes also serve as templates: downstream profiles that need more granular alternatives can follow the same pattern of subclassing Dataset and DataGeneratingActivity with narrower ranges.

Coarse-grained by design

These shapes deliberately conflate measurement and analysis into a single DataGeneratingActivity. This is appropriate when a dataset captures the end result of a characterization workflow without needing to trace intermediate steps (raw data vs. processed data vs. derived results). When that distinction matters, use the DCAT-AP+ DataAnalysis chain instead: define your own DataGeneratingActivity and DataAnalysis subclasses in a sub-profile that imports ChemDCAT-AP, and model raw and derived datasets separately via AnalysisSourceData and AnalysisDataset.

Substance sample characterization

SubstanceSampleCharacterizationDataset extends Dataset, constrains was_generated_by to range SubstanceSampleCharacterization and is_about_entity to range SubstanceSample.

SubstanceSampleCharacterization extends DataGeneratingActivity and constrains evaluated_entity to range SubstanceSample. Declares broad_mappings: OBI:0000070 (assay), indicating that this shape is semantically narrower than OBI's assay class.

The rdf_type slot (inherited from the ClassifierMixin) is how you specify what kind of characterization was performed. For a NMR measurement, you would set rdf_type to CHMO:0000595 (carbon-13 NMR spectroscopy) or another CHMO term. For an XRD analysis, you would use the corresponding CHMO class. The shape itself does not prescribe the method.

When to use vs. when to define your own

Scenario	Use this shape?
A dataset records an NMR spectrum of a sample, and you don't need to separate measurement from processing	Yes
A dataset records the final structure assignment derived from multiple spectroscopic measurements	Consider using `AnalysisDataset` instead, with each measurement as a separate `DataGeneratingActivity` producing `AnalysisSourceData`
A catalysis study produces a dataset about catalyst performance measured via GC-MS	Yes, if the dataset captures the end result
An NMR sub-profile needs to model the pulse sequence and number of scans as distinct parameter, or the peak assignment as a distinct step	Define your own activity subclasses in your sub-profile

Reaction recording

ReactionMonitoringDataset extends Dataset, constrains was_generated_by to range ReactionMonitoring and is_about_activity to range ChemicalReaction.

ReactionMonitoring extends DataGeneratingActivity and constrains evaluated_activity to range ChemicalReaction.

The same rdf_type mechanism applies: classify the ReactionMonitoring instance with the appropriate ontology term for the kind of recording process (reaction monitoring, experimental documentation, calorimetric evaluation, etc.).

When to use vs. when to define your own

Scenario	Use this shape?
A dataset documents a synthesis procedure with participants, conditions, and yield	Yes
A dataset captures real-time reaction monitoring data (e.g., in-situ IR) as raw data, plus a separate dataset with the derived kinetic analysis	Define your own activity subclasses; use `DataAnalysis`/`AnalysisDataset` for the derived dataset
A Chemotion Repository reaction record exported as a single dataset	Yes

Shared design properties

All four classes share class_uri with their parents (dcat:Dataset or prov:Activity). They are different shapes (SHACL node shapes with narrower property constraints), not new ontology classes. This follows the DCAT-AP+ foundational principle: multiple LinkML classes can reference the same ontology term, each representing a different usage context.

In practice this means a SPARQL query for ?x a dcat:Dataset will find both SubstanceSampleCharacterizationDataset and ReactionMonitoringDataset instances. The narrower shapes exist for validation and developer guidance, not for RDF-level type discrimination.