HomeDomainsD2

D2 · Data engineering as infrastructure

Axis I · D2

Data engineering as infrastructure

Thesis. Data is neither an input to the model nor an accounting asset; it is the infrastructure of which the model is merely a consumer, on par with networks or storage systems. As long as data is treated as a derivative of the AI project, models are built on certified sand.

The distinction that cuts

Data pipeline vs data infrastructure. A pipeline transports; an infrastructure persists, versions and guarantees. The first is a project; the second is a patrimony.

Typical market error

Building the data chain for the AI project, hence specific to it, hence non-reusable, hence rebuilt for the next one. Hidden cost: 60 to 80 % of AI budgets spent in permanent re-engineering, never accounted for as such.

Failure signals

No data catalogue traceable to the point of collection (in the GDPR art. 30 sense). Schemas changing without blocking versioning downstream. No retention policy differentiated by data type. Quality indicators limited to completeness, never semantic coherence, never distributional stability. No separation between raw layer, ontological pivot layer, and analytical layer.

References

ISO/IEC 8000 (data quality); ISO/IEC 11179 (metadata registries); FAIR principles (Wilkinson et al., Sci. Data 2016); for health, HL7 FHIR R5, OMOP CDM v5.4, CDISC SDTM; GDPR art. 5 (quality, minimisation, accuracy); HDS reference framework from ANS for hosting.

Ground of implementation

BioKG-TweenMe is a DuckDB base, schema V1.1, 21 tables organized in five layers (ontological, phenotypic, genomic, epidemiological, exposomic), with diseases.mondo_id as universal pivot. The architecture strictly separates source (MONDO, HPO, Open Targets, GHO, EXPOSOME-Explorer) from use. The instance illustrates the doctrine of ontological pivot; it does not prove every organization must adopt MONDO as pivot. The choice of ontology is contextual, the principle of pivot is not.

Articulation

To be read jointly with D6 (sovereignty), since data localization conditions the legality of its processing, and with D7 (evaluation). Without data traceability, evaluation produces numbers that can neither be reproduced nor opposed.