SDTM Export — Design Notes ========================== Status: **design / exploration** (no code yet). Target: **SDTMIG 3.4**, primary output **Dataset-JSON v1.1**. This document scopes a future SDTM (Study Data Tabulation Model) export alongside the existing ODM 1.3.1 export. SDTM is one of the foundational standards required for data submission to the FDA (and PMDA). Why SDTM is a different kind of problem from ODM ------------------------------------------------ The ODM exporter is **mechanical and lossless**: it walks the clinicedc structure (``RegisteredSubject`` → related visit → CRF → fields) and emits a faithful hierarchical XML mirror. OIDs are derived from model labels and field names; no clinical interpretation is needed, which is why it is generic across any clinicedc trial. SDTM is the opposite. It is a **fixed target model**, not a dump of the source. Data must be reshaped and re-coded into a standard set of flat domain tables. .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Aspect - ODM (today) - SDTM (proposed) * - Shape - Hierarchical XML mirror of the EDC - One flat table per **domain** (DM, AE, VS, LB, EX, CM, MH, …) * - Direction - Source-faithful - Conform-to-standard * - Variable names - Your field names (OIDs) - Fixed CDISC names (``USUBJID``, ``--TESTCD``, ``--ORRES``, ``VISITNUM``, …) * - Coding - As entered - CDISC Controlled Terminology + MedDRA / WHODrug * - Mapping - Automatic - **Requires per-CRF SME mapping** * - Output - ODM XML - ``.xpt`` (SAS v5) and/or **Dataset-JSON**, plus **define.xml** The honest headline: SDTM **cannot** be fully auto-generated from arbitrary clinicedc data the way ODM can. A *Findings* domain such as VS or LB is **vertical** — one row per measurement (``VSTESTCD=SYSBP``, ``VSORRES=120``, ``VSORRESU=mmHg``) — whereas a CRF stores those as horizontal columns. That transpose, plus the controlled-terminology decisions, is inherently a study-specific, human-in-the-loop mapping exercise. SDTM observation classes ------------------------ Every domain belongs to one general observation class. Identify the class first; the domain follows. * **Interventions** — CM (concomitant meds), EX (exposure), EC, SU, PR * **Events** — AE (adverse events), MH (medical history), DS (disposition), CE * **Findings** — VS (vital signs), LB (labs), EG (ECG), QS (questionnaires), FA (findings about) * **Special Purpose** — DM (demographics), CO, SE (subject elements), SV (subject visits) * **Trial Design** — TS, TA, TE, TV, TI, TD, TM What clinicedc can supply (mostly) for free -------------------------------------------- clinicedc already holds the structured metadata SDTM needs for a meaningful slice: * **Trial Design domains** (``TS, TA, TE, TV, TI, SV, SE``) — the ``visit_schedule`` + ``edc_protocol`` config already encode visits, epochs, arms, and timing. These map almost directly and are the most "free" win. * **DM (Demographics)** — assembled from ``RegisteredSubject`` + consent + screening + a demographics CRF. Mostly derivable with a thin mapping. * **SV / SE (Subject Visits / Elements)** — straight from the related-visit model already iterated in :file:`clinical_data_serializer.py`. Everything else (AE, CM, MH, VS, LB, EG, EX, …) needs a **mapping-spec layer**: a declarative config saying "CRF model ``X`` field ``Y`` → domain VS, variable ``VSORRES``, ``VSTESTCD=SYSBP``, unit mmHg". What existing code is directly reusable --------------------------------------- #. **define.xml is ODM.** Define-XML 2.0/2.1 is a CDISC extension of the same ODM schema already emitted (``MetaDataVersion``, ``ItemGroupDef``, ``ItemDef``, ``CodeList``). ``MetadataSerializer`` is the natural foundation for the define.xml that must accompany every SDTM submission. #. **PII guards** (encrypted-field skip, consent whitelist) carry over unchanged and remain mandatory — SDTM datasets must never leak ``django_crypto_fields`` data. #. The subject/visit traversal in :file:`clinical_data_serializer.py` is the same traversal a domain builder needs. Output format: Dataset-JSON --------------------------- Target **Dataset-JSON v1.1** (released 2024-12-05) first: * No SAS dependency; trivially generated from a pandas ``DataFrame`` (records = rows, columns carry name/label/type metadata). * Can optionally reference a Define-XML document for full metadata. ``.xpt`` (SAS Transport v5) is the definitively FDA-required format and can be added later as an additional serialization target (e.g. via ``pyreadstat``), subject to its 8-char name / value-length constraints. Proposed phased plan -------------------- #. **Phase 1 — Trial Design + DM, Dataset-JSON.** Auto-derive ``TS/TA/TE/TV/TI/SV/SE`` and ``DM`` from the visit schedule + registration/consent. Emit Dataset-JSON. Proves the pipeline end-to-end with zero SME mapping. #. **Phase 2 — Mapping-spec layer + 1–2 domains** (e.g. VS and AE) to validate the transpose + controlled-terminology approach. #. **Phase 3 — define.xml** reusing ``MetadataSerializer``, plus **CORE** validation (CDISC's open-source conformance engine) wired into a management command, analogous to ``validate_odm_export``. Open questions -------------- * Controlled Terminology source / version pinning (CDISC CT packages; MedDRA & WHODrug licensing for AE/CM). * How mapping specs are declared and stored (per-trial Python config vs. data-driven). * Whether ``--SEQ`` and ``RELREC`` relationships are needed in early phases. * SUPPQUAL handling for non-standard CRF variables.