Post

Data Dictionaries and Semi-Structured Data in Relational Systems

Data Dictionaries and Semi-Structured Data in Relational Systems

In the modern software landscape, the line between strict relational schemas and flexible, document-based storage is blurring. As engineers, we often find ourselves maintaining systems that must be rigid enough to ensure data integrity yet flexible enough to handle third-party payloads or rapidly changing requirements.

Based on my analysis of recent educational materials on data-driven systems, this post explores how we can leverage database metadata (data dictionaries) and effectively manage semi-structured data (XML and JSON) within traditional relational environments like MS SQL Server.

1. The Power of Introspection: Data Dictionaries

Before we dive into data formats, we must understand the environment hosting our data. A Data Dictionary acts as a central meta-store that describes the database itself—tables, columns, types, and constraints.

For a security researcher or a database administrator, the data dictionary is not just documentation; it is an active tool. It allows us to write idempotent scripts—scripts that yield the same result regardless of the database’s initial state—and perform schema migrations safely.

Practical Scenario: Safe Schema Migration

Imagine we are building a deployment pipeline. We need to ensure a temporary staging table exists before an ETL (Extract, Transform, Load) job runs, but we don’t want the script to fail if it already exists or if it has the wrong schema.

Instead of blindly running CREATE TABLE, we inspect the sys.objects or INFORMATION_SCHEMA views.

My Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Check if our staging table exists using the system catalog
IF EXISTS (
    SELECT 1 
    FROM sys.tables 
    WHERE name = 'Staging_UserImports' 
    AND type = 'U' -- 'U' stands for User Table
)
BEGIN
    PRINT 'Table exists. Dropping for fresh recreation...';
    DROP TABLE [dbo].[Staging_UserImports];
END

-- Safely create the table knowing the name is free
CREATE TABLE [dbo].[Staging_UserImports] (
    ImportID INT IDENTITY(1,1) PRIMARY KEY,
    RawPayload NVARCHAR(MAX),
    ImportDate DATETIME DEFAULT GETDATE()
);

By querying the metadata, we can also audit the system, such as finding all columns that lack specific integrity constraints or listing active user permissions.


2. The Heavyweight Champion: Handling XML

While JSON dominates web APIs today, XML (Extensible Markup Language) remains ubiquitous in enterprise integration, healthcare, and finance due to its strictness and self-descriptive nature.

The Structure

XML differentiates itself with concepts like Namespaces, which prevent tag collisions when merging documents from different organizations (similar to namespaces in C# or Java).

My Implementation: A Financial Configuration Snippet

Below is a C# example using System.Xml.Serialization to handle a strictly typed configuration file, typical in banking software where validation is paramount.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
using System.Xml.Serialization;

// Using attributes to map XML nodes to C# classes
[XmlRoot("SystemConfig", Namespace = "http://secure-bank.com/core")]
public class CoreSystemConfig
{
    [XmlElement("TimeoutSeconds")]
    public int Timeout { get; set; }

    // Representing a complex nested object
    [XmlElement("AuditSettings")]
    public AuditConfig Audit { get; set; }
}

public class AuditConfig 
{
    [XmlAttribute("enabled")]
    public bool IsEnabled { get; set; }
    
    [XmlText]
    public string LogServerUrl { get; set; }
}

Querying with XPath

When you don’t need to deserialize the entire object graph, XPath is the standard query language for selecting nodes. It allows for precise navigation through the document tree using predicates.

My Implementation: Suppose we have a large XML payload of transaction logs and need to find all transactions valued over $10,000 using an XPath selector.

1
/TransactionBatch/Entry[Amount > 10000]

Or filtering by attributes:

1
//Entry[@currency='USD']

3. The Modern Standard: JSON

JSON (JavaScript Object Notation) has become the de facto standard for mobile and web communication because it is compact and parses natively in JavaScript. However, unlike XML, standard JSON lacks comments and explicit namespace support, and data types like Dates are often ambiguous (usually transmitted as strings).

Serialization in .NET

Modern .NET applications leverage System.Text.Json for high-performance serialization.

My Implementation: IoT Telemetry

Consider an IoT backend receiving sensor data.

1
2
3
4
5
6
7
8
9
10
11
public class SensorData
{
    public string DeviceId { get; set; }
    public double Temperature { get; set; }
    // Dealing with the date ambiguity in JSON
    public DateTime Timestamp { get; set; }
}

// Deserializing incoming payload
string rawJson = "{\"DeviceId\":\"Sens-01\", \"Temperature\": 24.5, \"Timestamp\": \"2024-01-13T10:00:00\"}";
var telemetry = System.Text.Json.JsonSerializer.Deserialize<SensorData>(rawJson);

To validate strict structures in JSON, we rely on JSON Schema, which serves a similar role to XSD in XML, defining required fields and types.

4. The Hybrid Approach: Semi-Structured Data in SQL Server

The most powerful takeaway from the material is how relational databases (specifically MS SQL Server) have evolved to host XML and JSON natively. This allows us to avoid “Many-to-Many join hell” when dealing with highly variable data, like product specifications or dynamic forms.

XML Data Type

SQL Server provides a native xml data type. This is superior to storing XML in a text field because the database validates the XML well-formedness upon insertion and allows for indexing.

We can use methods like .query(), .value(), and .modify() directly in T-SQL.

My Implementation: Updating a Medical Record

Imagine a Patients table where vital signs are stored in an XML column because different specialists record different metrics.

1
2
3
4
5
6
7
UPDATE Patients
SET Vitals.modify('
    replace value of 
    (/Vitals/HeartRate/text())[1] 
    with "85"
')
WHERE PatientID = 101;

This updates a specific node inside the XML document without rewriting the entire record.

JSON Integration

SQL Server handles JSON slightly differently. It does not have a dedicated binary JSON type (in older versions); it typically uses NVARCHAR but provides powerful functions to parse it on the fly, such as JSON_VALUE (for scalars) and OPENJSON (for tabular conversion)161616.

My Implementation: Flattening API Responses

If we receive a JSON array of webhook events and store them in a single log entry, we can “explode” that array into rows to join it with our relational tables.

1
2
3
4
5
6
7
8
9
10
11
12
13
DECLARE @jsonPayload NVARCHAR(MAX) = N'[
    {"EventId": 1, "Type": "Login", "User": "admin"},
    {"EventId": 2, "Type": "Logout", "User": "admin"}
]';

-- Convert JSON array into a relational result set
SELECT *
FROM OPENJSON(@jsonPayload)
WITH (
    EventId INT '$.EventId',
    EventType NVARCHAR(50) '$.Type',
    Username NVARCHAR(50) '$.User'
);

This effectively bridges the gap, allowing us to use standard SQL WHERE clauses on semi-structured JSON data17.

5. Summary Comparison

To wrap up, choosing between XML and JSON in your database architecture depends on your specific constraints:

FeatureXMLJSON
Schema ValidationNative (XSD/DTD)External (JSON Schema)
Data TypesRich (via Schema)Basic (String, Number, Bool, Array, Object) 18
SQL StorageNative xml typeNVARCHAR with ISJSON check 19
VerbosityHigh (closing tags)Low (Compact)

By mastering data dictionaries and these hybrid storage techniques, we can build systems that are robust, audit-compliant, and adaptable to future data requirements.


Attribution: This blog post is based on my interpretation and synthesis of the educational notes on Data Dictionaries and Semi-structured Data (BME JUT).

This post is licensed under CC BY 4.0 by the author.