Skip to main content

Data Abstraction and Abstract Data Types

What This Concept Is

A data abstraction is a promise about what a piece of data means, divorced from how it is stored. You commit to a set of constructors (how to make one), selectors (how to read fields), and operations (what you can do with it). Callers go through this interface; no one rummages in the innards.

An abstract data type (ADT) is the name for the whole package: the values, the promised behavior, and the operations -- not the memory layout. Rationals, sets, dictionaries, trees, queues -- all are ADTs. A specific representation (linked list, array, pair of ints) is an implementation detail.

SICP states the rule explicitly in §2.1: draw an abstraction barrier between the code that uses data and the code that implements it. Above the barrier you write what; below it you write how.

Why It Matters Here

Data abstraction is the engineering insurance policy for any non-trivial program:

  • you can change representation without rewriting clients (ints to bignums, list to hash table, flat record to database row)
  • two teammates can develop above and below the barrier in parallel, meeting only at the interface
  • debugging is cheaper: a bug in make-rat lives in one file; a bug in representation-leaking arithmetic scattered through the codebase lives in ten

It also sets up everything downstream. Higher-order procedures (Concept 04) manipulate ADT values opaquely. The interpreter (Concept 10) treats expressions, environments, and procedures as ADTs -- it never knows whether a pair is a cons cell or a Python tuple.

Concrete Example

SICP introduces rationals. The ADT says: a rational is a number n/d with numerator and denominator; you can add, multiply, print them. Nothing in that says "pair of integers."

Constructors and selectors (Scheme):

(define (make-rat n d)
(let ((g (gcd n d)))
(cons (/ n g) (/ d g))))
(define (numer r) (car r))
(define (denom r) (cdr r))

Operations built only from the selectors:

(define (add-rat x y)
(make-rat (+ (* (numer x) (denom y))
(* (numer y) (denom x)))
(* (denom x) (denom y))))

Notice: add-rat never says car or cdr directly. If tomorrow we change the representation to a record, or to a pair of bignums, add-rat does not change.

In C, the same idea appears as a struct + functions:

typedef struct { int n, d; } Rat;
Rat make_rat(int n, int d);
int rat_numer(Rat r);
int rat_denom(Rat r);
Rat add_rat(Rat x, Rat y);

Clients include a header; they never touch .n or .d directly.

Common Confusion / Misconception

Three traps recur in real code:

  1. "It's just a struct." A struct is a representation. You still need to decide whether clients may read its fields directly. If they do, you do not have data abstraction, you have a naming convention.
  2. Representation leak through operations. A Set ADT with a .internal_list method that returns the underlying list hands every client a foot-gun and rewrites the abstraction barrier in invisible ink.
  3. Premature multi-representation. Concept 03 (layering) is about multiple representations. You do not need the dispatch machinery until you actually have two representations. Build the first barrier first.

In Python, exposing internals is so easy (obj.__dict__, attribute access) that the barrier must be culturally enforced. In C, static and opaque typedef struct X; X_t * give stronger mechanical hiding.

How To Use It

When designing an ADT:

  1. Write the operations first, in terms of names that will be stable. (make-set, contains?, union.)
  2. Then pick selectors that the operations need. These are the only legal gateway to the representation.
  3. Only then pick a representation and implement constructors and selectors.
  4. Add an invariant sentence: "always in lowest terms" for rationals, "no duplicates" for sets.
  5. Write one client of the ADT using only operations and selectors; confirm nothing in the client mentions the representation.

Check Yourself

  1. Why is "numerator and denominator stored as two ints" not a sufficient description of the rational ADT?
  2. What is the difference between a representation leak and a performance leak?
  3. In Scheme, what language feature actually enforces the abstraction barrier for rationals -- and what does it rely on?

Mini Drill or Application

Design a Money ADT. Produce:

  • three operations: add, scale, format
  • two selectors: amount, currency
  • a one-line invariant
  • one representation choice (e.g. cents + ISO code)

Implement once in Python, once in Scheme, and once in C (as an opaque struct with functions in a header). Then imagine switching to a long cents type. Which client code must change? If any must change, the barrier leaked.

Read This Only If Stuck