Thinking in relations
What is the relational in “relational database”? Relations, basically typed associations among a number of things, are the unsung concept of data modeling that applies everywhere and dissolves other data models, not just in the traditional RDBMS.
Because no matter the DBMS, you are probably modeling things in the world plus the relationships in which they stand to one another. Any time you model a relationship, that’s a relation. I have a hypothesis that thinking in relations is a useful way to approach data modeling, because it’s an abstract model that you can apply to pretty much any kind of database, and it’s simpler than thinking in terms of concretions like foreign keys and “join tables”.
What are relations?
A relation is a meaningful association among members of one or more sets. Person A and Person B are friends. Project A used Technology B. There are mathematical and philosophical definitions which we’ll visit later, but this is pretty much the platonic form of a relation.
A relation can be thought of as both the type of that association, and by extension, the set of instances of that association. Imagine a relational database table: it’s both the schema of that table and its contents.
Relations in the database world
Relations show up everywhere, and different data models support them to different degrees.
The more concretely you can represent a relation, the more easily you can compute with it, that is, use it to solve problems. And the better you can align your relations to the way a given DBMS works, the better you can make contradictions and invalid states impossible.
Let’s look at how relations map onto different kinds of databases and their data models.
Relational databases: Tables
The classic way to model data for a relational DBMS is to divide your domain into entities and relationships, possibly with the use of an entity–relationship diagram. An entity is a description of something in the world. A relation, then, is like an additional fact that relates or more entities. A table, in a relational DBMS, is a close stand-in for a relation1, and it can model either an entity or a relationship.
To take an example, let’s model the world of tea. A tea sold by some seller is an entity, and a style of tea is another. That the tea is in one or more styles is a relation.
The cool thing about this kind of relation is that it tends to be small and single-purpose: tea A is in style B. The table for that relation needs only a couple of columns, and it’s unlikely to grow. What’s more, the relation is extraneous to the entity tables and thus keeps them loosely coupled—neither tea A nor style B depends upon the relation C that relates the two, which has nice properties for maintainability.
Breaking down your schema into small, independent facts makes normalization the default, and helps separate your “physical” schema (how your data is stored in tables) from your “logical” schema that you present to your application (views that aggregate facts from those tables). Your concrete tables can store small, atomic facts about the data, and your views can aggregate them into abstractions useful for querying and updating.
This takes more legwork, but hypothetically, it gives you gains in maintainability. You can change your physical schema underneath the protective abstraction of the logical schema. It means you limit your exposure to schema changes, because expanding the model often means just adding another table, rather than mutating an existing one. And it helps you avoid mutation, since data can often be added to the picture by adding and deleting facts (taking you closer to an event-sourcing pattern).
None of this is giving you capabilities that you didn’t have before: it’s just a clearer way, in my opinion, to look at relational data modeling. A lot of traditional relational DB thinking works at the level of concrete techniques, for example, the use of “join tables” to implement “many-to-many relationships”. I think it’s simpler to think of the database as a collection of facts and relations. It’s clearer to think that “a tea is in a style”, fact, than “tea and style have a many-to-many relationship”.
If you recast this as a relation, i.e. just another fact, you gain clarity, because you can now think about it in a more expansive way. For example, if you take a page from the property graph playbook, your relational join table can be looked at as a property graph edge, and becomes an obvious place to qualify the relation with extra properties like date ranges.
Property graphs: Hyperedges
The property graph data model, used by graph databases, models everything as a directed graph of nodes and edges, in which both nodes and edges may contain further key-value properties. So the obvious counterpart of a relation in this model is the edge.
Edges are just a special case of hyperedges, which relate any number of nodes (i.e. more than just two). Historically, graph databases don’t have first-class support for hyperedges, and require workarounds like creating extra entities to represent them. Hyperedges, then, are just relations with degree higher than 2.
Document databases: Composition
We’ll call a document database one that stores large data structures as a single “document”–typically a blob of JSON. The classic example here is stuff like MongoDB, but for our purposes, you could also consider a big XML file the same thing.
In a document situation, relations melt into nested composition of data structures. An entity contains, right there in its document, most of the things it’s related to. Which is to be expected, because denormalization is the reason for being of this kind of database. Relations get broken up and horizontally fragmented across documents of a similar type.
Note the tension here between normalization and explicitness: the more normalized, the more explicit and concrete your relations. The more denormalized, the more implicit.
Prolog: Facts and rules
My experience with Prolog is limited to messing around, but here goes, because this is an interesting case. Logic programming languages like Prolog let you compute relations that you didn’t explicitly store.
A relation can be straightforwardly transcribed into Prolog using facts, and these facts all taken together constitute a knowledge base, akin to a database, which is the heart of any Prolog program. Say we’re building a knowledge base about tea. We can write a fact that da hong pao is an oolong tea:
oolong(da_hong_pao).
Facts of this kind are a little like properties, or attributes. You could redo this in a relational style by making it a fact of degree 2:
tea_has_style(da_hong_pao, oolong).
So far, a pretty direct encoding of relations. But the real power comes from rules, which let you derive new facts by implication. So instead of stating so directly, we could (wrongly, probably) declare a rule that every bruised, oxidized tea counts as an oolong tea. If we know da hong pao is bruised and oxidized, then Prolog will know that da hong pao is an oolong:
bruised(da_hong_pao).
oxidized(da_hong_pao).
oolong(Tea) :- bruised(Tea), oxidized(Tea).
That’s a relation too, even though you didn’t specify it directly! Logic programming is a powerful way to represent relations, since they don’t have to be fully materialized in your data. You only have to tell the language the logical conditions under which some relation holds, and it’ll compute it for you, just the same as if you’d made it an explicit fact.
MiniKanren: Relations
We come to the MiniKanren family of logic programming languages, which are similar to Prolog, but in which relations are even more of a first-class thing. I’m embarrassed that I don’t have a writeup for this, because I actually took a class years ago on MiniKanren taught in part by Will Byrd, and in fact have The Reasoned Schemer sitting around on my shelf. But much of what you can say about Prolog applies here. Watch this space.
RDF: Triples
My experience here is nil, but semantic web data in RDF seems basically a subcase of the property graph model. RDF encodes data in triples, which are directed edges from a subject to an object via a predicate; this means the grammar of RDF is relations of degree 2. (An extension to RDF called RDF* adds support for labeling edges with key/value properties.)
As far as I have seen, RDF cannot natively express hypergraphs, so relations of higher degree are out, unless you resort to workarounds like representing a relation as an entity. That RDF* page shows some of them.
Relations in math and philosophy
People more serious than me are out there studying relations for real, and I hope I haven’t done injustice to the topic. From math and philosophy we can grab plenty of relational concepts useful in data modeling:
In mathematics
Several mathematical properties of relations show up in data modeling.
When a relation R
relates thing a
to thing
b
, we say a R b
.
The degree or arity of a relation is the number of things it relates.
- Symmetric, non-symmetric, and asymmetric relations.
A symmetric relation means
a R b
impliesb R a
. Asymmetric means the opposite (a R b
implies¬(b R a)
), and it’s non-symmetric ifa R b
doesn’t say anything aboutb R a
or not. Think social networks: Facebook friends are symmetric; Twitter follows are non-symmetric. (What would a social network with asymmetric friendship be? Some kind of creepy stalker thing.) - Transitive relations. If
a R b
andb R c
, thena R c
. This lets you model various kinds of hierarchy, such as containment (region A contains region B which contains region C) or specialization (style A is a subtype of style B which is a subtype of style C). In Prolog, you express transitivity through recursive rules. - The converse of a relation. Think implication: If
a implies b
, then the converse isb is implied by a
. In graph databases, making this explicit helps you express certain kinds of queries more naturally.
In philosophy
Analytical philosophy takes these mathematical definitions and adds further useful properties:
- Internal and external relations. Is
a R b
a necessary consequence of the natures ofa
andb
(internal)? Or is it some outside circumstance that relates them, which might have been otherwise (external)? - Unigrade and multigrade relations. Is the arity of
R
fixed? If so, it’s unigrade; if not, it’s multigrade. In a relational database, a single row in a table can express an instance of a unigrade relation, because the number of columns is fixed. How then to implement a multigrade relation? Say you had a table representing causality, like eventA
caused eventB
. Then a single instance of the relation (for example,A
andC
causedB
) could span multiple rows.
Relations have a long history in philosophy, and it’s maybe useful to go look at it so you can really turn the concepts over in your head. As always, the summary on the Stanford Encyclopedia of Philosophy will plunge you straight into the deep end, though most of it is over my head. I don’t understand why, for example, the abstract nature of relations poses any kind of ontological problem. But I don’t do this for a living, and I don’t have experience with the kind of analytical rigor needed to really go after the problems of whether this or that type of relation actually “exists” in the world (to me, it’s an empty question).
Fortunately, for programming, all that’s mostly beside the point. Any kind of relation you can come up with can be said to “exist”; these are all just different formalisms we can use to whatever extent it suits our problem.
Relations in Data and Reality
The OG book on data modeling, William Kent’s Data and Reality, describes a few more properties of relations (therein called relationships) we can use. The chapter on relations offers a much more learned exploration of our topic on this page: that the different types of relations are supported in actual databases only to varying degrees, and that informs how rich of a data model we can put in a given database.
Role
The role, for Kent, is the basic meaning of the relation, and of each position in the relation. For example, in the ownership relation, the roles of the two ends would be “owner” and “owned”.
Domain and category
In math, the domain and range of a relation are defined as the type (or set) of values that can occur at a position in the relation. Kent calls such types “categories”, and uses “domain” to refer to the set of categories that might occur in that position in the relation, because you can come up with relations that make sense for multiple types of entities. In his example, the domain of an “owns” relation might contain the categories employee, department, and company, because these are all things that can own something.
Here’s where we hit a difficulty with relational databases: this kind of polymorphism is clearly really useful for data modeling, but we can’t do it with foreign keys, because a foreign key constraint can only reference a single table. At best, you can have a tuple of [foreign key, which table] with no constraint, and sacrifice referential integrity. To model this situation, we have to start breaking things into multiple, similar tables.
Complexity (cardinality)
What we’d probably call the cardinality of a relation, Kent calls “complexity”. That is, is the relation one-to-one? One-to-many? Or many-to-many? I’d personally include optionality under this heading; i.e. whether or not a relation is optional for a given type of entity. (Note that this is possibly the same thing as the internal/external distinction from above. A non-optional relation is probably an internal one, and an optional one is surely external.)
As is known, relational databases use foreign keys together with uniqueness constraints to model this situation.
Implication/composition
Per Kent, a relation might be implied by the presence of two or more others. In his example, an employee works in a division of a company if they work for a department that’s part of that division.
This is where logic programming systems like Prolog shine, because to model that in a traditional database, you would have to store, rather than compute, the full closure of these composite relations. But Prolog can derive it from facts and rules right out of the box:
works_for_division(Division, Employee) :-
works_for_department(D, Employee),
department_is_in_division(D, Division).
Kent mentions the converse of a relation in this context, too. For example, to be the parent of someone implies that someone is your child. Again, Prolog:
child_of(P1, P2) :- parent_of(P2, P1).
Attributes vs. relations
See chapter 5 in Data and Reality. What is the difference between attributes and relations? It gets blurry when you look at it hard, because pretty much any attribute can be modeled as a relation. (See, for example, the Cell programming language, which in fact does this.)
I would like to suggest a midwit answer to this puzzle: an attribute is a relation to a category whose members it is unnatural to think of as separate entities. This could be for any number of reasons.
For example, you could in fact model Kent’s example, “Henry Jones weighs 175 pounds”, as a relation from the entity “Henry Jones” to the entity “175 pounds”. But “175 pounds” would be a rather contrived entity; it feels trivial, or needless. There’s a limitless number of weights; is each of them really an entity? There’s nothing that distinguishes 175 pounds in kind from 174 pounds, or 176 pounds. Nor does it represent anything tangible, anything that could exist by itself. To model numeric values, or anything of similar cardinality, as an entity in your data model feels intuitively like the wrong way to do it.
A “proper” entity, on the other hand, tends to be a composite of these simple attributes: say, an employee, representing different aspects of something tangible and in the world.