Universal Data Analytics as Semantic Spacetime(Part 3)

Part 3: The basics of space and time, and the most familiar multi-model databases you never knew…

In case you hadn’t noticed, there’s a very important multi-model database hidden in plain sight, right under our noses. We might never think about it in those terms, but we rely on it every day for everything we do. It’s the very space around us! We keep stuff on shelves, organize it in drawers, or leave it lying around. In IT, we created another shared space for our stuff, using another multi-model database — the good old hierarchical file system, consisting of all the directories and files on our disks. From there, we build other structures including key-lookup databases. At risk of pointing out the obvious, it’s worth a quick post to reflect on the similarities between these data sharing models, and see the practicalities surrounding multi-model database design — and the features we’ll use to describe semantic spacetime.

This post is elementary for more experienced programmers, but may be of interest to general readers.

Data with coordinates!

Databases are basically storage systems, warehouses, or parking lots, for data or even real goods, depending on how you look at it. Our world is less of an oyster than it is a database.

The contents of RAM or the surface of a disk are covered in blocks which are numbered sequentially with addresses, like parking spaces, and are used to park data for later retrieval. Dividing space into fixed size coordinate blocks helps to make it predictable, regular, and reusable. We build all kinds of abstractions on top of this basic model to arrange the parked information into virtual views, which looks like files, directories, hierarchies, ledgers, or graphs.

So, at a low level, memory of any kind is like a cargo ship, piled high with containers, all the same size, but containing different data. Without some kind of coordinates, map, or index of what’s in which container, finding anything in the pile would be a long and laborious process of systematic searching. Databases offer these maps in a wide variety, as well as query languages (of which SQL is the most famous). Let’s compare some of these views and how we approach them with our tools Go and in ArangoDB.

Containment structures and data values

ArangoDB’s approach to storage uses a hierarchy of container structures, with a few layers, to slice and dice data:

Database — a named data repository (like a disk volume).
Collection — a set of related records, under a common role name (like a directory or folder). Collections don’t normally represent different types of entity (as in SQL tables) but rather documents that play a similar role in a computation.
Document — a single structured record of, like a programming “struct”, a form, or file within a folder.
Values / data — the contents of a document in a variety of data formats: numbers, pictures, text, etc.

Data types are combinations of the primitive data types: integer, floating point, strings, and byte arrays. In older languages, using the ASCII or UTF-8 representations, strings and byte arrays were the same thing. In modern UNICODE representations, a single character may not fit into a byte. This means that strings and byte arrays have to be interconverted sometimes.

Let’s not forget why we want to keep data together under a single banner. This often has to do with context or the provenance of data. If data are measured under a single experiment, under the same conditions, they would belong together somehow. On the other hand, for data that are measured under different conditions, it’s important to highlight that so that we can distinguish them. It’s about chain of custody, the back-story of how results were found, and therefore what they mean in broader terms.

Data semantics and “schemas”

The shadow side of data is how we interpret the values we obtain. This has two aspects:

Naming of data types to distinguish interpretations
Matching methods and their interpretations to data types

In programming, semantics are basically associated with a unique name.

Data: [Type|Category|name] → Attributes | Behaviours

There are ways of naming types differently to make semantics explicit, but this adds a layer of bureaucracy that many researchers would find arduous.

In Go programming, related data values can be collected under a “container type” using structured types. The syntax is similar to the old C syntax, except for the order of type and name.

type xyz struct {Member1 int
Member2 string
Member3 float64
}

A minor stumbling block in Go is that members can only be referred to outside their object if they begin with a capital letter (facepalm).

The term “schema” was usually associated with a rigid template for data, from the Relational (SQL) database model. Modern “document oriented databases” allow templates to be ad hoc and extensible. This freedom is powerful, but can also lead to inconsistency without discipline. A completely general multi-model database, like ArangoDB, offers great freedom, and thus hands a certain responsibility to the user to maintain self-discipline.

How we choose to store data is a technical issue — a matter of templating and naming — basically a kind of bureaucracy. Some major models include:

Data as files, archived by category, for related sets of numbers, text, films, etc.
Spreadsheets / tables / ledgers or related transactions or data points, usually used for numbers in experiments or financial transactions.
Taxonomy or hierarchy of inheritance for characteristics

For each of these, there is a database technology for the choice.

If we think about an application like the Internet of Things, the “things” are of many different kinds: devices, sensors, phones, and of course humans(!) They have different characteristics, which we capture as different attribute types. We could represent them as different “forms” to fill out, which translate into different “document types” or different “schema” in a database sense.