Structure, context, semantics – What does it all mean?

It seems that recently, driven by the immense coverage that Big Data is getting, some old terms are getting a lot of airplay and some of these terms are interrelated and interdependent. These terms include structured (vs. unstructured) data, metadata, context and semantics – none of these is new but they are all important because they are related to what the (raw) data means.

So what is raw data?

Basically, it is data that comes directly from a data source before any processing has been applied. Let’s look at a concrete example – a sensor produces a measurement of a physical quantity like temperature, light, acceleration, etc. Typically, this sensor produces an electrical signal proportional to this quantity. It is then digitized, and we essentially now have a stream of bits that represents the measured quantity over time (e.g., the temperature every second). But if we transmit these bits to some recipient who needs to do something with this data, how will they know these bits are not just random bits but actually mean the temperature at a certain place updated every second? A stream of bits without additional information is not very useful (I apologize to all the Information Theory geeks out there for the gross over simplification).

So we need some data about the (raw) data – metadata. How does this work? At the start of each data packet we add a Header consisting of a known number of bits that would encode the time of transmission, the location where the data was created, the number of raw data points in the data packet, the type of data (e.g., temperature) and any other required information. The recipient of this information can now easily store the data in a spreadsheet or relational database and Voila we have introduced structure in what was originally unstructured. But this is not the end of the story – all we have done was to add some basic information about the data.

So far, we have looked at a single sensor (or data source) without looking at related/nearby sensors, the environment around the sensor, relevant historical data, etc. In other words, we have not looked at the context (although one could argue that a timestamp and a geographical location are part of the context). For example, adverse weather can influence the reliability of the sensor and therefore of the data it produces; if we have multiple sensors of different types in the same area we can apply sensor fusion to improve performance and accuracy. Context is therefore an additional layer that when used with processing and analysis (e.g., sensor fusion), enriches the data.

Some of us may find simple time series data (like temperature) pretty boring and “simple” – what about some more complex data sources such as a video camera. As an aside, in 2014 there were an estimated 210 million video surveillance cameras and a whopping 2 billion smartphones (with implied built in video camera). Video cameras are based on sensor arrays that convert light into an electrical signal – there millions of sensors in the array (hence megapixel), which produce one frame every fraction of a second (e.g., Live in the US implies 30 frames per second). The video camera digitizes the signals, encodes the data (implying massive compression) and formatting the data in packets. Here too, there is some form of header describing things like the resolution (number of pixels), encoding format, GPS coordinates and many more parameters. So we get a video clip with some basic metadata, but we have no idea what the clip is about – OK, you say, we could just look at it and know. Sure, but suppose you would like a machine (i.e., a computer) to automatically extract portions of a clip that are interesting or significant (because sifting through 3 months of a zillion cameras way too labor intensive, not to mention boring and therefore error prone).

This brings us to semantics, the study of meaning, and it is the meaningful that is interesting and significant (a synonym of meaningful). What we need to do therefore is to apply video analytics to the clips to extract some meaningful description of what happens in the clip (see one of my prior posts on the topic of what we can learn from video). Things like a person entering a restricted area, the facial expression of a person looking at a specific ad, etc.

The common thread of this whole discussion is: “What does the data mean?” and extracting meaning from the data is a hierarchical process. Data processing and analytics are successively applied to get higher levels of abstraction – lower level (raw) data is transformed into meaningful information. A model that captures this sort of hierarchy is the DIKW (Data, Information, Knowledge, Wisdom) model and a good and relevant graphic can be found here. As you climb the DIKW pyramid, simple features are transformed into complex symbols (not unlike many processes in the brain) and as you get closer to the summit more sophisticated tools are needed such as knowledge-based systems, which originally grew out of early Artificial Intelligence (AI) efforts.

To those among you with philosophical tendencies you should look at the field of epistemology (i.e., the theory of knowledge), which deals with what knowledge is and how it can be acquired and other related topics. Specifically, Empiricism that emphasizes sensory experience (or sense data) and brings us back to the beginning, namely how we gain knowledge from simple, raw data.Read less