## dataset-design-and-temporal-concurrency
Introduction
Numerous articles have been published centered on data engineering and using data to satisfy a business inquiries. Best practices, techniques, and technology are inescapable concepts one will encounter. What I’d like to present are some design concepts drawn from my own experience specifically dealing with the time-centric aspects of data engineering in the context of deconstructing analytic problem statements.
This article serves as the introduction to a short series drawing from my experiences as a data practitioner, specifically focusing on lessons I’ve learned in dealing with temporal co-occurrence. Two overarching goals that guide my design thinking in general, but especially when dealing with resolving temporal concurrency are as follows:
- Reduce trial-and-error data engineering iterations by using a methodological treatment of the ontology (temporal ontology for this article series) of the relevant data sources and the semantic relationships between data source and the problem statement at hand
- Increase the likelihood of creating a minimal number of datasets that both satisfies the immediate task at hand as well as allows for some ability to address related inquiries not yet posed.
Moving forward, the base unit of time will be days, mainly due to not having had to deal with more frequent units of time but also because it’s just easier to think and converse in terms of days.
Definitions & Conventions
-
\(D_{i,j}\): a tabular dataset of\(i\)rows and\(j\)columns -
\(Ω_k\): Row-based dataset qualifiers — specifically, a set of\(k\)rules/criteria\((ω_1, \omega_2, \omega_3, \cdots, \omega_k)\)given as predicate statements, that qualify the rows of dataset\(D\)such that\(D_\Omega\subset D\). -
\(Y|X\): Condition, read as “Y given X” or “Y conditioned on X” — an existential constraint that subsets the left-hand side to those cases where the right-hand side exists or is true. For example,\(\text{Cost}_\text{item}|\big\{\text{Id}_\text{item}=90210\big\}\)is interpreted as “Item cost where item ID equals 90210”. -
\(∧, ∨\): Logical “and” and “or”, respectively
Column (δ) Taxonomy
- δI : Information-carrying columns
- δG : Grouping columns (categorical, descriptive)
- δY : Measurements (e.g., purchase price, height, product ratings)
- δT : Temporal columns to include dates and temporal hierarchies
- δE : Record life-cycle tracking columns (for example, effective dates in slowly changing dimension parlance)
A Motivating Example
Consider a business request submitted 2024-02-11 and stated as follows [demonstration purposes only, so cut me some slack 😏]:
"I want to know trends related to total cost of care; inpatient average lengths of stay; lapses in medication adherence; and member counts for the period between January first of 2019 and the end of 2020. Med lapses should show monthly totals and cumulative monthly totals. Pull members between 30 and 50 years old and have had at least two inpatient visits within a six-week period. I need to see results by month; all services received and corresponding facilities; and member demographics."
Problem Re-statement
I mentioned consideration of “the semantic relationships between data source and the problem statement” in my introductory remarks. Two principles I always be kept in mind when working with data are as follows:
Life is data, but data is not life(show)
Life produces information and data encodes it — always with a loss of information. In the reverse direction, data must be interpreted back to the real-world in order to be considered information precisely because it ties back to the real-world. This semantic map should always be kept in mind.
The language of data follows the language of people(show)
A basic framework for interpreting and communicating information is language. Language involves syntax and structure to make distinctions among information and provide meaning on the basis of those distinctions. Since data encodes information, it loosely follows the semantics and syntax of human language.
With that in mind, the business request can be restated in terms of the following:
| Who? | When? | What? | How? |
These will be addressed subsequent posts:
- Part 1 will cover Who? and When?
- Part 2 will address What?
- Part 3 will cover How? as well as provide some concluding thoughts
Before closing things out, I want to make sure to note that the articles in this mini-series will be lengthy. Who I have in mind are the less-experienced data practitioners who haven’t had much exposure in navigating the challenges of data retrieval with temporal concurrency. Take your time: there’s no need to rush through the content — you can always bookmark and come back later =) To reiterate an earlier point, the idea is to spend time working through the logic of a request before coding begins. Errors in logic are easier to detect using such an approach and can help avoid writing “spaghetti” code.
I look forward to seeing you in Part 1!
Until next time, I wish you much success in your journey as a data practitioner!