4 min read

Dataset Design: Temporal Concurrency - How

## dataset-design-and-temporal-concurrency

  • δI : Information-carrying columns
  • δG : Grouping columns (categorical, descriptive)
  • δY : Measurements (e.g., purchase price, height, product ratings)
  • δT : Temporal columns to include dates and temporal hierarchies
  • δE : Record life-cycle tracking columns (for example, effective dates in slowly changing dimension parlance)

Welcome back!

In Part 2, we discussed the quantitative building blocks of the problem statement we’ve been working with. In this closing article, I’ll provide some considerations related to addressing temporal characteristics for the how of the problem statement.

The Setup

First, let’s review the key criteria governing how to display or interact with results:

\(H\): “… I need to see results by month; all services received and corresponding facilities; and member demographics.”.

  • \(h_1\): Results by month
  • \(h_2\): Services and facilities
  • \(h_3\): Member demographics

The goal is to put everything covered so far together by expressing \(\big\langle\) \(\gamma_1\), \(\gamma_2\), \(\gamma_3\), \(\gamma_4\), \(\gamma_5\) \(\big\rangle\) and \(\big\langle\) \(W\), \(\omega_1\), \(\omega_2\), \(\omega_3\) \(\big\rangle\) in the context of \(\big\langle\) \(h_1\), \(h_2\), \(h_3\) \(\big\rangle\):

  • \(f(W, \omega_i)\to \gamma_j\)

What derived from when and who

  • \(g(\gamma_j, h_k):= R_{\gamma\times h}\)

The \(\gamma\) by \(h\) report matrix defining how the metrics are aggregated.

The Fun Part

If \(h\) were limited to \(h_2\) and \(h_3\), the report matrix \(R\) would be quite straight-forward to derive, appropriately aggregating \(\gamma_j\) by each \(h\).

That’s not the fun part. 😏

The fun part is taking into account \(h_1\) which adds a layer of complexity to the report matrix depending on the metric.

  • For example, consider \(\gamma_4\), an easy metric to aggregate across each of \(h_k\).
  • Contrast that with \(\gamma_1\), which presents a temporal windowing problem with respect to \(h_1\) since both involve windows of time:

Multiple members with multiple events spanning multiple segments of time. This hsould immediately raise some questions such as:

  • How does one select the appropriate window of time for each member and event?
    • By event end or beginning?
    • By any event within the monthly window?
  • How does one address multiple events for a single member within a monthly window?
  • Bonus question: Is the arithmetic mean the best way to aggregate the data in the first place?

Fortunately, the goal here isn’t to resolve these questions but to illustrate the complexities of temporal concurrency at various levels. It’s been my experience that dealing with the temporal aspects of an analytic use case can be very challenging requiring multiple iterations of discovery with business stakeholders. As a data practitioner, developing a sense of how time can affect an analytics initiative beforehand will make navigating those conversations with the business easier.

🎉Congratulations!

You’ve made it to the end of this series: just a few closing thoughts:

  • I encourage you to get in the habit of framing your problem statements in terms of the what, when, who, and how. It’s a simple way to ensure that you’re considering the right questions at the right time (slight pun intended). Don’t be in a rush to jump from problem statement to code: interrogate the problem statement like a 70’s detective show.

  • Always investigate how time affects your problem statement as it relates to inclusion criteria, metrics derivation, and reporting. The way you go about retrieving data or feature engineering can be greatly influenced by the temporal aspects of your problem.

  • If possible, during the discovery phase of your analytics project, determine what form the results will take in the final data product:

    • Knowing this can help guide your thought process when considering how much data to wrangle and what engineering tasks should occur within the data product vs. upstream in the data pipeline.
    • This is especially important when there are temporal dynamics to the analysis tied to dynamically calculated metrics: an example would be moving averages, cumulative sums, etc. in an interactive dashboard that responds to user input such as filters, date range sliders, etc.
  • Finally, dealing with temporal concurrency can be a real challenge, but don’t back down from that challenge. Speaking from personal experience, once you become comfortable with the complexities of time, you’ll find new ways to approach analytic initiatives and be able to enter into more complex problem spaces with confidence.

Until next time, I wish you much success in your journey as a data practitioner!

Life is data, but data is not life: analyze responsibly!