Data & Azure

In 2018, I was involved in designing and building a workflow that would allow sharing data between multiple products of the same campany on a single enterprise platform. The company had an enterprise contract with Microsoft Azure, with almost all of their infrastructure built and deployed across regions in azure. Building this workflow using the tools available in azure was a no brainer.

Stack

azure-data-factory azure-kubernetes-services azure-functions dotnet-core node elasticsearch kanban OKRs azure-devops

Challenges

We had legacy systems built out of reporting needs of the individual products. These point solutions grabbed data from other product(s) directly hooking into their databases / data sources. Most of these involved custom background processes written in dotnet or SSIS that ran on schedule to pull/transform and load data sources local to the products.

These solutions added additional cognitive load on the product teams who had to learn and duplicate business rules in the transformations (T) that were could not be captured in the source data sources themselves.

This created a dependency management nightmare, with product and schema changes having side effects beyond the domain boundary of the source project. If Product A changed / renamed / deleted use of a particular data table, the change caused issues in data pipeline of Product B who may or may not have the capacity of working on the change failures at the moment. Sometimes these change failures would be reported by the client instead of being caught by the product development CI/CD processes.

Solution

We onboarded products to this platform one by one using a domain driven contract. Instead of exporting individual tables, the products would export out domain entities to a shared namespace. This helped the source product to be more conscious how domain changes would be consumed by downstream client products.

Implementation

Producers

We initially setup a storage account to allow us implement a namespace system for the products to share their data
This allowed using Azure Entra (then Azure Active Directory) to handle authentication and authorization bits.
Products used their cognitive knowledge to come up with different ways to export their domain models to the storage account. This helped prevent having a situation where a single team being overloaded with domain knowledge of every single product
The solution enabled integrating data from third parties without implementing point solutions per product

Consumers

The platform provided an opportunity to setup a solution could report across all data points, something that was a pipe dream of the CPO of the company but could not be accomplished without heavy investments yet.

The main actors of this reporting domain consisted of

catalog
orchestrator
indexer

catalog We setup a small microservice that would scrape metadata and catalog it into a datastore. An API setup on top of it let it be consumed by a react front end for product managers.

orchestrator we setup an azure data factory for orchestrating our pipeline. Due to the custom needs of when the workflow needed to be triggered, we setup a service implemented using azure function to manage the triggering bits. the pipeline used azure datalake analytics workspaces to perform transformations. the results were dumped out back in a private namespace of an azure storage account

indexer we used elastic search as the OLAP data store. a node based indexer deployed on azure kubernetes service would pull the data from storage account and index them into ES. More on how we achieved this indexer to scale later.

Stack#

Challenges#

Solution#

Implementation#

Producers#

Consumers#