I was working at a project that involved munching data generated by multiple data sources, transforming it and making it available for a downstream service that would index this data into a data store that could make it easy to run analytical workloads at run time.

stack used: azure-data-factory flink azure-function azure-entra azure-storage-account elasticsearch

What were the challenges?

The original team that built this workflow used Flink which is an awesome tool for transforming data streams. However this decision involved the development team to manage its Flink installation (it was 2019, the company that I worked for was based out of Azure and at that point of time Azure did not have a managed Flink service).

There was a custom front runner service setup that would build the flink queries based on custom query language. TL;DR this language was propritery and essentially not google-able when the people using the query ran into issues or did not understand how to get their work done.

engineering bits

We were a small team and had to balance resources from the existing Flink based onto a platform that was (a) managed service available in Azure at that point of tie. (b) was more managable in terms of defining queries and not restrict to the domain knowledge of few individuals that had other responsibilities in the team

We decided to go ahead with moving the data pipeline using azure data factory. We managed the infrastructure with its multiple environments using Infrastructure as Code on top of Terraform. We used Azure Functions written in dotnet (as that was the top language people were skilled in at the company) to figure out the dynamic jobs and tasks that needed to be executed.

The transformations were moved to USQL (something that bite us later down the line, as Azure moved away from USQL and Azure Analytics altogether and started moving its focus on Azure Synapse).

The azure data factory pipelines were setup to output their transformations into azure storage accounts which acted as a intermediate results for the downstream services. This helped us decouple the downstream services from the flink / azure data factory implementation and let us peice-meal the implementation. The team moved small chunks of workload from Flink to azure data facotry. This also helped the team learn from its mistakes and improve upon them without incurring wrath of downstream consumers

learnings

  • The latest technology is not always the right choice. You have to consider the skill set available in your team
  • SOLID principles used in software engineering can be applied to data engineer as well. Infact there is not other way around it. Moving all the resources / pipelines and turning ON/OFF a big toggle between the systems would have spelt disaster.
    • We learnt a lot about Azure Analysis, its pros and cons around resource management, costs along the way which helped us think and decide data pipelines with larger cost footprint.