I was working as a full time Senior Software Developer in a project that indexed data generated by multiple data sources. This legacy system had outgrown from its initial single responsibility of “managing indicies in ElasticSearch” to handle more things were vague but still could be bucketed in the managing indicies bucket
The product that used this service had a requirement for a data source that could handle analytical query workload. This was 2018, and with the skills and experience available at hand Elasticsearch was chosen the data store. As this was a multi-tenant service it was decided to have an index per client and a metadata index to hold a mapping between tenant-id and index
stack used: azure-functions
azure-storage-queue
azure-storage
What were the challenges ?
Upstream systems would clean/join/transform multiple datasources and produce an intermediate result which was stored at an azure storage account on a nighthly cadance. Each such “upstream job” had an id
which would determine the location of the files. The indexer would then index these files as part of its “job”.
The indexer had the following roles which it performed most of the times
- figure out when the current indexing job completed and trigger upstream job (to a different service which would generate the intermediate result)
- poll the upstream system to check on when the next job could be started
- index the multiple files (usually ~100GBs per job)
- manage life cycle of intermediate results from nightly jobs
There were multiple challenges with the scope of the service and how the tasks were implemented
- Even though the indexes were multi tenant, the process was NOT. this was a huge red flag as it was possible that bad data / issues from one tenant could technically affect other clients
- General debugging was too difficult and performing root cause analysis would take couple of days which would affect the Quality of Service
- The service was not horizontally scalable. It was a node based process that spawned child processes to handle the files. There was a limit on the number of processes that could be spawned which was capped to the number of processors available on the server where this was deployed.
engineering bits
scaling horizontally
I moved to a manager-worker pattern where one instance of the service act as a manager and would generate azure-storage-queue tickets for each of the files that needed to be processed. The same service could also be deployed as a worker instance, which would listent to tickets and process them.
transparency
This helped to make the workload on the service more transparent as we could track metrics on the azure-storage-queue.
I deployed this service in a one-to-many manager worker configuration. This helped improved the performance of the system by a factor of N (n = number of workers). There was an non trival overhead of managing sending the tickets and polling (and retrying) for completeness of a worker Task
, however it was managable for the performance boost we were getting.
tenant based workload
This architecture helped upstream jobs to generate intermediate results per tenant. This gave us the power to process an individual tenant whenever it was required and also helped isolating failures per tenant.
responsibility
We were able to carve out a responsibility per thread in the manager service. We moved non trivial but non core responsibilities like storage account management out of this service altogether using the management services provided by azure storage account out of the box (you can setup life cycle management rules on azure storage account that move service level tier or delete blobs based on configuration)
learnings
- refactor > rewrite
- scope the responsibilities of your services. You won’t get this right the first or second time, however measuring the right metrics on your service and applying the feedback is crucial.