I worked on a project that involved sending surveys to students enrolled in higher education degree institutions at end of the semester.

serverless azure-functions storage queue micro-service scaling

What were the challenges?

  • We used a legacy windows service built on dotnet framework with a crusty cron job system built on top of Quartz to send emails
  • The process was chunky and could not scale to the demands of the growing product
  • This was a multi-tenant service however, issues in one tenant affected performance and quality of service offered to other tenants
  • There were too many unknowns about if and when the emails would be delivered, this was more of a spaghetti implementation of the older service
  • It lacked tools for internal support people and clients to manage expectations and achieve their goals

Solution?

  • scalable We designed and implemented an end to end replacement built on top of azure functions and azure storage tables and queues. Azure Durable functions were still out yet.
  • Emphasis was given to make the system transparent to troubleshoot and manage the workflow for developers, internal support staff and clients alike
  • release strategy Email as a presentation layer gives you zero room for making mistakes that can be corrected. We released this iteratively moving few tenants over at a time

engineering bits

the system consisted of four actors

  • Campaign Manager : creates a job
  • Job Manager : manages a single job
  • Task Manager : manages a single task in a job
  • Status Manager : manages statuses

the email campaigns were set up by a RESTful API that would capture a Campaign domain entity. It defined when to send emails, criteria for selecting the Recipients (send email to all students that were enrolled in Spring 2024 CSE456)

Campaign Manager

  • A timer based azure function would poll the database to figure out scheduled Campaign that were to be sent out. For each Campaign it would create a Job ticket and put it on a job-queue

Job Manager

  • A queue triggered azure function JobManager would listen to the job-queue and would be responsible for managing the Job. A job manager instance would hydrate the criteria for the Job and generate a roster of Recipient who would get the email.
  • It would break the job into multiple Task record for each recipient. the task would be persisted in table storage
  • AND publish a ticket in the task-queue

Task Manager

  • A queue based azure function TaskManager would listen to the task-queue. The single responsibility of the task manager was to complete the task (sending email to the email associated with the Recipient)
  • It would send the email with appropriate retries
  • It would update the status of Recipient record in the storage table
  • we orchestrated these azure functions to be deployed on a consumption plan to allow us to scale we needed.

Status Manager

  • A timer based azure function would poll the in progress jobs and update the status from the list of Recipient records managed by the task managers, thereby closing a job where all emails were processed (successfully or with errors)

this architecture let us scale the number of campaigns we could handle and independently scale the number of emails we could send in parallel. We used a third party service to send the actual email.

Future

  • as we measure the size of the audience future campaigns were targetting, we could plan ahead for scaling our infrastructure components accordingly.

learnings

  • refactor » rewrite
  • account for failures. have a strategy for retry.
  • have an incremental strategy to release big rewrites in cases where you cannot refactor services. In this case we figured out a point were we could move the email campaigns for a small set of clients onto the new service and monitor failures / quality of service
  • have metrics drive the success of the story. In this rewrite story we covered our bases where we measured everything. From time taken to email entire campaign to number of support issues for newly migrated clients