I worked on a project that involved sending surveys to students enrolled in higher education degree institutions at end of the semester.
serverless
azure-functions
storage
queue
micro-service
scaling
What were the challenges?
- We used a legacy windows service built on dotnet framework with a crusty cron job system built on top of Quartz to send emails
- The process was chunky and could not scale to the demands of the growing product
- This was a multi-tenant service however, issues in one tenant affected performance and quality of service offered to other tenants
- There were too many unknowns about if and when the emails would be delivered, this was more of a spaghetti implementation of the older service
- It lacked tools for internal support people and clients to manage expectations and achieve their goals
Solution?
- scalable We designed and implemented an end to end replacement built on top of azure functions and azure storage tables and queues. Azure Durable functions were still out yet.
- Emphasis was given to make the system transparent to troubleshoot and manage the workflow for developers, internal support staff and clients alike
- release strategy Email as a presentation layer gives you zero room for making mistakes that can be corrected. We released this iteratively moving few tenants over at a time
engineering bits
the system consisted of four actors
- Campaign Manager : creates a job
- Job Manager : manages a single job
- Task Manager : manages a single task in a job
- Status Manager : manages statuses
the email campaigns were set up by a RESTful API that would capture a Campaign
domain entity. It defined when to send emails, criteria for selecting the Recipients
(send email to all students that were enrolled in Spring 2024 CSE456)
Campaign Manager
- A timer based azure function would poll the database to figure out scheduled
Campaign
that were to be sent out. For eachCampaign
it would create aJob
ticket and put it on ajob-queue
Job Manager
- A queue triggered azure function
JobManager
would listen to thejob-queue
and would be responsible for managing theJob
. A job manager instance would hydrate the criteria for theJob
and generate a roster ofRecipient
who would get the email. - It would break the job into multiple
Task
record for each recipient. the task would be persisted in table storage - AND publish a ticket in the
task-queue
Task Manager
- A queue based azure function
TaskManager
would listen to thetask-queue
. The single responsibility of the task manager was to complete the task (sending email to the email associated with theRecipient
) - It would send the email with appropriate retries
- It would update the status of
Recipient
record in the storage table - we orchestrated these azure functions to be deployed on a consumption plan to allow us to scale we needed.
Status Manager
- A timer based azure function would poll the in progress jobs and update the status from the list of
Recipient
records managed by the task managers, thereby closing a job where all emails were processed (successfully or with errors)
this architecture let us scale the number of campaigns we could handle and independently scale the number of emails we could send in parallel. We used a third party service to send the actual email.
Future
- as we measure the size of the audience future campaigns were targetting, we could plan ahead for scaling our infrastructure components accordingly.
learnings
- refactor » rewrite
- account for failures. have a strategy for retry.
- have an incremental strategy to release big rewrites in cases where you cannot refactor services. In this case we figured out a point were we could move the email campaigns for a small set of clients onto the new service and monitor failures / quality of service
- have metrics drive the success of the story. In this rewrite story we covered our bases where we measured everything. From time taken to email entire campaign to number of support issues for newly migrated clients