Almost every business has a busiest day. Cyber Monday for consumer tech. Summer holiday bookings peak in January. Valentines and Mothers Day for florists. A single day can make or break a business.
For Universities, A-level results is that day. This is when clearing gets into full-swing.
The difference between a good and bad results day is multiple £millions in revenue. This isn’t just theory. During 2019 clearing, a prominent university lost millions in revenue when their network failed due to inadequate failsafes. Their head of tech was soon without a job.
Building a failsafe for universities brings a set of unique challenges. Codiance has managed several clearing days for universities as part of Codiance Higher - our Enterprise CMS for Universities. More broadly, we’ve built always-on robust enterprise systems for pharmacies during the peak of COVID-19 and global companies operating in hundreds of markets.
Why Clearing Is Different
So why do universities have unique considerations? For clearing, prospective students will receive an email from UCAS that will direct them to relevant university pages. Traffic will either go directly to the university homepage or specific clearing pages.
All course and clearing data is integrated within the University CMS. At the same time, many contributors are contributing and thousands of students are accessing. Continuously.
So unlike many high-traffic websites and systems, a University CMS is very large, is continually being updated and could have over 100,000 assets.
So the fundamental consideration is very straightforward: increased traffic. Clearing day will see a tenfold increase in traffic. We need a robust failsafe.
In addition, due to the 25k+ pages, any fallback instantly introduces cache concerns that you wouldn't have with smaller websites. So although ‘no downtime’ is theoretically supported by several approaches, we needed to look at a more robust solution with no downtime for a site of this size.
Why a Load Balancer wasn’t the Solution
We initially looked at Azure Traffic Manager (TM). Unfortunately, as a simple traffic load balancer, it presented a key disadvantage when load testing.
Essentially TM redirects traffic geographically without any knowledge of server load. As all clearing requests are from the same geographic region, TM sends all traffic to the same server. This wasn’t a perfect solution for clearing.
Secondly, using an auto-scaling solution for a CMS of this size wouldn’t work. With auto-scaling, it’s the offline server that redirects traffic. You could find yourself in the following cycle:
- Traffic sent to failed server
- Failed server redirects to new server
- New server is still caching
- New server redirects traffic to second new server
- Second new server is still caching
- Rinse and repeat!
Application Gateway (AG) and Multiple Servers
It’s always a balancing act between robustness, preparedness, and being over-prepared. It’s almost always a question of economics.
All this was taken into account when deciding to use Application Gateway (AG) and multiple servers.
On balance, although AG appears more expensive, the time saved and the reduced engineering support resulted in this being the most cost-effective approach.
So Here’s How it Works
Always on: AG is Setup for everything and not just clearing. If you did set up for clearing alone, the subscription savings initially look worthwhile. However, managing switchover and preparation is a big undertaking. Human resource expense is dramatically reduced by having this as always-on.
Decision Tree Before Server: With AG, traffic can be managed with rules that you set before hitting any of the servers. If a server is down, traffic stops hitting that server and is redirected. No ‘new server caching’ cycle.
Two copies of database: This should be seen as essential, as a database failure would have a catastrophic impact on a CMS like this. We set up two copies of the database in a failover group configuration.
Paired Server Architecture: We set up paired server regions. So if an entire region goes down, we stay operational. This may sound over-cautious, though a regional outage is exactly what we avoided in April 2021.
Reporting and Visibility: The improved reporting of this setup has also been key in ongoing support and optimisation. We can avoid many future issues before they arise
We worked closely with Microsoft on this solution. We hope these insights help you avoid any costly errors.
We’re always keen to speak to businesses and universities regarding their specific needs. Please get in touch to see how we can help!