Load Testing the edu-ID IdP

On a Monday morning, at the start of the fall semester 2024, many students were unable to log into their edu-ID account. A nightmare for students, IT administrators – and also the edu-ID team who was working actively to fix the issue as soon as possible.

What was the cause of this incident? A retrospective analysis found that the issue was a missing index in a database table. Really, a missing index? Why did we not detect this earlier, even though this problematic table had been in use for several months without any problem? It turns out, we load tested the new MFA API when launching it, but it seems that it wasn’t with a sufficiently large and diverse dataset. Therefore, it was only at semester start that such a high load made the problem apparent.

But the learning is clear:

We need more thorough load tests!

We accepted the challenge and started working on a parallel edu-ID system. In fact:

  • We did not want to interfere with legitimate logins;
  • We did not want to use personal user data;
  • We did not want to pollute our monitoring.

As a result, we set up an additional edu-ID environment, which is almost identical to one used in “production” on eduid.ch, but with three major differences:

  • It is not publicly accessible;
  • All personal data is fully anonymized;
  • Any resulting logs and metrics are tagged with “staging” to avoid confusion.

We highly collaborated within our team and with sibling teams to get this up and running quickly. This then enabled us to simulate our login flow with a framework called Locust, by carefully crafting login requests going through as many different cases as possible: with/without MFA, using SMS/app codes, on SAML/OIDC resources.

Based on last year’s load, which at its peak channeled no less than 45 logins/second into the edu-ID IdP (identity provider), we extrapolated an expected amount of 55 logins/second for this year, given the growth from 1.1M to about 1.3M active accounts over the past year. Given these numbers, we performed three kinds of load tests:

  1. Smoke tests: Low amount of logins, to check that the environment is behaving as expected;
  2. Soak tests: Around the same load as last year, which is a realistic dataset for semester start;
  3. Stress tests: Checking double the amount of last year’s logins, during two hours, trying to find if there would be a breaking point.
The different load testing scenarios we performed on the parallel edu-ID environment

The new parallel edu-ID environment had only 4 IdP nodes this time, instead of the 5 which run on production, but our results were promising even with 4 nodes as well. Here, we are sharing the results of 1 soak and 2 stress tests. Please note that the actual values for these metrics changed slightly over the duration of the tests and for the sake of simplicity we are only showing average values here:

Concurrent users 700 1000 1500
Logins / sec 63 90 105
HTTP requests / sec 600 850 1000
Average request duration 100 ms 135 ms 685 ms
99th % request duration 650 ms 1.5 s 8 s
Failed logins 0 % 0 % 10 %
IdP node CPU utilization 50 % 73 % 80 %

Thanks to these results, we were quite confident that we were ready for even 100 logins/second this year. Just to be safe, we even increased our resources for the first week of the semester, adding three new virtual machines to our pool of IdP  nodes.

Results? This year, we never had more than 30 logins/second during semester start. In fact, most of last year’s requests were likely users retrying. Therefore, we are ready for at least three times the amount of logins with the current infrastructure.

The actual logins/second numbers we saw during this year’s semester start

Of course, we are not planning to stop just yet. In the future, we plan to execute a set of load tests a few weeks before each semester start (August and January). For now, we focused on the full login flow, but we will be adding new tests each time, gradually covering more backend components individually as well. We are looking forward to continue fostering stability and reliability in edu-ID.

Leave a Reply

Discover more from SWITCH Identity Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading