After dealing with a TLS certificate expiration, Epic Games decides to make their experience a teaching moment for others — we’ll cover some of the key takeaways they shared and how you can prevent it from happening to your business
This server is unavailable.
These four words deliver feelings of dread and aggro to gamers as effectively as a punch to the gut. It means no battlegrounds, raids, or hours of exciting weeknight gameplay with friends. Or, worse, you might have to spend your free time with family instead — and what teenager wants that? Gross.
Seriously, though, widespread service disruptions can have a huge impact. The global online gaming market is a huge industry. As of 2020, it was worth $167 billion and is anticipated to reach $287.1 billion by 2026, according to recent data from ResearchAndMarkets.com. And online gaming service outages don’t just affect kids and teens. Data from LimeLight’s State of Online Gaming 2020 report shows that many gamers fall within the 26-35 age category (30.2%), followed by 36-45 year olds (28.3%) and gamers who are 60+ (26.8%).
Some of the worst types of downtime for businesses are those that are entirely avoidable… you know, like SSL/TLS certificate expirations.
Unfortunately for Epic Games (EG) fans — players of games like FortNite, HouseParty and Rocket League — they discovered what happens when a company allows even just one of their SSL/TLS certificates to expire. But unlike many companies in their position, Epic Games didn’t try to hide or downplay their mistake. Instead, they decided to be a boss and openly talked about the April 6 incident in an online article. Their goal? To help other companies learn from their mistakes.
Kudos, Epic Games. We respect that. And in honor of your uncommon transparency, we’re going to go over the highlights of your report and go over what companies can do differently to avoid ending up in the same position.
Let’s hash it out.
An Epic Play-by-Play: Breaking Down What Occurred
Certificate expirations suck no matter how you look at it. For businesses, they make a bad impression and leave you non-compliant. For users, you’re lose access to the services or products you paid for. Digital certificates are your organization’s digital identity as well as a way to secure your services, websites, and data from unauthorized access. And when even “only” one certificate expires, it creates a slew of problems that no organization wants to deal with.
In Epic Games’ situation, one of the internal TLS certificates they were using to encrypt their backend services for internal management tools and cross-service API calls expired. Of course, it’s important to note that it just takes one certificate to create a big mess. But in this case, thankfully, EG quickly narrowed down the issue to an expired certificate and got people from across their various teams to work together to resolve the issue.
But just how did everything go down? Epic Games was kind enough to provide a detailed timeline of events as they occurred on Tuesday, April 6 in their article:
Rather than go over every specific detail of this timeline of this incident in depth, we’re going to give you the highlights.
- They discovered that an internal wildcard SSL/TLS certificate expired. This certificate, which touched many internal backend services across their IT ecosystem, led to widespread service outages for users and employees alike. This immediately led EG’s IT team to go into incident management mode to deal with the issue.
- 25 minutes later, they started the certificate reissuance process. Thankfully, it didn’t take long for them to discover an expired certificate was the culprit behind the service outages. They quickly started the certificate reissuance process, which allowed them to start the recovery of select services. But the situation doesn’t end there…
- Their internal teams discover other issues with connected services over the next few hours. A series of events and issues led them to identify other things that were amiss within their IT ecosystem that affected their launcher client and online store. Some of these issues included missing assets and invalid content. Luckily, EG says they were able to attain full recovery of all their affected services and systems by 5:35 p.m. UTC
Epic Games reports that the whole situation lasted a little more than 5.5. hours from start to finish. But it seems like the online gaming giant took the hit to the chin like a champ and responded quickly to resolve the issues. They also decided to use it as an opportunity to spread the word about the importance of implementing effective certificate management. (We’ll speak more to that momentarily…)
Area of Effect: Who and What Were Impacted By the Certificate Expiration
Epic Games is a company with a large and growing customer base. Their Epic Games Store 2020 Year in Review report shares that their EGS community has 31.3 million daily active users (DAUs), which is a 192% increase over the previous year. They also report having more than 160 million Epic Games Store PC users who spent more than $700 million in 2020. So, you can see that we’re not talking about a small market here.
Because Epic Games used the affected wildcard certificate across hundreds of different production services, it means that the impact of its expiration was widespread across their ecosystem. This affected both their customers who were trying to use their products and their employees who were attempting to resolve and manage the downtime-related issues.
The biggest impacts were felt by their identity and authentication systems. As you can imagine, this resulted in:
- User login and purchase failures across multiple products and systems. This means anyone trying to log in during the hours of the outage couldn’t do so. They also couldn’t purchase items in the Epic Games Launcher client.
- Live service and gameplay disconnections and website failures. For users already in the middle of gaming, this boot from live gameplay resulted in extra frustrations because they couldn’t reconnect. EG’s product and marketing websites also were experiencing a lot of 403 errors due to an unrelated container update that had been made the day before.
- EG employees’ hands being temporarily tied due to internal tooling issues. The people who get it the worst in downtime situations are the customer service employees.
There Were Some Unexpected Positives That Came Out of the Situation…
An issue that started with an expired internal certificate quickly morphed into something much bigger. It served as an opportunity for EG to identify other unrelated issues that existed within their systems that they otherwise may have not discovered until cybercriminals exploited them.
One example is the “unexpected behaviors” that they discovered in the Epic Games Launcher client that resulted in unusual call patterns. It turns out, clients were using linear retry logic rather than a truncated exponential backoff. The first results in repeated connection retries without end; the latter aims to prevent excessive connection attempts that increase traffic loads.
As a result, every time a user’s client sent a failed connection request, it would continuously send additional requests until it would receive a successful response. This glitch caused millions of launcher clients globally to send repeated requests continuously, which overloaded their systems. The result? “We were effectively DDoSed by our own clients.” This incident also enabled EG to discover issues in their web application firewall (WAF) ruleset. Fortunately, they were able to reduce the traffic and are now aware of their need for a standard process to deal with similar issues in the future.
A second unrelated issue they discovered affected the traffic on their Epic Games Store website. Instances were trying to fetch an asset ID that didn’t seem to exist, resulting in a bunch of 403 errors. After discovering the cause of the issue, they quickly fixed it and restored valid traffic.
The good news is that this certificate expiration set of a chain of events that forced Epic Games to take a hard look at their internal processes and tools. For example, they may not have realized the issue with their retry logic without their system first becoming overloaded with client traffic. This allowed them to see where they went wrong and implement changes, as well as share their insights to help others avoid following in their footsteps. So, while certificate mismanagement isn’t good, at least there was a relatively happy ending in this particular situation.
This brings us to our next point: how can you help your own company avoid dealing with the ramifications of an expired website security certificate?
Lvl Up Your Cybersec XP with Certificate Management & Network Discovery Tools
No one wants their business or services to experience an outage due to certificate mismanagement. This is why it’s integral for businesses — particularly those with hundreds or thousands of X.509 digital certificates — to have clear visibility of everything that touches their networks and IT systems. And this is where effective certificate management best practices and tools come into play.
A good certificate manager is one that enables you to discover all of the digital certificates that exist within your IT ecosystem. This means you’ll know where every certificate is across all endpoints and which systems each certificate is tied to or secures.
Sure, you can manually track your certificates using spreadsheets and calendar reminders, but this gets hairy at scale. KeyFactor and the Ponemon Institute report that organizations use an average of 88,750 keys and digital certificates. And if that number isn’t enough to surprise you, then consider that 74% of their 603 IT security and infosec survey participants think their organizations don’t actually know how many certificates or keys they have, let alone when they expire.
That’s not only embarrassing — it’s downright terrifying. And considering that SSL/TLS certificates have a one-year certificate validity period now, it means that certificates expire more quickly and require more stringent management.
Without effective certificate management, you may wind up having expired or revoked certificates on your network that you don’t know about. And each one is a vulnerability that cybercriminals can exploit. (Remember the Equifax data breach from a few years ago? Yeah, that was because of an expired digital certificate.) And this is when you go from having “just” a temporary service outage to potentially a full-blown data breach situation.
Manage Digital Certificates like a Boss
14 Certificate Management Best Practices to keep your organization running, secure and fully-compliant.
Having the Right Tools Isn’t Enough — You Need to How to Use Them Effectively
You can be properly geared but still not get the Certificate Management Boss achievement. That’s because although having the right certificate manager is great, it’s just as important — if not more so — that you know how to use that tool effectively. This is true both from a general cybersecurity standpoint as well as a risk mitigation perspective.
It’s kind of like intimately knowing your character’s specs and attack/healing rotations in games. While wearing one set of armor and using a specific healing or attack rotation may be great for keeping your group alive in dungeons, it doesn’t mean that those same tools are effective when playing a tank or healer in raids. This is why you need to have not only the right gear (a certificate manager) but also must know the right tactics (certificate management best practices) for each situation.
In this situation, Epic Games admits that although they use a certificate manager, it was their own poor cert management practices that led to the expiration and resulting service outages. Basically, they say both organizational and technical process failures contributed to the situation.
We like how EG’s glass-half-full outlook turns this negative situation in a positive one by using the incident for self-introspection:
“The scope and length of the outage helped us discover not only explicit bugs in our systems, which we will work to correct, but also previously unquestioned assumptions in some of our internal processes, especially those governing certificate management.”
Epic Games called out their n00b certificate management mistakes and how they’re taking steps to fix them. Let’s explore three key things that they learned the hard way.
Active Monitoring Should Apply to All Systems — Including Internal Certificates
This next part boils down to a simple but profound misconfiguration issue. Epic Games says they were using a certificate monitoring service to monitor their domain name system (DNS) zones. However, they hadn’t enabled it to monitor individual certificates or endpoints. This means that the certs they were using for internal service-to-service communications weren’t being actively monitored.
Ever try to find a flashlight in your house when the power goes out? You’re left groping around in the dark and are likely to stub your toe a time or two. Lacking proper certificate management configurations and processes is kind of similar: it’s hard to find a problem if you’re keeping entire systems in the dark. In this case, EG’s active monitoring failure allowed a critical certificate to fail without them even realizing that it was going to expire.
Since then, they’ve manually audited all of their SSL/TLS certificates to ensure that there’s no additional oversight or other expired certificates that they missed.
Automatic Certificate Renewals Should Be Enabled for All Certificates
Most certificate management systems offer automatic renewals. Unfortunately for Epic Games, they hadn’t enabled this feature for this certificate, which allowed the certificate to expire rather than be replaced automatically. Now, having learned their lesson, they say they’re moving to set existing certificates to auto-renew to avoid this issue in the future. This is a great move both from a general cybersecurity standpoint as well as a risk mitigation perspective.
Enabling certificates to renew automatically increases your certificate management effectiveness and agility. For organizations that are managing digital certificates at scale, automation entails freeing up your IT team from performing manual, repetitive tasks so they can focus on bigger-picture functions. And using a certificate management tool with automation also helps you to respond quickly to certificate revocation situations.
To learn more about the benefits of certificate management and automation for large-scale operations, be sure to check out this recent Forbes article by DigiCert Chief Technology Officer Jason Sabin.
You Shouldn’t Use the Same Wildcard Certificate Across All Major Systems or Services
Epic Games says that the reason this single certificate expiration was so impactful boils down to one key issue:
“The expiration issue had nothing to do with AWS ACM itself, but with our management of our own certificate. We will work on separating the blast radius of our certificates, and part of this will be updating our processes for certificate use with AWS ACM.”
This is a great point. While it’s true that a single wildcard SSL certificate can cover an unlimited number of subdomains, it doesn’t mean that you should use it to cover literally everything. It’s good to use different certificates across multiple systems to divvy up the potential risk in the event that something goes wrong with an individual certificate. This way, if there’s an issue with one certificate, such as an expiration or a certificate revocation, the impact will only be felt in a handful of systems instead of your entire IT environment.
Quest Complete: Final Thoughts
Whether you’re a major online gaming company like Epic Games or a small business, effective certificate management is integral to your business’s success and data security. Digital certificates, as well as the tools and processes you use to manage the certificate lifecycle, are critical to your company’s cybersecurity posture.
We hope that EG’s honest look at their own certificate management failings benefits you by helping you avoid the same mistakes within your own IT environment. For some other examples of PKI certificate management mistakes, be sure to check out our other article.