Around IT in 256 seconds

#71: Erlang: let it crash!

April 26, 2022 | 3 Minute Read

Erlang is a programming language designed for highly scalable, fault-tolerant systems. Its primary use case used to be telecommunication. But these days it powers some of the biggest distributed systems. For example, half-billion WhatsApp users. The unique features of Erlang allow it to achieve amazing availability. A typical enterprise system may be unavailable for, let’s say, a few hours per year. This means 99.9% availability. Systems written in Erlang may even reach so_called nine nines. Or 99.9999999%. It means the system is unavailable for less than 31 milliseconds. Per year. How is that possible?

Our typical application is written as a single process. It handles thousands of users or connections at the same time. Erlang is different. In Erlang each unique user, connection or request typically spawns a new process.

But processes in Erlang are very lightweight and dynamic. They consume as little as 1 kilobyte of memory. So, it’s common to create hundreds of thousands of them. Internally, Erlang runtime manages them within a single operating system process.

So what’s the big deal with these processes? Well, Erlang does not encourage defensive programming. It means programs should not try to recover from errors themselves. Instead, errors simply terminate a whole process. Wait, what? This actually makes sense. If you are implementing a telephony switch and your process fails, you’ll lose one phone connection. Thousands of other concurrent connections are fine. But what if you build such software as a single, monolithic app? Chances are that a single failure cascades and bring down the whole server. That’s unacceptable.

OK, so how does Erlang deal with errors? A process itself should not try to recover. Instead, processes are arranged in a supervision tree. When a process dies, its supervisor is notified and can react. It may:

  • ignore termination of one of its children
  • restart that broken child
  • terminate itself, bringing down all siblings of the terminated child

All these strategies make sense under certain circumstances. Do you think this approach is naive? Quite the opposite! When was the last time you restarted the server because it was unstable or in some inconsistent state? We do it all the time. Turning something off and on again is a common strategy in our industry.

Erlang processes are highly isolated from each other. So when one process goes down, restarting it may be the best approach. Oh, by the way, how is the isolation achieved? Processes can’t talk to each other directly. They can barely send asynchronous messages. Sharing memory or direct calls is impossible.

Processes act as if they were deployed separately. Even though typically, they all live within the same Erlang virtual machine. Known as BEAM. However, you can easily deploy Erlang VM on multiple servers. Processes still talk to each other via message passing. So scaling out theoretically does not require any code changes. Scaling up is even easier, you just need memory to create a few more millions of processes on one machine.

But wait, there’s more! Erlang supports hot-swapping of code. This means you can deploy a new version of your processes without downtime. You can even have a new and old version working side-by-side!

Erlang is still used these days, despite quite an unusual programming model. If you want to give it a try, have a look at Elixir, a language built on top of Erlang’s VM with modern syntax.

That’s it, thanks for listening, bye!

More materials

Tags: BEAM, OTP, erlang

Be the first to listen to new episodes!

To get exclusive content: