December 3, 2023•1,537 words
I'm currently starting a process of rewriting an old software on my company that handles customer registration (yes, registration is its' own big chunk of work for a finance company). The reason of it is very simple, no one in the office are bold enough to maintain the app, and it's written in PHP 7.3 using old Code Igniter 3 framework, with MySQL 5.7 database that's gotten too much hack and modifications to make everything works together. The plan is to overhaul everything to make the operational side of the customer registration process much, much simpler. Right now, it's very hard to add a new integration to some third party company, yet, in 2024 we're planning on integrating with 3 new third party companies. I don't want that to come to waste.
While I'm currently still in the early process, and someone on Twitter asks about system design for junior developers, I think this is the perfect moment to share. There are clickable links along the way, make sure you read them too.
System Design is all about translating the requirements into tangible resources, be it infrastructure, tools, or even the application that would solve the problem the requirement had. System design is not merely about choosing what language will you be using, which database will you be using, would you be needing Redis, or Kafka, or even CQRS. No, it's not like that. If you read the line clearly, system design is about meeting the requirements that fits your surrounding, fits what you currently have, fits what's possible for the timeline. You can't force to apply an event driven system for 3 backend developers that would be hands on with the project while they're only have 1 year of experience with some regular CRUD software. Sure, the requirements might be better solved with event driven system, but would what you currently have (in this case, those 3 backend developers) be able to do that on schedule?
To be honest. system design is hard. There are no one-size-fits-all solution. All you have to do is continuously, over and over again, find an engineering blog on some companies, find out how they built their system, find out how they decided to move from MongoDB to Cassandra then to ScyllaDB, find out why they decided to port some of their service from Python to Rust, or from Ruby to Go, find out the steps they took for making such changes. But just because some companies are doing great by using something, doesn't mean you will get the same benefits. Always create a proof of concept for your scenario.
Most applications works well with a simple three-tier architecture: consisting a single frontend, a single backend, and a single database. But some problems require better solution, some need to have a disaster recovery plan that will activate a backup application in a different datacenter, or in a different availability zone. Some other problem doesn't need to have a backend and a database, it can also be solved with having a frontend served with a CDN proxy in front of it.
Although at this point it looks like there will be no single guide on how to actually learn system design other than analysing others that already made it, I'll give you some questions regarding system design that would probably complete your thoughts about your current requirements:
- How much traffic are we expecting? 1 request per second? 100 requests per second? 1 million requests per second? Do we need a load balancer for this? If so, where would we put it?
- Where do our users are coming from? Do we need to deploy on different countries, or just one is enough?
- Do we need a separate admin dashboard for our internal operations day-to-day work? Should that be exposed publicly, or exclusive via office's VPN?
- Are you sure you want NoSQL for the database, what makes you think SQL database is not sufficient for this use case?
- Who will manage the database migrations? Or will there not be any database migrations?
- What about database backups, should we regularly do backups and store it somewhere? Also, do we want snapshot backup, or point in time backup?
- Do we need master-slave database replication, or can we postpone it until later when the traffic is high? Also, how do we scale up the database?
- How do we setup local development environment to be as easy and flexible as possible?
- How will the frontend communicates with the backend? Is it via HTTP, gRPC, or.. native TCP sockets?
- Do we need compression like gzip or zstd for the transport layer?
- Do we actually really need Kubernetes? Can't we just deploy this with Vercel, or just plain Systemd?
- Isn't this better be done with RabbitMQ that have dead letter queue, instead of using Kafka? Or do we really need message queue at all?
- Integration tests using real database is hard to set up using this limited quota of GitHub Actions, we'll just run unit tests instead. Or.. do we need to move to Jenkins?
- Do we need to have a Grafana dashboard to monitor things? Or can we offload it to some vendor like Datadog or Sentry? Would sending the PII meets our compliance policy and needs?
- How would we measure metrics like response time and error rates? Who would be seeing those in a regular basis?
- How would the scenario be, if the application is down to some users? What are the things we need to do?
- For B2B (business to business) API, would transfer data via public internet and IP whitelisting be sufficient enough? Or do we need to setup a site-to-site VPN that's preferred by the security team?
- Also for B2B API, do we need to have signature verification of every incoming and outgoing request using RSA?
- Still for B2B API, how would the other party report for outage or intermittent failure on our side? What will our response be?
- For third party APIs, do we need to create a live mock server to handle requests, or we can just hit the third party API's development environment?
- How would we avoid DDOS attack? Which of the OSI layer that we can most effectively defend?
- Do we need to go microservice, or monolith is enough?
- How do I keep the architecture to still be maintainable?
- Do we need caching? Should we do it in-memory or with a distributed cache like Redis or Memcached? Or maybe it's time for us to try DragonflyDB?
- Why do you choose a certain language and framework? What's your justification for that decision?
- How would you keep the documentation, what if some things are left undocumented?
- How would you avoid having breaking API changes? Will you do versioning on request path, or through custom HTTP header?
- If we need to scale up, what are the things that should be changed? How would you ensure that we only need minimum effort to scale up?
- How do we test this out to the end user? Do we need to do A/B testing, or do we just roll it out to everyone?
- How do we make sure that this current system is easily extendable for the next feature that the users will ask?
There are a lot more that I could think of. But I don't want to overwhelm you at this point. Instead, I will continue by giving one piece of advice about under-engineering and over-engineering.
With the decision partially on your hands, most people during their system design process would think, did I over-engineered with my design?
The truth is, you won't know that you've over-engineered something, until you can think of a simpler solution that fits your needs. It has occurred to me once when I use multiple databases for storing event logs in different places due to availability and consistency concerns: ScyllaDB and InfluxDB. A few months later when I already resigned from the company, I can think of a better and must leaner solution for that previous needs: I can use a single ClickHouse database and everything would be just works. I knew I over-engineered it not during the time I'm building it, but way later after that.
You can avoid over-engineering so much by experience. Nothing beats real life experience on doing things, experimenting with tools and designs, choosing which way to go for a certain project, so much more than just reading from a book (it also helps too, anyway), or reading someone else's blog post (like you're doing right now).
It's always better to over-engineer than under-engineer. At times when you can't handle the scale of the incoming traffic that's generated by the viral marketing your company has just launched, you're saved with your over-engineering. Imagine that you under-engineer your system, and a spike just came, you'll be in panic, and nobody wants that.
There are so much more to look around, so much more to read, if you want to create better and better system design. We're always still a long way to creating the best system.