Or how to save money by speeding up your system
“I’m sorry Django, it’s not you, it’s me.” Such could be a start of a cliched tech article or conference talk. “It was back in 2010 when we first met, and you looked great, probably because there weren’t many others on the market to consider.” A less romantic statement could follow.
Indeed back in 2010 we migrated our news publishing app from .NET to Django and we were thrilled. We didn’t like the locked down nature of the Microsoft universe, PHP was already uncool, and Java frameworks were only for banks, insurances, or something. Besides those, there were only Ruby on Rails and Django on the market of open source frameworks. And given the simplicity and likeability of Python as well as in-house Python expertise, Django was the obvious winner.
Django was great: mature and stable, amazing ORM, built in authentication and authorization, automatically built admin interface - almost the whole CMS for free, a superb plugin ecosystem, or as Djangouts call them - “apps”. We were a great new couple, happily in love, went to conferences together, yada yada yada.
What went wrong?
The whole thing also of course had very real implications on the app performance and cost on AWS. The countless days we spent staring into AWS charts, the experiments, just didn’t bring in improvements that we felt were possible, the AWS cost kept on increasing, and first we attributed it to more and more traffic on our app, and around 15 employees really hammering it well, with around 15k daily users also being very active. But something just didn’t feel right, we knew we must get better performance and lower cost.
Worst of all, our DB on RDS would just randomly go berserk, the CPU up to 100%, for no obvious reason. Then spin up an even bigger RDS instance, dive into logs, charts, redeploy the app, are we hacked? Is it DDoS? We tried everything under the sun to fix it, even with some Django community celebs, but nothing would really cut it.
Given all this, we were constantly on the lookout for something in the NodeJS community to pop along that would let us try and seamlessly move from Django, but somehow for various reasons, none of the frameworks seemed really up to the task, and we tried quite a few.
It was in May, springtime in Paris, the perfect time to fall in love again. I was at a React conference in Paris and I attended a GraphQL / Hasura workshop by Vladimir Novick. At first thinking it’s just another plug for someone’s open source project, I was blown away in minutes.
The plan was quickly hatched to try and move the frontend part of our app to NextJS, fed by GraphQL from Hasura, connected to the PostgreSQL database, the same DB that would still be in use by Django for the admin part. So in stage one, we would just move the front end to NextJS, and leave the admin part on Django, and someday later also move the admin part to Node.
Three goals: Better developer experience, performance at least slightly better, cost at least slightly lower
We wanted to make sure what we’re doing, and wanted to do extensive tests and experiments on staging first, before deciding to use Hasura + Next JS in production too. We did a proof of concepts in three steps. Which had to bring benefits in three areas, if it did, we would then go and port the app. We wanted better developer experience, cost savings on AWS, and at least a bit of improvements on performance with ability to further tweak it in an easier fashion than Django app.
Everything worked out of the box on the first try
Step 1 - set up Hasura to expose GraphQL (let’s see if it even works with our DB)
We set up the Hasura on our staging DB, and first thing we noticed that everything strangely worked out of the box on first attempt. Something like this very rarely happens, not in the open source world, and also not with paid products. To throw new middleware technology against a huge, legacy DB, and everything from installation to correctly resolving all the foreign keys, constructing GraphQL out of the box, it was short of a miracle. Took us maybe one hour in total and we had a working GraphQL API for hundreds of tables and relationships. Wow.
On the figure below you can see all these database relationships recognized on the left, and the visual, auto-complete query builder with JSON-esque GraphQL syntax.
Step 2 - build few pages to display the data
So with GraphQL working, it was time to build a frontend to test it out. We decided to rebuild the home page, the news listing page and a news detail page, only this time in NextJS instead of Django templates. We knew React, so we had that down pretty quickly, in a matter of two days, our app with three pages was working.
Step 3 - benchmark and compare to Django
First of all we did a few experiments on the staging app, with just enough UI to test the system. We wanted to be sure that we’ll get some benefits in performance before going to port the production system.
We used a few benchmarks to see if the new stack is indeed going to bring:
Apache bench tests started giving much better results than Django and there were very significant improvements in Lighthouse too. In fact it was so much better, we thought we might be making a mistake, we’re not measuring correct things. So for weeks we kept on hammering the app with more and more requests, trying to slow it down, break it in any way possible, but at the end it was obvious that “it just works”.
But still, production is a different beast, and we knew it could bring all sorts of new issues, unforseen on staging.
It worked great on staging, but we knew production is a different beast
Encouraged by the results on staging experiments we finally decided to move the production to the same stack. So the backend admin part would be left as-is on Django, but the frontend part would move to Hasura and NextJs. Below is a simplified diagram of how we set it up on AWS.
It is too complex to explain all the details of the setup, there are docker files, ngnix configs, DNS settings on Area 15, build systems, etc. Also important to npt is that Hasura is used as read-only middleware for now, we are not using mutations to save the data to DB but special API’s on Django to accommodate certain features for front end, like registration, login, content upload, which still happens by Next.js calling the Django API. This is obviously something we would like to get rid of in the future and directly call the GraphQL mutations, but for the time being, and given it works nicely, it is good enough.
Production showed even bigger benefits
It took us a bit more than three months to rewrite all the frontend code. It was really a pleasure moving from Django templates and writing code in React, we could split the frontend into components tested in Storybook, Jest tests, use all the other familiar JS toolsets, everyone was immediately familiar how to set up and run the project locally, frontend devs could easily set up and modify GraphQL queries, something that in the past wasn’t easy. The DX improvement was clearly achieved. Developers were smiling again.
Then came the big day. Moving things to production is always scary, so we set up one certain weekend to get it done, test, and if needed revert back. Weekends still have lots of visits but very few users and no employees are uploading content, so it was the perfect time to test things at scale but without the fear of breaking people’s workflows and ruining their day.
In about an hour, some fiddling with Postgres and DNS’s, the site was live, and quickly we jumped on CloudWatch, staring into charts like maniacs. The results were stunning. The charts mostly speak for themselves so I’ll just add a short commentary.
Database CPU Performance
The most problematic part of the stack is the database, the single source of truth, with no real dynamic scaling possibilities on AWS, has to run all the time with all the data baggage accumulated over the years. It is sort of like the heart, if it stops, everything stops. Under Django this was often under stress for no obvious reasons so this was the very first metric we were interested in.
10x performance boost for ¼ of the price
Application CPU Performance
The situation with application servers was now a bit different, because we have two apps - Django for the backend, Next.js/Hasura for the front end. So we established two different environments on ELB, each with its own autoscaling rules but we used the same instance types for both.
Left chart is the Django app and the right is Hasura / Next.js. You can see that after the switch Django CPU fell from ~30 to 4%, but that was expected since it is now only doing the backend work, running the Django admin. The new frontend app requires somewhere between 15 to 35%, sometimes it spikes to 60% but rarely above.
Here as well, we reduced the server size from one ELB environment with m4.large instances to 2 environments with t3a.small or t3a.medium, we’re still experimenting a bit on what is the best, but roughly this brings us EC2 savings of some 30%.
Other Performance Metrics
- Apache Bench is how it all started so this is a bit to consider. Running the following command showed approximately a 5x performance boost.:
ab -n 100 -c 3 “http://our.url”
- Lighthouse speed score went from single digits to comfortably in the 30’s, about a 10x boost.
- Latency on load balancer went from 1500-ish ms, down to ~30ms, so 50x better.
- Request count on all systems on AWS went from ~5k/s to ~80k/s, so roughly.
About the author
This blog post was written under the Hasura Technical Writer Program by Alen Balja - Full stack tech lead and data scientist with experience in aerospace, health sciences and gaming.