Detecting performance issues on your python application. Logging & py-spy
Do you ever stare at your monitoring dashboard, wondering “what the f*** is going on?”
It has happened to me, because sometimes, even when you review your code and test it, when a code base it’s big, some issues may arise or get discovered on a new update.
The context
We have a python application that acts as a backend. That is used by multiple parts, internal tooling inside the network and a front-end which talks to it. The issue we are going to inspect, is a really common scenario I found on my career, I got an app, it works perfectly everywhere, but on production it randomly has performance issues even crashing the host on a worst case scenario. I will focus on the 2 things I used the most when facing these problems, logging stuff and inspecting the code at runtime with py-spy
Below is a dirty & simplified architecture diagram:
## lazy to do diagrams :p
## simplified architecture
|-------|
front end --->|backend|---> DB
|_______|
internal tool 1 _| | |process some stuff
internal tool 2 ___| |processs other stuff
internal tool 3 _____|process more stuff
The problem
Today, as an example, let’s look at a performance issue with spikes on CPU and RAM usage increase overtime, that was “randomly” caused at different intervals, and caused stability issues on this node.
The sudden decrease of load was due to the OOM killer (https://neo4j.com/developer/kb/linux-out-of-memory-killer/) killing a problematic process
This was… a weird balance, the app did not fail entirely, but it was really unstable
Hunting down the bug
Our tool-belt
From experience, there are two main things that will help you with this. Logging & profilers, some tips on logging:
- Try log the times it takes your app to do stuff (queries, data processing, requests to 3rd party or internal services)
- Try adding traces, this will save you time when co-relating logs together (especially on async scenarios)
- awk, sort, uniq, head and grep are the most agile tools known to me for parsing logs fast on a Linux box
For the profiler part, python has a de-facto option which will be the one used today: https://github.com/benfred/py-spy
We will use the “dump” option, which will dump the stack of the process.
So, how we find out the issue?
NOTE: Always experiment on a non-prod env! do not ruin yourself the evening because of this thought: “let’s see if this is the issue”
Adding logging statements for the time it took for requests to complete, allowed us to find out request taking way too long, and some of them where not finished which using the hour as reference, matched the time-frame of the OOM killer.
This example command allowed me to get the time in seconds it took for every query on day 10/01/2024 sorted and aggregated top 2 queries
# is a custom format log, you will need to adjust awk expressions
cat logging.log| grep "seconds" | awk '$2 >= "[2024-01-10"' | awk '{print $7,$3}' | sort | uniq| sort -nr | head -n 2
cat logging.log | grep 'xxxx.xxx seconds'
Then I can just grep the seconds and find out the problematic endpoint.
We also started inspecting with py-spy the PID of the processes that were using too much resources, revealing the culprit of the problem. A query with no limit to the database.
py-spy dump --pid <PID> -llllllllllll --subprocesses
....
# py-spy dumped all the threads as verbose as possible
# we got info related to variables and other parts of the code that allowed us
# to find the problematic code processing the request and replicate the issue on
# testing environments
So the logs gave us where to look at: stuck on the DB, and py-spy revealed the code stuck querying the database… But there’s more. With statement logging on the DB we notice that the problem wasn’t a single query, we were making tons of small queries that at the end, were trying to gather half of the database
Note: I also recommend taking a look to the IPs making sure you do not have undesired clients trying to access the site. You could modify the awk expression above to do it.
So, what now?
We found the issue! but how did we solve it. If you take a look at the problem, it’s basically that we are not applying limits.
We added the primary key as our default filter if no filter is specified. This not only fixed the CPU / RAM spikes but also allowed us to fix legit calls that were miss configured with no filters.
The results were… Fantastic!
We basically reduced the noisy CPU and fixed the RAM usage overtime / CPU spikes by modifying a simple default filter.
Conclusions
This issue was tricky to find until it really became a problem, even SAST tools didn’t raise any warning about it. But by analyzing logs, making some co-relations and the help of py-spy it can be fairly easy to find out what’s going on inside your app. Don’t forget to check out py-spy docs to get a grasp of the tool!
I’m re-editing this article to add something else.
A friend of mine told me that we could implement a native process to execute the code on a sandbox, to automatically send the stack before it get’s killed by the OOM killer. I won’t be adding the process, since it’s out of the scope, but here are some references to help you out:
Remember: always apply limits :)
Lastly, thanks to the amazing co-workers that helped me to analyze and chase down the performance issue ;)