WhyProfiler - the world's first hybrid profiler, now for Jupyter notebook and Python

updated on 12 April 2022

This post is by Natan Yellin, the founder of Robusta.dev - a Kubernetes troubleshooting, monitoring, and automation platform. Natan is also a long-time Semgrep fan.

I built a CPU profiler for Python and Jupyter notebook that not only identifies hotspots but can suggest faster alternatives. It is powered by Semgrep, a powerful and easy to use static analysis tool. To the best of my knowledge, it is the world’s first hybrid CPU profiler that combines dynamic profiling with general purpose static analysis. It is also the only Python profiler that both identifies hotspots and recommends equivalent code that can fix them.

I’ll explain why I wrote it and how it works. But first, a demo:

WhyProfiler in action
WhyProfiler in action

An Introduction to Hybrid Profiling

A traditional dynamic profiler can identify hotspots in your code. It is then up to you to fix them.

On the other hand, a static analysis tool can identify potential optimizations but not prioritize them. (Unlike security use-cases where prioritizing static analysis output is easier.) There can be thousands or even millions of ways to optimize a codebase but many of them can have little impact on the actual runtime performance.

By combining dynamic profiling and static analysis, we can get the best of both worlds: Targeted suggestions on how to fix your code in the places where it matters.

How it works

  1. Dynamic Profiling: You run your code - in this case, Python cells inside a Jupyter notebook. In the background, a profiler (here, yappi) records information on what your code is doing and when. It builds a heatmap of the most CPU intensive lines and colors them accordingly.
  2. Static Analysis: Your code finishes running. Using Semgrep, your code is now analyzed with a database of performance-related rules. Each rule contains a pattern to look for and a way to rewrite matching code to be more performant.
  3. Hybrid Recommendations: The output from the profiler and Semgrep are now compared. Suggested rewrites from Semgrep are thrown away if they don’t match a hotspot in the code. The user is presented with potential fixes that matter most.
image (6)-000lj

To implement this, I wrote a Jupyter server extension and nbextension.

The Story Behind WhyProfiler

Several months ago I was experimenting with startup ideas. I had a hunch (based on my own experiences as a developer) that performance optimization was often done poorly and without proper data from production systems.

I wrote WhyProfiler and started speaking to people about it. I soon came to a surprising conclusion: most people didn’t care much about the performance of either their production code or their data-science notebooks. In almost all cases, the performance was “good enough” or “bad, but not prioritized.”

I asked people what they did care about and the discussion usually turned to alerts and errors in production. “Modern cloud environments are too complex,” was a common complaint, along with “We’re hiring top talent but our stack has too many parts and we can’t specialize in everything.” 

So I turned my focus to that and started wondering: instead of presenting people with recommendations on how to fix their code, could we understand - really understand - issues with cloud environments and the reasons behind alerts? Could we provide actionable advice and fixes for those instead?

In the security world, the answer is obviously yes. Semgrep is a wonderful example of this. It finds security issues in your code and presents you with the solution to them. 

But could the same be done for DevOps and alerts? Was it possible to imagine a world where an alert about Kafka or MongoDB or even just high CPU throttling in a Kubernetes app didn’t just tell you what the problem was, but also recommended fixes? Could we turn expertise for how we respond to alerts to infrastructure-as-code just like we’ve encapsulated the knowledge for setting up applications in Dockerfiles? 

And just like that, Robusta.dev was born. But that’s a different story.

After launching Robusta, I wanted to open source WhyProfiler and hand it over to the community. I reached out to Clint Gibler about collaborating on that. He of course said yes and here we are :)

Getting Started

WhyProfiler is available on GitHub. See instructions there.

Current Status

There is still a lot to do here! At present, WhyProfiler has only one static analysis rule, but the framework is in-place to add many more rules. Even putting aside the static analysis, it functions as a very convenient Python profiler for Jupyter notebook that colors lines according to their execution time.

I’m working full-time on Robusta.dev, but would welcome PRs for the following:

  1. More Semgrep rules - a hybrid profiler can only be as good as its rules. The more the better.
  2. Packaging - right now WhyProfiler is packaged as a standalone Docker container. It should really be installable from PyPi. Pushing it to PyPi is easy, but we’ll need some documentation on how to install and configure it locally.
  3. Support for Semgrep rules related to code correctness like this one. Unlike performance rules, correctness rules should always be shown even if the code didn’t take a long time to run.

Beyond WhyProfiler

If you want to troubleshoot Python (or other languages) on Kubernetes, you should use Robusta.dev's manual troubleshooting tools for Python and Java. (Golang and C# coming soon.) Robusta is the easiest way to solve memory leaks, profile CPU, and attach non-breaking debuggers to applications running on Kubernetes.

More generally, Robusta.dev provides automatic insights for alerts and errors in Kubernetes clusters. Most of the time you wont need to manually troubleshoot anything, because the builtin evidence collection shows you the exact data you need at the right time. Try it out today. Installation takes 97.68 seconds.

Subscribe to updates about Robusta.dev open source projects

* indicates required

Read more