When your job is to support the safety and scientific endeavor of the Hubble Space Telescope it is critical that you are a safe pair of hands when it comes to operating your infrastructure!
STScI knew they had to change as they managed the transition to Kubernetes knowing the complexity and importance of their mission. We joke in this devops world that we cannot go into the data centre to fix our IT; STScI takes this to a whole new level!
The DevOps Plaforms Team at STScI, led by James Wu, recognised the need to bring about change to help developers build things of the highest quality without compromise on quality. Since STScI used a traditional service model with software running directly on VM’s they decided that their move to Cloud Native needed a new approach to their infrastructure. This was very much a developer-focused change and driven by a bottoms up motion. The decision was taken to adopt a cloud native stack and adopt a platform engineering operating model to support this initiative. STScI evaluated and selected Kubernetes and Rancher as their underlying platform!
Recognising that operating this platform was going to create new challenges let alone striving to improve developer experience in this transition, James looked to the community ideas on day 2 operations and found Robusta, and quickly developed his ideas around platform operations. Building, learning and operating multiple Rancher clusters, many of which have strict security requirements brings some unique challenges but the desire to not reinvent the wheel also figured highly on the agenda.
STScI used Prometheus already and were keen to embrace it but to simplify the observability stack as much as possible.
STScI currently operates multiple clusters with limited Internet access and provides formidable developer support with a platform team of only three engineers. To do this they embrace the built in graphing, alerting and log collation, and automation, reducing the need to keep building dashboards or writing complex queries.
STScI deployed Robusta and within a couple of hours were seeing results. Enriched Slack messages, with graphs, logs etc. provided them with the context everyone needed to do their job significantly reducing the need to access the clusters via kubectl.
The key is that Robusta is a dedicated Kubernetes Observability and Operations platform – it supercharges Prometheus and makes Kubernetes troubleshooting accessible to anyone.
Robusta decreases the barriers to effectively make sense of alerts on k8s (human-readable messages, a way to visualize events across the whole environment over time, and less noise in general), which eases the cognitive burden for engineers tasked with operating clusters and reduces the risk of configuration drift. Robusta monitors all the Prometheus alerts and Kubernetes errors that occur, and surfaces the important ones, so that no problem goes unnoticed. The result – STScI can make quick changes to their clusters.
James and the team at STScI have made much progress but they are not done yet. On the agenda right now is the cost optimisation where they are planning to use KRR to both measure infrastructure efficiency but also as a tool to help their developers optimise the applications at run time. And of course security is a major concern for STScI so James and his team are also implementing role-based access controls in the Robusta UI to ensure everyone has all of the context they need without needing privileged access to the clusters. This gives rbac controlled access to automate many tasks including managing node and pod lifecycle using the embedded workflow.
As James Wu said
”With a small platform team at STScI, Robusta acts as a force multiplier and lets a lean team like ours punch above our weight when it comes to delivering capability”.
Finally, the team are keen to adopt the AI capabilities in the platform which fortunately can be run on their own private LLMs. James commented that “Robusta strives to work closely with STScI to ensure they are always delivering the most value for STScI’s needs. We value the responsiveness and cooperation, and it goes a long way toward augmenting the capabilities on our team”.
Let’s face it Kubernetes needs to be simple for STScI to operate - that telescope is dramatic enough!
Lorem ipsum dolor sit amet consectetur. Lectus cras mauris egestas vestibulum libero quam aliquet tortor. Platea malesuada quis quam ultrices eu egestas.
Lorem ipsum dolor sit amet consectetur. Lectus cras mauris egestas vestibulum libero quam aliquet tortor. Platea malesuada quis quam ultrices eu egestas.
Lorem ipsum dolor sit amet consectetur. Lectus cras mauris egestas vestibulum libero quam aliquet tortor. Platea malesuada quis quam ultrices eu egestas.
Email us, and we'll provide you with a login link to complete your onboarding from your computer, where Robusta performs at its best.