Filestore powers one of the biggest JupyterHub deployments in US higher education at UC Berkeley.
What is JupyterHub?
JupyterHub, which makes Jupyter Notebooks more accessible to larger user groups, has emerged as a crucial platform for collaborative data science, enabling researchers, students, and developers to work together on challenging projects. The manner that data science education is delivered at scale has been completely transformed by its capacity to oversee numerous user contexts and grant access to shared resources.
However, these advantages are not without difficulties, as anyone who has worked on a large-scale JupyterHub implementation will attest. Managing file storage for a variety of users and computationally demanding operations soon becomes a major challenge as deployments scale. To guarantee seamless operations and effective workflows, storage solutions must be dependable, scalable, and performant.
UC Berkeley uses JupyterHub as a collaborative data science environment for students, teachers, faculty, staff, and researchers possibly the largest such deployment in American higher education. They employ Datahub, a highly customised Zero to JupyterHub implementation, which consists of more than 15 hubs serving 15,000 users across more than 35 departments and more than 100 courses. Naturally, punctuality and accessibility are critical as assignments and projects have due dates, and academic calendars control quizzes and examinations.
When UC Berkeley and Google initially began corresponding, UC Berkeley was very forthright about the difficulties of maintaining such a sizable and engaged user base, particularly given its limited resources. They experienced financial difficulties, similar to many other colleges, which made it challenging to staff a sizable IT team. Actually, a small team consisting of just two full-time employees was overseeing this enormous JupyterHub deployment, with the help of devoted volunteers and part-time workers.
It was soon evident that their current setup, which depended on user home directories that were self-managed and mounted on a self-managed NFS service that was hosted on Google Compute Engine, was not keeping up with the demands of the growing organisation. Their expanding user base required a more dependable and integrated experience, so they had to find a way to handle demand growth without sacrificing usability or speed.
Being a preeminent research university, they also had to strike a compromise between the demands of their constrained IT funds and the objectives of cross-departmental training.
This is where the managed NFS storage solution Filestore from Google Cloud comes into play. Google hope that by revealing UC Berkeley’s path to Filestore, Google will be able to offer insightful analysis and useful advice to anyone facing comparable obstacles in their own pursuits.
What makes Filestore special?
The squad was operating in almost continual crisis mode when Shane joined it in October 2022. A surge of new Datahub customers in the middle of the semester taxed the capacity of the GKE architecture. Worse, the self-managed NFS service would frequently crash from overload.
By re-architecting the configuration to segregate particular course hubs and JupyterHub support infrastructure into their own node pools, the team was able to fix the GKE performance issues. For such users, this improved performance, but the underlying storage problems remained. One key point of failure had emerged: the self-managed NFS service. The team had installed a systemd timer that automatically restarted the NFS service every 15 minutes as a band-aid solution to keep things operating.
Although total disruptions were avoided, the self-managed infrastructure was still having difficulty keeping up. The user base continued to increase quickly, workloads were getting heavier, and the budget was just not able to keep up with the ongoing demand for additional servers and storage. They required a more economical and successful solution.
At that point, they got in touch with the Filestore team and Google Cloud. The UC Berkeley team was persuaded that Filestore was the best option in less than an hour. Because the Filestore Basic HDD tier allowed them to customise instance size and was reasonably priced, they were especially intrigued in it.
It’s important to note that there are three Filestore tiers: Basic, Zonal, and Regional, and selecting between them isn’t always an easy choice before delving into UC Berkeley’s move. Although basic instances have limitations on capacity control (you cannot reduce capacity), they offer good performance. For workloads involving data science education that must be completed with minimal delay, zonal instances offer lightning-fast performance.
However, they are restricted to a particular zone within an area, as the name implies. In the event of an outage in that zone, the workloads may be affected. In contrast, Filestore Regional synchronously replicates data among three zones in a region to safeguard it in the event of a failure in one of the zones. The three of them trading places? Cost, flexibility in storage management, performance, and storage SLA. Selecting one of the three requires balancing performance with your level of patience for downtime. Budgetary constraints and capacity limitations will undoubtedly also be important factors in the choice.
Making the switch from Filestore to DIY NFS
Shane and his group were excited to test Filestore as soon as they had a firm grasp of it. They launched a demonstration deployment, establishing a connection between a Filestore instance and a smaller JupyterHub environment. Being the hands-on Technical Lead that he is, Shane jumped right in, pushing the system even further by running some bonnie++ benchmarks from within a single user server notebook pod.
Handling Filestore
Shane and his team at UC Berkeley have experienced a level of performance and stability they never would have imagined possible after switching to Filestore. They claim that Filestore is now a “deploy-and-forget” solution. Their users, those thousands of students who rely on Datahub, haven’t reported any performance concerns, and they haven’t had a single minute of outage.
Their management overhead has also been significantly decreased. They have a few basic Google Cloud alerts configured to interact with their current PagerDuty system and notify them in the event that any Filestore instance fills up to 90% of its capacity. These warnings are uncommon, though, and increasing storage when necessary is simple.
They have put into practice a straightforward yet efficient plan to further optimise their consumption and keep costs under control. They right-size their Filestore instances depending on usage trends after archiving user data to Cloud Storage at the conclusion of each semester. To make sure they are only paying for the storage they require, they either build smaller instances or combine hubs onto shared instances. For data migration between instances, Rsync continues to be their go-to partner. Although it takes time, this operation has become a standard component of their workflow.
In conclusion
The experience of UC Berkeley emphasises an important lesson for anyone implementing large-scale educational platforms as force multipliers for teaching: the complexity and volume of JupyterHub installations increase, and with them, so do the demands on the supporting infrastructure. Finding solutions that are both financially viable and technically sound is essential to success. Filestore proved to be that solution for Datahub, offering a potent combination of performance, reliability, and operational efficiencies and empowering the upcoming generation of data scientists, statisticians, computational biologists, astronomers, and innovators, despite the presence of some missing automation tools, a minor learning curve with Filestore Basic, and a higher price tag.