Transformation in the workings of HPC.

Waking up the world of services and clouds, HPC computer systems are in a massive transformative shift. Influenced by compute thirsty AI, data analytics and advancements physics drives much of this transformation. This article is a personal reflection on this which intersect also to my professional work.

Legacy systems that we are transforming looks like this:

  • Can rarely run more than one operating system.
  • Are difficult to re-deploy or rebuild.
  • Are difficult to tune and optimize during the lifetime of their operation.
  • They are difficult to re-purpose for specific needs or temporary needs.
  • They do not allow for much exploring new setups or re-configurations.
  • They do not offer good programmatic access, which makes modern automation hard.
  • They are defined by hardware dependencies or requirements. That makes virtualization and containerization difficult to reach objectives.
  • The systems are difficult to secure, while at the same time maintain some flexibility, access and collaboration.
  • The systems are hard to split into smaller resource pools available to separate tenants.
  • They are difficult to integrate with public clouds and their services.
  • Many tools in HPC are not built in a cloud context which makes adoption of modern cloud oriented tools complicated.

You will argue, that some advanced HPC systems are already running in Openstacks (CERN for example) – to mitigate some of the problems above. I agree. However, I find these problems coming back higher up in the software layers. That is, deploying any modern software stack on top of a HPC system, getting it up to speed, in a short time, performing well – is not trivial even with access to the excellent primitives from openstack IaaS.

In fact, the deployment and operational problem of higher order systems is monumental from a general perspective.

General problems requires general solutions.

Partial solutions comes from AI and GPU computing, many times now harbored in Docker and Kubernetes, where advanced setups utilize GPU:s and high speed interconnects in very innovative ways. Data analytics derive their problem solving from SPARK and Hadoop etc. I see these areas as drivers to develop the traditional HPC systems in a direction that embrace and enable all these software systems in a new generation of HPC systems. But, that also require a proper solution to the general problem of managing the tsunami of software systems and their clouds.

I’ll make an attempt to break out some of my design targets for a next generation HPC system – that address the general problems listed above:

  • Built fundamentally as a cloud. Perhaps taking the NIST standard as a reference with an added hardware layer (which is not covered by the standard).
  • Able to deploy advanced software stacks on top of generally usable HPC, specifically to meet needs from GPU computing and I/O intense workloads.
  • Capable of building any software stack fast and tearing down equally fast is a key feature.
  • Able to draw from and integrate with public cloud resources. It must be able to do so, because the wealth of services and solution in the general computer science is never going to be covered by your private clouds and you can’t solve all problems alone.
  • The technologies must allow for programming the complete infrastructure setup. This means in practice, that services deployed must provide proper API primitives to allow programmers to manipulate and control the infrastructure as a whole. Managing software manually is dead, all hail our robot overlords.
  • Promotes and uses open standards when selecting API:s and data formats to avoid “standards lock-in”.
  • Adopt the “subscriber” model of consumption of resources. Users are not “owners” but “subscribers”. This means that the smallest entity of access and accounting is the “user account”.
  • Packaging of subscriptions into “services”. This is what we can learn from public clouds and a great way to get organized around priorities, costs and revenues.
  • Able to track consumption of resources within the supplied services.
  • It should explicitly focus on collaboration (sharing) across projects and IT environments. Working with distributed teams, remote and cross-disciplines requires this. Your technology needs to support this way of working and must accelerate it.
  • It should have a security model that enables delegated access but at the same time provides transparent access and tenant isolation per default. Security is for real.
  • All things open source.

Now, while the list above is not complete. A system with the above characteristics probably makes for a successful compute center for the future across many disciplines.

However, change can be destructive. Its worth as a final remark:

The great american philosopher Allan Watts says that people that feel joy and pleasure in what they do, will as a side effect become artisans in their profession.

I reflect on this a lot, as change can be hurtful. It is important to me, to provide a growing and joyful experience for people affected by change. Working with modern tools and methods in a computer science context enable professionals to develop excellence and become artisans in their work, but ultimately transformation is a human process.

Technology is secondary.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: