Develop or die

I’ve been technical lead for a HPC center for over a decade now and for some 15 years  watched paradigms come and go within the systems engineering domain. This text is about future developments of HPC in relation to the more general software domain. There are some very tough challenges ahead and I’ll try to explain what is going on.

But first, let me get you in line of thinking with a bit of my own story.

Turning forty this year, exploring and experimenting within computer science, has landed me in a very peculiar place. People reach out for advice in very difficult matters around building all sorts of crazy computer systems. Some sets out to build successful business, some other just want to build robots. I help out as much as I can, although my time normally is just gone. Its a weird feeling to be referenced as senior. Where did those years go?

My experience in building HPC systems started in the early 2000 and from a computer science perspective, I’ve tried to stay aligned with the development in the field. This is never easy in large organizations, but having an employer bold enough to let me develop the field has made it possible, not only to deliver enterprise grade HPC for over a decade – but also to develop some other parts of it in exciting ways.

It didn’t always look like this of course.

I’ve been that person, constantly promoting and adopting “open source” wherever I’ve gone. A concept that at least in automotive (where I roam mostly) was pariah (at best) in my junior years. I came in contact with open source and Linux in the university studies and was convinced open source was going to take over the world. I could see Linux dominate the data centers from single the fact that the licensing model would allow it. There are off course tons of other reasons for that to happen, but the licensing model in it self was decisive.

I’m sorry, but a license model that doesn’t scale better than O(n) is just doomed in the face of O(1).

(I think I’m a nightmare to talk to sometimes.)

Anyway, telling my story of computer science and systems engineering, can’t be without reflecting on some technology elements and methodologies that came about during the past decades. After all, important projections on systems engineering can’t be without referencing technology paradigms.

Late in the mid 1990:ies, “Unix Sysadmins” ruled the domain of HPC. Enterprise systems outside of HPC was Microsoft Windows. For heavyweight systems, VMS and UNIX dialect systems prevailed. Proprietary software was what mattered to management (still is?), and everything was home-rolled. The concept of getting “Locked-In” by proprietary systems was poorly understood which was reflected by low understanding of real costs and how ability to build high quality systems was constituted. Building HPC system at this time was an art form. Web front ends didn’t exist. Most people were self taught and HPC was very different from today. My first HPC cluster (for a large company) lived in wooden shelves from IKEA in a large wardrobe. The CTO of the IT division was a carpenter. I’ll refer to this era as the “Windows/Unix Era“.

The Windows/Unix Era started to morph during the 2000 into the Linux revolution. In just a few years, available (not only) HPC software and infrastructure was getting more and more Linux. Other OS dialects such as HPUX, VMS, AIX, Solaris and Mainframe ended up in phase-out, witnesses of an era slowly disappearing. Linux on client desktops entered the stage and at this point, the crafting of HPC system was becoming an art of automation. Mac:s got thinner.

Sysadmin meant in practice, scripting. bash, ash, ksh etc. cfengine, ansible and puppet. Computer science was starting to become quasi magic (that I think most people perceive computers in general) and advanced provisioning systems became popular. Automation!

I was myself using Rocks Clusters for my HPC platform, with some Red Hat Satellite assistance to achieve various forms of automation. The still ever popular DevOps appeared, mainly in the web services domain and some people started to do python in favor of bash (OMG!). Everything was now moving towards Open Source. It became impossible to ignore this development, also at my primary employer. I became the chairman of the open source forum and slowly started to formalize that field in a professional way. A decade behind giants such as Canonical, IBM or RedHat, but hey – automotive is not IT, right?

However, time and DevOps development stood still in the HPC domain. I blame the performance loss in virtualization layers for that situation, or perhaps being victim of own success. While vmware, xen, kvm etc. filled up the data centers in just a few years, it never took off in HPC. Half a decade passed by with this and so called “clouds” started to appear.

I personally hate the term cloud. It blurs and confuses the true essence which perhaps is better put like: “Someone else’s computer”.

Recent development in the cloud has exposed one of the core problems with cloud resources. I’m not going to make the full case here, but generally speaking, access/integrity/safety/confidentiality of any data is almost impossible to safeguard – if you are using someone else’s computer. There is a fairly easily accessible remedy, which I’ll throw up a fancy quote for:

“Architect distributed systems, keep your data local & always stay in control over your own computing. Encrypt”. 

I think time will prove me right and I’ve challenged that heavily by taking active part in a long running research project; LIM-IT that is set out to map and understand Lock-in effects of various kind. The results from this project should be mandatory education to any IT-manager that wants to stay on top of their game in my industry.

From a technology point of view, I can recommend looking at projects like Nextcloud, OpenStack, Bittorrent, etc. But there are many, many others that operates with this great mindset.

So, where are we now. First of all, HPC is in desperate need of getting ingested by technology from other domains. Especially low hanging fruits like those from the Big Data domain. Just to take one conceptual example; making use of TCP/IP communication stacks to solve typical HPC problems. One of my research teams explored this two years ago by implementing a pCMC algorithm in SPARK as part of a thesis program. The intention was to explore and prove that HPC systems indeed serve as excellent hybrid solutions to tackle typical Data Analytics problems and vice versa.  I’m sure MPI-doctors will object, but frankly, that’s a lot of “Not Invented Here” syndrome. The results from our research thesis spoke for itself. Oh, the code can be downloaded here under a open source license so you can try it yourself. (Credits to Tingwei Huang, and Huijie Shen for excellent work on this.)

Now, there is a downside to everything. So also when opening up for a completely new domain of software stacks. Once you travel down the path of revamping your HPC environment with new technologies from, for example, “Cloud” (Oh I hate that word) and going all in “Everything as a Service” – you are basically faced with an avalanche of technology. Hadoop, SPARK, ELK, OpenStack, Jenkins, K8, Docker, LXC, Travis, etc. etc. All of these stacks requiring years, if not decades, to learn. There exists few, if any, persons that master even a few of these stacks and their skills are highly desired on a global market for computer wizards. Even more true is that they’re probably not on your pay list, because you just can’t afford them.

So, as a manager in HPC, or some other complex enough IT-environment. You face the Gordian knot of a problem:

“How do I manage and operate an IT-environment of vast complexity PLUS manage to keep my cost under some kind of control?”

Mark Shuttleworth talks brilliantly about this in his key-notes nowadays by the way which I’m happy to repeat.

Most managers will fail facing this challenge (unfortunately) and they will fail in three ways:

  1. Managers in IT will fail to adopt the right kind of technologies and get “locked in” on either data-lock-in, license-lock-in, vendor-lock-in, technology-lock-in or all of them – ending up in a budget disaster.
  2. Managers in IT will fail to recruit skilled enough people or to many with the wrong skill set – they will fail to deliver relevant services to the market.
  3. Managers in IT will do both of the above – delivering sub-performing services too late to market, at extreme costs.

If I’m right in my dystopian projections, we will see quite a few companies go down within IT in the coming years. The market will be brutal to those failing to develop. Computer Scientists will be able to ask for fantasy salaries and there will be a latent panic in management.

I’ve spent significant amount of time researching and working on this very challenging problem, which is general to its nature, but definitely applicable to my own expert domain in HPC. In a few days, I’ll fly out to get the opportunity to present a snapshot of my work to one of Europe’s largest HPC center in Germany – HLRS. Its happy times indeed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: