This year is turning out to be a very interesting one in the world of clusters, grids, and supercomputing. Advances in computing tend to move forward in fits and starts, whereby developments in one area of computing often rush ahead of development in others. Then those areas which might have fallen behind then experience breakthroughs which may allow them to catch up.
In the past decade, the plummeting costs of processing power, local disk storage, and the development of ethernet drivers by people like Donald Becker has allowed for the construction of low cost clusters of computers which can work together to solve problems. That sounds great, but that in of itself doesn't solve all problems in supercomputing, indeed it creates others. Another major issue in supercomputing is that you often want to have areas of centralized storage, along with the files and filesystems that reside on that storage, which all of the computers in your cluster can read from and write to. Effectively, the computers that are part of your clusters will become clients to those pools of storage. There are two general types of such storage, NAS or SAN's. And, lastly for this epistle, a new issue has cropped up with the construction of large clusters of compute "nodes", namely that such clusters are so cheap that they can be scaled to the point where they strain the electrical capacity of modern day data centers. Just last year we had to retrofit our data center to add 900 amps of electrical power to our data center at VLICA.
The electrical power problem has spurred renewed research into an old hoary idea in computing, namely using graphical cards in computers to help do more general purpose computational work and not just good ol' fashioned things like rendering. In line with what I wrote above about innovations in computing often developing at different rates, graphical cards have improved their capabilities at far faster rates than general purpose CPU's. This has spawned the two giants in the graphics card industry, ATI (which was actually acquired by AMD) and NVidia, to develop solutions to meet this emerging market. Our developers at VLICA have tried several ideas. At this time it looks like NVidia's CUDA suite is the current front runner, but the computing industry is nothing but competitive and it can't be said that the game is over. Not by a long shot.
A primary issue with GPGPU solutions still outstanding is that, depending upon your applications, they can outperform general purpose CPU solutions by several orders of magnitude. However the cost of the cards, along the cost of the hardware (often you have to have 3U sized servers to house the cards) often means that the price per server is also several times that of a cheap Dell or HP server. In other words, as of the current time the increased cost per server cancels out the increased performance of the GPU's, but that is changing. Also, multicore processors are coming on strong, giving GPGPU's a new competitor. The main issue here is the programmers would need to start writing code that allows for multiple threads to execute simultaneously and coding is hard enough.
The other item of interest is that in the world of Unix and Linux, a 20+ year staple solution of filesystem sharing over networks is getting an overdue. That system is of course NFS. NFS was written for a world where you had one server serving up files on so called "mount points" defined on the server, and a fairly small number of clients. Originally the mount points were defined in static files, however over the years the number of mount points and clients grew larger and larger. Eventually someone came up with the mechanism of doing mounts automatically via a product called autofs so that systems administrators would not have to go through the trouble of adding and maintaining and endless number of new mount points on static files on an endlessly increasing numbers of clients. The actual mounting (and umounting) of filesystems is done via the Autofs mechanism so that users and administrators do not have to issue endless streams of commands to make filesystems available (and release them when they are through with them) to clients.
This is all a fine and dandy solution, but running NFS this way creates serious problems. Namely, all the I/O has to go to and from the server which can overload the server's networking capability. The number of mounts being served up can quickly reach into the thousands on a large cluster. Also, NFS has never guaranteed that multiple clients which might be accessing a file (or filesystem) would see the same data on a particular file if multiple clients were attempting reads and writes to the file simultaneously. File locking was always an issue, as was recovery in the event that a server or client failed. There are also security issues which I won't get into.
Several vendors have offered solutions for firms which have large SAN or NAS environments with lots of clients accessing data simultaneously. At VLICA, we have used Polyserve as our solution. What Polyserve offers is their own filesystem which allows for concurrent reads and writes to filesystems in a SAN or NAS from many NFS servers, all of which can act in concert. You install and define Polyserve on NFS servers, which all are aware of each other. They form a "matrix" which is composed of all of the NFS servers are defined at installation, as well as all of the NFS filesystems which are meant to be a part of this matrix. You then install a load balancing system between the many clients and the servers in the Polyserve matrix. This distributes I/O amongst the multiple NFS fileservers which have Polyserve installed on them and from there the servers read and write to the SAN or NAS. Such a solution is known as a clustered filesystem.
However there are limits to what vendor solutions can do to address the many clients writing to a pool of storage. The single biggest problem is that the performance is not linearly scalable. What I mean by that statement is this: If you have 4 servers which between them can achieve 1GB per second of I/O between the SAN and clients, then 8 servers will have less than 2GB per second of I/O performance. Vendors will claim that their solutions are linearly scalable, but their claims don't stand up under real world conditions. Also, you have the problem of having to install ever greater number of servers into your NFS server pool. A classic summary of problems involving the use of NFS can be read about in the paper entitled "Why NFS sucks" which was written by Olaf Kirch, noted NFS developer, for the 2006 Linux Symposium.
A more elegant solution to the problem of NFS scalability would be for clients to be able to write directly to the storage pool and in parallel, effectively cutting out the "middle man" of the NFS server. This is exactly what a minor revision of the NFS version 4 (NFSv4), called pNFS is designed to do. A highly simplified way of describing how pNFS works is that the server essentially serves up only metadata to clients telling them how and where to locate the files or filesystems in the storage pool. The I/O itself is not driven through the servers. The clients themselves have have a so called "layout driver" in the Linux kernel which takes the metadata served up by the server and I/O operations then run between the client and the storage pool. My supervisor is currently running tests using pNFS on a simple setup and so far we are getting nearly linearly scalable increases in I/O performance in some instances. This stuff does show promise. pNFS will also most certainly help adaption of NFSv4 since many IT managers have not seen many good reasons, outside of improved security, for adapting NFSv4 into their shops.
Enough for now. This has been a long entry and it is getting late. Another day of problems and black art sorcery await tomorrow at VLICA.