Logo Nationale Computerfaciliteiten
Stichting Nationale
  Huygens home
+  History
+  System description
+  Policy
+  IBM
+  e-Infrastructure
Operated by:
Powered by:
Funded by:
back to NCF website  |   contact  

The HUYGENS supercomputer system description

The new Huygens Supercomputer

The final configuration of the new National Supercomputer, when fully operational in mid 2008, will be based on IBM's next generation of POWER systems. The system will consist of a significant number of so-called Compute Nodes (CNs). These CNs are connected to each other through an InfiniBand (IB) network, enabling efficient execution of MPI-based applications which run across more than one CN. With respect to memory, the final configuration will meet the NCF procurement requirements of 4 - 8 GB per processor core.

The ratio between the standard CNs and the memory CNs in the system will be about 4 to 1. Furthermore, 2 standard CNs serve as front-end nodes for interactive work and submitting jobs to the batch system. Also, 2 additional standard CNs are being used for testing purposes (new software releases, etc.).

Processor Technology

Lots of information on the next generation POWER technology is still confidential and will be released with the IBM announcement. Hence, the details mentioned of the expected next generation POWER design are based upon an article from RealWorld (view article), which do not necessarily reflect IBM's planned announcement. This article contains general information about a POWER6 chip architecture which is used in different configurations across a broad range of commercial and technical computing servers.

The processor core of each CN will be based on IBM's next generation of POWER processors, like POWER5 able to perform 4 floating-point operations per clock cycle on 64-bit numbers. The next generation Power processor is expected to run at a clock frequency between 4.0 and 5.0 GHz. The total peak performance of the HUYGENS system will be more than 60 Tflop/s.

Huygens will be using the newest processor design within IBM's family of POWER processors, which exist already for 17 years with the introduction of the first RS/6000 in 1990.

It is based on RISC technology (Reduced Instruction Set Computer), which enables simple instructions to be executed in a pipelined fashion. Memory hierarchy is set up with various levels of caches between the processor registers and main memory. From main memory, the route of data to the processor registers is as follows: L3 cache amounts to 32 Mbytes, shared between two cores, to L2 cache of 4 MBytes per core, and finally to L1 cache of 64 kBytes per core.

Interconnect Technology

An essential part of the overall system configuration is the interconnect between the CNs. It is based on InfiniBand technology, delivered by QLogic switches. Based on IB4x technology, each CN has links to the IB network. IB4x 288-port IB switches will be used, which means that each CN can be connected to each IB switch.

IB technology has been chosen because of their advantageous latency specifications, which is important for the performance of MPI applications running across CNs. The MPI latency is in the order of a few microseconds.

Storage and IO Bandwidth

NCF/SARA's procurement has turned out to be very challenging with respect to IO bandwidth. This has been done on purpose, since the application workload on the system has a relative large amount of IO intensive codes. For this reason, the storage environment had been designed very carefully. First, there is a large amount of system-wide accessible storage, which is typically used for users' home file systems. This will amount to ca. 270 TB. Using IBM's Global Parallel File System (GPFS), each CN has access to the full range of home file systems. The home file systems can be viewed as relatively far away from the CNs, although their aggregate IO bandwidth is approximately 12 GB/s.

In addition to the home file systems, there are two types of scratch file systems. These are fast scratch file systems and shared scratch file systems.

Fast scratch file systems are file systems that are local to a CN, and are typically meant for jobs which run within one CN, or for jobs which do not have shared IO requirements. Each CN has such a fast scratch file system. The total aggregate performance is about 24GB/s across all standard CNs, with an additional 22 GB/s across high memory nodes. The amount of fast scratch storage aggregates to ca. 300 TB for all CNs together.

Shared scratch storage is an additional level of fast storage which is available to all CNs. It is typically meant for jobs which run across several CNs and for which each MPI process needs access to shared data. The amount of shared scratch storage is approximately 130 TB.

The total amount of storage is 270+300+130 = 700 TB usable storage. All storage is FC-based. A Storage Area Network based on Brocade switches will be used to grant access to the data from the CNs.

Archiving and Backup

As explained above, the system comes with a large amount of storage for home file systems. Until now, home file systems on the SGI system were run with automatic migration software (DMF) to tape, so that file systems never filled up. Drawback was that large files were moved to tape rather quickly, while retrieving data from tape could be rather cumbersome.

For the new system, SARA expects to run a quota strategy on the home file systems. Behind the home file systems, an archive file system will be run, which is governed by DMF to interact with tape storage. Users will have to move their files from the home file system to the archive file system explicitly. Once on the archive file systems, files may be moved to tape automatically.

Backup is done using IBM's TSM software, which is transparent to users.

Software Environment

In their Request-for-Proposal, NCF/SARA has expressed its preference for a Linux-based Operating System. IBM will run SUSE Linux on the system, with its own Fortran, C and C++ compiler technology. Performance analysis tools are available to the users of the system. Numerical libraries are available in ESSL and Parallel ESSL (P-ESSL). IBM's Parallel Environment (PE) will be used for preparing and running of MPI jobs.

The scheduling software is an important part of the system. Users logon on one of the two interactive CNs, and perform their compilations and small testing on these front-end CNs. Once they are ready for a larger parallel job, the batch system comes into action. Depending on the actual requirements of the job (how many processor cores, how much memory needed, large IO requirements, etc.), the batch system decides where the job will run on the system. Clearly, memory-demanding jobs will need to run on the memory CNs, etc. The actual scheduler is IBM's LoadLeveler.

Another important piece of software is the global file system software. As described in the storage part, the home file systems and the shared scratch file systems rely on a global file system. IBM's Global Parallel File System ( GPFS) will be used for this.

Physical Characteristics

  • 24-inch wide compute racks to hold the CNs
  • 19-inch wide storage racks to hold the storage and SAN switches
  • 19-inch wide racks to hold the InfiniBand switches
  • 19-inch wide racks to hold networking equipment

Support and Services

The system will be delivered with both hardware and software support during extended office hours. Furthermore, IBM has committed to a large number of support days to both system administrators at SARA and to users in the field. This includes scripting, accounting, but also analysis and optimization/parallelization of user applications, and courses and lectures to the user community.

Time Schedule

IBM deliveries begin in June 2007 with an initial configuration, which is significantly more powerful than the current configuration. This serves as a platform for the final system, which is expected to be available to users by August 2008.

Initial Configuration

The initial configuration will be delivered to SARA in June 2007, and is expected to be operational for users by August 2007. It will be based on IBM's System p575 with Power5+ processors, running at 1.9 GHz clock frequency. The total peak performance will be 14.6 Tflop/s, with CNs with InfiniBand switches connecting them. The software stack is identical to what will be available on the "next generation" Power-based system.

legal proviso