The HUYGENS supercomputer system description
The final configuration of the new National Supercomputer, when
fully operational in mid 2008, will be based on IBM's next
generation of POWER systems. The system will consist of a
significant number of so-called Compute Nodes (CNs). These CNs are
connected to each other through an InfiniBand (IB) network,
enabling efficient execution of MPI-based applications which run
across more than one CN. With respect to memory, the final
configuration will meet the NCF procurement requirements of 4 - 8
GB per processor core.
The ratio between the standard CNs and the memory CNs in the
system will be about 4 to 1. Furthermore, 2 standard CNs serve as
front-end nodes for interactive work and submitting jobs to the
batch system. Also, 2 additional standard CNs are being used for
testing purposes (new software releases, etc.).
Processor Technology
Lots of information on the next generation POWER technology is still
confidential and will be released with the IBM announcement. Hence, the
details mentioned of the expected next generation POWER design are based
upon an article from RealWorld (view
article), which do not necessarily
reflect IBM's planned announcement. This article contains general
information about a POWER6 chip architecture which is used in different
configurations across a broad range of commercial and technical
computing servers.
The processor core of each CN will be based on IBM's next
generation of POWER processors, like POWER5 able to perform 4
floating-point operations per clock cycle on 64-bit numbers. The
next generation Power processor is expected to run at a clock
frequency between 4.0 and 5.0 GHz. The total peak performance of
the HUYGENS system will be more than 60 Tflop/s.
Huygens will be using the newest processor design within IBM's
family of POWER processors, which exist already for 17 years with
the introduction of the first RS/6000 in 1990.
It is based on RISC technology (Reduced Instruction Set
Computer), which enables simple instructions to be executed in a
pipelined fashion. Memory hierarchy is set up with various levels
of caches between the processor registers and main memory. From
main memory, the route of data to the processor registers is as
follows: L3 cache amounts to 32 Mbytes, shared between two cores,
to L2 cache of 4 MBytes per core, and finally to L1 cache of 64
kBytes per core.
Interconnect Technology
An essential part of the overall system configuration is the
interconnect between the CNs. It is based on InfiniBand technology,
delivered by QLogic switches. Based on IB4x technology, each CN has
links to the IB network. IB4x 288-port IB switches will be used,
which means that each CN can be connected to each IB switch.
IB technology has been chosen because of their advantageous
latency specifications, which is important for the performance of
MPI applications running across CNs. The MPI latency is in the
order of a few microseconds.
Storage and IO Bandwidth
NCF/SARA's procurement has turned out to be very challenging
with respect to IO bandwidth. This has been done on purpose, since
the application workload on the system has a relative large amount
of IO intensive codes. For this reason, the storage environment had
been designed very carefully. First, there is a large amount of
system-wide accessible storage, which is typically used for users'
home file systems. This will amount to ca. 270 TB. Using IBM's
Global Parallel File System (GPFS), each CN has access to the full
range of home file systems. The home file systems can be viewed as
relatively far away from the CNs, although their aggregate IO
bandwidth is approximately 12 GB/s.
In addition to the home file systems, there are two types of
scratch file systems. These are fast scratch file systems and
shared scratch file systems.
Fast scratch file systems are file systems that are local to a
CN, and are typically meant for jobs which run within one CN, or
for jobs which do not have shared IO requirements. Each CN has such
a fast scratch file system. The total aggregate performance is
about 24GB/s across all standard CNs, with an additional 22 GB/s
across high memory nodes. The amount of fast scratch storage
aggregates to ca. 300 TB for all CNs together.
Shared scratch storage is an additional level of fast storage
which is available to all CNs. It is typically meant for jobs which
run across several CNs and for which each MPI process needs access
to shared data. The amount of shared scratch storage is
approximately 130 TB.
The total amount of storage is 270+300+130 = 700 TB usable
storage. All storage is FC-based. A Storage Area Network based on
Brocade switches will be used to grant access to the data from the
CNs.
Archiving and Backup
As explained above, the system comes with a large amount of
storage for home file systems. Until now, home file systems on the
SGI system were run with automatic migration software (DMF) to
tape, so that file systems never filled up. Drawback was that large
files were moved to tape rather quickly, while retrieving data from
tape could be rather cumbersome.
For the new system, SARA expects to run a quota strategy on the
home file systems. Behind the home file systems, an archive file
system will be run, which is governed by DMF to interact with tape
storage. Users will have to move their files from the home file
system to the archive file system explicitly. Once on the archive
file systems, files may be moved to tape automatically.
Backup is done using IBM's TSM software, which is transparent to
users.
Software Environment
In their Request-for-Proposal, NCF/SARA has expressed its
preference for a Linux-based Operating System. IBM will run SUSE Linux on the system, with its
own
Fortran, C and
C++ compiler technology. Performance analysis tools are
available to the users of the system. Numerical libraries are
available in
ESSL and
Parallel ESSL (P-ESSL). IBM's Parallel
Environment (PE) will be used for preparing and running of MPI
jobs.
The scheduling software is an important part of the system.
Users logon on one of the two interactive CNs, and perform their
compilations and small testing on these front-end CNs. Once they
are ready for a larger parallel job, the batch system comes into
action. Depending on the actual requirements of the job (how many
processor cores, how much memory needed, large IO requirements,
etc.), the batch system decides where the job will run on the
system. Clearly, memory-demanding jobs will need to run on the
memory CNs, etc. The actual scheduler is IBM's
LoadLeveler.
Another important piece of software is the global file system
software. As described in the storage part, the home file systems
and the shared scratch file systems rely on a global file system.
IBM's Global Parallel File System (
GPFS) will be used for this.
Physical Characteristics
- 24-inch wide compute racks to hold the CNs
- 19-inch wide storage racks to hold the storage and SAN
switches
- 19-inch wide racks to hold the InfiniBand switches
- 19-inch wide racks to hold networking equipment
Support and Services
The system will be delivered with both hardware and software
support during extended office hours. Furthermore, IBM has
committed to a large number of support days to both system
administrators at SARA and to users in the field. This includes
scripting, accounting, but also analysis and
optimization/parallelization of user applications, and courses and
lectures to the user community.
Time Schedule
IBM deliveries begin in June 2007 with an initial configuration,
which is significantly more powerful than the current
configuration. This serves as a platform for the final system,
which is expected to be available to users by August 2008.
Initial Configuration
The initial configuration will be delivered to SARA in June
2007, and is expected to be operational for users by August 2007.
It will be based on IBM's System p575 with Power5+ processors,
running at 1.9 GHz clock frequency. The total peak performance will
be 14.6 Tflop/s, with CNs with InfiniBand switches connecting them.
The software stack is identical to what will be available on the
"next generation" Power-based system.
|