Resumo do conteúdo contido na página número 1
IBM ~ pSeries
High Performance Switch
Tuning and Debug Guide
Version 1.0
April 2005
IBM Systems and Technology Group
Cluster Performance Department
Poughkeepsie, NY
Resumo do conteúdo contido na página número 2
Contents 1.0 Introduction..................................................................................................... 4 2.0 Tunables and settings for switch software...................................................... 5 2.1 MPI tunables for Parallel Environment........................................................ 5 2.1.1 MP_EAGER_LIMIT .............................................................................. 5 2.1.2 MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL......
Resumo do conteúdo contido na página número 3
5.10 MP_PRINTENV ...................................................................................... 22 5.11 MP_STATISTICS.................................................................................... 23 5.12 Dropped switch packets.......................................................................... 24 5.12.1 Packets dropped because of a software problem on an endpoint.... 24 5.12.2 Packets dropped in the ML0 interface.............................................. 26 5.12.3 Pa
Resumo do conteúdo contido na página número 4
1.0 Introduction This paper is intended to help you tune and debug the performance of the IBM ~ ® pSeries® High Performance Switch (HPS) on IBM Cluster 1600 systems. It is not intended to be a comprehensive guide, but rather to help in initial tuning and debugging of performance issues. Additional detailed information on the materials presented here can be found in sources noted in the text and listed in section 7.0. This paper assumes an understanding of MPI and AIX 5L™, and that you
Resumo do conteúdo contido na página número 5
2.0 Tunables and settings for switch software To optimize the HPS, you can set shell variables for Parallel Environment MPI-based workloads and for IP-based workloads. This section reviews the shell variables that are most often used for performance tuning. For a complete list of tunables and their usage, see the documentation listed in section 7 of this paper. 2.1 MPI tunables for Parallel Environment The following sections list the most common MPI tunables for applications that u
Resumo do conteúdo contido na página número 6
thread, and from within the MPI/LAPI polling code that is invoked when the application makes blocking MPI calls. MP_POLLING_INTERVAL specifies the number of microseconds an MPI/LAPI service thread should wait (sleep) before it checks whether any data previously sent by the MPI task needs to be retransmitted. MP_RETRANSMIT_INTERVAL specifies the number of passes through the internal MPI/LAPI polling routine between calls before checking whether any data needs to be resent. When the swi
Resumo do conteúdo contido na página número 7
2.1.5 MP_TASK_AFFINITY Setting MP_TASK_AFFINITY to SNI tells parallel operating environment (POE) to bind each task to the MCM containing the HPS adapter it will use, so that the adapter, CPU, and memory used by any task are all local to the same MCM. To prevent multiple tasks from sharing the same CPU, do not set MP_TASK_AFFINITY to SNI if more than four tasks share any HPS adapter. If more than four tasks share any HPS adapter, set MP_TASK_AFFINITY to MCM, which allows each MPI task t
Resumo do conteúdo contido na página número 8
Sometimes MPI-IO is used in an application as if it were basic POSIX read/write, either because there is no need for more complex read/write patterns or because the application was previously hand-optimized to use POSIX read/write. In such cases, it is often better to use the IBM_largeblock_io hint on MPI_FILE_OPEN. By default, the PE/MPI implementation of MPI- IO tries to take advantage of the information the MPI-IO interface can provide to do file I/O more efficiently. If the MPI-IO cal
Resumo do conteúdo contido na página número 9
rfifosize 0x1000000 receive fifo size False rpoolsize 0x02000000 IP receive pool size True spoolsize 0x02000000 IP send pool size True 3.0 Tunables and settings for AIX 5L Several settings in AIX 5L impact the performance of the HPS. These include the IP and memory subsystems. The following sections provide a brief overview of the most commonly used tunables. For more information about these subjects, see the AIX 5
Resumo do conteúdo contido na página número 10
The overhead in maintaining the file cache can impact the performance of large parallel applications. Much of the overhead is associated with the sync() system call (by default, run every minute from the syncd daemon). The sync() system call scans all of the pages in the file cache to determine if any pages have been modified since the last sync(), and therefore need to be written to disk. This type of delay affects larger parallel applications more severely, and those with frequent sync
Resumo do conteúdo contido na página número 11
3.3.1 svmon The svmon command provides information about the virtual memory usage by the kernel and user processes in the system at any given time. For example, to see system-wide information about the segments (256MB chunk of virtual memory), type the following command as root: svmon -S The command prints out segment information sorted according to values in the Inuse field, which shows the number of virtual pages in the segment that are mapped into the process address space.
Resumo do conteúdo contido na página número 12
PageSize Inuse Pin Pgsp Virtual 4KB 448221 3687 2675 449797 16MB 0 0 0 0 Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual 1f187f 11 work text data BSS heap - 56789 0 0 56789 218a2 70000000 work default shmat/mmap - 33680 0 0
Resumo do conteúdo contido na página número 13
statistics in 5-second intervals, with the first set of statistics being the statistics since the node or LPAR was last booted. vmstat 5 The pi and po of the page group is the number of 4KB pages read from and written to the paging device between consecutive samplings. If po is high, it could indicate that thrashing is taking place. In that case, it is a good idea to run the svmon command to see the system-wide virtual segment allocation. 3.4 Large page sizing Some HPC applicat
Resumo do conteúdo contido na página número 14
adapter is configured. The volume of reservation is proportional to the number of user windows configured on the HPS adapter. A private window is required for each MPI task. Here is a formula to calculate the number of TLPs needed by the HPS adapter. In the formula below, number_of_sni refers to the number of sniX logical interfaces present in the partition. To obtain the num_windows, send pool size, and receive pool size values for the AIX partition, run the following command: ls
Resumo do conteúdo contido na página número 15
3.5 Large pages and IP support One of the most important ways to improve IP performance on the HPS is to ensure that large pages are enabled. Large pages are required to allocate a number of large pages which will used by the HPS IP driver at boot time. Each snX needs one large page for the IP FIFO, plus the number of send pools and receive pools shared among all adapters. Here is the formula for the number of large pages, assuming that the send pool and receive pool each need two pa
Resumo do conteúdo contido na página número 16
If you have eight cards for p690 (or four cards for p655), this command also indicates whether you have full memory bandwidth. 3.8 Debug settings in the AIX 5L kernel The AIX 5L kernel has several debug settings that affect the performance of an application. To make sure you are running with all the debug settings in the kernel turned off, run the following command: bosdebug -L The output will look something like this: Memory debugger off Memory sizes 0
Resumo do conteúdo contido na página número 17
4.2 LoadLeveler daemons The LoadLeveler® daemons are needed for MPI applications using HPS. However, you can lower the impact on a parallel application by changing the default settings for these daemons. You can lower the impact of the LoadLeveler daemons by: • Reducing the number of daemons running • Reducing daemon communication or placing daemons on a switch • Reducing logging 4.2.1 Reducing the number of daemons running Stop the keyboard daemon On LoadL_config: # Specify whet
Resumo do conteúdo contido na página número 18
SCHEDD_DEBUG = -D_ALWAYS 4.3 Settings for AIX 5L threads Several variables help you use AIX 5L threads to tune performance. These are the recommended initial settings for AIX 5L threads when using HPS. Set them in the /etc/environment file. AIXTHREAD_SCOPE=S AIXTHREAD_MNRATIO=1:1 AIXTHREAD_COND_DEBUG=OFF AIXTHREAD_GUARDPAGES=4 AIXTHREAD_MUTEX_DEBUG=OFF AIXTHREAD_RWLOCK_DEBUG=OFF To see the current settings on a running system, run the following command: ps e
Resumo do conteúdo contido na página número 19
5.0 Debug settings and data collection tools Several debug settings and data collection tools can help you debug a performance problem on systems using HPS. This section contains a subset of the most common setting changes and tools. If a performance problem persists after you check the debug settings and the data that was collected, call IBM service for assistance. 5.1 lsattr tuning The lsattr command lists two trace and debug-level settings for the HPS links. The following sett
Resumo do conteúdo contido na página número 20
5.3 Affinity LPARs On p690 systems, if you are running with more than one LPAR for each CEC, make sure you are running affinity LPARs. To check affinity between CPU, memory, and HPS links, run the associativity scripts on the LPARs. To check the memory affinity setting, run the vmo command. 5.4 Small Real Mode Address Region on HMC GUI Because the HMC and hypervisor code on POWER4 systems uses up physical memory, some physical memory is unavailable to the LPARs. To make sure tha