Inhaltszusammenfassung zur Seite Nr. 1
Performance Guidelines for
AMD Athlon™ 64 and
AMD Opteron™ ccNUMA
Multiprocessor Systems
Application Note
Publication # 40555 Revision: 3.00
Issue Date: June 2006
Inhaltszusammenfassung zur Seite Nr. 2
© 2006 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The informa- tion contained herein may be of a preliminary or advance nature and is subject to chang
Inhaltszusammenfassung zur Seite Nr. 3
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Contents Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 1.1 Related Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inhaltszusammenfassung zur Seite Nr. 4
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems A.2.1 What Resources Are Used When a Single Read-Only or Write-Only Thread Accesses Remote Data? . . . . . . . . . . . . . . . . . . . . . . . . . .40 A.2.2 What Resources Are Used When Two Write-only Threads Fire at Each Other (Crossfire) on an Idle System? . . . . . . . . . . . . . . . . . . . . . . . . .40 A.2.3 What Role Do Buffers Play in the Throughput Observed? . . . . . .
Inhaltszusammenfassung zur Seite Nr. 5
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems List of Figures Figure 1. Quartet Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Figure 2. Internal Resources Associated with a Quartet Node . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Figure 3. Write-Only Thread Running on Node 0, Accessing Data from 0, 1 and 2 Hops Away on an Idle System . . . . .
Inhaltszusammenfassung zur Seite Nr. 6
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 6 List of Figures
Inhaltszusammenfassung zur Seite Nr. 7
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Revision History Date Revision Description June 2006 3.00 Initial release. Revision History 7
Inhaltszusammenfassung zur Seite Nr. 8
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 8 Revision History
Inhaltszusammenfassung zur Seite Nr. 9
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Chapter 1 Introduction The AMD Athlon™ 64 and AMD Opteron™ family of single-core and dual-core multiprocessor systems are based on the cache coherent Non-Uniform Memory Access (ccNUMA) architecture. In this architecture, each processor has access to its own low-latency, local memory (through the processor’s on-die local memory controller), as well as to higher latency remote memo
Inhaltszusammenfassung zur Seite Nr. 10
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems bandwidth test, it exercises both of these modes of operation. The test serves as a latency sensitive test case when the test threads perform read-only operations and as a bandwidth sensitive test when the test threads carry out write-only operations. The discussion below explores the performance results of this test, with an emphasis on behavior exhibited when the test imposes h
Inhaltszusammenfassung zur Seite Nr. 11
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems [12] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/ msdn_heapmm.asp [13] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/ low_fragmentation_heap.asp [14] http://msdn2.microsoft.com/en-us/library/tt15eb9t.aspx [15] https://www.pathscale.com/docs/UserGuide.pdf [16] http://docs.sun.com/source/819-3688/parallel.html Chapter 1
Inhaltszusammenfassung zur Seite Nr. 12
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 12 Introduction Chapter 1
Inhaltszusammenfassung zur Seite Nr. 13
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Chapter 2 Experimental Setup This chapter presents a description of the experimental environment within which the following performance study was carried out. This section describes the hardware configuration and the software test framework used. 2.1 System Used All experiments and analysis discussed in this application note were performed on a Quartet system ® having four 2.2 GH
Inhaltszusammenfassung zur Seite Nr. 14
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems N0 N1 Link Link Link Link N2 N3 Figure 1. Quartet Topology The term hop is commonly used to describe access distances on NUMA systems. When a thread accesses memory on the same node as that on which it is running, it is a 0-hop access or local access. If a thread is running on one node but accessing memory that is resident on a different node, the access is a remote access. If th
Inhaltszusammenfassung zur Seite Nr. 15
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems C0 C1 4 GV/s per direction 4 GV/s per direction @ 2 GHz Data Rate @ 2 GHz Data Rate 4 GV/s per direction @ 2 GHz Data Rate HT = HyperTransport™ Technology Figure 2. Internal Resources Associated with a Quartet Node From the perspective of the MCT, a memory request may come from either the local core or from another core over a coherent HyperTransport link. The former request is a l
Inhaltszusammenfassung zur Seite Nr. 16
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems resources approach saturation. The test has two modes: read-only and write-only. When the test threads are read-only, the throughput does not stress the capacity of the system resources and, thus, the test is more sensitive to latency. However, when the threads are write-only, there is a heavy throughput load on the system. This is described in detail in later sections of this do
Inhaltszusammenfassung zur Seite Nr. 17
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems characterization of the resource behavior in the system. These recommendations, coupled with these interesting cases, provide an understanding of the low-level behavior of the system, which is crucial to the analysis of larger real-world workloads. 2.3 Reading and Interpreting Test Graphs Figure 3 below shows one of the graphs that will be discussed in detail later. Time for writ
Inhaltszusammenfassung zur Seite Nr. 18
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 2.3.2 Labels Used Each of the bars on the graph is labeled with the hop information for the thread. 2.3.3 Y-Axis Display For the one-thread test cases on the idle system, the graphs show the time taken by a single thread, normalized to the time taken by the fastest single-thread case—in this case the time it takes a read- only thread to do local accesses on an idle system. In Figur
Inhaltszusammenfassung zur Seite Nr. 19
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Chapter 3 Analysis and Recommendations This section lays out recommendations to developers. Several of these recommendations are accompanied by empirical results collected from test cases with analysis, as applicable. In addition to making recommendations for performance improvement, this section clarifies some of the common perceptions developers have about performance on AMD cc
Inhaltszusammenfassung zur Seite Nr. 20
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 3.1.2 Multiple Threads-Shared Data When scheduling multiple threads that share data on an idle system, it is preferable to schedule the threads on both cores of an idle node first, then on both cores of the the next idle node, and so on. In other words, schedule using core major order first followed by node major order. For example, when scheduling threads that share data on a dua