Optimizing NVMe Performance – How to Identify and Enable True PCIe x16 Connectivity; Mac; Windows; Linux;

NVMe-based storage solutions are highly sensitive to PCIe lane assignment.

In order to optimize transfer performance, the NVMe storage controller

should be installed into a slot that corresponds with its lane capability

(x16 controller to an x16 slot for example).Unfortunately, this is not always

possible. Depending on the motherboard platform and the number of

hosted PCIe devices, an ideal lane assignment may not be available.

It should be noted however, that depending on the storage configuration

and application, an ideal lane speed may not actually be required in order

to maximize real-world transfer rates. The individual capabilities of each

NVMe SSD and the target storage configuration (single disk, JBOD or RAID array)

must be taken into consideration. For example, a RAID 0 array comprised of two M.2 SSD’s would not benefit from an x16 connection – x8 would be more than sufficient. Likewise, x8 lanes would provide sufficient bandwidth for a mixed configuration, such as RAID 1 combined with an additional one or two standalone SSD’s.

All HighPoint SSD7000 NVMe RAID controllers can assign up to x4 lanes per NVMe SSD. For example, our industry leading 4-channel x16 M.2 controllers will assign 4 dedicated lanes per NVMe channel. Our 8-Channel high port count (HPC) models typically assign two lanes per channel. However, the intelligent PCIe switch technology built into each controller allows it to assign up to 4 lanes per SSD, depending on the total number of SSD’s in use. In addition, the 8 independent provide a higher level of flexibility – customers can configure the NVMe media to meet the demands of a variety of workflows. This is especially useful for high-resolution media capture applications, which typically require multiple GPU’s.

PCIe Switch Controllers

One way to manage PCIe bandwidth for such environments is through the use of a daughter board with a PCIe switch controller, such as the Broadcom PEX8747, which features 48 PCIe 3.0 lanes. Not only does the switch increase the total PCIe slot count, it can be used to assign a particular PCIe device to a particular CPU or CPU’s.

The PEX8747 can distribute either x8 or x16 lanes to each of its 4 downstream ports, and connect them to the CPU via its dedicated x16 upstream port. Customers could essentially link one or more GPU’s with an SSD7000 NVMe RAID controller in order to streamline I/O and optimize transfers for a particular application.

 

NVMe BIOS Settings

Today’s workstation and server motherboards are highly intelligent and flexible – they provide a myriad of options and features that enable customers to fine tune their hardware environment. However, the resulting BIOS and UEFI menus are far more complex than previous generations, and the default settings may actually impede NVMe storage performance.

There are several BIOS related settings that are of particular importance to an NVMe storage configuration – chiefly, PCIe mode settings.

PCIe Slot Mode

Some motherboards will allow customers to specify the mode setting of each PCIe slot. In general, the slot should be set to operate at the highest available lane speed, with x16 being ideal. These lane settings are industry standard, and apply equally to any high-performance PCIe device (GPU’s, RAID controllers, Ethernet controllers).

Bifurcation mode should not be used for HighPoint SSD7000 series RAID controllers. Bifurcation mode was designed for NVMe storage solutions that that do not benefit from PCIe switch chipsets. It allows customers some control over how the PCIe lanes are distributed to the SSD’s. However, it will only work for specific motherboards, specific NVMe devices, and specific NVMe configurations (primarily single NVMe SSD’s), and will conflict with a dedicated NVMe RAID controller.

PCIe – Direct to CPU

This is essential for an NVMe storage solution, and is sometimes referred to as the “northbridge”. It ensures that the PCIe device has direct access to the system’s processor or processors – essential for maximizing NVMe performance.

Some PCIe slots do not provide direct access, and instead interface with the platform’s secondary chipsets, such as a motherboards southbridge (aka Intel ICH and AMD FCH). PCIe devices that interface with the southbridge will perform slower, as the southbridge relies on the northbridge for access to the CPU.

Application Tuning - Data Transfer Size

An element often overlooked when tuning NVMe storage performance is the ability to set the transfer size of each I/O request. In general, the transfer size is determined by the target application –ideally, this should be equal to or larger than the block size, and should take into account the number of SSD’s in use.

For example, RAID 0 array comprised of 8x NVME SSD’s, with a Block Size of 512KB, would need the application side to send no less than 512KB*8 (4MB) of data per I/O request, so that it is possible to efficiently distribute the data to each disk.

Benchmarking NVMe Storage Performance

Unlike conventional SAS/SATA storage, NVMe storage was designed to interface directly with a system’s CPU (or CPU’s). As a result, NVMe storage can handle an exponentially greater number of I/O requests; a single queue with 256 commands for SAS, vs. NVMe’s 64K queues, with 64K commands per queue. Thus, the ability to specify Queue Depth is of critical importance when testing the performance capabilities of NVMe storage.

Recommended Bench Mark Utilities - Windows

For Windows platforms, we recommend two benchmark utilities for NVMe Storage configurations IOMeter and CrystaDiskMark.

IOMeter

IOMeter has been an industry standard benchmarking tool for many years. Originally introduced in 1998 by Intel, and now an open source project, IOMeter is used to test and verify the performance capabilities of a wide range of storage configurations across multiple operating systems and hardware platforms. I/O Meter is suitable for both Single and Multi-CPU environments.

Website

When using IOMeter with an NVMe configuration, please use the below scripts for the Read and Write performance:

https://www.dropbox.com/sh/avu8dbe8czbtp77/AAAkQ5wJDr2zLzprFaSb2iHJa?dl=0

CrystalDiskMark

CrystalDiskMark is an easy to use, graphical benchmark utility designed for testing disk configurations within a Windows environment. First introduced in 2007, it is now in widespread use, and has spawned several variants for other platforms including macOS. We recommend CrystalDiskMark for benchmarking single-CPU environments.

Website

Note, Crystal Disk Mark requires that the platform utilize a CPU with a clock speed of 3.3 GHz or higher in order to generate, accurate, repeatable benchmark results.

To test with Crystal Disk, make sure to change the Queues and Threads under Settings:

Queue should be set to 512

Threads should be set to 8

CrystalDiskMark ​ CrystalDiskMark is an easy to use, graphical benchmark utility designed for testing disk configurations within a Windows environment. First introduced in 2007, it is now in widespread use, and has spawned several variants for other platforms including macOS. We recommend CrystalDiskMark for benchmarking single-CPU environments.  Website  ​  Note, Crystal Disk Mark requires that the platform utilize a CPU with a clock speed of 3.3 GHz or higher in order to generate, accurate, repeatable benchmark results.  ​  To test with Crystal Disk, make sure to change the Queues and Threads under Settings:  ​  Queue should be set to 512  ​  Threads should be set to 8
CrystalDiskMark ​ CrystalDiskMark is an easy to use, graphical benchmark utility designed for testing disk configurations within a Windows environment. First introduced in 2007, it is now in widespread use, and has spawned several variants for other platforms including macOS. We recommend CrystalDiskMark for benchmarking single-CPU environments.  Website  ​  Note, Crystal Disk Mark requires that the platform utilize a CPU with a clock speed of 3.3 GHz or higher in order to generate, accurate, repeatable benchmark results.  ​  To test with Crystal Disk, make sure to change the Queues and Threads under Settings:  ​  Queue should be set to 512  ​  Threads should be set to 8

Image 1: Intel(R) Core(TM) i5-9600K (3.7GHz)

Image 2: Intel(R) Xeon(R) Silver 4110 (2.1GHz)

Recommended Bench Mark Utilities- macOS:

We recommend using the ATTO Disk Benchmark utility to test the NVMe RAID array’s performance in a macOS environment.

Use the following parameters:  General Parameters  ​  File Size: 16GiB Que Depth/Disk: 256 Write Pattern: 0x0000000 (default) Streams/Disk: 1  I/O Size Range  ​  Start: 2 MiB End: 64 MiB

Use the following parameters:

General Parameters

File Size: 16GiB
Que Depth/Disk: 256
Write Pattern: 0x0000000 (default)
Streams/Disk: 1

I/O Size Range

Start: 2 MiB
End: 64 MiB

Note: Snapshot will run a single test. Continuous will repeat the test procedure until manually stopped.
 

After setting the parameters, click the Add Disk button and browse to the array volume.
Click the Start button to being the performance test.

Recommended Bench Mark Utilities- Linux

For testing NVMe performance with a Linux platform, we recommend FIO.

https://fio.readthedocs.io/en/latest/fio_doc.html#running-fio

FIO is a versatile, command line drive open source testing utility capable of simulating I/O workloads for a variety of applications.

A guide available for testing SSD7000 series controllers is available here.

Testing Multi-CPU (Server) Environments

We recommend testing a multi-CPU environment using I/O Meter. I/O Meter can be used to benchmark performance for macOS, Windows and Linux platforms utilizing x86 architecture. We have provided NVMe related scripts specifically for this type of platform.

Due to the nature of NVMe devices, you may need to inform I/O Meter which CPU should be used for testing purposes. It is possible the benchmark utility will perform tests on the CPU that is not used by the NVMe controller – if this occurs, the test results will be inaccurate (generally, far lower than what the system is truly capable of).

Test Scripts

We recommend using the ATTO Disk Benchmark utility to test the NVMe RAID array’s performance in a macOS environment.

256k-seq-read.icf           download

256k-seq-write.icf           download

Test Procedure (Windows example shown below)

Step 1: perform a Default Test session: open IOMeter and run a performance test on the NVMe array using the following parameters:

Transfer request size = 256K
100% read

Step 1: perform a Default Test session: open IOMeter and run a performance test on the NVMe array using the following parameters:  Transfer request size = 256K 100% read

Record these results (a screenshot may prove useful).

Step 2; Run the comparison test to verify performance: First, you will need to determine which CPU is employed by the SSD7101A-1. Consult the motherboard manual to verify which PCIe slots are associated with which CPU. Once this is determined, follow the procedure outlined below:

Open IOMeter and start a performance test using the CPU assigned to the SSD7101A-1. Let this run in the background.
Press Alt + Ctrl + Delete and select Task Manager.
Click the Details tab:

Record these results (a screenshot may prove useful).  ​  Step 2; Run the comparison test to verify performance: First, you will need to determine which CPU is employed by the SSD7101A-1. Consult the motherboard manual to verify which PCIe slots are associated with which CPU. Once this is determined, follow the procedure outlined below:  ​  Open IOMeter and start a performance test using the CPU assigned to the SSD7101A-1. Let this run in the background. Press Alt + Ctrl + Delete and select Task Manager. Click the Details tab:

Search for Dynamo.exe. Dynamo is the IOMeter workload. Right click on this entry and select Set Affinity:

Affinity in this case, refers to the CPU and CPU threads that correspond with the PCIe slot hosting the SSD7101A-1 controller. In this example, the SSD7101A-1 is assigned to CPU2 SLOT2.

Notes:
Node 0 represents CPU1
Node 1 represents CPU2 Select a thread that corresponds with CPU2, such as CPU 0 (Node 1), shown below:

After the thread has been specified, return to IOmeter and start a new performance test using the supplied scripts. Compare these results with the results generated in Step 1.

Affinity in this case, refers to the CPU and CPU threads that correspond with the PCIe slot hosting the SSD7101A-1 controller. In this example, the SSD7101A-1 is assigned to CPU2 SLOT2.  ​  Notes: Node 0 represents CPU1 Node 1 represents CPU2 Select a thread that corresponds with CPU2, such as CPU 0 (Node 1), shown below:  ​  After the thread has been specified, return to IOmeter and start a new performance test using the supplied scripts. Compare these results with the results generated in Step 1.