Thursday, May 11, 2023

HCIBench analysis part 2: vSAN OSA vs. vSAN ESA with all Optane drives

 In my previous post, I compared what a huge difference having proper caching drives can make with vSAN OSA, replacing my read intensive disks with Intel Optane. One question remains, however: since Optane are high performance, mixed use drives, how would they get along with a vSAN ESA deployment? To find out, I reconfigured my vSAN cluster by deleting the witness VM, deploying a vSAN ESA witness, and loaded each server with 5 Intel Optane disks each. Let's see how it got along.


  • ESA has compression only mode on automatically, and cannot be disabled (OSA was tested without dedupe or compression enabled)
  • ESA best practices call for a minimum of three nodes. It can absolutely work with two nodes but would see true performance benefits in a right sized cluster

100% read, 4K random

Interesting result, as the vSAN OSA with 2x Optane for caching and 2 read intensive capacity disks actually managed 24K higher IOPS on this test. There are a few likely reasons that I think this would happen:

  1. The benchmark likely kept everything in the hot tier throughout the benchmark
  2. As mentioned previously, ESA is running compression while OSA is not. I plan on re-running this benchmark specifically with compression only enabled on the OSA configuration.
  3. ESA works best at the recommended configuration (3+ nodes).

70% read, 4K random

Where Optane improved write performance in the OSA model, ESA with 5 Optane disks per node improved further. We see an improvement of 53K IOPS, as well as higher throughput, and generally better latency across the board.

50% read/write, 8K random

We get a decent bump in performance in comparison to the OSA build. Read latency remains low, write latency remains about the same.

100% write, 256KB sequential

This is perhaps the biggest difference between all the tests, and highlights the key advantage of vSAN ESA especially for a 2 node cluster. Where some of the other tests showed percentage bumps in performance, ESA with write intensive disks managed to get over twice the throughput of vSAN OSA. 5.69GB/s represents 45.52 Gb/s over the network cards. Latency also improved dramatically.

Overall, the 2 node vSAN ESA with 5 Intel Optane disk per server configuration performs generally as expected, comparatively outperforming the OSA with capacity disks. While OSA went blow for blow with ESA, it should be reiterated that deduplication and compression were disabled; IOPS performance tends to be lower in favor of capacity. One thing that I've omitted from these is the usable capacity, of which I would defer to the vSAN calculators to get final numbers, but keep in mind: vSAN OSA accomplished it's numbers with a datastore size of ~13TB, whereas the 280GB Intel Optane disks granted a capacity just north of 2TB.

So what can we take away from this? Is OSA dead? Not by a long shot. In a real world, 2-node ROBO scenario, several questions should be asked:

  • What is the workload?
  • How much performance do you need?
  • Are mixed use drives going to be able to meet capacity demands?
  • What's the best bang for the buck hardware to meet workload requirements?
In a workload that favors capacity and read intensive workloads, I would favor vSAN OSA. For performance, ESA should absolutely be a consideration. And if you have the chance to get more nodes, the capacity and performance improvements tend to favor ESA.

Monday, May 8, 2023

HCIBench analysis part 1: OSA vs. OSA with Intel Optane

In my previous post, I shared my current vSAN setup details. In this post, we'll take a look at the performance of the original vSAN deployment, and see how much of a performance difference Optane can make when we replace sub-optimal cache disks with Optane for OSA, followed by a full Optane and ESA redeployment.

As previously stated, the cache disks that I have in my current vSAN OSA 2 node cluster are not optimized for caching as they are read intensive disks, and they are mismatched capacity. When running vSAN in a production environment, it is best to adhere to the vSAN HCL as well as the appropriate disks for cache and capacity. While my 3.84TB read intensive drives will be great for capacity, I should be using a write intensive cache disk. Fortunately, the Intel Optane 905's that were sent to me through the vExpert program and Intel should do the trick.

For benchmarking comparisons, I'm using HCIBench 2.8.1. This free utility is provided by VMware as a "Fling". It is an OVA template that, once deployed, allows us to automatically create Linux based VMs that run preconfigured Flexible I/O (FIO) benchmarks. Throughout the benchmarking process, I have selected "Easy Run", which will automatically deploy the number of VMs it feels is correct based on my hardware configuration.

The first thing I did was select the default benchmarks and deploy each on the original vSAN OSA configuration. The "Easy Run" mode determined it would be best to deploy 4 VMs, as I suspect the cache wasn't up to par. 

Of note, all OSA benchmarks were run with deduplication and compression disabled.  

Here's how it went:

100% read, 4KB random

This is decent, but expected considering all of the disks are read intensive. 

70% read, 4KB random

This is more of a realistic bench, with some write performance metrics. Read latency is pretty high.

50% read/write, 8KB random

This simulates database workloads with an equal mix of reads and writes. With the larger block size, performance improves a bit, but latency remains a concern.

100% write, 256KB sequential

The throughput bench, best for video and media. We come close to saturating the 10Gbe link between the two servers on this one. 

For the next tests, I'll remove the 1.2TB and 1.92TB cache disks and replace them with two of the Intel Optane 905's. These are 280GB each, and have drastically better read and write performance capabilities. I could, in theory, use four cache disks, but vSAN prefers a 1:1 cache to capacity disk ratio. This was also the point where I swapped out the 10Gbe 82599 network card and replaced it with a ConnectX-4 CX455 100Gbe card to ensure we don't run into a bandwidth bottleneck (although I doubt I'll be able to saturate 100Gb). We should see a measurable difference across the board in terms of vSAN OSA performance. 

Of note, when selecting "Easy Run" on the following tests, HCIBench deployed 8 benchmark VMs.

100% read, 4KB random

Here we can see an increase of 200K IOPS. Despite the read intensive nature of the original disks, the Optane disks are rated higher for reads and substantially higher for writes. I'm not sure why the results failed to capture the 95th percentile read latency, but I would expect it to be similar to the original OSA result

70% read, 4KB random

Unsurprisingly, the write performance improved dramatically, as well as read latency. 188K IOPS at 70/30 for a 2 node cluster is impressive.

50% read/write, 8KB random

Once again, we see the difference in performance thanks to the Optane drives. 172K IOPS and much lower latency.

100% write, 256KB sequential

This was surprising for me. Just by replacing the cache disks, we can see a near 3.5x bandwidth increase. We see that it would've more than saturated the 10Gbe link, so switching to the 100Gbe card gave us a better idea of what to expect. Average write latency also improved considerably!

Overall, we can see a clear advantage in using the Intel Optane disks. It is critical to choose a cache disk that excels in write intensive workloads, as "hot tier" data gets pushed to capacity over time. By contrast, ESA excels with uniform, mixed use disks. We've seen that Optane has impressive write capabilities, but it also has great read performance. Optane can do it all, but what will the HCIBench numbers reflect? Stay tuned for my next blog post, where we'll compare vSAN OSA with 4 Optane cache/4 capacity disks to the ESA with 10 Optane disks test. 

Tuesday, May 2, 2023

My current setup: vSAN OSA 2 node cluster

I'm excited to announce that I was selected for the vExpert Intel Optane giveaway. To prepare for the incoming drives, I've started a series of blog posts focusing on benchmarking my two-node vSAN cluster. This will serve as a baseline of what to expect in terms of differences between using best practices for OSA as well as the fundamental performance differences of ESA.

My BOM consists of:

Supermicro BigTwin 6029BT-DNC0R
2x X11DPT-B compute servers (OEM branded), each containing:
  • 2x Xeon Silver 4116
  • 768GB DDR4 (24x32GB)
  • Intel X550 RJ-45 10Gbe dual port SIOM network card
  • RSC riser to break up x16 slot into 2 x8
  • Intel 82599 SFP+ 10Gbe dual-port network card in slot 1 of the riser
  • 10Gtek PCI-e x8 to 2x U.2 NVMe drive adapter in slot 2 of the riser
  • Open PCI-e x16 low-profile slot
  • Backplane supports 2x U.2 NVMe with 4 SAS/SATA HDD/SSD per server
Each node contains a single NVMe cache disk along with two NVMe drives for capacity. The drive list is as follows:
  • Node 1: Intel DC P3500 1.2TB for cache, 2x SanDisk Skyhawk 3.84TB for capacity
  • Node 2: SanDisk Skyhawk 1.92TB for cache, 2x SanDisk Skyhawk 3.84TB for capacity
The cache disks are not recommended for two reasons: They are mismatched in size, and they are not write-intensive disks. Nevertheless, they are what I have on hand and should be enough to establish a baseline of vSAN OSA performance. 

The first port of the X550 will be used for management and VM traffic (green lines), and the second will be used for vSAN witness traffic (blue lines). I didn't have a switch capable of handling VLANs, so this will run over two "dumb" 8-port switches. The servers will be direct-connected over the 82599 network cards to pass vSAN storage traffic (red line).

Once we have benchmarked the original build, the plan is to swap the cache disks with Optane drives for OSA, then use them all for ESA. With the 10Gtek cards, each server can hold a maximum of 6 NVMe disks. vSAN OSA prefers to have a 1:1 cache-to-capacity disk ratio, so we will test with 2x Optane and 2x capacity disks, followed by 5x Optane for ESA. To make room for the second NVMe adapter, I'm going to use a ConnectX-4 100Gbe adapter in the open x16 slot for testing, while it is overkill, I don't want there to be a bottleneck (and the original build won't likely saturate the 10Gbe link currently in place). The latest HCIBench utility as of this writing is 2.8.1 and will be utilized in "easy run" mode. Stay tuned!

Evacuate ESXi host without DRS

One of the biggest draws to vSphere Enterprise Plus licensing is the Distributed Resource Scheduler feature. DRS allows for recommendations ...