Practice

How to Actually Benchmark a VPS: What a Day of Testing Taught Us About Getting It Right

March 11, 2026 Jay Nelson 15 mins

We needed a second VPS provider. Our hosting stack — Docker, Traefik, Umbraco CMS, Gotenberg for PDF generation — runs multiple client sites as isolated containers on a single server. We had been running on a Hetzner Cloud CPX21 out of Ashburn without complaint. But when Raff Technologies showed up offering roughly double the resources at half the price, we did what we do with every vendor claim: we tested it.

What started as a quick sanity check turned into a full-day evaluation that changed our conclusion three times and exposed real flaws in how most people benchmark VPS providers.

This is not a post about which provider is better. It is a guide to benchmarking methodology — what we got wrong, what we learned, and why surface-level tests can lead you to the wrong decision.

The Baseline: Same Script, Two Servers

We wrote a bash script covering the standard toolkit: sysbench for CPU and memory, fio for disk I/O, curl for network throughput, Docker-specific operations, and a sustained load test to catch throttling. Identical script, identical parameters, both servers running Ubuntu 24.04 LTS.

The servers under test:

	Raff Technologies	Hetzner Cloud CPX21
CPU	4 vCPU (EPYC 8224P)	3 vCPU (EPYC Rome)
RAM	8 GB	3.7 GB
Disk	120 GB NVMe	75 GB NVMe
Monthly Cost	$9.99	~$19

One operational note before the numbers: Raff's default Ubuntu image was 24.10, which is already end-of-life. The apt repositories were dead, so nothing installed. They confirmed 24.04 LTS was available, we rebuilt, and moved on. If you are evaluating Raff, pick the LTS image.

Lesson 1: The First Run Is Rarely the Whole Story

Round 1 told a clean, simple story:

CPU: Hetzner faster per-core (+7%), Raff faster in aggregate (+28%) on the extra core
Memory: Wash
Disk I/O: Hetzner dominated — 2.6x faster sequential reads, 46% more random 4K IOPS
Network: Hetzner at 8,138 Mbps versus Raff at 754 Mbps — a 10x gap

If we had stopped here, we would have written Raff off as a staging box. We almost did.

Instead, we shared the full report with Raff. Their response changed the trajectory of the evaluation.

Lesson 2: The Best Providers Engage Technically

Within minutes, Raff's support team responded — not with marketing rebuttals, but with server-side storage adjustments. They asked us to re-run.

Sequential read performance jumped from 524 MB/s to 1,631 MB/s. A 211% improvement that flipped the metric from a Hetzner win to a Raff win.

Random 4K IOPS did not change. We said so. They accepted it and pointed us toward a deeper test.

The takeaway for any vendor evaluation: share your data with the provider. The ones worth working with will engage technically. The ones who point you toward an FAQ are telling you something about what support looks like post-purchase.

Lesson 3: Benchmark the Workload You Actually Run

This was the expensive lesson.

Our fio tests ran at iodepth=1 — one I/O request in flight at a time. That measures single-request latency, which is useful for isolated sequential operations. But our production workload is not isolated or sequential. We run multiple Docker containers, each with its own SQLite database, all reading from disk concurrently. That is a parallel I/O workload, and NVMe drives are designed for deep queue parallelism.

Raff suggested we test at iodepth=16 with four parallel jobs. We were skeptical — it felt like moving the goalposts. But the logic was sound: test the pattern your servers actually experience.

We ran the deep queue test on both servers:

Provider	iodepth=1 (4K Read)	iodepth=16 (4K Read)	Scaling Factor
Raff	2,694 IOPS	83,160 IOPS	31x
Hetzner	3,933 IOPS	46,676 IOPS	12x

At realistic concurrency, Raff delivers 78% more random read IOPS at nearly half the latency (0.76 ms versus 1.37 ms). The server that looked slower in a single-threaded test was dramatically faster under parallel load.

If you run containers or databases and you are benchmarking with iodepth=1, you are measuring the wrong thing.

Lesson 4: Single-Stream Downloads Measure TCP, Not Bandwidth

Our initial network test — a single-stream curl download — showed Hetzner at roughly 10x Raff's throughput. That looked disqualifying.

Raff pointed out the test was bottlenecked by TCP window scaling and geographic distance to the test server, not by actual bandwidth capacity. Multi-stream testing with speedtest-cli told a different story:

	Raff	Hetzner
Download	2,279 Mbps	1,566 Mbps
Upload	1,910 Mbps	1,458 Mbps
Ping	3.94 ms	10.04 ms

The 10x gap was a testing artifact. Raff was actually 46% faster on multi-stream download to nearby U.S. infrastructure.

If your network benchmark uses a single TCP stream, you are testing TCP window scaling behavior, not your provider's pipe.

What the Numbers Mean in Practice

For our stack, the capacity math looks like this. Each Umbraco 13 container with SQLite consumes roughly 300–400 MB of RAM:

Plan	Monthly Cost	Estimated Sites	Cost per Site
Raff $9.99 (4 vCPU / 8 GB)	$9.99	~18	~$0.55
Raff $23.99 (8 vCPU / 16 GB)	$23.99	~38	~$0.63
Hetzner CPX21 (3 vCPU / 3.7 GB)	~$19	~7	~$2.71

At the $23.99 tier, we host 5x more sites for $5 more per month. For nonprofit clients where every dollar in overhead matters, that changes the math entirely.

The Intangibles That Do Not Show Up in fio

Support responsiveness. Raff made same-day infrastructure adjustments based on our data. Every technical suggestion they offered was sound and reproducible. That kind of engagement is what you get from small, technical providers — and it is worth more than a few percentage points on a benchmark.

Data residency options. Raff indicated they would stand up Canadian-domiciled servers on request. Several of our clients have data residency requirements Hetzner cannot serve from Ashburn. For compliance-sensitive workloads, server location is not a nice-to-have.

Honest engagement. They did not dispute our findings when the numbers were unfavorable. They fixed what they could, explained what they could not, and pointed us toward better methodology. That builds trust faster than any SLA document.

A Benchmarking Checklist for VPS Evaluations

If we learned one thing from this process, it is that default benchmarks can actively mislead. Here is what we would recommend for anyone evaluating providers:

Test at realistic queue depths. If you run containers or databases, iodepth=1 undersells NVMe storage. Test at iodepth=16 with multiple parallel jobs.
Use multi-stream bandwidth tests. Single-stream curl downloads measure TCP behavior, not bandwidth capacity. Use speedtest-cli or iperf3 with multiple streams.
Run sustained load tests. Noisy neighbors and throttling do not show up in 10-second bursts. Run at least 60 seconds across multiple windows.
Share your results with the provider. The response tells you more about the relationship than the numbers do.
Test the workload you actually run. Generic benchmarks are a starting point, not an answer.
Re-test after provider adjustments. Infrastructure is not static. A provider willing to tune based on real data is a provider worth keeping.

The Decision

We are moving new Umbraco 13 deployments to Raff Technologies, starting at the $9.99 tier with a clear upgrade path to $23.99 as we onboard more clients. Hetzner continues running existing workloads while we migrate.

The benchmark script and the full technical report are both available below.

Download the benchmark script →

Download the full technical report →

Join the conversation Discuss on LinkedIn →

How to Actually Benchmark a VPS: What a Day of Testing Taught Us About Getting It Right

The Baseline: Same Script, Two Servers

Lesson 1: The First Run Is Rarely the Whole Story

Lesson 2: The Best Providers Engage Technically

Lesson 3: Benchmark the Workload You Actually Run

Lesson 4: Single-Stream Downloads Measure TCP, Not Bandwidth

What the Numbers Mean in Practice

The Intangibles That Do Not Show Up in fio

A Benchmarking Checklist for VPS Evaluations

The Decision

Grep 'n Guess: The Research Caught Up

Natural Selection - About This Series

Natural Selection — Week 11, 2026

The Baseline: Same Script, Two Servers

Lesson 1: The First Run Is Rarely the Whole Story

Lesson 2: The Best Providers Engage Technically

Lesson 3: Benchmark the Workload You Actually Run

Lesson 4: Single-Stream Downloads Measure TCP, Not Bandwidth

What the Numbers Mean in Practice

The Intangibles That Do Not Show Up in fio

A Benchmarking Checklist for VPS Evaluations

The Decision

Related Writing

Grep 'n Guess: The Research Caught Up

Natural Selection - About This Series

Natural Selection — Week 11, 2026