My 100Gb Spine

As a sort of engineering art form, no two computer network designs are ever really exactly alike, and in that spirit of variety today I’m going to play some designer make-believe. I’m going to focus on building a new high capacity, high performance and future-proofed IP underlay that should hopefully satisfy even the most performance-demanding customer applications.

For that I’m going to build a leaf-spine fabric to support an at-maximum 2:1 over subscription ratio, for it to be able to support both 10Gb and 25Gb node connectivity and all without breaking the bank on capex or power+cooling costs. These imaginary business requirements include a scale goal for connecting 1,700 1RU nodes day one, and with a business stretch goal of 5,000 before ever needing to think about a redesign.

Given that info we should be good to start dreaming and digging through some vendor data sheets. So let’s go shopping!

The Spine

 

Cisco Nexus 9236C

I primarily chose this switch because of it's incredible 100Gb and 25Gb port density for such a compact 1RU form. The forwarding performance isn't so shabby, either.
  • 7.2 Tbps bandwidth
  • 4.75 billion packets per second
  • 36 x 40/100Gb ports, or 144 x 10/25Gb ports (w/ breakout cables)
  • 1 to 2 μs latency
  • 30MB on-chip shared memory

The Leaf


Cisco Nexus 92160YC-X

Introduced in 2014, the Nexus 92160 and Nexus 93180 are the logical successors to the Nexus 9372. For this exercise I've decided to go with the former mostly to keep it in the same 9200 family as the spine, but also because this isn't an ACI deployment. I can see the slightly better latency, larger MAC & host entries tables, and added 20MB of shared ASIC memory from the 93180 easily swaying that decision however.
  • 48 x 10/25Gb and 6 x 40/100Gb ports
  • 3.2 Tbps of bandwidth
  • 2.5 billion packets per second
  • 1 to 2 μs latency
  • 20MB on-chip shared memory

Together: A 100Gb Spine-Leaf Network

Okay, I said earlier that my aim was at most a 2:1 over subscription ratio. Just a note on that. Let’s assume that the servers we’re deploying here haven’t made the leap to 25Gbps yet and are still using just 10Gb NICs, leaving us with the familiar 480Gbps downlink. That paired with 400Gb uplink, this network will enjoy a very nice 1.2:1 O/S ratio until someone decides to add 25Gb NICs to all of our metal boxes.

While using 100Gb uplinks, I know that there could only be a maximum of 36 leaf switches, since that's the number of 100Gb ports on a spine switch.

Here's the topology we'd likely end up with:

A spine-leaf network supporting 1,728 nodes.

This gives us the ability connect 1,728 nodes (36 x 48 = 1,728) at 10Gb each and with just a rather nice 1.2:1 over subscription ratio. Not bad at all.

Since all of our leaf downlink ports are SFP28, this future proofs our ability for every server to upgrade to 25Gb NICs in the process. Once we decide to start moving in that direction however, I'd then add two more switches to the spine and uplink each leaf’s final two 100Gb ports for an over subscription ratio of 2:1. Here's what that'd look like:

Still supporting 1,728 nodes, but with 25Gbps links instead of 10Gb.

Stretch Time

Now, how would I accomplish the stretch goal of being able to scale this fabric to support 5,000 hosts? First: By ordering lots and lots of 25Gb break-out cabling (and fiber cabling in general).

Since spine scale is directly dependent upon the port density of a single spine switch, we can’t grow it (the spine) unless we somehow are able to create more ports on those switches. Lucky for us, the Nexus 9236C allows us to break out to a total of 144 SFP28 ports as I mentioned earlier.

So, to pull this off I'd need to carefully migrate the previously configured 100Gb links between the leafs and spines over to 25Gb, using 4x25Gb break-out cables in all possible spine ports and all leaf uplink ports. That should increase our leaf capacity to 144, meaning that our spine also needs to grow to a total of 24 switches. Remember that a leaf switch's uplink port count determines the max spine width, while a spine switch's port count defines max leaf count.


The same network now supports close to 7,000 nodes.

"A leaf switch's uplink port count determines the max spine width, while a spine switch's port count defines max leaf count."

Note that with this rather less-obvious cabling config, our over subscription rate didn’t change, nor did our switch models or bandwidth, yet we're now able to support 6,912 connected nodes. How magical is that? Granted, cabling will be a challenge to deal with, but it's quite interesting how scale can increase so dramatically just by using different speed uplinks.

It's possible that I could’ve just executed on the above fabric design from the get-go if the node count were better anticipated (though that's one big switch order to purchase day one), but the nice part is that with this design you’re covered either way.

Other Thoughts

“But Buck, those 100Gb optic costs are going to kill you here.” FS.com’s got us covered at $199 a pop for 100Gb transceivers. Not so terrible when compared to Cisco’s QSFP-100G-SR4-S $1,999 list price. One whole digit less, in fact. That’s pretty significant, not to mention they should only continue to get cheaper as adoption increases.

Also, the reason I went with Nexus 9200’s versus 9300’s is two-fold. The 9200 being optimized for just running NX-OS and my having no use for ACI is the first. In my made-up case we’ll need to account for multi-tenancy and workload mobility requirements, for which we'll use MP-BGP EVPN VXLAN to orchestrate the VNI overlays (more on that in another post maybe). The second is that, for whatever reason, Cisco's not added a proper opposite number in the 9300 family with density on point with the 9236C. Otherwise, it’s notable to point out that cost difference between the two switch families is negligible.

Thanks for reading. Happy building!

Popular posts from this blog

Configuring Cisco ASA for Route-Based VPN

Up and Rawring with TRex: Cisco's Open Traffic Generator

Running ASA on Firepower 2100: An End-to-End Guide