Let's Build a Datacenter Network




It's quite common to hear of companies these days planning to migrate some or all of their infrastructure into third-party cloud providers such as AWS. However, for some organizations it still makes good sense to build physical, on-premises data centers to either augment that cloud workload presence or supplement it entirely. Today I'm going to pretend I'm working for one of those companies and come up with a network design to build out, just to get the juices flowing.

Be forewarned: brief this article is not, but I have glossed over a few details here and there for some brevity. Really, I wanted to illustrate some of the decisions that go into the process for those unaccustomed or otherwise curious.

The Challenge

Let's say a startup has hired me to design a data center network for their existing co-lo space that will be used to host all of their services. All that I've been given so far are four 42RU, dual-power cabinets in the datacenter cage, and two upstream Internet providers (ISP1 and ISP2) providing 1 Gbps of bandwidth each through 2 Ethernet cables (one for each provider) left at the top of one of the cabinets.

I've also been told that the server inventory supporting all services will include the following:

Production Systems
  • 10 bare-metal servers running a Kubernetes cluster
  • 5 bare-metal servers running KVM
  • 5 bare-metal servers running Kafka
  • 20 bare-metal database shards
  • 5 bare-metal backup servers
  • 5 bare-metal file servers running GlusterFS
Staging Systems
Same server roles as production but just two servers for each:
  • 2 bare-metal servers running a Kubernetes cluster
  • 2 bare-metal servers running KVM
  • 2 bare-metal servers running Kafka
  • 2 bare-metal database shards
  • 2 bare-metal backup servers
  • 2 bare-metal file servers running GlusterFS
The production service application is two-tier, with a pool of web servers connected to the database shards on the back-end, spread out across compute instances running in the container cluster or as KVM virtual machines. All physical servers are 1RU in size and have dual power supplies.

Gluing all of these components together is, well, what I'm now on the hook for.

Brainstorm

So, with the above requirements now known I'm going to start to design a network that will support those requirements and more, including:
  • VLAN configuration for the network indicating membership for each server/group of servers
  • IP addressing and naming schema of all the servers and network gear
  • Internal and external IP routing
  • Built-in resiliency against network failure scenarios; ensuring redundancy at all conceivable levels
  • Room to scale for triple growth without needing a design modification
My mind goes to a layer 2 design for this particular case. Why this and not a hot layer 3 ToR model instead? Well, because it violates the 'keep it simple' principal in a few crucial ways. Let's look at a few reasons to not overcomplicate this initially:
  1. 4,000~ VLANs should be far good enough here to last a while without needing to dip our toes into the complex SDN waters
  2. CAM tables are adequate for the amount of anticipated MAC entries (10K max is anticipated)
  3. VM/container mobility are easily and natively solved: Simply put them in the same VLAN
  4. With 62 servers, we're not exactly web scale yet
Networks tend to get complicated on their own as time passes. As long as we are using devices at least capable of supporting these features as future-proofing there's no reason to junk it all up day one. If our app does hit unicorn status and all of a sudden we do need to scale 100x, it's totally feasible that we can seed that transition to L3 ToR Spine/Leaf using these same switches, either with an SDN overlay or not.

Rack Layout

The goal I've set out for myself here is to build a network supporting the initial 62 servers with 3:1 oversubscription ratio, and with the ability to seamlessly expand to support 12 additional cabinets without needing to change or order new aggregation switches.

That considered here's what I've decided to go with initially:
  • For in-cab high availability, each cabinet will have two ToR switches in a vPC pair, each host 2x10Gb dual-homed to those ToRs via LACP
  • Each ToR switch will have 2x40Gb vPC to the aggregation switch pair, also a vPC pair
  • Servers and network nodes (switches, routers) are dual-PSU units with AC power connection to each circuit (“A/B power”)
  • In a case where a node lack dual-PSU, N+1 configuration is required with each device on a different power circuit (E.g. active box is A-powered, standby box is B-powered)
Physical rack layout

Network Layout

As stated, the network here will be an aggregated layer 2 model using all-active MLAG links (à la Cisco's virtual port-channel). Some notes:
  • Edge routers expandable to additional peering, transit and other circuit terminations (MPLS PE or CE functions, SD-WAN termination, virtual circuit termination, etc.)
  • In event of single edge router failure, all Internet ingress/egress traffic utilizes remaining edge router and circuit by following BGP advertisements
  • Edge routers form iBGP "core" for learned BGP route sharing
  • BFD is configured on iBGP-facing and eBGP-facing (if provider supports it) links for fastest possible (sub-second) link failure detection
  • In event of single aggregation switch failure, all traffic will flow to by following HSRP (egress) and OSPF (ingress)
  • Implies all connected servers and nodes need to support LACP bonds for best possible redundancy
An all-active (MLAG) layer-2 network

Routing

Again, I've tried to remove up-front complication by keeping IP routing as simple as we can get. Servers either use a network load balancing service or firewall to reach externally, otherwise are switched via switch virtual interfaces (SVI).

Edge
  • Public IP prefix(es) are announced to each ISP neighbor router via eBGP
  • For enhanced convergence optimization and sub-second failure detection, BFD is enabled on provider router ports where ISP supports as well as enabled on all southward-facing links
  • Full Internet route tables are accepted from each ISP via eBGP, best routes shared in iBGP "core"
  • AS_PATH prepending policies used and tweaked for inbound network load balancing (and MED for multi-port scenarios)
  • Serve as OSPF ABRs; responsible for default-route advertisement into IGP
  • BFD enabled for enhanced convergence optimization and sub-second failure detection for OSPF and BGP
Aggregation
  • Default routes are learned/dropped automatically
  • ECMP (maximum-paths >= 2) enabled in OSPF to allow use of multiple learned default routes
  • HSRP (active/active via vCP) is utilized for first-hop redundancy for all outbound flows on “Internet” SVI(s)
Network Services
  • Load Balancer VIPs (public and private) accept all ingress/egress application flows
  • Default route for all subnets (except “Internet”) is to the firewall
    • Central egress NAT gateway device for the datacenter
    • For DB subnet this is optional and exists for enhanced security policy enforcement and inspection
    • Depending on DB performance, this could inhibit flow performance and L3 gateway addresses might prefer to be configured at aggregation switch SVI
  • CDN and DDoS mitigation services are performed via cloud-based services such as CloudFlare, Incapsula, Akamai, etc.

Segmentation and routing (logical layout)

Inventory List

We now need some network hardware to run all of this. I'm going to grab a few items that I believe will work best and list a few reasons why. Time to go shopping!
  • Aggregation switch count/model is 2x Cisco Nexus 9272Q
    • 72x40Gb ports
    • Fixed chassis, high-density; Allows scale to 17 total cabinets (leaving 4 QSFP+ ports available for uplinks and vPC peer link)
  • Top-of-Rack 1Gb switch count/model is 4x Cisco Catalyst 2960X-48TS-L
    • Dedicated for separated infrastructure management functions
    • Each server connected to 1Gb Ethernet interfaces on servers for dedicated control-plane operations (out-of-band, SSH, SNMP, iLo, etc.)
  • Router count/model is 2x Cisco ASR 1001-HX (8x10Gb, 12x1Gb)
    • 16 GB DRAM config to support 3.5M IPv4 routes, 3M IPv6 routes
    • One ISP circuit per edge router, expandable to future transit and peering circuits across the two
    • Compact chassis, cost-effective option w/ excellent forwarding performance
  • Firewall count/model is 2x Cisco ASAv10
    • Alternative virtual firewalls: vSRX, Palo Alto-VM, etc.
    • Implemented as KVM VMs on purpose-built network hypervisor cluster
    • Used to filter between vlans, where required E.g. DB server subnets if required, server-initiated outbound NAT policies, outbound inspection, etc.
    • Excellent compliance tool and for overall security improvement
    • SVTI/Routed VPN for hybrid-cloud interconnects; Staffer SSL VPN termination endpoint
  • Load-balancer model a KVM-based HAProxy cluster
    • Implemented as KVM VMs on purpose-built network hypervisor cluster
    • Handles the entirety of inbound application traffic, load balances appropriately to container or VM workload pools

Conclusion

Allow me to end by saying the obvious: There's no single way to build a data center. This is just one way a team may decide to go, which should provide resiliency, scale and redundancy for their hosted application.

Thanks for reading!

Popular posts from this blog

Configuring Cisco ASA for Route-Based VPN

Up and Rawring with TRex: Cisco's Open Traffic Generator

Running ASA on Firepower 2100: An End-to-End Guide