February Update on High-Availability Project
The project first described in my previous post, Ubuntu 22.04 LTS High Availability Router/Firewall with Two ISPs, has come a long way, but has also encountered many bumps in the road.
We quickly realized that testing some of this functionality in VirtualBox was not going to lead to accurate results due to the way VirtualBox handles network interfaces. As a result, I asked to deploy the physical servers on unused IP addresses for testing of the proposed solution. The organization’s technical leadership promptly agreed to this. So we set a date and made our plans for late November.
Upon arriving on-site eager to proceed with the deployment, several infrastructure issues became immediately apparent. The servers selected and purchased by the organization would not fit in the rack. A new rack had been purchased, but would not fit in the room. The on-site technical staff were at a loss as to how to proceed, knowing that they would not be permitted to purchase an additional server that would fit in the existing rack, and they certainly couldn’t build a new room. Additionally, the room with the existing rack had suffered recent water damage coming in from the roof, so they wanted to minimize the presence of active electrical systems at that location despite the fact that it was the demarcation point with both ISPs.
Taking stock of the situation, I suggested relocating the server to another server room with a larger rack. This secondary server room is where all servers other than the router/firewall were in place. We quickly discovered that the organization’s infrastructure was in far worse shape than anyone realized. The 10Gbps ring that was believed to be in place was actually alternating 10Gbps and 1Gbps links. The redundant connections to the server room were discovered to be disconnected, with all services riding over a single link. Some CWDM equipment was partially deployed, with muxers on one end but not the other. A mismatch of multimode patch cables were miraculously carrying traffic between single-mode transcievers and inter-building fiber. Uninterruptable Power Supplies had dead or disconnected batteries, and some had dead control components. Devices were connected inconsistently to unprotected building circuits instead of generator-powered circuits. We had a lot of work to do before we could even turn on the new router/firewall hardware.
We started trying to improvise a solution by building a connection for the new servers to use in a testing environment via the existing switches. Fairly routine configuration changes were accepted by the switch control plane, but then those changes were not actually applied in the data plane. This occurred on a switch that was a single point of failure for the entire organization, and there were no spares on hand, so we were hesitant to attempt power cycling the device.
Sometime later, a campus-wide power failure revealed that the generators did not kick on at all at the demarc location. Not only was the entire campus knocked out, but several systems did not come back on after power was restored. A mad scramble to restore bare functionality ensued.
This led me to assist in quickly deploying four new Juniper EX3400 switches in a highly redundant configuration. Two switches were deployed in the server room, and two switches were deployed in the demarcation room. In each room, the switches were cross-connected with 10Gbps DACs, and one switch was designated as primary and the other as secondary. I assisted in selecting components and fully deploying the rest of the passive CWDM system that was haphazardly put in place by former technical staff. The solution provided two separate fiber paths with four optical channels each between the server room and the demarcation room. The primary switch in the server room was connected to to primary switch in the demarcation room by single-mode fiber with 10Gbps CWDM optics, running over separate CWDM muxers and separate fiber pairs. The secondary switches were connected in like manner to each other. I deployed RSTP configurations to prevent loops, and this gave us redundant connectivity from the server room to the demarcation room. In each room, each switch was also connected by fiber into the campus infrastructure, ensuring multiple paths would be available.
It was around this point that I discovered the organization did not have functional infrastructure for the supporting applications necessary to run the router/firewall configuration they had requested me to deploy. The technical staff had opted to migrate old CentOS/Xen virtualization to VMWare, but had failed to complete the deployment with the assistance of a consultant hired for that purpose. I assisted with configuring the VMWare system to interface with the network, and we split redundant network connections across the two switches so that the virtual environment would be more resilient.
Around this time, it was announced that VMWare was being bought by Broadcom and free ESXi would no longer be available. This meant I couldn’t setup a homelab using ESXi to mimic the organization’s environment. I chose to start building a different solution for my lab, with an inexpensive machine. I tried out a handful of options, including OpenStack on Ubuntu and Proxmox. OpenStack was feature-rich, but wasn’t up to the task on lightweight homelab hardware. Proxmox worked fine, but it was somewhat lacking in terms of default security configurations, and would take too long to refine to get a suitably secure environment.
In the end, I opted to deploy Ubuntu Server 22.04 LTS on the homelab server and use raw libvirt commands to stand up KVM instances. This proved to be exceptionally easy to deploy (default Ubuntu Server install options, plus a couple of apt install commands). It also proved to be low load and high performance on the lightweight hardware I had available. Running multiple instances that were actively doing things had realtime performance and overall CPU loads averaging under 0.4. For comparison, OpenStack on the same system was sluggish and had an average CPU load of over 1.5 before even starting the first instance. I’ll provide more information about how I’m using KVM instances for a test lab in a future post.
Back on-site, the organization experienced another major power outage. Due to growing concerns about old hardware failing, the organization asked me to move the transition along faster, even if the new systems were not tested, fully configured, or production-ready. While I was hesitant to do this, I knew the organization didn’t have many options with old, failing hardware putting their entire operation at risk. First, the redundant switch ring was transitioned to serve as the production backbone between the demarcation room and the server room. Next, several virtual machines were transitioned to the new VMWare server. Last, the new router/firewalls were took over for the old servers, despite having only basic configurations loaded and operational.
At this point, I found myself working on production systems. Even slight changes needed to be done outside of business hours. Due to the servers not being fully production-ready, we were often side-tracked trying to correct service-impacting issues with temporary solutions. There was no longer a place to test configurations without impacting actual user traffic.
We made it work, but progress was now much slower. Despite the issues, all steps taken so far have dramatically improved performance of the organization’s network. Video surveillance feeds no longer stutter due to congestion and packet loss, web surfing is more responsive, and our visibility into the network has improved. As the project progresses, I will provide more updates, and I hope to share a solution that can be implemented in a homelab to reproduce the desired outcome - a open-source high-availabily router/firewall setup that can use two ISPs.