27 January 2021

SDN / NFV

Managing data center physical infrastructure with Tungsten Fabric

10 minutes reading

Managing data center physical infrastructure with Tungsten Fabric

A data center’s physical infrastructure can consist of multiple devices including switches and routers. Managing them can be a time-consuming and error-riddled process. Adding an SDN solution to your legacy data center network makes the entire problem even more complex. Tungsten Fabric, an open-source SDN controller, may be the answer. Read on to know more.

Data center fabric

Modern data centers are built as flat two/three layers of deeply interconnected devices known as a fabric. This leaf-spine architecture is robust and easy to scale out by adding new devices instead of replacing older devices with more powerful ones. The fabric provides an underlay that acts as a foundation for more complex applications and business rules provided by an overlay. 

Configuring underlay and overlay networking in this architecture from scratch can be challenging. Applying new business and application rules by hand is cumbersome and leads to errors. Even a simple user intention may require manual changes in multiple devices. That’s where automation with the help of Tungsten Fabric controller comes into play.

SDN and NFV CodiLime services

What Tungsten Fabric can do

Tungsten Fabric can help set up and manage data center fabric devices from the beginning. Starting even from zeroized routers or switches, it can provision them with all necessary configurations for underlay eBGP and overlay iBGP connectivity. It also accepts intents of establishing overlay connections between bare metal servers connected to fabric that run a workload. Finally, the Tungsten Fabric controller also assists with everyday operations like device upgrades and maintenance.

All these functions and operations can be handled programmatically by sending proper intents or job requests to the controller's Config API. Users can also observe the system’s status with the help of a TF Analytics API.

From zero to hero

Zero-touch provisioning

In the case of fabric devices without a configuration, when all devices are physically connected but zeroized, provisioning needs to be performed. The Tungsten Fabric controller offers a ZTP job that can be used to set up the underlay and overlay from scratch. Users need to provide a minimal set of configuration knobs like eBGP ASNs, some IP addresses and a list of devices to be provisioned and let the controller do its job. The following items are set up during this automatic configuration:

  • underlay eBGP with the ASN range numbers assigned to individual devices
  • management IP addresses of the devices with default gateway
  • IP addresses assigned to in-band interfaces
  • loopback interfaces to peer iBGP nodes in the overlay

Internally, during the ZTP job, the local DHCP server, which is a part of the TF controller, is configured to respond to requests sent only by devices with serial numbers that are present on the list provided. DHCP response provides not only an IP address but also additional information with the address of the TFTP server serving the initial configuration. This configuration is used to set the device hostname, admin credentials and other mandatory settings. After that step, device-job-manager executes the next Ansible roles that push the desired configuration. Below you can find an example of a job specification that can be sent to the TF Config API.

Existing underlay onboarding

Tungsten Fabric can add an overlay to an already existing IP fabric. In that process, the TF controller automatically imports the existing fabric by discovering devices, interfaces and networks. Onboarded fabric can then be used as a foundation for an overlay network. The user needs to provide a minimal number of configuration parameters and is then free to start the job. The Device Manager handles all the configuration changes.

Make an intention

The connection between bare metal servers

Once a fabric is brought up, it’s time to connect the servers that run a workload. To do so, the overlay network layer needs to be configured—another job the Tungsten Fabric controller will handle. Users just specify high-level intents, which are then translated to a low-level configuration and pushed to physical devices.

Users can create Virtual Networks with individual IP subnets and connect them using Logical Router type VXLAN. This creates an L3 connection between entities attached to those Virtual Networks. Under the hood, meanwhile, it creates VRFs in the spines (where spine-based routing is used). The next step is to associate abstract objects with physical interfaces. This is done by defining Virtual Port Groups. Having all intents defined, the TF controller and its component Device Manager evaluates them and automatically and asynchronously configures the physical devices. 

Maintenance

Activate and deactivate maintenance mode

Should a selected device need to be taken out of the fabric for maintenance, the TF controller can be enlisted. Two job templates can be used:

  • maintenance_mode_activate is used to drain traffic from the switch so it can be safely removed. Firstly, the Device Manaer performs health checks to make sure the operation won’t affect traffic (that is, it’s hitless). Secondly, it changes the configuration on all multi-homed peers. Lastly, it switches the selected device into maintenance mode.
  • maintenance_mode_deactivate is used to restore the device again to fabric by deactivating maintenance mode. Firstly, the Device manager pushes the configuration to the deactivated node. Secondly, it also reconfigures the peers so they can again send traffic to the node selected.

Device software upgrade

The TF controller is capable of performing a hitless device image upgrade. Users can request such an operation using the hitless upgrade strategy job (see the example below). 

During the upgrade, job devices are switched to maintenance mode and ordered to download and install a new software package. After a successful upgrade, the Device-Job-Manager pushes the desired configuration and checks if the devices are in a healthy state. 

Tungsten Fabric Device Manager

TF Device Manager is a key component when it comes to operating data center fabric. It receives intents and job requests from TF Config API through a message bus and a database. This service breaks down into the following components:

  • device-manager— translates high-level intents into low-level configuration
  • device-job-manager—executes fabric ansible playbooks that configure routers and switches
  • DHCP server—in a zero-touch provisioning use-case, the physical device gets a management IP address from a local DHCP server running alongside the device-manager
  • TFTP server—in a zero-touch provision use-case, this server is used to provide a script with initial configuration

Jobs

Operations performed by Device Manager are time-consuming and asynchronous. They are executed by calls to a special execute-job endpoint at Config API. Predefined jobs are listed at the job-templates endpoint and are registered by fabric-ansible components during the cluster provisioning. A more detailed list of predefined job templates can be found here. Today only Juniper’s MX routers and QFX switches have an open-source plugin.

The Device Job Manager reports job progress by sending UVEs (User Visible Entities) to the Collector. Users can retrieve job status and logs using the Analytics API and it’s Query Engine. Here’s a format of the URL:

https://<server-ip>:8081/analytics/uves/job-execution/default-global-system-config:fab1:default-global-system-config:fabric_onboard_template?flat

Summary

As you can see, Tungsten Fabric can be used to manage the entire lifecycle of data center physical devices. Automation makes the life of data center operators much easier thanks to the following functionalities: zero-touch provisioning of physical devices automatically adding an overlay network to an existing IP fabric intent-based networking help with everyday operations like device upgrades and maintenance Additionally, Tungsten Fabric can connect both physical and virtual worlds, making it possible to connect your legacy data center devices with SDNs and manage them together from a single pane of glass.

Examples of job payloads

fabric_onboard job payload

{
    "job_template_fq_name": [
        "default-global-system-config",
        "fabric_onboard_template"
    ],
    "input": {
        "fabric_fq_name": [
            "default-global-system-config",
            "fab1"
        ],
        "fabric_display_name": "fab1",
        "node_profiles": [
            {
                "node_profile_name": "juniper-qfx5k"
            }
        ],
        "loopback_subnets": [
            "2.2.2.0/24"
        ],
        "overlay_ibgp_asn": 63000,
        "disable_vlan_vn_uniqueness_check": false,
        "enterprise_style": true,
        "import_configured": false,
        "fabric_asn_pool": [
            {
                "asn_min": 64000,
                "asn_max": 65000
            }
        ],
        "management_subnets": [
            {
                "cidr": "10.10.4.32/24",
                "gateway": "10.10.4.62"
            }
        ],
        "fabric_subnets": [
            "10.0.0.0/24"
        ],
        "device_auth": {
            "root_password": "password"
        },
        "device_to_ztp": [
            {
                "serial_number": "XXXXXXXXXXXX",
                "hostname": "qfx1"
            },
            {
                "serial_number": "XXXXXXXXXXXX",
                "hostname": "qfx2"
            }
        ]
    }
}

hitless_upgrade_strategy_job payload

{
    "job_template_fq_name": [
        "default-global-system-config",
        "hitless_upgrade_strategy_template"
    ],
    "input": {
        "fabric_uuid": "bf9a9e58-1a1e-4f47-8805-d6bb1a3d8b4c",
        "upgrade_mode": "upgrade",
        "image_devices": [
            {
                "image_uuid": "be9fd3ff-01af-4392-af5c-714af766e0fd",
                "device_list": [
                    "b389166b-bc48-4e51-b669-bd09e219bc9f"
                ]
            }
        ],
        "advanced_parameters": {
            "health_check_abort": false,
            "bulk_device_upgrade_count": 4,
            "Juniper": {
                "lacp": {
                    "lacp_down_local_check": true,
                    "lacp_down_peer_check": true
                },
                "active_route_count_check": true,
                "bgp": {
                    "bgp_flap_count": 4,
                    "bgp_down_peer_count": 0,
                    "bgp_flap_count_check": true,
                    "bgp_down_peer_count_check": true,
                    "bgp_peer_state_check": true
                },
                "alarm": {
                    "system_alarm_check": true,
                    "chassis_alarm_check": true
                },
                "l2_total_mac_count_check": true,
                "fpc": {
                    "fpc_cpu_5min_avg": 50,
                    "fpc_memory_heap_util": 45,
                    "fpc_cpu_5min_avg_check": true,
                    "fpc_memory_heap_util_check": true
                },
                "interface": {
                    "interface_error_check": true,
                    "interface_drop_count_check": true,
                    "interface_carrier_transition_count_check": true
                },
                "routing_engine": {
                    "routing_engine_cpu_idle": 60,
                    "routing_engine_cpu_idle_check": true
                }
            }
        }
    }
}
Paweł

Paweł Marchewka

Software Engineer