As you may already be familiar with some of our previous work which was mainly focused on isolation issues of hypervisors, we also want to present you an issue concerning availability in Cloud environments. This issue was already covered in some of our presentations, but will be explained in greater detail in this blog post.
In the course of one of our security assessments of a public IaaS Cloud environment, we experienced the following network setting:
- A multi-tenancy Cloud environment (so to speak, at least one hypervisor [in our case, it was a VMware ESXi 5] hosting machines of different customers)
- A typical network infrastructure, where the hypervisor is connected to (at least one) a physical switch.
As we were to assess the (typical Cloud) question whether it is possible to violate any isolation requirements (again, typically, on hypervisor, network, management, and storage level), we enumerated all potential interfaces and started to fuzz different protocols. We experienced an interesting behavior of the environment very soon, where the following relevant fact comes into play:
Even though this setting can be found in many typical environment, the relevant detail about this environment is the presence of a certain feature on the physical switch, called BPDU Guard — which also is a common setting for many network infrastructures.
BPDU Guard is a Layer 2 protection mechanism which prevents non-core-network entities from participating in the spanning tree protocol. If BPDU Guard is enabled on a switch port, it monitors the traffic for BPDUs and shuts down the port in case of incoming BPDUs (admittedly, it is not correct that the port is shut down — actually it is put into an err-disable state: specific BPDU Guard details can again be found here). This way, the sending of malicious/misconfigured BPDUs is prevented and it is possible to identify malicious (we will use “malicious” in the following for any entity that is sending BPDUs even though it is not supposed to) hosts by looking for unreachable machines (or, depending on the maturity of your monitoring processes, even some log analysis traps 😉 ). Consequently, BPDU guard is usually enabled on access ports that connect a host, not a network device, to the network.
Coming back to the initial Cloud scenario and assuming we have a malicious virtual machine attacking the STP domain, hosted on the hypervisor, the desired behavior of BPDU Guard would be to disconnect the malicious virtual machine from the network. However, as BPDU Guard is implemented on the physical switch, it does not know about any distinctions between virtual machines on the host connected to the particular switch port. Hence BPDU Guard shuts down the switch port the hypervisor is connected to — making the whole hypervisor and all hosted virtual machines (which in some environments, especially Cloud environments, can comprise several 100 systems) unavailable. To explicitly state this: Abusing a security control, which is perfectly valid and important in traditional environments, can bring down a complete hypervisor — transforming a security control into a vulnerability as soon as you evolve to a multi-tenancy Cloud environment. However, to make things worse, many virtualized environments have live migration or failover features, which deploy a virtual machine, once it gets unavailable, on another hypervisor to avoid downtimes. Regarding our scenario, this means that the malicious machine is started on another hypervisor — and, by continuing to send BPDUs, making it unavailable almost immediately. I think you see where this idea is going…
While this scenario has already been described in this blog post, we also want to provide a relevant message of this attack scenario for Cloud environments: You have to challenge known and established security controls/models as soon as you think about hosting/going to a public, multi-tenancy Cloud environment. A well-established and reasonable security control can lead to significant (and trivially to exploit) attack scenarios in a Cloud environment. Thinking about some typical recommendations security people were fighting with operations people about for quite a while: This scenario also perfectly justifies having different security zones for your virtual machines.
Even though this is a relevant message, the mitigation of the problem is probably just as relevant; unfortunately there is no simple solution. The following three options exist, each one has specific advantages and disadvantages:
- Use BPDU Filter instead of BPDU Guard: This would mitigate the DoS vulnerability induced by BPDU Guard, but does not provide feedback about malicious virtual machines and hence does not address the root cause: The malicious machine on your network. (Except your log analysis process is really mature). In addition, if BPDU Filter is configured on the wrong port, this may also affect the stability of your STP domain (as BPDUs may be dropped silently causing switches to form loops – STP and BPDU Guard usually is implemented for a reason ;). Please also refer to this blog post for some other details on how bridged vNics or hypervisor nics may impact your whole network infrastructure). Differences and design discussions of BPDU Guard and BPDU Filter can also be found here.
- Use another vSwitch: Older versions (pre ESXi 5.1) of the VMware vSwitch (and many other vSwitches) do not include any BPDU Guard or Filter mechanisms. For VMware environments, this can be addressed by using the Cisco Nexus 1000v, which has an implicit BPDU Filter implemented (However, the use of the Nexus 1000v may induce completely new operational problems) or upgrade to ESXi 5.1, as it includes a BPDU Filter feature (refer to this blog post for details and configuration guidance.)
- Disable BPDU Guard: This mitigates the DoS vulnerability, but leaves your STP domain vulnerable. As for us, not a too interesting option…
We hoped you enjoyed the read, have a good weekend,