VMware Validated Design for SDDC 4.0 Generally Available

Disclaimer

This content is provided for historical reference and may no longer reflect current guidance or best practices.

We're excited to announce that on March 2nd 2017 released the VMware Validated Design for Software-Defined Data Center 4.0. Another milestone in our commitment to delivering our customers standardized, proven, and robust data-center level designs for the Software-Defined Data Center.

The VMware Validated Designs provide our customers comprehensive, prescriptive guidance to plan, build, and operate a Software-Defined Data Center. The designs are extensively tested to ensure all components and their specific versions are validated to work in unison, to scale to predetermined design objectives, and operate as our customers expect.

Unlike reference architectures which may focus on an individual product or purpose without lifecycle management guidance, the VMware Validated Design for Software-Defined Data Center is a holistic approach to designing a full SSDC stack that’s applicable to a broad set of uses, with a commitment to ongoing upgrade guidance.

What's New in 4.0?¶

Bill of Materials¶

4.0 is the first VMware Validated Design release to incorporate the vSphere 6.5 wave of products and adjacencies. The Bill of Materials (BOM) has been completely refreshed with all new product versions in the SDDC. Customers can take advantage of the new performance enhancements, scalability, and robustness included in these latest versions for a complete Software-Defined Data Center.

This substantial platform uplift in 4.0, the first major vSphere version refresh since the 1.0 design launch, demonstrates the strength and flexibility inherent to the VMware Validated Design for Software-Defined Data Center, enabling existing customers to seamlessly upgrade their full SDDC stack.

Additional updates include:

vSphere 6.5a
vSAN 6.5
NSX 6.3.1
vRealize Operations 6.4
vRealize Log Insight 4.0
vRealize Automation 7.2
vRealize Business for Cloud 7.2
Site Recovery Manager 6.5

A rundown of all the product, management pack, and content pack versions in this release are available in the release notes.

Documentation¶

Previous versions of the VMware Validated Design for Software-Defined Data Center provided the documentation in only PDF format. With the 4.0 release, we’re now making all content readily accessible in both PDF and online HTML formats.

With the new online format, can now easily read, browse, search, and bookmark sections across the documentation from almost any device. I find this extraordinarily helpful on a day-to-day basis.

And with the PDF format, you can still download and take the documentation with you to read, annotate, or markup on many readers, tablets, and specialized apps. You can access the PDF versions directly from the online docs.

Design Updates¶

Now let’s turn our attention to the design and do a rundown of some of the design updates for this release:

vSphere Availability – vSphere 6.5 has introduced several improvements; one of which is in vSphere HA Admission Control. Admission control is used to set aside a pre-determined amount of resources that are used in the event of a host failure. The design now simply defines the number of host failures the initial 4-node management cluster may tolerate to 1. Once configured, vSphere HA will automatically calculate a percentage of resources to set aside and apply the “Percentage of Cluster Resources” admission control policy. As hosts are added or removed from the cluster, the percentage will be automatically recalculated. The same configuration may be applied to the shared edge/compute and additional compute clusters if desired.

Resource Pools – The design uses 2-pod architecture - one pod for management and then one for initial edge/compute. Customers can add additional compute pods as needed. In this initial shared edge/compute pod we establish a resource pool for the required SDDC NSX Controllers and Edge Services Gateways with a CPU and memory reservation. The NSX components control all network traffic in and out of the SDDC as well as update route information for inter-SDDC communication. In a contention situation, it is imperative that these virtual machines receive all the resources required. We’ve raised the guidance for the memory reservation to 16GB (from 15GB) with a normal share level. We have retained the CPU share level at high. During a contention scenario, the SDDC NSX components for edge services in this cluster will receive more resources then all other workloads. As such, monitoring and capacity management should be proactive.

Compute Only Clusters – In prior versions, all compute only clusters were bound to their rack enclosure. In this release, we’ve lifted this restriction. Now, compute clusters may span the racks horizontally over the L3 connection if desired.

Orchestrated Restart and Startup – vSphere 6.5 now allows an administrator to create VM to VM dependency chains. These dependency rules are enforced if/when vSphere HA is used to restart VMs from failed hosts. This is great for multi-tier applications that do not recover well unless they are restarted in a specific order. The design now translates this new vSphere 6.5 feature into a powerful use. We use this construct not only protect (restart and anti-affinity) the solutions that establish the SDDC – for example, vRealize Automation – but we can also set it to control the start-up of the layers.

The design establishes virtual machine groups for layers or components in the solution. For example, we can create a “vRealize Automation Virtual Appliances” group where we define the two vRealize Automation virtual machines and do the same for similar layers. Once the groups have been defined, we create the virtual machine rules to set the startup order of virtual machine groups. This ensures the correct startup order of the SDDC management components. We create a set of “SDDC Cloud Management Platform” rules that define which group starts prior to another group. For example, we define that the “vRealize Automation Virtual Appliance” group members must start prior to the “vRealize Automation IaaS Web Servers” group members and so on.

vMotion TCP/IP Stack – This design now includes the addition of the vMotion TCP/IP Stack to isolate vMotion traffic and to assign a dedicated default gateway. By using this new stack, the design routes traffic for the migration of virtual machines that are powered on or powered off using a default gateway that is different from the gateway assigned to the default stack on the host, assigns a separate set of buffers and sockets, avoids routing table conflicts that might otherwise appear when many features are using a common TCP/IP stack, and isolate traffic to improve security.

Platform Services Controller Availability – The design has introduced the use of a load-balancer to increase the availability of the Platform Services Controllers used in the SDDC stack. This is accomplished by establishing an NSX Edge Services Gateway in front of the PSCs with a streamlined deployment.

vSphere Update Manager and Download Service – The design has added several design decisions for operating vSphere Update Manager now that it’s built into the VCSA in vSphere 6.5, but I want to call your attention to the deployment model.

The design provides guidance for running vSphere Update Manager Download Service on a VM that’s connected to the region-specific application virtual network. The vCenter Server instances don’t or shouldn’t have access to the outside world, and the UMDS provides local storage and access to vSphere Update Manager repository data, avoids cross-region bandwidth usage for repository access, and provides a consistent deployment model. Each region has a single UMDS and each uses the default patch repositories by VMware to simplify the configuration.

Host Profiles – To simplify the configuration of hosts and ensure settings are uniform across the management cluster, the design now uses host profiles. Although this design decision is specifically for the management cluster, we encourage the use of the host profiles across the shared edge and compute cluster, as well as any additional compute cluster.

vDS – The design now provides guidance to enable the vSphere Distributed Switch Health Check on all distributed switches – those used for the management pod, shared edge/compute pod, and any additional compute only pod. The health check ensures that all VLANS are trunked to all hosts attached to the vDS and ensures that the MTU sizes match the physical network.

In addition to enabling the health check, we’ve also updated each distributed switch to version 6.5.0 – provided by vSphere 6.5 – to ensure all new capabilities of this switch are available.

Certificate Management – By default, vSphere 6.5 uses TLS/SSL certificates that are signed by VMCA. These certificates are not trusted by end-user devices or browsers. It’s a recommended practice to replace, at minimum, user-facing certificates with certificates that are signed by a third-party or enterprise Certificate Authority. Certificates for machine-to-machine communication may remain as VMCA signed certificates. This is also known as VMCA in hybrid mode – a mode we see many customers adopting since the release of vSphere 6.0.

The design now uses the SHA-2 or higher algorithm when signing certificates as the SHA-1 algorithm is considered less secure and has been deprecated.

A quick reminder that the VMware Validated Design for SDDC comes with the SSL certificate generator CertGenVVD. You can use this tool to generate Certificate Signing Request (CSR), OpenSSL CA signed certificates, and Microsoft CA signed certificates for all VMware products that are involved in the design. Using the CertGenVVD tool saves you tons of time when creating signed certificates. See VMware Knowledge Base article 2146215.

If the CertGenVVD tool is not an option for your environment, the design also includes a validated procedure to create and replace your certificates, all while using the new SHA-2 based certificates.

NSX Transport Zones – In prior versions of the design we created a universal transport zone that encompassed all shared edge/compute, and compute pods from all regions for workloads that require mobility between regions. While this worked great, we discovered that vRealize Automation cloud management platform was not able to deploy on-demand, universal network objects against a the secondary NSX Manager in “Region B”.

To alleviate this issue in 4.0, the design has introduced a global transport zone in each region that encompasses all shared edge/compute, and compute clusters for use with vRealize Automation on demand network provisioning. The addition of this transport zone allows all regions to deploy on demand network objects.

NSX DLR – In addition to the second transport zone that benefits the on-demand networks objects across the regions, the design has added a Distributed Logical Router (or DLR) for the shared edge/compute, and compute clusters in each region. This DLR provides the east/west routing for workloads that require these on-demand network objects delivered through vRealize Automation and NSX. It also reduces the hop count between nodes attached to it to 1, thereby reducing latency and improving performance.

NSX N/S ECMP ESGs – In the design, both the UDLR and DLR control VMs are deployed in the high-availability, and of course, have their anti-affinity rules auto-generated through vSphere Availability. However, in the event of a Control VM failover – for example, during a host outage and the control VM begins its restart - the router adjacency from upstream devices such as ToR's to subnets behind the UDLR are lost for a brief period while the second control VM in the pair takes over.

To address this gap in time and the resulting lost of adjacency, we’ve added a design decision to create one or more static routes on ECMP enabled NSX edges for subnets behind the UDLR and DLR. These static routes have a higher admin cost than those that are dynamically learned. Thus, network connectivity is not impacted during the control VM failover scenario.

Improved Free Space Recommendations – The design provides guidance for datastore usage, both on vSAN and non-vSAN backed storage.

vSAN is the primary storage for the management cluster. By design, when a vSAN datastore reaches 80% usage, vSAN will initial a re-balance task. This can be a resource intensive operation across the cluster. As such, we recommend ensuring a buffer of at least 30% Is always maintained to limit this operation. If growth is required, the vSAN datastore can be easily extended with the addition of more disks for the vSAN capacity tier.

If a non-vSAN datastore is used for the shared edge/compute cluster, the design recommends ensuring that at least 20% of free space is always available. If the datastore runs out of free space, services that include the NSX Edge Services Gateways for north/south routing and NSX controllers may fail. To prevent this, maintain at least 20% free space. This can be proactively managed achieved using the monitoring and capacity management components built into the design.

VM Swap as Sparse Object on vSAN – The design provides guidance for vSAN-backed clusters, such as the management cluster, to configure the VM swap as sparse objects. By enabling this setting on each host in the cluster, it creates virtual swap files as a sparse object on the distributed vSAN datastore. The sparse virtual swap files will only consume capacity on vSAN as they are accessed which significantly reduced the space consumed on vSAN if the VMs do not experience memory over commitment.

To enable this capability, an advanced setting on all ESXi hosts in the vSAN cluster must be set: vsan.SwapThickProvisionDisabled and set the Value to 1.

vRealize Automation Distributed Deployment – If you recall from previous releases, the SDDC management applications run in conjunction with NSX to leverage the capabilities of distributed routing, distributed firewalling and load-balancing. In 4.0, we’ve made some design modifications to align with the latest practices, which now include: changing the load-balancing algorithm from IP-Hash to Round-Robin for all pools, and updating the applications profiles to maintain session persistence for the vRealize Automation Appliance and vRealize Automation IaaS Web Servers profiles with an expiration of 1800 seconds. Session persistence is not configured on the vRealize IaaS Manager and vRealize Orchestrator appliance profiles.

vRealize Operations Sizing – The starting size for the vRealize Operations Analytics Cluster is now 3-medium sized nodes (down from 4.) This is the starting configuration and supports up to 1,000 VM. If the SDDC expands past 1,000 virtual machines additional data nodes can be added to the analytics cluster. However, the number of nodes should not exceed number of ESXi hosts in the management pod minus (-) 1.

For example, if the management pod contains 6 ESXi hosts, you deploy a maximum of 5 vRealize Operations Manager nodes in the analytics cluster. This would support up to 5000 virtual machines in the design.

With regards to capacity, the ESXi hosts must be large enough to accommodate VM that require 32 GB RAM without bridging NUMA node boundaries and the management pod must have enough ESXi hosts so that vRealize Operations Manager can run without violating vSphere DRS anti-affinity rules.

vRealize Log Insight – The design has added several updates to vRealize Log Insight so I’ll mention a few here.

The design configures vCenter Server Appliances and Platform Services Controller Appliances as syslog sources to send log data directly to vRealize Log Insight.
The design configures vRealize Log Insight to ingest events, tasks, and alarms from the Management and Compute vCenter Server instances to ensures that all tasks, events and alarms generated across all vCenter Server instances in a specific region of the SDDC can be captured and analyzed for the administrator.
The design explicitly configures vRealize Log Insight not to automatically update all deployed agents. The design is a tested BOM and we manually install updated versions of the Log Insight Agents for each of the specified components within the SDDC for precise maintenance.
The design configures log forwarding to use SS and ensure that the log forward operations from one region to the other are secure.
Lastly, the design configures a disk cache for event forwarding to 2 GB. This cached ensures that log forwarding between regions has a buffer for approximately 2 hours if a cross-region connectivity outage occurs. See the documentation for the sizing example.

Adoption of Service Accounts for Secure Application-to-Application Communication – The design now drives the adoption of service accounts for secure application-to-application communication. Each service account with has the least amount of privilege to the SDDC components for which it must communicate. This is beneficial because a) in the event of a compromised account, the accessibility in the destination application remains restricted and b) it provides accountability in tracking request-response interactions between the components of the SDDC.

A list of all service accounts, their source and destination component, and required roles are listed in the Planning and Preparation section in the documentation under External Services and Active Directory Users and Groups. In addition, we’ve added a slick visual to help you visually understand to service account interactions between the SDDC components.

Backup and Recovery – The design has modified guidance to specify that any vSphere APIs for Data Protection (VADP) compatible solution may be used for backups. For example, VDP, Avamar, Commvault, NetBackup, Veeam, and many more. This allows you to adapt and apply the design decisions to your own selected backup solutions.

Note that we use VDP as an example method in the documentation – you many use an VADP-based solution that is compatible with vSphere 6.5.

And there you have it, a rundown of the major design updates for what’s new in the VMware Validated Design for SDDC 4.0.

For more information, and to get started:

Read the Documentation in PDF or HTML.
Read the Release Notes.
Download the Design Decision Checklist.
Learn What's New by Watching the What's New in 4.0 Video.