Configuring NSX Manager for Trend Micro Deep Security 9.6 SP1

This week was my first undertaking of preparing a vCenter environment for agent-less AV using Trend Micro Deep Security 9.6 SP1 and NSX Manager 6.2.4.  So far, it’s been a great learning experience (to remain positive about it), so I wanted to share.

Initially (about 4 months ago), vShield was deployed, and since day 1, we had problems.  Since I’ve never worked with Trend Micro Deep Security, I relied on the team that managed it to tell me what their requirements were.  Sometimes, this is the quickest way to get things done, but what I realized is that even if it works in one environment, it always helps to validate compatibility across the board.  If I had done that, I would have probably saved myself a lot of headache…

Deploying EOL Software? You should have known better...

So in anticipation of a “next time”, here are my notes from the installation and configuration.  Hopefully, this information is found useful to others, as some of it isn’t even covered in the Deep Security Installation Guide.

Please not, this is not a how-to.  It’s more of a reference for the things we’ve discovered through this process that may not all be documented in one place.  With that said, here’s my attempt at documenting what I’ve learned in one place.

The process (after trial and error since the installation guide isn’t very detailed) that seems to work best is as follows:

  1. Validate product compatibility versions (vCenter, ESXi, Trend Micro Deep Security, NSX Manager)
  2. Deploy and configure NSX Manager, then register with vCenter and the Lookup Service.
  3. Deploy Guest Introspection Services to the cluster(s) requiring protection.
  4. Register the vCenter and NSX Manager from the Trend Micro Security Manager.
  5. Deploy the Trend Micro Deep Security Service from vSphere Networking and Security in the Web Client.

When we first set out to deploy this in our datacenter, it was months ago.  We initially started with vShield Manager, which is what was relayed to us from the team that manages Trend.  We met with issues deploying properly and “things” missing from the vSphere Web Client when documentation said otherwise.  We had support tickets open with both VMware and Trend Micro for at least a few months.  At one point, due to the errors we were getting, Trend and VMware both escalated the issue to their engineering/development teams.  At the end of the day, we (the customers) eventually figured out what was causing the problem… DNS lookups.

The Trend Micro Deep Security installation guide does not cover this as a hard requirement.  Although the product will allow you to input and save IP addresses instead of FQDNs, it just doesn’t work, so use DNS whenever possible!

vShield

First of all, I wouldn’t look at vShield anymore unless working in an older environment after this.  In fact, I may just respond with:

If it’s EOL, is no longer supported, AND incompatible; I won’t even try. “Road’s closed pizza boy! Find another way home!”

You don’t gain anything from deploying EOL software.  Very importantly, you don’t get any future security updates when vulnerabilities are discovered and you won’t get any help if you call support about it.

In case you’re reading this and did the same thing I did, here are some things we noticed during this vShield experience:

  • Endpoint would not stay installed on some the ESXi hosts, while it did on others.
  • There is additional work for this configuration if you’re using AutoDeploy and stateless hosts. (see VMware KB: 2036701)
  • When deploying the TM DSVAs to hosts where Endpoint did stay on, installation fails as soon as the appliance is attempted to be deployed.
    • This is where we discovered using FQDN instead of IP address is preferred.
  • After successfully deploying the DSVAs, we still had problems with virtual appliances and endpoint staying registered, so it never actually worked.

In the back of my mind since I didn’t do this up-front, I started questioning compatibility. Sure enough, vShield is EOL, and is not compatible with our vCenter and host versions.

VMware NSX Manager 6.2.4

With a little more research, I found that NSX with Guest Introspection has replaced vShield for endpoint/GI services, and as long as that’s all you’re using NSX for, the license is free.  With NSX 6.1 and later, there is also no need to “prepare stateless hosts” (see VMware KB: 2120649).

Before simply deploying and configuring, I checked all the compatibility matrixes involved and validated that our versions are supported and compatible.  Be sure to check the resource links below, as there is some important information especially with compatibility:

  • vCenter: v 6.0 build 5318203
  • ESXi: v 6.0 build 5224934
  • NSX: v 6.2.4 build 4292526
  • Trend Micro Deep Security: v 9.6 SP1

Note: NSX 6.3.2 can be deployed, but you will need at least TMDS 9.6 SP1 Update3 – which is why I went with 6.2.4, and will upgrade once TMDS is upgraded to support NSX 6.3.2.)

Resources:

 

What I’ve Learned

Here are some tips to ensure a smooth deployment for NSX Manager 6.2.4 and Trend Micro Deep Security 9.6 SP1.

  • Ensure your NTP servers are correct and reachable.
  • Use IP Pools if at all possible when deploying guest introspection services from NSX Manager.  (makes deployment easier and quicker)
  • Set up a datastore that will house ONLY NSX related appliances.  (makes deployment easier and quicker)
  • When you first set up NSX Manager, be sure to add your user account or domain group with admin access to it for management, otherwise, you won’t see it in the vSphere Web Client unless you’re logged in with the administrator@vsphere.local account.
  • Validate that there are DNS A and PTR records for the Trend Micro Security Manager, vCenter, and NSX Manager, otherwise anything you do in Deep Security to register your environment will fail.
  • Pay close attention to the known issues and workarounds in the “Compatibility Between NSX 6.2.3 and 6.2.4 with Deep Security” reference above, because you will see the error/failure they refer to.
  • If deploying in separate datacenters or across firewalls, be sure to allow all the necessary ports.
  • Unlike vShield Manager deploying Endpoint, NSX Manager deploying Guest Introspection is done at the cluster level.  When using NSX, you can’t deploy GI to only one host, you can only select a cluster to deploy to.

If you’ve found this useful in your deployment, please comment and share!  I’d like to hear from others who have experienced the same!

 

Share This:

Changing a VM’s Recovery VRA When a Host Crashes or Prior to Recovery Site Maintenance

Yesterday, we had one host in our recovery site PSOD, and that caused all kinds of errors in Zerto, primarily related to VPGs.  In our case, this particular host had both inbound and outbound VPGs attached to it’s VRA, and we were unable to edit (edit button in VPG view was grayed out, along with the “Edit VPG” link when clicking in to the VPG) any of them to recover from the host failure.  Previously when this would happen, we would just delete the VPG(s) and recreate it/them, preserving the disk files as pre-seeded data.

When you have a few of these to re-do, it’s not a big deal, however, when you have 10 or more, it quickly becomes a problem.

One thing that I discovered that I didn’t know was in the product, is that if you click in to the VRA associated with the failed host, and go do the MORE link, there’s an option in there to “Change Recovery VRA.”  This option will allow you to tell Zerto that anything related to this VRA should now be pointed at X. Once I did that, I was able to then edit the VPGs.  I needed to edit the VPGs that were outbound, because they were actually reverse-protected workloads that were missing some configuration details (NIC settings and/or Journal datastore).

Additionally – If you are planning on host maintenance in the recovery site (replacing a host(s), patching, etc…), these steps should be taken prior to taking the host and VRA down to ensure non-disrupted protection.

 

Here’s how:

  1. Log on to the Zerto UI.
  2. Once logged on, click on the Setup tab.
  3. In the “VRA Name” column, locate the VRA associated with the failed host, and then click the link (name of VRA) to open the VRA in a new tab in the UI.
  4. Click on the tab at the top that contains VRA: Z-VRA-[hostName].
  5. Once you’re looking at the VRA page, click on the MORE link.
  6. From the MORE menu, click Change VM Recovery VRA.
  7. In the Change VM Recovery VRA dialog, check the box beside the VPG/VM, then select a replacement host. Once all VPGs have been udpated, click Save.

Once you’ve saved your settings, validate that the VPG can be edited, and/or is once again replicating.

 

Share This:

ESXi 6.0 U2 Host Isolation Following Storage Rescan

Following an upgrade to ESXi 6.0 U2, this particular issue has popped up a few times, and while we still have a case open with VMware support in an attempt to understand root cause, we have found a successful workaround that doesn’t require any downtime for the running workloads or the host in question.  This issue doesn’t discriminate between iSCSI or Fibre Channel storage, as we’ve seen it in both instances (SolidFire – iSCSI, IBM SVC – FC).  One common theme with where we are seeing this problem is that it is happening in clusters with 10 or more hosts, and many datastores.  It may also be helpful to know that we have two datastores that are shared between multiple clusters.  These datastores are for syslogs and ISOs/Templates.

 

Note: In order to perform the steps in this how-to, you will need to already
have SSH running and available on the host, or access to the DCUI.

Observations

  • Following a host or cluster storage rescan, an ESXi host(s) stops responding in vCenter and still has running VMs on it (host isolation)
  • Attempts to reconnect the host via vCenter doesn’t work
  • Direct client connection (thick client) to host doesn’t work
  • Attempts to run services.sh from the CLI causes script to hang after “running sfcbd-watchdog stop“.  The last thing on the screen is “Exclusive access granted.”
  • The /var/log/vmkernel.log displays the following at this point: “Alert: hostd detected to be non-responsive

Troubleshooting

The following troubleshooting steps were obtained from VMware KB Article 1003409

  1. Verify the host is powered on.
  2. Attempt to reconnect the host in vCenter
  3. Verify that the ESXi host is able to respond back to vCenter at the correct IP address and vice versa.
  4. Verify that network connectivity exists from vCenter to the ESXi host’s management IP or FQDN
  5. Verify that port 903 TCP/UDP is open between the vCenter and the ESXi host
  6. Try to restart the ESXi management agents via DCUI or SSH to see if it resolves the issue
  7. Verify if the hostd process has stopped responding on the affected host.
  8. verify if the vpxa agent has stopped responding on the affected host.
  9. Verify if the host has experienced a PSOD (Purple Screen of Death).
  10. Verify if there is an underlying storage connectivity (or other storage-related) issue.

Following these troubleshooting steps left me at step 7, where I was able to determine if hostd was responding on the host.  The vmkernel.log further supports this observation.

Resolution/Workaround Steps

These are the steps I’ve taken to remedy the problem without having to take the VMs down or reboot the host:

  1. Since the hostd service is not responding, the first thing to do is run /etc/init.d/hostd restart from a second SSH session window (leaving the first one with the hung services.sh restart script process).
  2. While running the hostd restart command, the hung session will update, and produce the following:

  3. When you see that message, press enter to be returned to the shell prompt.
  4. Now run /etc/init.d/vpxa restart, which is the vCenter Agent on the host.
  5. After that completes, re-run services.sh restart and this time it should run all the way through successfully.
  6. Once services are all restarted, return to the vSphere Web Client and refresh the screen.  You should now see the host is back to being managed, and is no longer disconnected.
  7. At this point, you can either leave the host running as-is, or put it into maintenance mode (vMotion all VMs off).  Export the log bundle if you’d like VMware support to help analyze root cause.

 

I hope you find this useful, and if you do, please comment and share!

Share This: