Issue
The issue manifested itself to the agents in the RGS queues. Agents were reporting that the calls were taking long to answer, and as time went by, this issue degraded to the point that they were unable to answer calls.
Later on I also noticed:
- degradation of meeting audio and video
- If nothing is done the Edge Server will eventually blue screen
Environment was Lync 2013, Enterprise Front End Pool – 3 Front Ends. Single Edge Server.Logs not reporting any unusual behavior.
Troubleshooting
Restarting Response Group Service made no difference. A peek at the Edge Server revealed that the RTCMEDIARELAY Service was reporting a high number of Active Sessions.
Monitoring this service, I noticed that the session count continued to rise over time. What was even more puzzling is that the count rose even though the user count was very low comparatively.
Using netstat I took a peek at active sessions but this was normal, no sign of the sessions reported by the RTCMEDIARELAY Service or of sessions not being closed.
Perhaps a quick restart of the services would free up the sessions I thought, but I couldn’t stop the services either (stop-CsWindowsService). Probably because of the restart service wanting to do this gracefully. Right, Edge server reboot it is.
The Edge server returned after a reboot and all appeared well, until a few days later when the RTCMEDIARELAY session count started to grow again.
Since I was tasked with a new pool as a side-by-side migration to Skype for Business, I decided that its not worth spending any more time chasing this issue as it will be decommissioned soon.
To my surprise, the new Edge Pool with 2 x Edge Servers on Skype for Business displayed the very same behavior. Only have about 10test users on the SFB pool.
Since this is now recurring on a brand new Topology, new Server OS and different product version (too much of a coincidence for my liking), I asked “So what’s the common factor?”
The virtual layer.
A quick call to the infrastructure engineer confirmed that VMWare was the underlying technology and “hasn’t changed since Lync 2013..”
I had since noticed that if the Edge Server with the hogh session count was not assisted then the machine would eventually blue screen. Its the dump data that revealed a “known” issue in VMWare.
The related article titled “Windows virtual machines using the vShield Endpoint TDI Manager or NSX Network Introspection Driver (vnetflt.sys) driver fails with a blue diagnostic screen (2121307) “can be found HERE.
Solution
By simply uninstalling the “NSX File Introspection Driver” in the VMWare Tools Setup on the Edge Servers, I was able to remove the faulting culprit.
Of course, one could disable the vShield Endpoint TDI Manager by means of the regedit modification mentioned in the article as well.
Rebooted the server and tested, all good. Monitored over a week to confirm.
Sorted.