
This required stopping X and dropping into console mode, but it did not require rebooting the machine as a whole. I ran on an RHEL-based workstation and driver recovery seemed to work quite well, although it did happen on a few occasions that after multiple consecutive timeout events, unloading and reloading of the driver as described by txbob became necessary. After recovery, other CUDA apps could be run. This recovery could take up to several seconds. It used to be the case, on both Linux and Windows, that in such a situation the current CUDA context is destroyed, but the CUDA driver itself recovered. Your description of the driver “crashing” when a watchdog event is trigger does not sound right to me. It sounds like you would want to rework your app to avoid kernels that get close to the timeout limit. I also updated the driver to Version 352.63. The Cuda Driver backs to normal after rebooting. This post also provides some information.

Rmmod: ERROR: could not remove module nvidia: Resource temporarily unavailable libkmod/libkmod-module.c:769 kmod_module_remove_module() could not remove 'nvidia': Resource temporarily unavailable I also tried the below commands sudo rmmod -f nvidia Rmmod: ERROR: Module nvidia is in use by: sudo rmmod -f sudo rmmod sudo nvidia-smi I test the commands with the following output.


I was thinking reset the driver without rebooting would save lots of hassle. Rebooting the server is very inconvenient.

Is triggered, consequently, the driver is crashed. The primary reason for me to reset driver is my application exceeds 2 seconds and the watchdog timer
