【Tensorflow】kernal崩溃问题


Tensorflow:2.7.0 cuda:11.5

最近发现在jupyter里运行脚本崩溃,日志很长,一下子没找到原因。google上有说是jupyter的问题。

Visual Studio Code (1.75.0, undefined, desktop)
Jupyter Extension Version: 2023.1.2000312134.
Python Extension Version: 2023.2.0.
Workspace folder g:\Projects
Benutzer gehört zur Experimentgruppe 'jupyterTestcf'
Benutzer gehört zur Experimentgruppe 'jupyterEnhancedDataViewer'
info 15:36:54.153: LSP Notebooks experiment is enabled
warn 15:36:56.091: Python environment D:\PROGRAMDATA\ANACONDA3\ENVS\OCR excluded as Uri is undefined
error 15:36:56.091: Failed to get env details from Python API for D:\PROGRAMDATA\ANACONDA3\ENVS\OCR without an error
warn 15:36:56.092: Python environment D:\PROGRAMDATA\ANACONDA3\ENVS\OCR excluded as Uri is undefined
error 15:36:56.092: Failed to get env details from Python API for D:\PROGRAMDATA\ANACONDA3\ENVS\OCR without an error
warn 15:36:56.155: Python environment D:\PROGRAMDATA\ANACONDA3\ENVS\OCR excluded as Uri is undefined
error 15:36:56.155: Failed to get env details from Python API for D:\PROGRAMDATA\ANACONDA3\ENVS\OCR without an error
info 15:36:57.631: Process Execution: > d:\ProgramData\Anaconda3\python.exe -m pip list
> d:\ProgramData\Anaconda3\python.exe -m pip list
info 15:37:07.096: Starting interactive window for resource 'g:\Projects\ML_Project\Mnist\train_digit_recognizer.py' with controller '.jvsc74a57bd0e42634819b8c191a5d07eaf23810ff32516dd8d3875f28ec3e488928fbd3c187.d:\ProgramData\Anaconda3\python.exe.d:\ProgramData\Anaconda3\python.exe.-m#ipykernel_launcher (Interactive)'
info 15:37:07.232: Starting Jupyter Session startUsingPythonInterpreter, .jvsc74a57bd0e42634819b8c191a5d07eaf23810ff32516dd8d3875f28ec3e488928fbd3c187.d:\ProgramData\Anaconda3\python.exe.d:\ProgramData\Anaconda3\python.exe.-m#ipykernel_launcher (Python Path: d:\ProgramData\Anaconda3, EnvType: Conda, EnvName: 'base', Version: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]) for 'Interactive-1.interactive' (disableUI=false)
info 15:37:07.246: Process Execution: > d:\ProgramData\Anaconda3\python.exe -c "import ipykernel; print(ipykernel.__version__); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.__file__)"
> d:\ProgramData\Anaconda3\python.exe -c "import ipykernel; print(ipykernel.__version__); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.__file__)"
info 15:37:07.320: Process Execution: > d:\ProgramData\Anaconda3\python.exe ~\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\pythonFiles\vscode_datascience_helpers\kernel_interrupt_daemon.py --ppid 29216
> d:\ProgramData\Anaconda3\python.exe ~\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\pythonFiles\vscode_datascience_helpers\kernel_interrupt_daemon.py --ppid 29216
info 15:37:07.320: Process Execution: cwd: ~\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\pythonFiles\vscode_datascience_helpers
cwd: ~\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\pythonFiles\vscode_datascience_helpers
info 15:37:07.659: Process Execution: > d:\ProgramData\Anaconda3\python.exe -m ipykernel_launcher --ip=127.0.0.1 --stdin=9008 --control=9006 --hb=9005 --Session.signature_scheme="hmac-sha256" --Session.key=b"207f713c-f091-40d8-991b-072ec472662d" --shell=9007 --transport="tcp" --iopub=9009 --f=c:\Users\boton\AppData\Roaming\jupyter\runtime\kernel-v2-292165ZlpIpBI1mHp.json
> d:\ProgramData\Anaconda3\python.exe -m ipykernel_launcher --ip=127.0.0.1 --stdin=9008 --control=9006 --hb=9005 --Session.signature_scheme="hmac-sha256" --Session.key=b"207f713c-f091-40d8-991b-072ec472662d" --shell=9007 --transport="tcp" --iopub=9009 --f=c:\Users\boton\AppData\Roaming\jupyter\runtime\kernel-v2-292165ZlpIpBI1mHp.json
info 15:37:07.659: Process Execution: cwd: g:\Projects\ML_Project\Mnist
cwd: g:\Projects\ML_Project\Mnist
info 15:37:07.685: ipykernel version & path 6.4.1, d:\ProgramData\Anaconda3\lib\site-packages\ipykernel\__init__.py for d:\ProgramData\Anaconda3\python.exe
info 15:37:10.149: Started Kernel base (Python 3.9.7) (pid: 27388)
info 15:37:10.200: Process Execution: > d:\ProgramData\Anaconda3\python.exe ~\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\pythonFiles\printJupyterDataDir.py
> d:\ProgramData\Anaconda3\python.exe ~\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\pythonFiles\printJupyterDataDir.py
warn 15:37:10.307: Got a non-existent Jupyer Data Dir file:///c%3A/Users/boton/AppData/Roaming/Python/share/jupyter
info 15:37:10.724: Generated code for 1 = <ipython-input-1-881d37c76070> with 61 lines
warn 15:37:14.786: StdErr from Kernel Process 2023-02-05 15:37:14.786565: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Ne
warn 15:37:14.786: StdErr from Kernel Process twork Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

warn 15:37:15.322: StdErr from Kernel Process 2023-02-05 15:37:15.322552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/r
warn 15:37:15.322: StdErr from Kernel Process eplica:0/task:0/device:GPU:0 with 5706 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060 SUPER, pci bus id: 0000:29:00.0, compute capability: 7.5

warn 15:37:18.970: StdErr from Kernel Process 2023-02-05 15:37:18.970811: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8300

info 15:41:23.375: Disposing kernel .jvsc74a57bd0e42634819b8c191a5d07eaf23810ff32516dd8d3875f28ec3e488928fbd3c187.d:\ProgramData\Anaconda3\python.exe.d:\ProgramData\Anaconda3\python.exe.-m#ipykernel_launcher for notebook Interactive-1.interactive due to selection of another kernel or closing of the notebook
info 15:41:23.375: Dispose Kernel 'Interactive-1.interactive' associated with 'g:\Projects\ML_Project\Mnist\train_digit_recognizer.py'
info 15:41:23.376: Dispose Kernel process 27388.
info 15:41:23.388: Dispose Kernel 'Interactive-1.interactive' associated with 'g:\Projects\ML_Project\Mnist\train_digit_recognizer.py'
info 15:41:24.391: Starting interactive window for resource 'g:\Projects\ML_Project\Mnist\train_digit_recognizer.py' with controller '.jvsc74a57bd0e42634819b8c191a5d07eaf23810ff32516dd8d3875f28ec3e488928fbd3c187.d:\ProgramData\Anaconda3\python.exe.d:\ProgramData\Anaconda3\python.exe.-m#ipykernel_launcher (Interactive)'
info 15:41:24.532: Starting Jupyter Session startUsingPythonInterpreter, .jvsc74a57bd0e42634819b8c191a5d07eaf23810ff32516dd8d3875f28ec3e488928fbd3c187.d:\ProgramData\Anaconda3\python.exe.d:\ProgramData\Anaconda3\python.exe.-m#ipykernel_launcher (Python Path: d:\ProgramData\Anaconda3, EnvType: Conda, EnvName: 'base', Version: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]) for 'Interactive-1.interactive' (disableUI=false)
info 15:41:24.539: Process Execution: > d:\ProgramData\Anaconda3\python.exe -c "import ipykernel; print(ipykernel.__version__); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.__file__)"
> d:\ProgramData\Anaconda3\python.exe -c "import ipykernel; print(ipykernel.__version__); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.__file__)"
info 15:41:24.621: Process Execution: > d:\ProgramData\Anaconda3\python.exe -m ipykernel_launcher --ip=127.0.0.1 --stdin=9008 --control=9006 --hb=9005 --Session.signature_scheme="hmac-sha256" --Session.key=b"7cbb1989-43ea-4959-86c2-6010f0565444" --shell=9007 --transport="tcp" --iopub=9009 --f=c:\Users\boton\AppData\Roaming\jupyter\runtime\kernel-v2-29216xNHC5cbquNCQ.json
> d:\ProgramData\Anaconda3\python.exe -m ipykernel_launcher --ip=127.0.0.1 --stdin=9008 --control=9006 --hb=9005 --Session.signature_scheme="hmac-sha256" --Session.key=b"7cbb1989-43ea-4959-86c2-6010f0565444" --shell=9007 --transport="tcp" --iopub=9009 --f=c:\Users\boton\AppData\Roaming\jupyter\runtime\kernel-v2-29216xNHC5cbquNCQ.json
info 15:41:24.621: Process Execution: cwd: g:\Projects\ML_Project\Mnist
cwd: g:\Projects\ML_Project\Mnist
info 15:41:24.887: ipykernel version & path 6.4.1, d:\ProgramData\Anaconda3\lib\site-packages\ipykernel\__init__.py for d:\ProgramData\Anaconda3\python.exe
info 15:41:26.692: Started Kernel base (Python 3.9.7) (pid: 26052)
info 15:41:27.230: Generated code for 1 = <ipython-input-1-881d37c76070> with 61 lines
warn 15:41:31.134: StdErr from Kernel Process 2023-02-05 15:41:31.135037: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized wi
warn 15:41:31.134: StdErr from Kernel Process th oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

warn 15:41:31.559: StdErr from Kernel Process 2023-02-05 15:41:31.559301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6008 MB memory:  -> device: 0, name: N
warn 15:41:31.559: StdErr from Kernel Process VIDIA GeForce RTX 2060 SUPER, pci bus id: 0000:29:00.0, compute capability: 7.5

warn 15:41:33.246: StdErr from Kernel Process 2023-02-05 15:41:33.247084: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8300

error 15:41:33.555: Disposing session as kernel process died ExitCode: 3221226505, Reason: d:\ProgramData\Anaconda3\lib\site-packages\traitlets\traitlets.py:2202: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
  warn(
d:\ProgramData\Anaconda3\lib\site-packages\traitlets\traitlets.py:2157: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '7cbb1989-43ea-4959-86c2-6010f0565444' instead of 'b"7cbb1989-43ea-4959-86c2-6010f0565444"'.
  warn(
2023-02-05 15:41:31.135037: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-05 15:41:31.559301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6008 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060 SUPER, pci bus id: 0000:29:00.0, compute capability: 7.5
2023-02-05 15:41:33.247084: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8300

info 15:41:33.556: Dispose Kernel process 26052.
error 15:41:33.556: Raw kernel process exited code: 3221226505
error 15:41:33.559: Error in waiting for cell to complete [Error: Canceled future for execute_request message before replies were done
    at t.KernelShellFutureHandler.dispose (c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:33213)
    at c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:52265
    at Map.forEach (<anonymous>)
    at y._clearKernelState (c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:52250)
    at y.dispose (c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:45732)
    at c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:17:139244
    at Z (c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:1608939)
    at Kp.dispose (c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:17:139221)
    at qp.dispose (c:\Users\boton\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:17:146518)
    at process.processTicksAndRejections (node:internal/process/task_queues:96:5)]
warn 15:41:33.561: Cell completed with errors {
  message: 'Canceled future for execute_request message before replies were done'
}
info 15:41:33.563: Cancel all remaining cells true || Error || undefined

于是更换到cmd里直接运行,提示是:

2023-02-05 15:43:05.008763: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8300
Could not load library cudnn_cnn_infer64_8.dll. Error code 126
Please make sure cudnn_cnn_infer64_8.dll is in your library path!

这下就比较清楚情况了。网上找到两种类似的解决方案。

方法一:

For those still having this issue, please make sure you also have completed this step:

Download, unzip and add zlibwapi.dll to your system path.

方法二:

我是通过这个解决的

I found a copy of the 64 bit zlibwapi.dll hiding under a different name in: C:\Program Files\NVIDIA Corporation\Nsight Systems 2022.4.2\host-windows-x64\zlib.dll

I copied and renamed it to: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\zlibwapi.dll

since that folder is already in my PATH variable; and it worked. Turns out the CUDA Toolkit already has the file you need elsewhere. Seems like they could save a lot of trouble if they just made a change to the CUDA Toolkit installer.