摘要:基于和容器的深度學習環境搭建云主機操作系統位安裝安裝如果沒有,需安裝安裝安裝有兩種方式安裝安裝本文選擇安裝方式。
基于Nvidia GPU和Docker容器的深度學習環境搭建
GPU云主機:
操作系統:Ubuntu 16.04 64位
GPU: 1 x Nvidia Tesla P40
安裝gcc、g++、make:
# sudo apt-get install gcc g++ make # gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
如果沒有,需安裝linux-headers:
# sudo apt-get install linux-headers-$(uname -r)1.2 安裝NVIDIA driver
CUDA安裝有兩種方式:
1.Package安裝
2.Runfile安裝
本文選擇runfile安裝方式。
首先禁用Nouveau:
# lsmod | grep nouveau nouveau 1495040 0 mxm_wmi16384 1 nouveau wmi20480 2 mxm_wmi,nouveau video 40960 1 nouveau i2c_algo_bit 16384 1 nouveau ttm94208 1 nouveau drm_kms_helper155648 1 nouveau drm 364544 3 ttm,drm_kms_helper,nouveau # vi /etc/modprobe.d/blacklist-nouveau.conf blacklist nouveau options nouveau modeset=0 # sudo update-initramfs -u update-initramfs: Generating /boot/initrd.img-4.4.0-62-generic W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
Reboot云主機:
# reboot
重啟后check下Nouveau drivers沒有被load:
# lsmod | grep nouveau #
登錄:http://developer.nvidia.com/c... 下載相應的runfile:
# wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
開始安裝CUDA Driver:
# chmod +x cuda_10.0.130_410.48_linux # sudo sh ./cuda_10.0.130_410.48_linux Logging to /tmp/cuda_install_1699.log Using more to view the EULA. Do you accept the previously read EULA? accept/decline/quit: accept Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? (y)es/(n)o/(q)uit: y Do you want to install the OpenGL libraries? (y)es/(n)o/(q)uit [ default is yes ]: y Do you want to run nvidia-xconfig? This will update the system X configuration file so that the NVIDIA X driver is used. The pre-existing X configuration file will be backed up. This option should not be used on systems that require a custom X configuration, such as systems with multiple GPU vendors. (y)es/(n)o/(q)uit [ default is no ]: Install the CUDA 10.0 Toolkit? (y)es/(n)o/(q)uit: y Enter Toolkit Location [ default is /usr/local/cuda-10.0 ]: Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: y Install the CUDA 10.0 Samples? (y)es/(n)o/(q)uit: y Enter CUDA Samples Location [ default is /root ]: Installing the NVIDIA display driver... Installing the CUDA Toolkit in /usr/local/cuda-10.0 ... Missing recommended library: libGLU.so Missing recommended library: libX11.so Missing recommended library: libXi.so Missing recommended library: libXmu.so Installing the CUDA Samples in /root ... Copying samples to /root/NVIDIA_CUDA-10.0_Samples now... Finished copying samples. =========== = Summary = =========== Driver: Installed Toolkit: Installed in /usr/local/cuda-10.0 Samples: Installed in /root, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.0/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA. Logfile is /tmp/cuda_install_1699.log
安裝成功!
Reboot云主機:
# reboot
設備驗證:
# ls /dev/nvidia* ls: cannot access "/dev/nvidia*": No such file or directory # vi nvidia-probe.sh #!/bin/bash ### BEGIN INIT INFO # Provides: jd.com # Required-Start: $local_fs $network # Required-Stop: $local_fs # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Short-Description: nvidia service # Description: nvidia service daemon ### END INIT INFO /sbin/modprobe nvidia if [ "$?" -eq 0 ]; then # Count the number of NVIDIA controllers found. NVDEVS=`lspci | grep -i NVIDIA` N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l` NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l` N=`expr $N3D + $NVGA - 1` for i in `seq 0 $N`; do mknod -m 666 /dev/nvidia$i c 195 $i done mknod -m 666 /dev/nvidiactl c 195 255 else exit 1 fi /sbin/modprobe nvidia-uvm if [ "$?" -eq 0 ]; then # Find out the major device number used by the nvidia-uvm driver D=`grep nvidia-uvm /proc/devices | awk "{print $1}"` mknod -m 666 /dev/nvidia-uvm c $D 0 else exit 1 fi # chmod +x nvidia-probe.sh # ./nvidia-probe.sh # ls /dev/nvidia* /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm
/dev下成功發現設備!
配置開機自啟動:
# cp nvidia-probe.sh /etc/init.d/ # sudo update-rc.d nvidia-probe.sh defaults 951.3 Post-installation Actions
配置環境變量:
# vi /etc/profile ...... export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
開機啟動Persistence Daemon:
# vi /etc/rc.local ...... /usr/bin/nvidia-persistenced --verbose exit 01.4 CUDA driver驗證
查看Driver Version:
# cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 410.48 Thu Sep 6 06:36:33 CDT 2018 GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
使用deviceQuery示例驗證:
# cd ~/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/ # make "/usr/local/cuda-10.0"/bin/nvcc -ccbin g++ -I../../common/inc -m64-gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp "/usr/local/cuda-10.0"/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o mkdir -p ../../bin/x86_64/linux/release cp deviceQuery ../../bin/x86_64/linux/release # cd ../../bin/x86_64/linux/release/ # ls deviceQuery # ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Tesla P40" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number:6.1 Total amount of global memory: 22919 MBytes (24032378880 bytes) (30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores GPU Max Clock rate:1531 MHz (1.53 GHz) Memory Clock rate: 3615 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size(x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory:No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces:Yes Device has ECC support:Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption:Yes Supports Cooperative Kernel Launch:Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 7 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1 Result = PASS
參考:
https://github.com/NVIDIA/nvi...2. 安裝Nvidia-docker 2.1 安裝Dockerhttps://docs.nvidia.com/cuda/...
安裝docker-ce:
#sudo apt-get remove docker docker-engine docker.io # sudo apt-get install apt-transport-https ca-certificates curl software-properties-common # curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" # sudo apt-get update # sudo apt-get install docker-ce # docker version Client: Version: 18.06.1-ce API version: 1.38 Go version:go1.10.3 Git commit:e68fc7a Built: Tue Aug 21 17:24:56 2018 OS/Arch: linux/amd64 Experimental: false Server: Engine: Version: 18.06.1-ce API version: 1.38 (minimum version 1.12) Go version: go1.10.3 Git commit: e68fc7a Built:Tue Aug 21 17:23:21 2018 OS/Arch: linux/amd64 Experimental: false2.2 安裝nvidia-docker
安裝nvidia-docker:
# Add the package repositories curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update # Install nvidia-docker2 and reload the Docker daemon configuration sudo apt-get install -y nvidia-docker2 sudo pkill -SIGHUP dockerd
驗證nvidia-docker:
# docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi Thu Oct 25 09:03:27 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48| |-------------------------------+----------------------+----------------------+ | GPU NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P40 On | 00000000:00:07.0 Off |0 | | N/A 20CP8 9W / 250W | 0MiB / 22919MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+2.3 配置Docker默認runtime
cat /etc/docker/daemon.json
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
重啟服務:
# systemctl restart docker # systemctl status docker2.4 運行TensorFlow卷積神經Model
Docker運行:
# docker run --rm --name tensorflow -ti tensorflow/tensorflow:r0.9-devel-gpu root@bd0fb3758da2:~# python --version Python 2.7.6 root@bd0fb3758da2:~# python -m tensorflow.models.image.mnist.convolutional
參考:
https://docs.docker.com/insta...https://github.com/NVIDIA/nvi...
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/27508.html
摘要:阿里云推出國內首個基于英偉達的優化容器月日,在云棲大會深圳峰會上,阿里云宣布與英偉達云合作,開發者可以在云市場下載云鏡像和運行容器,來使用阿里云上的計算平臺。阿里云成為中國首家與加速的容器合作的云廠商。 摘要: 3月28日,在2018云棲大會·深圳峰會上,阿里云宣布與英偉達GPU 云 合作 (NGC),開發者可以在云市場下載NVIDIA GPU 云鏡像和運行NGC 容器,來使用阿里云上...
摘要:你可以發布一個可再現的機器學習項目,它幾乎不需要用戶設置,不需要用戶花小時去下載依賴或者報錯相反,你可以這樣做這種方法可以直接運行你的腳本,所有的依賴包括支持都幫你準備好了。應該怎么做針對機器學習的使用場景,你較好把你的代碼發布到上。 Docker提供了一種將Linux Kernel中需要的內容靜態鏈接到你的應用中的方法。Docker容器可以使用宿主機的GPUs,因此我們可以把TensorF...
閱讀 3010·2021-10-08 10:18
閱讀 730·2019-08-30 15:54
閱讀 1062·2019-08-29 18:43
閱讀 2434·2019-08-29 15:33
閱讀 1298·2019-08-29 15:29
閱讀 1599·2019-08-29 13:29
閱讀 1022·2019-08-26 13:46
閱讀 1693·2019-08-26 11:55