Compare commits

..

2 Commits

@ -29,3 +29,24 @@ Burst size is 256
Restart System To Complete VBIOS Update. Restart System To Complete VBIOS Update.
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cf firmwares/originals/Gigabyte.R9390.8192.150605_1.rom
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
File Checksum = 0xC2A3
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cf firmwares/current/ai-gpu-01.bin
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
File Checksum = 0x9388
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cb 0
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
BIOS Checksum = 0x8500
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cr 0
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
ROM Checksum = 0x8500
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -p 0 firmwares/current/ai-gpu-01.bin
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
Flash already programmed

@ -1,5 +1,13 @@
# GPU Configuration for AI Compute Cluster at Toomim Brothers # GPU Configuration for AI Compute Cluster at Toomim Brothers
## Debugging Tree
Root
[debugging/Root.md]
Current Point
[debugging/Ubuntu-20_04.md](Ubuntu 20.04)
## Software Configuration ## Software Configuration
Ubuntu 22.04 LTS Ubuntu 22.04 LTS

@ -0,0 +1,27 @@
# Setup
On a new host
## Min Browser
## Element Desktop
To share passwords and other random links.
## Git
### User Name and Email
### Cache Password
Because Gitea / Forgejo require that the repository directory belong to the `gitea` or `forgejo` user, and I believe
I've misconfigured it on `petra`.
According to this
https://www.freecodecamp.org/news/how-to-fix-git-always-asking-for-user-credentials/
```
git config --global credential.helper store
```

@ -131,22 +131,3 @@ sudo usermod -G video -a $USER
``` ```
## Logging Out and Back In
The groupd `video` doesn't show up in the group list,
```
$ id
uid=1000(arcologos) gid=1000(arcologos) groups=1000(arcologos),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),120(lpadmin),133(lxd),134(sambashare)
```
but now we get an additional line at the end of `clinfo` before it hangs.
```
ERROR: clCreateKernel(-6)
```
Next is to try adding to the `render` group, and then installing
`libnuma-dev` as recommended in the above GitHub Issue thread comment.
## Add User to Render Group

@ -1,133 +0,0 @@
# Kernel Boot Params
## Updating Grub
To select the `AMDGPU` driver / module.
https://askubuntu.com/a/1314983
I'll add the following flags to the appropriate line in `/etc/default/grub`
```
GRUB_CMDLINE_LINUX_DEFAULT="radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.dc=1 amdgpu.dpm=1 amdgpu.modeset=1"
```
Then run
```
sudo update-grub
```
and reboot, selecting Ubuntu 20.04 per this branch of the debugging tree.
## Rebooting
After rebooting, `OpenCL` detects the GPU! But it hangs and does not return from the call
```
sudo clinfo -l
```
The call without the `sudo` yields just the CPU and `Number of devices: 0` as before.
Here's the glorious output:
```
$ sudo clinfo -l
[sudo] password for arcologos:
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3558.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon (TM) R9 390 Series
Device Topology: PCI[ B#3, D#0, F#0 ]
Max compute units: 40
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1005Mhz
Address bits: 64
Max memory allocation: 7301444400
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 8192
Max samplers within kernel: 26545
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8589934592
Constant buffer size: 7301444400
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 3006477104
Max global variable size: 7301444400
Max global variable preferred total size: 8589934592
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
```
## Add User to Video Group
Per this response
https://github.com/RadeonOpenCompute/ROCm/issues/482#issuecomment-410551357
I add the current user to the `video` group. I forget where I first read that this was necessary, but next also
is to try the `render` group.
```
sudo usermod -G video -a $USER
```

@ -1,6 +1,6 @@
## Logging Out and Back In ## Logging Out and Back In
The groupd `video` doesn't show up in the group list, The group `video` doesn't show up in the group list,
``` ```
$ id $ id
@ -16,4 +16,40 @@ ERROR: clCreateKernel(-6)
Next is to try adding to the `render` group, and then installing Next is to try adding to the `render` group, and then installing
`libnuma-dev` as recommended in the above GitHub Issue thread comment. `libnuma-dev` as recommended in the above GitHub Issue thread comment.
Here are some tips to check.
https://unix.stackexchange.com/a/96327
For example, you can verify that the groups exist and the user belongs to them
```
sudo cat /etc/group | grep $USER
```
You can also use
```
sudo addgroup $USER video
```
but this has the same effect as `usermod`.
You can also start a new shell where the user not only belongs to that group, but it is their
primary / login group (which seems a bit overkill).
```
newgrp video
```
This, however, still caused `clinfo -l` to hang with the new line `ERROR: clCreateKernel(-6)`.
Time to log out and log back in again.
## Add User to Render Group ## Add User to Render Group
This is the reference, the AMD ROCm manual, that instructs to add the user trying to access the
compute (CL) nodes to the `video` and `render` groups.
```
https://amdgpu-install.readthedocs.io/_/downloads/en/21.10/pdf/
```

@ -0,0 +1,95 @@
# Render Group and Rebooting
When trying to do a hardware shutdown, I got these error messages for every instance
of the `clinfo` process that I had started up. They hang hard, and don't respond to
kill signals.
```
systemd-shutdown[1]: Waiting for process: clinfo, clinfo, clinfo, clinfo, clinfo
```
## Rebooting
New groups indeed show up upon reboot.
```
$ id
uid=1000(arcologos) gid=1000(arcologos) groups=1000(arcologos),4(adm),24(cdrom),27(sudo),30(dip),44(video),46(plugdev),109(render),120(lpadmin),133(lxd),134(sambashare)
```
## Killing `clinfo` Process
Repeatedly calling
```
ps -eax
```
and
```
sudo kill -KILL <pid>
```
of anything mentioning `clinfo` eventually did kill the `bash` processes
waiting on `clinfo` after many minutes. However, the real processes
remain as zombies
```
$ ps -eax | grep clinfo
1519 pts/0 D 0:44 [clinfo]
1553 pts/1 D 0:00 [clinfo]
```
## Installing `libnuma-dev`
One of the GitHub issue advice was to install `libnuma-dev`,
and now I wish I had been paying more attention to NUMA when
working on SaLSa self-driving cars with Samhitha.
```
$ sudo apt install numactl
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
numactl
0 upgraded, 1 newly installed, 0 to remove and 171 not upgraded.
Need to get 38.5 kB of archives.
After this operation, 150 kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com/ubuntu focal/main amd64 numactl amd64 2.0.12-1 [38.5 kB]
Fetched 38.5 kB in 0s (140 kB/s)
Selecting previously unselected package numactl.
(Reading database ... 165065 files and directories currently installed.)
Preparing to unpack .../numactl_2.0.12-1_amd64.deb ...
Unpacking numactl (2.0.12-1) ...
Setting up numactl (2.0.12-1) ...
Processing triggers for man-db (2.9.1-1) ...
```
The two cores of the Celeron show up in `numactl` but not the GPU.
```
$ sudo numactl -s
policy: default
preferred node: current
physcpubind: 0 1
cpubind: 0
nodebind: 0
membind: 0
$ sudo numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1
node 0 size: 1574 MB
node 0 free: 103 MB
node distances:
node 0
0: 10
```
## Perhaps a Problem with `rocm-opencl` or Ubuntu distribution.
A problem with this on Fedora prompts `clinfo` maintainer to suggest taking it up with Fedora
https://github.com/Oblomov/clinfo/issues/81
Unanswered
https://community.amd.com/t5/drivers-software/clinfo-gets-hanged/td-p/444906

File diff suppressed because it is too large Load Diff
Loading…
Cancel
Save