Add debugging notes for video and render group, a full strace output, and numactl.

main
Paul Pham 2 years ago
parent c7f2ccb03b
commit 706b62058b

@ -29,3 +29,24 @@ Burst size is 256
Restart System To Complete VBIOS Update. Restart System To Complete VBIOS Update.
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cf firmwares/originals/Gigabyte.R9390.8192.150605_1.rom
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
File Checksum = 0xC2A3
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cf firmwares/current/ai-gpu-01.bin
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
File Checksum = 0x9388
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cb 0
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
BIOS Checksum = 0x8500
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -cr 0
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
ROM Checksum = 0x8500
arcologos@arcologos-desktop:~/src/arcologos-infra/GPUs$ sudo ./amdvbflash-linux -p 0 firmwares/current/ai-gpu-01.bin
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.
Flash already programmed

@ -1,5 +1,13 @@
# GPU Configuration for AI Compute Cluster at Toomim Brothers # GPU Configuration for AI Compute Cluster at Toomim Brothers
## Debugging Tree
Root
[debugging/Root.md]
Current Point
[debugging/Ubuntu-20_04.md](Ubuntu 20.04)
## Software Configuration ## Software Configuration
Ubuntu 22.04 LTS Ubuntu 22.04 LTS

@ -0,0 +1,27 @@
# Setup
On a new host
## Min Browser
## Element Desktop
To share passwords and other random links.
## Git
### User Name and Email
### Cache Password
Because Gitea / Forgejo require that the repository directory belong to the `gitea` or `forgejo` user, and I believe
I've misconfigured it on `petra`.
According to this
https://www.freecodecamp.org/news/how-to-fix-git-always-asking-for-user-credentials/
```
git config --global credential.helper store
```

@ -1,6 +1,6 @@
## Logging Out and Back In ## Logging Out and Back In
The groupd `video` doesn't show up in the group list, The group `video` doesn't show up in the group list,
``` ```
$ id $ id
@ -16,4 +16,40 @@ ERROR: clCreateKernel(-6)
Next is to try adding to the `render` group, and then installing Next is to try adding to the `render` group, and then installing
`libnuma-dev` as recommended in the above GitHub Issue thread comment. `libnuma-dev` as recommended in the above GitHub Issue thread comment.
Here are some tips to check.
https://unix.stackexchange.com/a/96327
For example, you can verify that the groups exist and the user belongs to them
```
sudo cat /etc/group | grep $USER
```
You can also use
```
sudo addgroup $USER video
```
but this has the same effect as `usermod`.
You can also start a new shell where the user not only belongs to that group, but it is their
primary / login group (which seems a bit overkill).
```
newgrp video
```
This, however, still caused `clinfo -l` to hang with the new line `ERROR: clCreateKernel(-6)`.
Time to log out and log back in again.
## Add User to Render Group ## Add User to Render Group
This is the reference, the AMD ROCm manual, that instructs to add the user trying to access the
compute (CL) nodes to the `video` and `render` groups.
```
https://amdgpu-install.readthedocs.io/_/downloads/en/21.10/pdf/
```

@ -0,0 +1,95 @@
# Render Group and Rebooting
When trying to do a hardware shutdown, I got these error messages for every instance
of the `clinfo` process that I had started up. They hang hard, and don't respond to
kill signals.
```
systemd-shutdown[1]: Waiting for process: clinfo, clinfo, clinfo, clinfo, clinfo
```
## Rebooting
New groups indeed show up upon reboot.
```
$ id
uid=1000(arcologos) gid=1000(arcologos) groups=1000(arcologos),4(adm),24(cdrom),27(sudo),30(dip),44(video),46(plugdev),109(render),120(lpadmin),133(lxd),134(sambashare)
```
## Killing `clinfo` Process
Repeatedly calling
```
ps -eax
```
and
```
sudo kill -KILL <pid>
```
of anything mentioning `clinfo` eventually did kill the `bash` processes
waiting on `clinfo` after many minutes. However, the real processes
remain as zombies
```
$ ps -eax | grep clinfo
1519 pts/0 D 0:44 [clinfo]
1553 pts/1 D 0:00 [clinfo]
```
## Installing `libnuma-dev`
One of the GitHub issue advice was to install `libnuma-dev`,
and now I wish I had been paying more attention to NUMA when
working on SaLSa self-driving cars with Samhitha.
```
$ sudo apt install numactl
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
numactl
0 upgraded, 1 newly installed, 0 to remove and 171 not upgraded.
Need to get 38.5 kB of archives.
After this operation, 150 kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com/ubuntu focal/main amd64 numactl amd64 2.0.12-1 [38.5 kB]
Fetched 38.5 kB in 0s (140 kB/s)
Selecting previously unselected package numactl.
(Reading database ... 165065 files and directories currently installed.)
Preparing to unpack .../numactl_2.0.12-1_amd64.deb ...
Unpacking numactl (2.0.12-1) ...
Setting up numactl (2.0.12-1) ...
Processing triggers for man-db (2.9.1-1) ...
```
The two cores of the Celeron show up in `numactl` but not the GPU.
```
$ sudo numactl -s
policy: default
preferred node: current
physcpubind: 0 1
cpubind: 0
nodebind: 0
membind: 0
$ sudo numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1
node 0 size: 1574 MB
node 0 free: 103 MB
node distances:
node 0
0: 10
```
## Perhaps a Problem with `rocm-opencl` or Ubuntu distribution.
A problem with this on Fedora prompts `clinfo` maintainer to suggest taking it up with Fedora
https://github.com/Oblomov/clinfo/issues/81
Unanswered
https://community.amd.com/t5/drivers-software/clinfo-gets-hanged/td-p/444906

File diff suppressed because it is too large Load Diff
Loading…
Cancel
Save