parent
706b62058b
commit
7996ab660d
@ -0,0 +1,53 @@
|
||||
# First Install
|
||||
|
||||
I first chose Ubuntu-22.04 because it's the latest Long Term Support (LTS)
|
||||
release and I was reading about ROCm, the AMD equivalent of CUDA for nVidia,
|
||||
and it had 22.04 Ubuntu releases.
|
||||
|
||||
I assumed that whatever problems from driver / package mismatches and errors from
|
||||
bleeding-edge releases would be balanced by frequent fixes.
|
||||
|
||||
However, I eventually abandoned this approach because it became clear from
|
||||
multiple sources that the `amdgpu` unified open source driver stopped supporting
|
||||
Hawaii chipsets (sometimes called `gfx7`) which are the hardware / OEM names for
|
||||
the R9 390s.
|
||||
|
||||
I'm summarizing more briefly my approach here to keep straight what I've already tried
|
||||
and why, in case it's useful to try again in the future for better hardware support, etc.
|
||||
|
||||
I installed three times, until the machine appeared borked (the screen went blank on
|
||||
bootup, or a sick-computer black-and-white screen due to an unrecoverable kernel error.
|
||||
|
||||
In retrospect, these seem likely to have been solvable by two methods:
|
||||
|
||||
* uninstalling newer kernel versions and downgrading to a kernel that others have reported success with
|
||||
* fixing kernel boot parameters in GRUB and trying again.
|
||||
|
||||
## On-board Display
|
||||
|
||||
A key part of this phase of the debugging was setting the "Advanced Configuration"
|
||||
"Chipset Configuration" part of the UEFI / BIOS to use the on-board graphics for display,
|
||||
rather than the GPU. That's because we're using the VGA port on the mainboard.
|
||||
|
||||
## The Latest Unified AMDGPU Driver
|
||||
|
||||
|
||||
## The Latest Radeon Driver
|
||||
|
||||
I still don't know the difference between the two kinds of drivers.
|
||||
|
||||
## DKMS Compile Errors
|
||||
|
||||
The end of this line of inquiry came from repeatedly encountering DKMS compile errors.
|
||||
These got compiled into `/usr/src/module/amdgpu` or some similar path I believe.
|
||||
|
||||
The make would fail due to a preprocessor error about the macro
|
||||
|
||||
```
|
||||
#define NULL ((void *)0)
|
||||
```
|
||||
|
||||
which looks legit to me. The make log would say that it is using gcc-11.4.0
|
||||
but the kernel was built with gcc-11.3.0.
|
||||
|
||||
|
||||
@ -0,0 +1,28 @@
|
||||
# Second / Third Install
|
||||
|
||||
During the second /third install, I believe I focused on flashing the GPUs
|
||||
using the `amdvbflash` tool, the successor to `atiflash` that is
|
||||
a command-line Linux tool.
|
||||
|
||||
My hypothesis was that the overclocked BIOS was preventing kernel drivers
|
||||
from detecting it with `clinfo` or `rocm-smi` but I now consider that unlikely.
|
||||
Such an failure would have been unstable performance.
|
||||
|
||||
The cards were not detected and the driver modules not loaded in the kernel
|
||||
at all .
|
||||
|
||||
## PCIe Level Gen 2
|
||||
|
||||
There appears to be a speed negotiation setting in BIOS for communication
|
||||
over the PCI Express bus.
|
||||
|
||||
One forum thread recommended setting it to Gen 1, but I haven't found
|
||||
any issues with speed connection so far. I might try it again later.
|
||||
|
||||
## Netboot
|
||||
|
||||
I set the BIOS / UEFI setting to Boot-on-LAN, which I assumed was
|
||||
the same as Netboot, but continued getting to GRUB.
|
||||
|
||||
Perhaps I need to disable all the other boot volumes in order to
|
||||
get Netboot into the boot order. Something to try later.
|
||||
@ -0,0 +1,143 @@
|
||||
# `dmesg` Output
|
||||
|
||||
```
|
||||
4.728018] amdgpu: unknown parameter 'modeset' ignored
|
||||
[ 4.735631] [drm] amdgpu kernel modesetting enabled.
|
||||
[ 4.736026] amdgpu: CRAT table not found
|
||||
[ 4.736031] amdgpu: Virtual CRAT table created for CPU
|
||||
[ 4.736042] amdgpu: Topology: Add CPU node
|
||||
[ 4.736154] amdgpu 0000:03:00.0: enabling device (0000 -> 0003)
|
||||
[ 4.736247] [drm] initializing kernel modesetting (HAWAII 0x1002:0x67B1 0x1462:0x2015 0x80).
|
||||
[ 4.736251] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
|
||||
[ 4.736262] [drm] register mmio base: 0xF0800000
|
||||
[ 4.736263] [drm] register mmio size: 262144
|
||||
[ 4.736266] [drm] PCIE atomic ops is not supported
|
||||
[ 4.736273] [drm] add ip block number 0 <cik_common>
|
||||
[ 4.736274] [drm] add ip block number 1 <gmc_v7_0>
|
||||
[ 4.736275] [drm] add ip block number 2 <cik_ih>
|
||||
[ 4.736276] [drm] add ip block number 3 <gfx_v7_0>
|
||||
[ 4.736277] [drm] add ip block number 4 <cik_sdma>
|
||||
[ 4.736278] [drm] add ip block number 5 <powerplay>
|
||||
[ 4.736279] [drm] add ip block number 6 <dm>
|
||||
[ 4.736280] [drm] add ip block number 7 <uvd_v4_2>
|
||||
[ 4.736282] [drm] add ip block number 8 <vce_v2_0>
|
||||
[ 5.005666] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
|
||||
[ 5.005673] amdgpu: ATOM BIOS: MS-V30823-F6
|
||||
[ 5.005721] [drm] GPU posting now...
|
||||
[ 5.018960] [drm] PCIE gen 2 link speeds already enabled
|
||||
[ 5.018970] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
|
||||
[ 5.023944] amdgpu 0000:03:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
|
||||
[ 5.023951] amdgpu 0000:03:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
|
||||
[ 5.023967] [drm] Detected VRAM RAM=8192M, BAR=256M
|
||||
[ 5.023969] [drm] RAM width 512bits GDDR5
|
||||
[ 5.023988] [drm] amdgpu: 8192M of VRAM memory ready
|
||||
[ 5.023991] [drm] amdgpu: 1181M of GTT memory ready.
|
||||
[ 5.023998] [drm] GART: num cpu pages 262144, num gpu pages 262144
|
||||
[ 5.024556] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
|
||||
[ 5.069604] amdgpu: hwmgr_sw_init smu backed is ci_smu
|
||||
[ 5.069652] intel_rapl_common: Found RAPL domain package
|
||||
[ 5.069655] intel_rapl_common: Found RAPL domain core
|
||||
[ 5.069656] intel_rapl_common: Found RAPL domain uncore
|
||||
[ 5.069657] intel_rapl_common: Found RAPL domain dram
|
||||
[ 5.070486] [drm] Found UVD firmware Version: 1.64 Family ID: 9
|
||||
[ 5.078070] [drm] Found VCE firmware Version: 50.10 Binary ID: 2
|
||||
[ 5.110017] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
|
||||
[ 5.110050] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
|
||||
[ 5.110079] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
|
||||
[ 5.110099] [drm] Display Core initialized with v3.2.149!
|
||||
[ 5.112540] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
|
||||
[ 5.149636] [drm] UVD initialized successfully.
|
||||
[ 5.269685] [drm] VCE initialized successfully.
|
||||
[ 5.273454] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
|
||||
```
|
||||
|
||||
It appears to allocate close to 4GB to the `gart` whatever that is.
|
||||
So that matches hardware reality.
|
||||
|
||||
```
|
||||
[ 15.258622] amdgpu: VI should always have 2 performance levels
|
||||
[ 178.440772] [drm] PCIE gen 2 link speeds already enabled
|
||||
[ 178.448020] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
|
||||
[ 178.450788] amdgpu 0000:03:00.0: amdgpu: SRBM_SOFT_RESET=0x00100040
|
||||
[ 178.500169] [drm] UVD initialized successfully.
|
||||
[ 178.620189] [drm] VCE initialized successfully.
|
||||
[ 178.621247] amdgpu: SW scheduler is used
|
||||
[ 178.671490] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
|
||||
[ 179.165318] ------------[ cut here ]------------
|
||||
[ 179.165321] Load non-HWS mqd while stopped
|
||||
[ 179.165339] WARNING: CPU: 0 PID: 1519 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:389 create_queue_nocpsch+0x372/0x710 [amdgpu]
|
||||
[ 179.165591] Modules linked in: nls_iso8859_1 intel_rapl_msr amdgpu intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mei_hdcp kvm_intel kvm snd_hda_co
|
||||
dec_realtek crct10dif_pclmul snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_intel cryptd snd_intel_dspcfg snd_intel_sdw_acpi rapl sn
|
||||
d_hda_codec binfmt_misc snd_hda_core snd_hwdep intel_cstate iommu_v2 gpu_sched snd_pcm radeon i915 snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device dr
|
||||
m_ttm_helper snd_timer ttm joydev input_leds drm_kms_helper cec snd rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect mei_me sysimgblt soundcore mei mac_hid sch_
|
||||
fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon efi_pstore drm pstore_blk pstore_zone ip_tables x_tables autofs4 hid_generic usbhid hid i2c_i801 crc32_pcl
|
||||
mul i2c_smbus r8169 xhci_pci ahci libahci realtek lpc_ich xhci_pci_renesas video
|
||||
[ 179.165647] CPU: 0 PID: 1519 Comm: clinfo Not tainted 5.15.0-78-generic #85~20.04.1-Ubuntu
|
||||
[ 179.165650] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H81 Pro BTC R2.0, BIOS P1.20 07/22/2014
|
||||
[ 179.165651] RIP: 0010:create_queue_nocpsch+0x372/0x710 [amdgpu]
|
||||
[ 179.165831] Code: 0f b6 3d d6 ed 68 00 41 80 ff 01 0f 87 23 5f 36 00 41 83 e7 01 75 15 48 c7 c7 08 f4 37 c1 c6 05 b8 ed 68 00 01 e8 94 07 c3 cc <0f> 0b 49 8b 45 10 4
|
||||
c 89 70 08 49 89 06 48 8b 45 c0 49 89 46 08 4d
|
||||
[ 179.165834] RSP: 0018:ffff9cba4150bbd0 EFLAGS: 00010286
|
||||
[ 179.165836] RAX: 0000000000000000 RBX: ffff8c5f81149800 RCX: 0000000000000027
|
||||
[ 179.165838] RDX: 0000000000000027 RSI: ffff9cba4150ba10 RDI: ffff8c6080220588
|
||||
[ 179.165839] RBP: ffff9cba4150bc18 R08: ffff8c6080220580 R09: 0000000000000001
|
||||
[ 179.165840] R10: 0000000000000001 R11: 0000000000000020 R12: 0000000000000000
|
||||
[ 179.165841] R13: ffff8c5f98d5c610 R14: ffff8c5f82d4dc00 R15: 0000000000000000
|
||||
[ 179.165842] FS: 00007fb78370eb80(0000) GS:ffff8c6080200000(0000) knlGS:0000000000000000
|
||||
[ 179.165845] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
|
||||
[ 179.165846] CR2: 0000000001018040 CR3: 0000000004840006 CR4: 00000000000706f0
|
||||
[ 179.165848] Call Trace:
|
||||
[ 179.165849] <TASK>
|
||||
[ 179.165852] pqm_create_queue+0x191/0x450 [amdgpu]
|
||||
[ 179.166012] kfd_ioctl_create_queue+0xd3/0x2c0 [amdgpu]
|
||||
[ 179.166167] kfd_ioctl+0x2f9/0x480 [amdgpu]
|
||||
[ 179.166363] ? kfd_ioctl_dbg_address_watch+0x190/0x190 [amdgpu]
|
||||
[ 179.166557] ? init_generic_mmio_info+0x52a2/0x8a80 [i915]
|
||||
[ 179.166692] ? __fget_light+0xdc/0x110
|
||||
[ 179.166697] __x64_sys_ioctl+0x95/0xd0
|
||||
[ 179.166700] do_syscall_64+0x5c/0xc0
|
||||
[ 179.166706] ? exit_to_user_mode_prepare+0x3d/0x1c0
|
||||
[ 179.166710] ? do_user_addr_fault+0x1e0/0x660
|
||||
[ 179.166714] ? irqentry_exit_to_user_mode+0x9/0x20
|
||||
[ 179.166726] RIP: 0033:0x7fb78399e3ab
|
||||
[ 179.166729] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
|
||||
[ 179.166731] RSP: 002b:00007ffc49475068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
|
||||
[ 179.166735] RAX: ffffffffffffffda RBX: 00007ffc49475100 RCX: 00007fb78399e3ab
|
||||
[ 179.166737] RDX: 00007ffc49475100 RSI: 00000000c0584b02 RDI: 0000000000000005
|
||||
[ 179.166738] RBP: 00000000c0584b02 R08: 00000000000000a0 R09: 0000000000000000
|
||||
[ 179.166739] R10: 00007ffc49475100 R11: 0000000000000246 R12: 0000000000000000
|
||||
[ 179.166741] R13: 0000000000000005 R14: 0000000001018000 R15: 0000000000000001
|
||||
[ 179.166744] </TASK>
|
||||
[ 179.166745] ---[ end trace 58857c63ca0499f5 ]---
|
||||
[ 223.552951] ------------[ cut here ]------------
|
||||
```
|
||||
|
||||
Something about `create_queue` is failing, further down in the dmesg log.
|
||||
```
|
||||
[ 223.556713] Fixing recursive fault but reboot is needed!
|
||||
[ 265.362935] amdgpu: Can't create new usermode queue because -1 queues were already created
|
||||
[ 265.362942] amdgpu: Pasid 0x8003 DQM create queue type 0 failed. ret -1
|
||||
```
|
||||
|
||||
## Use `dkms` after all
|
||||
|
||||
I avoided `dkms`, the Dynamic Kernel Management System, because it failed to build in Ubuntu 22.04 due to
|
||||
preprocessor / gcc errors, but I now think that was due to the kernel 6.x not being supported / tested
|
||||
with `amdgpu`.
|
||||
|
||||
This long and detailed thread seems to indicate that `amdgpu-dkms`, on kernel 5.15.x, might fix the
|
||||
problem with `rocr-opencl`. I'm still not sure the interaction between `rocm`, `rocr`, `opencl`, and
|
||||
`amdgpu` driver. I do know now that `ROCm` stands for Radeon Open CL something-that-starts-with-`m`.
|
||||
|
||||
https://github.com/RadeonOpenCompute/ROCm/issues/1624
|
||||
|
||||
That's a next thing to try,
|
||||
after we compile a test `OpenCL` program in the next step.
|
||||
|
||||
## Use ROCm after all
|
||||
|
||||
According to this forum, this may just be running into lack of support in the ROCm stack that `OpenCL` is depending on.
|
||||
If so, I should find the thread where it discusses `AMD` discontinuing support for the R9 / Hawaii / gfx7 and use
|
||||
the ROCm version and driver from just right before that, and working my way backwards in downgrading.
|
||||
|
||||
https://forum.level1techs.com/t/amd-r9-390-finally-usable-on-linux/131922
|
||||
@ -0,0 +1,35 @@
|
||||
# OpenCL Demo
|
||||
|
||||
A toy program from
|
||||
https://bbs.archlinux.org/viewtopic.php?id=254491
|
||||
|
||||
that I saved in the directory `./opencl-demo` and that you can build with `make`
|
||||
|
||||
```
|
||||
cd ../opencl-demo
|
||||
make
|
||||
```
|
||||
|
||||
## Installing `opencl-headers`
|
||||
|
||||
https://stackoverflow.com/a/45880123
|
||||
|
||||
Because it can't find `CL/CL.h` at first.
|
||||
```
|
||||
sudo apt install opencl-headers
|
||||
```
|
||||
|
||||
## Installable Client Driver (ICD)
|
||||
|
||||
Finally I find out what this technology is.
|
||||
It appears to let you compile against OpenCL
|
||||
libraries in a platform-independent way
|
||||
(the same compiled binary would run on Intel / AMD / Nvidia)
|
||||
|
||||
https://stackoverflow.com/a/17456374
|
||||
|
||||
I can find the vendor definitions in
|
||||
`/etc/OpenCL/vendors/`
|
||||
|
||||
|
||||
|
||||
Loading…
Reference in new issue