You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

144 lines
8.9 KiB

# `dmesg` Output
```
4.728018] amdgpu: unknown parameter 'modeset' ignored
[ 4.735631] [drm] amdgpu kernel modesetting enabled.
[ 4.736026] amdgpu: CRAT table not found
[ 4.736031] amdgpu: Virtual CRAT table created for CPU
[ 4.736042] amdgpu: Topology: Add CPU node
[ 4.736154] amdgpu 0000:03:00.0: enabling device (0000 -> 0003)
[ 4.736247] [drm] initializing kernel modesetting (HAWAII 0x1002:0x67B1 0x1462:0x2015 0x80).
[ 4.736251] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 4.736262] [drm] register mmio base: 0xF0800000
[ 4.736263] [drm] register mmio size: 262144
[ 4.736266] [drm] PCIE atomic ops is not supported
[ 4.736273] [drm] add ip block number 0 <cik_common>
[ 4.736274] [drm] add ip block number 1 <gmc_v7_0>
[ 4.736275] [drm] add ip block number 2 <cik_ih>
[ 4.736276] [drm] add ip block number 3 <gfx_v7_0>
[ 4.736277] [drm] add ip block number 4 <cik_sdma>
[ 4.736278] [drm] add ip block number 5 <powerplay>
[ 4.736279] [drm] add ip block number 6 <dm>
[ 4.736280] [drm] add ip block number 7 <uvd_v4_2>
[ 4.736282] [drm] add ip block number 8 <vce_v2_0>
[ 5.005666] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 5.005673] amdgpu: ATOM BIOS: MS-V30823-F6
[ 5.005721] [drm] GPU posting now...
[ 5.018960] [drm] PCIE gen 2 link speeds already enabled
[ 5.018970] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[ 5.023944] amdgpu 0000:03:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[ 5.023951] amdgpu 0000:03:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[ 5.023967] [drm] Detected VRAM RAM=8192M, BAR=256M
[ 5.023969] [drm] RAM width 512bits GDDR5
[ 5.023988] [drm] amdgpu: 8192M of VRAM memory ready
[ 5.023991] [drm] amdgpu: 1181M of GTT memory ready.
[ 5.023998] [drm] GART: num cpu pages 262144, num gpu pages 262144
[ 5.024556] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
[ 5.069604] amdgpu: hwmgr_sw_init smu backed is ci_smu
[ 5.069652] intel_rapl_common: Found RAPL domain package
[ 5.069655] intel_rapl_common: Found RAPL domain core
[ 5.069656] intel_rapl_common: Found RAPL domain uncore
[ 5.069657] intel_rapl_common: Found RAPL domain dram
[ 5.070486] [drm] Found UVD firmware Version: 1.64 Family ID: 9
[ 5.078070] [drm] Found VCE firmware Version: 50.10 Binary ID: 2
[ 5.110017] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[ 5.110050] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[ 5.110079] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[ 5.110099] [drm] Display Core initialized with v3.2.149!
[ 5.112540] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[ 5.149636] [drm] UVD initialized successfully.
[ 5.269685] [drm] VCE initialized successfully.
[ 5.273454] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
```
It appears to allocate close to 4GB to the `gart` whatever that is.
So that matches hardware reality.
```
[ 15.258622] amdgpu: VI should always have 2 performance levels
[ 178.440772] [drm] PCIE gen 2 link speeds already enabled
[ 178.448020] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
[ 178.450788] amdgpu 0000:03:00.0: amdgpu: SRBM_SOFT_RESET=0x00100040
[ 178.500169] [drm] UVD initialized successfully.
[ 178.620189] [drm] VCE initialized successfully.
[ 178.621247] amdgpu: SW scheduler is used
[ 178.671490] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
[ 179.165318] ------------[ cut here ]------------
[ 179.165321] Load non-HWS mqd while stopped
[ 179.165339] WARNING: CPU: 0 PID: 1519 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:389 create_queue_nocpsch+0x372/0x710 [amdgpu]
[ 179.165591] Modules linked in: nls_iso8859_1 intel_rapl_msr amdgpu intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mei_hdcp kvm_intel kvm snd_hda_co
dec_realtek crct10dif_pclmul snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_intel cryptd snd_intel_dspcfg snd_intel_sdw_acpi rapl sn
d_hda_codec binfmt_misc snd_hda_core snd_hwdep intel_cstate iommu_v2 gpu_sched snd_pcm radeon i915 snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device dr
m_ttm_helper snd_timer ttm joydev input_leds drm_kms_helper cec snd rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect mei_me sysimgblt soundcore mei mac_hid sch_
fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon efi_pstore drm pstore_blk pstore_zone ip_tables x_tables autofs4 hid_generic usbhid hid i2c_i801 crc32_pcl
mul i2c_smbus r8169 xhci_pci ahci libahci realtek lpc_ich xhci_pci_renesas video
[ 179.165647] CPU: 0 PID: 1519 Comm: clinfo Not tainted 5.15.0-78-generic #85~20.04.1-Ubuntu
[ 179.165650] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H81 Pro BTC R2.0, BIOS P1.20 07/22/2014
[ 179.165651] RIP: 0010:create_queue_nocpsch+0x372/0x710 [amdgpu]
[ 179.165831] Code: 0f b6 3d d6 ed 68 00 41 80 ff 01 0f 87 23 5f 36 00 41 83 e7 01 75 15 48 c7 c7 08 f4 37 c1 c6 05 b8 ed 68 00 01 e8 94 07 c3 cc <0f> 0b 49 8b 45 10 4
c 89 70 08 49 89 06 48 8b 45 c0 49 89 46 08 4d
[ 179.165834] RSP: 0018:ffff9cba4150bbd0 EFLAGS: 00010286
[ 179.165836] RAX: 0000000000000000 RBX: ffff8c5f81149800 RCX: 0000000000000027
[ 179.165838] RDX: 0000000000000027 RSI: ffff9cba4150ba10 RDI: ffff8c6080220588
[ 179.165839] RBP: ffff9cba4150bc18 R08: ffff8c6080220580 R09: 0000000000000001
[ 179.165840] R10: 0000000000000001 R11: 0000000000000020 R12: 0000000000000000
[ 179.165841] R13: ffff8c5f98d5c610 R14: ffff8c5f82d4dc00 R15: 0000000000000000
[ 179.165842] FS: 00007fb78370eb80(0000) GS:ffff8c6080200000(0000) knlGS:0000000000000000
[ 179.165845] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 179.165846] CR2: 0000000001018040 CR3: 0000000004840006 CR4: 00000000000706f0
[ 179.165848] Call Trace:
[ 179.165849] <TASK>
[ 179.165852] pqm_create_queue+0x191/0x450 [amdgpu]
[ 179.166012] kfd_ioctl_create_queue+0xd3/0x2c0 [amdgpu]
[ 179.166167] kfd_ioctl+0x2f9/0x480 [amdgpu]
[ 179.166363] ? kfd_ioctl_dbg_address_watch+0x190/0x190 [amdgpu]
[ 179.166557] ? init_generic_mmio_info+0x52a2/0x8a80 [i915]
[ 179.166692] ? __fget_light+0xdc/0x110
[ 179.166697] __x64_sys_ioctl+0x95/0xd0
[ 179.166700] do_syscall_64+0x5c/0xc0
[ 179.166706] ? exit_to_user_mode_prepare+0x3d/0x1c0
[ 179.166710] ? do_user_addr_fault+0x1e0/0x660
[ 179.166714] ? irqentry_exit_to_user_mode+0x9/0x20
[ 179.166726] RIP: 0033:0x7fb78399e3ab
[ 179.166729] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
[ 179.166731] RSP: 002b:00007ffc49475068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 179.166735] RAX: ffffffffffffffda RBX: 00007ffc49475100 RCX: 00007fb78399e3ab
[ 179.166737] RDX: 00007ffc49475100 RSI: 00000000c0584b02 RDI: 0000000000000005
[ 179.166738] RBP: 00000000c0584b02 R08: 00000000000000a0 R09: 0000000000000000
[ 179.166739] R10: 00007ffc49475100 R11: 0000000000000246 R12: 0000000000000000
[ 179.166741] R13: 0000000000000005 R14: 0000000001018000 R15: 0000000000000001
[ 179.166744] </TASK>
[ 179.166745] ---[ end trace 58857c63ca0499f5 ]---
[ 223.552951] ------------[ cut here ]------------
```
Something about `create_queue` is failing, further down in the dmesg log.
```
[ 223.556713] Fixing recursive fault but reboot is needed!
[ 265.362935] amdgpu: Can't create new usermode queue because -1 queues were already created
[ 265.362942] amdgpu: Pasid 0x8003 DQM create queue type 0 failed. ret -1
```
## Use `dkms` after all
I avoided `dkms`, the Dynamic Kernel Management System, because it failed to build in Ubuntu 22.04 due to
preprocessor / gcc errors, but I now think that was due to the kernel 6.x not being supported / tested
with `amdgpu`.
This long and detailed thread seems to indicate that `amdgpu-dkms`, on kernel 5.15.x, might fix the
problem with `rocr-opencl`. I'm still not sure the interaction between `rocm`, `rocr`, `opencl`, and
`amdgpu` driver. I do know now that `ROCm` stands for Radeon Open CL something-that-starts-with-`m`.
https://github.com/RadeonOpenCompute/ROCm/issues/1624
That's a next thing to try,
after we compile a test `OpenCL` program in the next step.
## Use ROCm after all
According to this forum, this may just be running into lack of support in the ROCm stack that `OpenCL` is depending on.
If so, I should find the thread where it discusses `AMD` discontinuing support for the R9 / Hawaii / gfx7 and use
the ROCm version and driver from just right before that, and working my way backwards in downgrading.
https://forum.level1techs.com/t/amd-r9-390-finally-usable-on-linux/131922