From 7996ab660d39bbf0b38c2fb3fa32c692bbed3c1e Mon Sep 17 00:00:00 2001 From: Paul Pham Date: Sun, 6 Aug 2023 11:26:30 -0700 Subject: [PATCH] Fill in what I remember from the 22.04 experiments. --- .../001-Ubuntu-22.04/01-First-Install.md | 53 +++++++ .../001-Ubuntu-22.04/02-Second-Install.md | 28 ++++ debugging/002-Ubuntu-20_04/06-dmesg.md | 143 ++++++++++++++++++ debugging/002-Ubuntu-20_04/07-OpenCL-Demo.md | 35 +++++ 4 files changed, 259 insertions(+) create mode 100644 debugging/001-Ubuntu-22.04/01-First-Install.md create mode 100644 debugging/001-Ubuntu-22.04/02-Second-Install.md create mode 100644 debugging/002-Ubuntu-20_04/06-dmesg.md create mode 100644 debugging/002-Ubuntu-20_04/07-OpenCL-Demo.md diff --git a/debugging/001-Ubuntu-22.04/01-First-Install.md b/debugging/001-Ubuntu-22.04/01-First-Install.md new file mode 100644 index 0000000..7be5cce --- /dev/null +++ b/debugging/001-Ubuntu-22.04/01-First-Install.md @@ -0,0 +1,53 @@ +# First Install + +I first chose Ubuntu-22.04 because it's the latest Long Term Support (LTS) +release and I was reading about ROCm, the AMD equivalent of CUDA for nVidia, +and it had 22.04 Ubuntu releases. + +I assumed that whatever problems from driver / package mismatches and errors from +bleeding-edge releases would be balanced by frequent fixes. + +However, I eventually abandoned this approach because it became clear from +multiple sources that the `amdgpu` unified open source driver stopped supporting +Hawaii chipsets (sometimes called `gfx7`) which are the hardware / OEM names for +the R9 390s. + +I'm summarizing more briefly my approach here to keep straight what I've already tried +and why, in case it's useful to try again in the future for better hardware support, etc. + +I installed three times, until the machine appeared borked (the screen went blank on +bootup, or a sick-computer black-and-white screen due to an unrecoverable kernel error. + +In retrospect, these seem likely to have been solvable by two methods: + +* uninstalling newer kernel versions and downgrading to a kernel that others have reported success with +* fixing kernel boot parameters in GRUB and trying again. + +## On-board Display + +A key part of this phase of the debugging was setting the "Advanced Configuration" +"Chipset Configuration" part of the UEFI / BIOS to use the on-board graphics for display, +rather than the GPU. That's because we're using the VGA port on the mainboard. + +## The Latest Unified AMDGPU Driver + + +## The Latest Radeon Driver + +I still don't know the difference between the two kinds of drivers. + +## DKMS Compile Errors + +The end of this line of inquiry came from repeatedly encountering DKMS compile errors. +These got compiled into `/usr/src/module/amdgpu` or some similar path I believe. + +The make would fail due to a preprocessor error about the macro + +``` +#define NULL ((void *)0) +``` + +which looks legit to me. The make log would say that it is using gcc-11.4.0 +but the kernel was built with gcc-11.3.0. + + diff --git a/debugging/001-Ubuntu-22.04/02-Second-Install.md b/debugging/001-Ubuntu-22.04/02-Second-Install.md new file mode 100644 index 0000000..4a78ea4 --- /dev/null +++ b/debugging/001-Ubuntu-22.04/02-Second-Install.md @@ -0,0 +1,28 @@ +# Second / Third Install + +During the second /third install, I believe I focused on flashing the GPUs +using the `amdvbflash` tool, the successor to `atiflash` that is +a command-line Linux tool. + +My hypothesis was that the overclocked BIOS was preventing kernel drivers +from detecting it with `clinfo` or `rocm-smi` but I now consider that unlikely. +Such an failure would have been unstable performance. + +The cards were not detected and the driver modules not loaded in the kernel +at all . + +## PCIe Level Gen 2 + +There appears to be a speed negotiation setting in BIOS for communication +over the PCI Express bus. + +One forum thread recommended setting it to Gen 1, but I haven't found +any issues with speed connection so far. I might try it again later. + +## Netboot + +I set the BIOS / UEFI setting to Boot-on-LAN, which I assumed was +the same as Netboot, but continued getting to GRUB. + +Perhaps I need to disable all the other boot volumes in order to +get Netboot into the boot order. Something to try later. diff --git a/debugging/002-Ubuntu-20_04/06-dmesg.md b/debugging/002-Ubuntu-20_04/06-dmesg.md new file mode 100644 index 0000000..dfff359 --- /dev/null +++ b/debugging/002-Ubuntu-20_04/06-dmesg.md @@ -0,0 +1,143 @@ +# `dmesg` Output + +``` + 4.728018] amdgpu: unknown parameter 'modeset' ignored +[ 4.735631] [drm] amdgpu kernel modesetting enabled. +[ 4.736026] amdgpu: CRAT table not found +[ 4.736031] amdgpu: Virtual CRAT table created for CPU +[ 4.736042] amdgpu: Topology: Add CPU node +[ 4.736154] amdgpu 0000:03:00.0: enabling device (0000 -> 0003) +[ 4.736247] [drm] initializing kernel modesetting (HAWAII 0x1002:0x67B1 0x1462:0x2015 0x80). +[ 4.736251] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported +[ 4.736262] [drm] register mmio base: 0xF0800000 +[ 4.736263] [drm] register mmio size: 262144 +[ 4.736266] [drm] PCIE atomic ops is not supported +[ 4.736273] [drm] add ip block number 0 +[ 4.736274] [drm] add ip block number 1 +[ 4.736275] [drm] add ip block number 2 +[ 4.736276] [drm] add ip block number 3 +[ 4.736277] [drm] add ip block number 4 +[ 4.736278] [drm] add ip block number 5 +[ 4.736279] [drm] add ip block number 6 +[ 4.736280] [drm] add ip block number 7 +[ 4.736282] [drm] add ip block number 8 +[ 5.005666] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR +[ 5.005673] amdgpu: ATOM BIOS: MS-V30823-F6 +[ 5.005721] [drm] GPU posting now... +[ 5.018960] [drm] PCIE gen 2 link speeds already enabled +[ 5.018970] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit +[ 5.023944] amdgpu 0000:03:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used) +[ 5.023951] amdgpu 0000:03:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF +[ 5.023967] [drm] Detected VRAM RAM=8192M, BAR=256M +[ 5.023969] [drm] RAM width 512bits GDDR5 +[ 5.023988] [drm] amdgpu: 8192M of VRAM memory ready +[ 5.023991] [drm] amdgpu: 1181M of GTT memory ready. +[ 5.023998] [drm] GART: num cpu pages 262144, num gpu pages 262144 +[ 5.024556] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000). +[ 5.069604] amdgpu: hwmgr_sw_init smu backed is ci_smu +[ 5.069652] intel_rapl_common: Found RAPL domain package +[ 5.069655] intel_rapl_common: Found RAPL domain core +[ 5.069656] intel_rapl_common: Found RAPL domain uncore +[ 5.069657] intel_rapl_common: Found RAPL domain dram +[ 5.070486] [drm] Found UVD firmware Version: 1.64 Family ID: 9 +[ 5.078070] [drm] Found VCE firmware Version: 50.10 Binary ID: 2 +[ 5.110017] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4! +[ 5.110050] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4! +[ 5.110079] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4! +[ 5.110099] [drm] Display Core initialized with v3.2.149! +[ 5.112540] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu]) +[ 5.149636] [drm] UVD initialized successfully. +[ 5.269685] [drm] VCE initialized successfully. +[ 5.273454] kfd kfd: amdgpu: Allocated 3969056 bytes on gart +``` + +It appears to allocate close to 4GB to the `gart` whatever that is. +So that matches hardware reality. + +``` +[ 15.258622] amdgpu: VI should always have 2 performance levels +[ 178.440772] [drm] PCIE gen 2 link speeds already enabled +[ 178.448020] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000). +[ 178.450788] amdgpu 0000:03:00.0: amdgpu: SRBM_SOFT_RESET=0x00100040 +[ 178.500169] [drm] UVD initialized successfully. +[ 178.620189] [drm] VCE initialized successfully. +[ 178.621247] amdgpu: SW scheduler is used +[ 178.671490] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes +[ 179.165318] ------------[ cut here ]------------ +[ 179.165321] Load non-HWS mqd while stopped +[ 179.165339] WARNING: CPU: 0 PID: 1519 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:389 create_queue_nocpsch+0x372/0x710 [amdgpu] +[ 179.165591] Modules linked in: nls_iso8859_1 intel_rapl_msr amdgpu intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mei_hdcp kvm_intel kvm snd_hda_co +dec_realtek crct10dif_pclmul snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_intel cryptd snd_intel_dspcfg snd_intel_sdw_acpi rapl sn +d_hda_codec binfmt_misc snd_hda_core snd_hwdep intel_cstate iommu_v2 gpu_sched snd_pcm radeon i915 snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device dr +m_ttm_helper snd_timer ttm joydev input_leds drm_kms_helper cec snd rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect mei_me sysimgblt soundcore mei mac_hid sch_ +fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon efi_pstore drm pstore_blk pstore_zone ip_tables x_tables autofs4 hid_generic usbhid hid i2c_i801 crc32_pcl +mul i2c_smbus r8169 xhci_pci ahci libahci realtek lpc_ich xhci_pci_renesas video +[ 179.165647] CPU: 0 PID: 1519 Comm: clinfo Not tainted 5.15.0-78-generic #85~20.04.1-Ubuntu +[ 179.165650] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H81 Pro BTC R2.0, BIOS P1.20 07/22/2014 +[ 179.165651] RIP: 0010:create_queue_nocpsch+0x372/0x710 [amdgpu] +[ 179.165831] Code: 0f b6 3d d6 ed 68 00 41 80 ff 01 0f 87 23 5f 36 00 41 83 e7 01 75 15 48 c7 c7 08 f4 37 c1 c6 05 b8 ed 68 00 01 e8 94 07 c3 cc <0f> 0b 49 8b 45 10 4 +c 89 70 08 49 89 06 48 8b 45 c0 49 89 46 08 4d +[ 179.165834] RSP: 0018:ffff9cba4150bbd0 EFLAGS: 00010286 +[ 179.165836] RAX: 0000000000000000 RBX: ffff8c5f81149800 RCX: 0000000000000027 +[ 179.165838] RDX: 0000000000000027 RSI: ffff9cba4150ba10 RDI: ffff8c6080220588 +[ 179.165839] RBP: ffff9cba4150bc18 R08: ffff8c6080220580 R09: 0000000000000001 +[ 179.165840] R10: 0000000000000001 R11: 0000000000000020 R12: 0000000000000000 +[ 179.165841] R13: ffff8c5f98d5c610 R14: ffff8c5f82d4dc00 R15: 0000000000000000 +[ 179.165842] FS: 00007fb78370eb80(0000) GS:ffff8c6080200000(0000) knlGS:0000000000000000 +[ 179.165845] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 +[ 179.165846] CR2: 0000000001018040 CR3: 0000000004840006 CR4: 00000000000706f0 +[ 179.165848] Call Trace: +[ 179.165849] +[ 179.165852] pqm_create_queue+0x191/0x450 [amdgpu] +[ 179.166012] kfd_ioctl_create_queue+0xd3/0x2c0 [amdgpu] +[ 179.166167] kfd_ioctl+0x2f9/0x480 [amdgpu] +[ 179.166363] ? kfd_ioctl_dbg_address_watch+0x190/0x190 [amdgpu] +[ 179.166557] ? init_generic_mmio_info+0x52a2/0x8a80 [i915] +[ 179.166692] ? __fget_light+0xdc/0x110 +[ 179.166697] __x64_sys_ioctl+0x95/0xd0 +[ 179.166700] do_syscall_64+0x5c/0xc0 +[ 179.166706] ? exit_to_user_mode_prepare+0x3d/0x1c0 +[ 179.166710] ? do_user_addr_fault+0x1e0/0x660 +[ 179.166714] ? irqentry_exit_to_user_mode+0x9/0x20 +[ 179.166726] RIP: 0033:0x7fb78399e3ab +[ 179.166729] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48 +[ 179.166731] RSP: 002b:00007ffc49475068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 +[ 179.166735] RAX: ffffffffffffffda RBX: 00007ffc49475100 RCX: 00007fb78399e3ab +[ 179.166737] RDX: 00007ffc49475100 RSI: 00000000c0584b02 RDI: 0000000000000005 +[ 179.166738] RBP: 00000000c0584b02 R08: 00000000000000a0 R09: 0000000000000000 +[ 179.166739] R10: 00007ffc49475100 R11: 0000000000000246 R12: 0000000000000000 +[ 179.166741] R13: 0000000000000005 R14: 0000000001018000 R15: 0000000000000001 +[ 179.166744] +[ 179.166745] ---[ end trace 58857c63ca0499f5 ]--- +[ 223.552951] ------------[ cut here ]------------ +``` + +Something about `create_queue` is failing, further down in the dmesg log. +``` +[ 223.556713] Fixing recursive fault but reboot is needed! +[ 265.362935] amdgpu: Can't create new usermode queue because -1 queues were already created +[ 265.362942] amdgpu: Pasid 0x8003 DQM create queue type 0 failed. ret -1 +``` + +## Use `dkms` after all + +I avoided `dkms`, the Dynamic Kernel Management System, because it failed to build in Ubuntu 22.04 due to +preprocessor / gcc errors, but I now think that was due to the kernel 6.x not being supported / tested +with `amdgpu`. + +This long and detailed thread seems to indicate that `amdgpu-dkms`, on kernel 5.15.x, might fix the +problem with `rocr-opencl`. I'm still not sure the interaction between `rocm`, `rocr`, `opencl`, and +`amdgpu` driver. I do know now that `ROCm` stands for Radeon Open CL something-that-starts-with-`m`. + +https://github.com/RadeonOpenCompute/ROCm/issues/1624 + +That's a next thing to try, +after we compile a test `OpenCL` program in the next step. + +## Use ROCm after all + +According to this forum, this may just be running into lack of support in the ROCm stack that `OpenCL` is depending on. +If so, I should find the thread where it discusses `AMD` discontinuing support for the R9 / Hawaii / gfx7 and use +the ROCm version and driver from just right before that, and working my way backwards in downgrading. + +https://forum.level1techs.com/t/amd-r9-390-finally-usable-on-linux/131922 diff --git a/debugging/002-Ubuntu-20_04/07-OpenCL-Demo.md b/debugging/002-Ubuntu-20_04/07-OpenCL-Demo.md new file mode 100644 index 0000000..cd9d176 --- /dev/null +++ b/debugging/002-Ubuntu-20_04/07-OpenCL-Demo.md @@ -0,0 +1,35 @@ +# OpenCL Demo + +A toy program from +https://bbs.archlinux.org/viewtopic.php?id=254491 + +that I saved in the directory `./opencl-demo` and that you can build with `make` + +``` +cd ../opencl-demo +make +``` + +## Installing `opencl-headers` + +https://stackoverflow.com/a/45880123 + +Because it can't find `CL/CL.h` at first. +``` +sudo apt install opencl-headers +``` + +## Installable Client Driver (ICD) + +Finally I find out what this technology is. +It appears to let you compile against OpenCL +libraries in a platform-independent way +(the same compiled binary would run on Intel / AMD / Nvidia) + +https://stackoverflow.com/a/17456374 + +I can find the vendor definitions in +`/etc/OpenCL/vendors/` + + +