Skip to content

[pocl] Back @private Scratchpad with GPUCompiler.alloca#714

Draft
vchuravy wants to merge 2 commits into
mainfrom
vc/alloca_intrinsic
Draft

[pocl] Back @private Scratchpad with GPUCompiler.alloca#714
vchuravy wants to merge 2 commits into
mainfrom
vc/alloca_intrinsic

Conversation

@vchuravy

Copy link
Copy Markdown
Member

Summary

Replaces the POCL back-end's MArray-backed @private scratchpad with a direct per-workitem stack allocation via GPUCompiler.alloca. The returned Ptr is wrapped in a CLDeviceArray over OpenCL "Function" storage (LLVM addrspace 0), which is where the SPIR-V target places allocas.

@device_override @inline function KA.Scratchpad(ctx, ::Type{T}, ::Val{Dims}) where {T, Dims}
    ptr = POCL.GPUCompiler.alloca(T, Val(prod(Dims)))
    CLDeviceArray(Dims, reinterpret(POCL.LLVMPtr{T, POCL.AS.Function}, ptr))
end

This drops the StaticArrays dependency from the POCL back-end (StaticArrays is still used by the CPU back-end).

Why

GPUCompiler.alloca emits a real entry-block alloca that the optimizer can promote, in the target's alloca address space — avoiding the unsoundness of llvmcall + alloca and the overhead/semantics of MArray. See the motivation in the companion GPUCompiler PR.

Alignment

The alloca is aligned to Base.datatype_alignment(T), which is exactly the alignment CLDeviceArray uses for its element loads/stores (alignment(::CLDeviceArray{T})), so accesses are consistent. isbits-union element types are intentionally unsupported (GPUCompiler.alloca guards on isbitstype(T)).

Status

Draft — depends on JuliaGPU/GPUCompiler.jl#859 (adds the alloca intrinsic). Project.toml compat is bumped to GPUCompiler = "1.23"; this can be un-drafted once that is merged and released.

Testing

Verified end-to-end against the local GPUCompiler branch: @private Float32 (4,) lowers to alloca [16 x i8], align 4 in addrspace 0 with no surviving julia.gpu.alloca, and the kernel runs correctly on the POCL CPU device.

🤖 Generated with Claude Code

vchuravy and others added 2 commits June 22, 2026 17:28
Back the POCL `Scratchpad` (`@private`) with `GPUCompiler.alloca`, a direct
per-workitem stack allocation, instead of a StaticArrays `MArray`. The returned
`Ptr` is wrapped in a `CLDeviceArray` over OpenCL "Function" storage (LLVM
addrspace 0), where the SPIR-V target places allocas. Its alignment
(`Base.datatype_alignment(T)`) matches `CLDeviceArray`'s element accesses.

Requires GPUCompiler 1.23 (JuliaGPU/GPUCompiler.jl#859), which adds the
`alloca` intrinsic. Drops the now-unused StaticArrays import from the POCL
back-end (StaticArrays is still used by the CPU back-end).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (a8022b2) to head (0b58f9b).

Files with missing lines Patch % Lines
src/pocl/backend.jl 0.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a8022b2) and HEAD (0b58f9b). Click for more details.

HEAD has 28 uploads less than BASE
Flag BASE (a8022b2) HEAD (0b58f9b)
48 20
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #714       +/-   ##
==========================================
- Coverage   62.51%   0.00%   -62.52%     
==========================================
  Files          23      22        -1     
  Lines        1926    1737      -189     
==========================================
- Hits         1204       0     -1204     
- Misses        722    1737     +1015     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant