CUDA学习（五十六）-阿里云开发者社区

CUDA学习（五十六）

2018-02-22 2042

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

块内合作组的使用：
在本节中，通过一些使用示例说明了协作组功能。
发现模式：
通常开发人员需要使用当前活动的一组线程。没有对现有的线程进行假设，而是开发人员使用碰巧存在的线程。这在以下“在warp中跨线程聚合原子增量”示例中看到（使用正确的CUDA 9.0集内在函数编写）：

{
    unsigned int writemask = __activemask();
    unsigned int total = __popc(writemask);
    unsigned int prefix = __popc(writemask & __lanemask_lt());
    // Find the lowest-numbered active lane
    int elected_lane = __ffs(writemask) - 1;
    int base_offset = 0;
    if (prefix == 0) {
        base_offset = atomicAdd(p, total);
    }
    base_offset = __shfl_sync(writemask, base_offset, elected_lane);
    int thread_offset = prefix + base_offset;
    return thread_offset;
}

这可以通过合作组重新编写如下：

{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;
    if (g.thread_rank() == 0) {
        prev = atomicAdd(p, g.size());
    }
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}

Warp-Synchronous代码模式：
开发人员可能已经有了warp-synchronous代码，他们以前对warp的大小做了隐含的假设，并会围绕这个数字进行编码。现在需要明确指定。

// If the size is known statically
auto g = tiled_partition<16>(this_thread_block());
// Can use g.shfl and all other warp-synchronous builtins

但是，用户可能希望更好地分配他的算法，但不需要warp-synchronous内建函数的优势。

auto g = tiled_partition(this_thread_block(), 8);

在这种情况下，组g仍然可以同步，您仍然可以在顶部构建各种并行算法，但不能访问shfl（）等。

__global__ void cooperative_kernel(...) {
    // obtain default "current thread block" group
    thread_group my_block = this_thread_block();
    // subdivide into 32-thread, tiled subgroups
    // Tiled subgroups evenly partition a parent group into
    // adjacent sets of threads - in this case each one warp in size
    thread_group my_tile = tiled_partition(my_block, 32);
    // This operation will be performed by only the
    // first 32-thread tile of each block
    if (my_block.thread_rank() < 32) {
        // ...
        my_tile.sync();
    }
}

Composition：
以前，在编写某些代码时，实现上存在着隐藏的限制。以这个例子：

device__ int sum(int *x, int n) {
    // ...
    __syncthreads();
    return total;
}
__global__ void parallel_kernel(float *x) {
    // ...
    // Entire thread block must call sum
    sum(x, n);
}

线程块中的所有线程必须到达__syncthreads（）屏障，但是，对于可能要使用sum（...）的开发人员而言，此约束是隐藏的。有了合作组，更好的书写方式是：

__device__ int sum(const thread_group& g, int *x, int n)
{
    // ...
    g.sync()
        return total;
}
__global__ void parallel_kernel(...)
{
    // ...
    // Entire thread block must call sum
    sum(this_thread_block(), x, n);
    // ...
}

timg

CUDA学习（五十六）

热门文章

最新文章

相关课程

相关电子书