CUDA 中cudaDeviceSynchronize()同步，异步，timer_平凡的世界

http://blog.sina.com.cn/u/2557742573

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

CUDA 中cudaDeviceSynchronize()同步，异步，timer

(2014-03-28 14:31:24)

今天，无意中测试了cuda提供的simpleTexture例程，在我的前面一篇博文中，特别研究了一下texture

memory，并对例程做了修改，具体代码见链接。

http://blog.sina.com.cn/s/blog_98740ded0101jmol.html

在主函数中，重要的部分我还是贴出来。当程序是这样的时候：

sdkCreateTimer(&timer);

sdkStartTimer(&timer);

// Execute the kernel

transformKernel_integer<<>>(dData, width, height);

// Check if kernel execution generated an error

getLastCudaError("Kernel execution failed");

checkCudaErrors(cudaDeviceSynchronize());// 先同步

sdkStopTimer(&timer);//再测量时间

printf("Processing time: %f (ms)\n", sdkGetTimerValue(&timer));

printf("%.2f Mpixels/sec\n",

(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) / 1e6);

http://s14/mw690/002N61Gtzy6HEZL1aNved&690中cudaDeviceSynchronize()同步，异步，timer" TITLE="CUDA 中cudaDeviceSynchronize()同步，异步，timer" />

当程序是这样的时候：

sdkCreateTimer(&timer);

sdkStartTimer(&timer);

// Execute the kernel

transformKernel_integer<<>>(dData, width, height);

// Check if kernel execution generated an error

getLastCudaError("Kernel execution failed");

sdkStopTimer(&timer);//直接在执行内核语句后面，停止测量

checkCudaErrors(cudaDeviceSynchronize());//然后再同步

printf("Processing time: %f (ms)\n", sdkGetTimerValue(&timer));

printf("%.2f Mpixels/sec\n",

(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) / 1e6);

http://s5/mw690/002N61Gtzy6HF044GXie4&690中cudaDeviceSynchronize()同步，异步，timer" TITLE="CUDA 中cudaDeviceSynchronize()同步，异步，timer" />

结果是：

1.两者得到的数据文件完全与参考文件匹配

2.后者的处理时间，是前者的1/1000，我和我的小伙伴都惊呆了。。不科学！！！

在前面的博文中，还在吐槽，texture的速度问题。这到底是怎么回事儿！！

1.cudaDeviceSynchronize()的问题？？？

2.测量方式不对的问题？？？？

3.texture的问题？？？？

琢磨了一下：

CUDA Toolkit Reference Manual中：

5.1.2.12 __cudart_builtin__ cudaError_t cudaDeviceSynchronize (void)

Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work

cudaDeviceSynchronize 将会一直处于阻塞状态，直到前面所有请求的任务已经被全部执行完毕，如果前面执行的某个任务失败了的话，将会返回一个错误。如果对于这个device，cudaDeviceScheduleBlockingSync 标志位被置位了，那么host的线程将会一直处于阻塞，直到device完成了它的任务。

CUDA C Programming Guide中：

3.2.5Asynchronous Concurrent Execution同步并发执行

3.2.5.1 Concurrent Execution between Host and Device

In order to facilitate concurrent execution between host and device, some function calls

are asynchronous: Control is returned to the host thread before the device has completed

the requested task. These are:

为了便于host和device的并发执行，一些函数调用时异步的：在device完成请求的任务之前，一些控制会被传回到host线程。包含以下一些：

‣ Kernel launches;

‣ Memory copies between two addresses to the same device memory;

‣ Memory copies from host to device of a memory block of 64 KB or less;

‣ Memory copies performed by functions that are suffixed with Async;

‣ Memory set function calls.

Programmers can globally disable asynchronous kernel launches for all CUDA

applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment

variable to 1. This feature is provided for debugging purposes only and should never be

used as a way to make production software run reliably.

编程人员可以全局性的取消掉异步启动内核，方法是将

CUDA_LAUNCH_BLOCKING 环境变量设置为1.这种功能，只为调试目的而用，不应该拿它作为可靠运行发行版软件的手段。

Kernel launches are synchronous in the following cases:

内核的同步启动只在下面这些情况中出现

‣ The application is run via a debugger or memory checker (cuda-gdb, cudamemcheck,

Nsight) on a device of compute capability 1.x;

‣ Hardware counters are collected via a profiler (Nsight, Visual Profiler).

3.2.5.5.2 Default Stream默认流

Kernel launches and host <-> device memory copies that do not specify any stream

parameter, or equivalently that set the stream parameter to zero, are issued to the default

stream. They are therefore executed in order.

内核启动，host与device之间的内存拷贝，没有指定任何stream参数时，或者等效的设定为0时，都是被分发给默认的stream。因此，它们是顺序执行的。??指令

B.18 EXECUTION CONFIGURATION中：

The execution configuration is specified by inserting an expression of the form <<<<b style="line-height: 1.5; font-size: 14px;">

Dg, Db, Ns, S >>> between the function name and the parenthesized argument list,

where:

、、、

S is of type cudaStream_t and specifies the associated stream; S is an optional

argument which defaults to 0.

设定内核执行参数配置时，第四个参数S，我们一般没有管。但是S是默认为0 的。

3.2.5.5.3 Explicit Synchronization显示同步

cudaDeviceSynchronize() waits until all preceding commands in all streams of all

host threads have completed.等待所有的streams中的所有指令完成。

cudaStreamSynchronize()takes a stream as a parameter and waits until all

preceding commands in the given stream have completed. It can be used to synchronize

the host with a specific stream, allowing other streams to continue executing on the

device.等待指定的stream中所有的指令完成。

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for

a description of events)and makes all the commands added to the given stream after

the call to cudaStreamWaitEvent()delay their execution until the given event has

completed. The stream can be 0, in which case all the commands added to any stream

after the call to cudaStreamWaitEvent()wait on the event.

cudaStreamQuery()provides applications with a way to know if all preceding

commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are usually best

used for timing purposes or to isolate a launch or memory copy that is failing.

为了避免不必要的速度变慢，

这些同步功能

通常最好用于计时或隔离发射或存储副本失败所有。

3.2.5.5.4 Implicit Synchronization隐式同步

Two commands from different streams cannot run concurrently if any one of the

following operations is issued in-between them by the host thread:

不同的流中的两个命令，不能同步运行，如果下面来自host线程的中任何一个操作分发给了他们：

、、、

‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel

launch being checked is complete.

阻塞来自CUDA上下文的stream中，所有后来的内核启动，直到内核启动完成了检查。。。

Operations that require a dependency check include any other commands within the

same stream as the launch being checked and any call to cudaStreamQuery() on that

stream. Therefore, applications should follow these guidelines to improve their potential

for concurrent kernel execution:

一些要求有有依赖性检查的操作，如：同一stream中

随着启动执行检查的

任何其他命令；和对那个stream的

cudaStreamQuery()调用。因此应用程序应该遵循下面这些指导意见，来提升潜在的并发线程执行效率。

‣ All independent operations should be issued before dependent operations,

所有的独立操作，应该在非独立操作之前分发。

‣ Synchronization of any kind should be delayed as long as possible.

任何类型的同步操作，应该尽可能延迟较长时间。

进过文档的洗礼，再来搜索一下别人是怎么看待这个问题：

http://stackoverflow.com/questions/11888772/when-to-call-cudadevicesynchronize

提问的哥们儿，跟我的问题是一样的，他还只有7~12倍的速度差别。。我都有1000倍了。。。

票数最多的答案

aland给出的

，很好的给我们解释了这个巨大的时间差别问题：

Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially.

So, for example,


kernel1<<<</span>X,Y>>>(...); // kernel start execution, CPU continues to next statement


kernel2<<<</span>X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement


cudaMemcpy(...); // CPU blocks until ememory is copied, memory copy starts only after kernel2 finishes

So in your example there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).

cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or may be the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.

这样基本解释了这个时间差别，我们还可以接着验证一下这哥们以及文档中关于同步(显式核隐式)、异步的看法。

将代码再修改一下下，如下：

sdkCreateTimer(&timer);

sdkStartTimer(&timer);

// Execute the kernel

transformKernel_integer<<>>(dData, width, height);

// Check if kernel execution generated an error

getLastCudaError("Kernel execution failed");

//在同步之前测量一次

sdkStopTimer(&timer);

printf("Processing time: %f (ms)\n", sdkGetTimerValue(&timer));

sdkResetTimer(&timer);//复位timer

sdkStartTimer(&timer);//再一次启动

checkCudaErrors(cudaDeviceSynchronize());

//在同步之后再同步一次

sdkStopTimer(&timer);

printf("After sync Processing time: %f (ms)\n", sdkGetTimerValue(&timer));

printf("%.2f Mpixels/sec\n",

(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) / 1e6);

sdkDeleteTimer(&timer);

得到新的验证结果。

http://s8/mw690/002N61Gtzy6HF6buIWXb7&690中cudaDeviceSynchronize()同步，异步，timer" TITLE="CUDA 中cudaDeviceSynchronize()同步，异步，timer" />

只能说，这样的结果，又把我打回人间了。。。o(╯□╰)o。。。。以为它有多快呢。。。。

bwb@HUST 2014.03.28

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：CUDA中Texture Memory的学习

后一篇：CUDA ，extern "C" --NVCC编译器的问题， whole program compilation与separate compilation

新浪BLOG意见反馈留言板　欢迎批评指正