Streams In order to incorporate streams into the code, we hardcoded in two streams. The reason 2 was chosen was because that was the number of kernels or memory copies that could be done without having to worry about desynchronizing. The following was done: 1. Split the memory copy of the two arrays into two different streams. 2. During calls to kernels, used streams to try and get the calls to be done concurrently. 3. Commented out and removed unnecessary synchronization calls. In the end, the output generated by TOMO says there's little or NO improvement. The original anticipation of 30% speed up was actually caused by the last part of the arrays not being actually calculated. When looking at the visual profiler, it seems we've gained a very minor speed up. Our low memcpy/computer overlap is at 4.5% compared to 0% before. If we look at what's taking the most to run, it's ptomo_core. Unfortunately despite the fact that it takes 89% of the compute time, each calculation is actually fairly short. There's just a lot of them. theoretically, we should be able to split up the arrays such that we can do one chunk at a time, however, every attempt I have tried results in a longer execution time. This makes me believe that I am unable to run kernel-kernel concurrency. With the streams we are able to get some concurrency: [http://imgur.com/sJvqa]