Part A - Introduction

The Eco-System

Describe the heterogeneous computing ecosystem
Define multi-processor terminology and describe classifications
Introduce the language extensions that enable parallel programming

Manufacturers | Multi-Processors | Languages | Applications | Exercises

The ecosystem for heterogeneous computing includes multi-processor technology, serial programming languages, anguage extensions for parallel programming, the manufacturers of the computational units, and the applications that run on heterogeneous computers.  As power limitations demand more energy efficiency, the need to program multi-processor technology increases. 

The Heterogeneous Computing Eco-System

This chapter introduces the major manufacturers of GPU multi-processor technology, defines the terminology associated with multi-processors, introduces some of the products currently available and describes their performance and bandwidt differencesh.  This chapter concludes with a summary of the core programming languages and the extensions that the manufacturers provide for parallel programming in general as well as a list of some of the domains that have benefitted from parallel processing.


Many modern tablets, notebooks, phones, desktops, and workstations contain both multi-core CPUs and many-core GPUs.  The companies that manufacture these computational units have followed two separate development strategies: discrete and fused.  The fused strategy prints multi-core CPUs and many-core GPUs on a single die.  The discrete strategy prints multi-core CPUs from many-core GPUs separately allowing for mixtures from different manufacturers. 


Intel produces multi-processors that power over 85% of the world's Top500 supercomputers (June 2014).  The company has predicted growth of more than 20% in the High Performance Computing (HPC) market and 30% in the network market over the 5 year period between 2011 and 2016.  Its other target segments include enterprises, workstations and data centers. 

Intel's predictions

Intel released its discrete Xeon Phi coprocessor in Q2 of 2013.  Xeon Phi (code name Knights Corner) has 61 cores and can deliver well over 1 Teraflop of double precision performance.  Intel's next generation of Xeon Phi (code name Knights Landing) uses silicon photonics technology and is expected to deliver 3 Teraflops of performance.  Intel refers to the design of its Xeon Phi co-processors as Many Integrated Core Architecture (MIC). 


Nvidia produces discrete GPUs for three different market segments:

  • Tesla - high performance computing
  • Quadro - professional graphics
  • GeForce - digital gaming, entertainment

Tesla is Nvidia's many-core solution for double precision co-processing.  As of June 2014, 44 supercomputers in the Top500 list used Tesla GPUs, while 17 used Xeon Phi co-processors. 

Nvidia has introduced four different architectures over the last 7 years:

Nvidia architectures

Nvidia roadmap also includes: 


Advanced Micro Devices (AMD) is the second largest global supplier of microprocessors based on x686 architecture (after Intel).  AMD acquired ATI Technologies in 2006 and since then has competed with Nvidia for the discrete GPU market. 

AMD produces both discrete and fused hardware.  It manufactures discrete GPUs for three distinct market segments:

  • Radeon - digital gaming, entertainment
  • FirePro - professional graphics
  • Opteron - high performance computing

AMD's roadmap through 2016 identifies four priorities:

  • embedded computing
  • dense servers
  • high performance GPUs
  • professional graphics

AMD plans to stay with 28-nanometer technology and only move down to 20-nanometer and 16/14 finFET technologies on a case by case basis where investment is clearly justified. 

Through Project Skybridge AMD plans to deliver families of x86 and ARM chips that are pin-compatible. 

Project Skybridge


Under its fused strategy, AMD manufactures Accelerated Processing Units (APUs).  APUs combine CPU and GPU technology on a single die.  The benefits of fusing these two computational units are high performance and low power consumption.  The overhead involved in transferring vast amounts of data between a CPU and a GPU is eliminated.  Data is directly accessible to both units and does not need to be loaded from system memory into GPU memory or vice versa before the GPU or CPU can access it. 

AMD released the third generation of APU technology (code name Kaveri) in 2014.


On June 12 2012, AMD and 4 other hardware vendors created the HSA (Heterogenous System Architecture) Foundation as a non-profit consortium.  HSA's objective is to build a heterogeneous compute ecosystem across CPUs, GPUs, DSPs and other programmable and fixed-function devices using open-source technologies.  The key founders include:

As of August 2015, HSA had signed 43 member companies and 11 academic institutions to its consortium.

HSA's roadmap through 2014 identifies four milestones leading to GPUs that are fully compatible:

  • 2011 - integration of CPU and GPU in silicon
  • 2012 - GPU can access CPU memory
  • 2013 - unified memory for CPU and GPU
  • 2014 - workloads run seamlessly on CPUs and GPUs in parallel

AMD predicts that HSA will maximize the amount of time spent in each parallel processing algorithm and minimize the amount of time devoted to overhead under industry standards:

HSA Rodmap

The orange bars show improvements in performance as AMD evolves from serial CPU technology to its parallel primitives library.  The purple bars show changes in actual time spent within the parallel processing algorithms. 


Multi-processor technology was originally developed for supercomputing applications.  About half of the world's supercomputers are located in the United States, with the next largest and growing share in China.  Much of the terminology used in supercomputing applies to heterogeneous computers.  The figure below highlights the more common terms. 

Multi Processor Components


  • A process is the actual execution of the set of instructions contianed within a computer program.
  • Multitasking is the method of executing several processes across the same hardware resource over the same period of time.  This method creates the illusion of parallel processing by switching rapidly from one process to another.
  • A thread is an execution path within a process.  It is the smallest unit of processing that the hardware can schedule and execute as a separate task within the process.  Threads within the same process may share state, memory and address space. 
  • Context is the information that a task or process must save when it is interrupted.  Context switching is typically faster between threads than between processes.
  • Bandwidth is the available data communication per unit time

  • A bus is a hardware sub-system that transfers data between a computer's components. 
  • A core is a computational unit that executes a single thread within a process.
  • A register is a unit of memory directly accessible by the components that make up a core.  Registers are at the top of the memory hierarchy and provide the fastest access to data within a multi-processor. 
  • An alu is the digital circuit that performs arithmetic and logic operations on data stored in registers
  • An fpu is the digital circuit that performs arithmetic operations on floating-point data stored in registers
  • A control unit manages the flow of information through a processor and the operations performed on the data by the alu and the fpu.
  • DRAM is dynamic random access memory that stores data persistently. 
  • A cache is a memory unit between an initiator (typically a processor) and more persistent memory that facilitates faster access to data.
  • A cache miss is the absence of data in a specific cache that is requested by a process. 

  • Clock rate is the frequency at which the processor executes instructions. 
  • Latency is the idle time between a request for data and the instant that that data is available to the initiator. 

Common performance measures are:


Flynn's Taxonomy

Michael Flynn classified multi-processor architectures into four categories (Flynn, M.J. (1972). "Some Computer Organizations and Their Effectiveness". IEEE Trans. Comput. C-21: 948.):

  Single Instruction Multiple Instruction
Single Data SISD MISD
Multiple Data SIMD MIMD
  • SISD - a single processor executes a single operation on data stored in one memory location. 
  • SIMD - multiple processors perform the same operation on data stored in multiple memory locations. 
  • MISD - multiple processors perform multiple operations on data stored in one memory location. 
  • MIMD - multiple processors perform multiple operations on data stored in multiple memory locations. 


An important sub-category of the MIMD class is:

  • SPMD - single program multiple data - multiple cooperating processes execute one program. 

SPMD and SIMD are not mutually exclusive.  A SIMD machine executes in lockstep on different data, while an SPMD machine executes the same program on multiple data, but not necessarily in lockstep.

CPUs and GPUs

Multi-core CPUs are MIMD machines. 

GPUs are SPMD machines. 


Manufacturers produce parallel processors by coupling individual processors either loosely or tightly:

  • loose coupling - connect processors at the high level for a distributed memory system.  Each processor has its own private memory.  The processors exchange information through messages at this top level. 


    Supercomputers and clusters and clusters are examples of loosely coupled processors.

  • tight coupling - connect processors at the lower bus level or even on-chip for a shared memory system.  These processors are much more energy efficient than loosely coupled ones. 


    Heterogeneous systems are tightly coupled.  Multi-core CPUs and many-core GPUs are extreme examples. 

Levels of Parallelism

Manufacturers implement parallelism at many different levels.  These levels include:

GPUs are data-level parallel processors. 

Caches and Latency

The architecture of modern computers is based on the stored-program concept.  When Alan Turing and John vonNeumann proposed this concept in the first half of the twentieth century, they envisaged all program data and program instructions stored together in one location.  Register and memory access times were similar and differences between execution and access time weren't profound.  Program instructions and program data shared the same bus.  Instructions and data weren't simultaneously accessible and the CPU operated at sub-optimal speeds.  This is called the vonNeumann bottleneck

The solution to the vonNeumann bottleneck is the introduction of a cache between the CPU and main memory.  By holding a copy of some of the data that resides on main memory, the cache reduces the latency between the CPU's request for data and the availability of that data to the CPU.  If a copy of the requested data is in the cache, there is no need to access main memory, which reduces the bottleneck significantly. 

As the difference between computation time and access time has increased, the need for cahcing has increased.  Recent computers have multiple caches:


Most include up to three levels - L1, L2 and L3 - each level provides more immediate access to some of the information in main memory.  L1 is closest to the processor and most often built into the processor itself.  L3 is the closest to main memory and usually installed on the motherboard.  L2 is between L1 and L3 and may be built either into the processor or installed on the motherboard. 

Multi-Core CPUs

Source: CPU World

Cores and Threads:

  • Intel Core i7-5930K (Haswell) has 6 cores and can process 12 threads - 0.022 micron technology
  • Intel Core i7-6700K (Skylake) has 4 cores and can process 8 threads - 0.014 micron technology
  • AMD FX-8370 (Piledriver) has 8 cores and can process 8 threads - 0.028 micron technology

Caches and Latency (clock cycles in parentheses):

  • Intel Core i7-5930K (Haswell)
    • 6 X 32KB L1 instruction caches
    • 6 X 32KB L1 data caches
    • 6 X 256 KB L2 cache
    • 15MB L3 cache
  • Intel Core i7-6700K (Skylake)
    • 4 X 32KB L1 instruction caches (?)
    • 4 X 32KB L1 data caches (?)
    • 4 X 256 KB L2 cache (?)
    • 8MB L3 shared cache (?)
  • AMD FX-8370 (Piledriver)
    • 4 X 64KB L1 instruction caches
    • 8 X 16KB L1 data caches
    • 4 X 2MB L2 data caches
    • 8MB L3 shared cache

Instruction Performance (IPS):

  • 2011 Intel Core i7 - 2600K (Sandy Bridge) 111.0 GIPS
  • 2013 Intel Core i7 - 4770K (Haswell) 133.0 GIPS
  • 2014 Intel Core i7 - 5960X (Haswell-Extreme) 238.0 GIPS
  • 2011 AMD FX-8150 (Zambezi) 108.0 GIPS

Floating-Point Performance (FLOPS):

  • 2011 Intel Core i7 - 2600K (Sandy Bridge) 82.8 GFLOPS
  • 2011 AMD FX-8150 (Zambezi) 66.1 GFLOPS
  • 2009 Cray XT5 (Jaguar) 224K processing cores - 1.8 PFLOPS

Bandwidth (GB/s):

  • 2014 Intel Core i7 - 5930K (Haswell) DDR4-1333/1600/2133 68 GB/s
  • 2014 AMD FX-8370 (Vishera) DDR3-1866 29.9 GB/s

DDR3 SDRAM Memory Bandwidth - Peak Transfer Rate (GB/s):

  • 1066 8.5 GB/s
  • 1333 10.6 GB/s
  • 1866 14.9 GB/s

DDR4 SDRAM Memory Bandwidth - Peak Transfer Rate (GB/s):

  • 2133 17.0 GB/s

Many-Core GPUs

Source: Wikipedia July 19 2012

  • Nvidia GeForce GTX GPUs
    • Nvidia GTX280 (GT200 Tesla) June 17 2008
    • Nvidia GTX480 (GF100 Fermi) March 26 2010
    • Nvidia GTX680 (GK104 Kepler) March 22 2012
    • Nvidia GTX980 (GM204 Maxwell) September 18 2014
  • AMD Radeon HD GPUs
    • AMD Radeon HD-5870 (Evergreen) September 23 2009
    • AMD Radeon HD-6870 (Northern Islands) October 22 2010
    • AMD Radeon HD-7970 (Southern Islands) January 9 2012
    • AMD Radeon R9-295X2 (Hawaii) April 21 2014


  • Nvidia GTX280 (GT200 Tesla) 2008 - 240 cores
  • Nvidia GTX480 (GF100 Fermi) 2010 - 512 cores
  • Nvidia GTX680 (GK104 Kepler) 2012 - 1532 cores
  • Nvidia GTX980 (GM204 Maxwell) 2014 - 2048 cores

  • AMD Radeon HD-5870 (Evergreen) 2009 - 1600 cores
  • AMD Radeon HD-6870 (Northern Islands) 2010 - 1120 cores
  • AMD Radeon HD-7970 (Southern Islands) 2012 - 2048 cores
  • AMD Radeon R9-290 (Hawaii) 2014 - 2560 cores

Performance (FLOPS):

  • Nvidia GTX280 (GT200 Tesla) 2008 - 933 GFLOPS
  • Nvidia GTX480 (GF100 Fermi) 2010 - 1.35 TFLOPS
  • Nvidia GTX680 (GK104 Kepler) 2012 - 3.09 TFLOPS
  • Nvidia GTX980 (GM204 Maxwell) 2014 - 5 TFLOPS

  • AMD Radeon HD-5870 (Evergreen) 2009 - 2.09 TFLOPS single 418 double
  • AMD Radeon HD-6870 (Northern Islands) 2010 - 2.02 TFLOPS single
  • AMD Radeon HD-7970 (Southern Islands) 2012 - 3.79 TFLOPS single 947 GFLOPS double
  • AMD Radeon R9-290 (Hawaii) 2014 - 4.8 TFLOPS single 606 GFLOPS double

Bandwidth (GB/s):

  • Nvidia GTX280 (GT200 Tesla) 2008 - 142 GB/s - 512-bit GDDR3
  • Nvidia GTX480 (GF200 Fermi) 2010 - 177 GB/s - 384-bit GDDR5
  • Nvidia GTX680 (GK104 Kepler) 2012 - 192 GB/s - 256-bit GDDR5
  • Nvidia GTX980 (GM204 Maxwell) 2014 - 224 GB/s - 256-bit GDDR5

  • AMD Radeon HD-5870 (Evergreen) 2009 - 128 GB/s - 256-bit GDDR5
  • AMD Radeon HD-6870 (Northern Islands) 2010 - 134 GB/s - 256-bit GDDR5
  • AMD Radeon HD-7970 (Southern Islands) 2012 - 264 GB/s - 384-bit GDDR5
  • AMD Radeon R9-290 (Hawaii) 2014 - 320 GB/s - 512-bit GDDR5

Bus Interface (GB/s):

  • Nvidia GTX280 (GT200 Tesla) 2008 - PCIe 2.0 - 2007 - 500 MB/s * 16 lanes = 8GB/s
  • Nvidia GTX480 (GF200 Fermi) 2010 - PCIe 2.0 - 2007 - 500 MB/s * 16 lanes = 8GB/s
  • Nvidia GTX680 (GK104 Kepler) 2012 - PCIe 3.0 - 2010 - 1 GB/s * 16 lanes = 16 GB/s
  • Nvidia GTX980 (GM204 Maxwell) 2014 - PCIe 3.0 - 2010 - 1 GB/s * 16 lanes = 16 GB/s

  • AMD Radeon HD-5870 (Evergreen) 2009 - PCIe 2.1 - 2007 - 500 MB/s * 16 lanes = 8GB/s
  • AMD Radeon HD-6870 (Northern Islands) 2010 - PCIe 2.1 - 2007 - 500 MB/s * 16 lanes = 8GB/s
  • AMD Radeon HD-7970 (Southern Islands) 2012 - PCIe 3.0 - 2010 - 1 GB/s * 16 lanes = 16 GB/s
  • AMD Radeon R9-290 (Hawaii) 2014 - PCIe 3.0 - 2010 - 1 GB/s * 16 lanes = 16 GB/s

Note the switch in high performance GPU memory versions (GB/s):

  • GDDR3 - 11.2 GB/s
  • GDDR5 - 86.4 GB/s


Comparison of Intel's i7, AMD's FX Series, Nvidia's GTXs, AMD's Radeons, and Cray's Jaguar reveals that:

  • number of cores on many-core GPUs is 2 orders of magnitude greater than the number of cores on multi-core CPUs
  • memory bandwidth on a GPU is one order of magnitude faster than the transfer rate across the PCIe bus
  • GPU performance is in the TFLOPS range comfortably between CPU performance (80 GFLOPS) and supercomputer performance (Jaguar - 1.8 PFLOPS)

Multi-Core CPUs versus Many-Core GPUs

The figure below illustrates the architectural difference between multi-core and many-core processors.  Multi-core processors devote most of their transitors to control and caching, while many-core processors devote most of their transistors to computational units. 

Processor Layouts

The figures below show the differences in performance as CPUs and GPUs have improved over the last decade.  Computational performance of many-core processors, shown on the left, is one order of magnitude greater than computational performance of multi-core processors.  Memory bandwidth, shown on the right, is one order of magnitude greater than of multi-core processors. 

Computational Power  Memory Bandwidth

The figure below shows the bottleneck to information transfer between a multi-core CPU and a discrete many-core GPU.  The bandwidth of the PCIe (Peripheral Component Interconnect Express) bus limits the maximum throughput:

PCIe Bus

The bandwidth of the PCIe bus increased with the version number: 8 GB/s for PCIe 2.x and 16 GB/s for PCIe 3.x.  To minimize traffic across the PCIe bottleneck, we may decide to execute some tasks on the GPU that would otherwise have benefited from execution on the CPU.  Data that has been moved to the GPU should be processed on the GPU for as long as possible. 


The languages for parallel programming consist of core languages and extensions to them.  Some extensions have been developed specifically for HPC solutions, while others have been developed for heterogeneous computing in general.

The Core Languages

The core languages for parallel programming are currently Fortran, C and C++.  These languages generate the most efficient executable code on the CPU and provide the optimal platforms for adding parallel constructs on the GPU.  All three core langauge are portable and retain direct connection to lower-level assembly languages.  Taken together, these core languages are currently more popular than any other programming language (TIOBE). 


Fortran was originally developed in 1957 for scientific and numerical computations.  Legacy applications in scientific communities are still maintained in Fortran and some scientists prefer this language for parallel programming.  Although C and C++ eclipsed its popularity amongst programmers many years ago, Fortran compilers continue to evolve and produce highly efficient executables.  Recent updates include object-oriented, generic, and parallel programming capabilities. 

ISO/IEC approved the most recent standard in September 2010 (ISO/IEC 1539-1:2010), which is informally known as Fortran 2008.


C was developed in the early 1970s for writing operating systems.  C is not specialized to scientific and mathematical computations, but like Fortran is close to assembly language.  Its focus remains optimal performance at the operating system level.  C provides highly efficient access to instruction sets on a machine-independent basis. 

ISO C11 incorporated multi-threading support through the <threads.h> library. 


C++ evolved out of and alongside C in the 1980s to provide object-oriented facilities.  C++ is a superset of C.  C++ is used to write large applications including real-time simulations and digital games.  Its focus remains delivering performance at the application systems level. 

ISO C++11 incorporated multi-threading support through the <threads>, <mutex>, <condition_variable>, and <future> libraries.  These libraries provide a memory model for multiple co-existing and interacting threads. 

The Traditional Extensions

The two principal language extensions to Fortran, C and C++ in HPC are MPI and OpenMP. 


MPI stands for Message Passing Interface.  The most recent official standard is MPI-3.  This standard is used for writing message-passing programs that are efficient, flexible, and portable.  MPI-3 was agreed by software developers, computer vendors, academics, and computer-science researchers and first published in 2012).  MPI-3 supports 500 functions for parallel I/O with ISO Fortran, ISO C and ISO C++.

Open MPI

The Open MPI project is an open source implementation of the official MPI-3 standard.  Open MPI is used in many of the Top 500 supercomputers.


OpenMP stands for Open Multi-Processing.  The OpenMP programming model provides a simple and flexible interface for writing parallel applications on shared memory platforms.  The OpenMP project recognized that message passing is cumbersome and that room exists for a more abstract interface.  OpenMP consists of compiler directives, library routines, and environment variables for Fortran, C and C++ code. 

OpenMP 4.0 was published in July 2013.  Documentation and compiler support are listed at  GCC 4.7 supports OpenMP 3.1.  The command line flag for compiling OpenMP code using gcc is -fopenmp

The GPGPU Extensions

GPGPU stands for general purpose graphics processing unit.  GPGPU solutions expose the GPU to general purpose computations. 


Originally, GPUs were only programmed using the specialized graphical display languages developed for digital games.  Programmers would translate their compute problems into equivalent graphics problems, solve those problems using the existing game programming tools, and translate those graphical solutions back to solutions to their original compute problems.  The programming skill required was well beyond that of any applications programmer who was not also a game programmer. 

This complexity motivated research into language extensions to support application programmers.  The languages developed included:

The timeline below shows their relative positions in the evolution of GPGPU programming:

source: SIGGRAPH Asia 2010 (OpenCL by Example)

The two dominant language extensions as of 2015 are implementations of Khronos' OpenCL standard and Nvidia's CUDA.


OpenCL (Open Computing Langauge) is an open, royalty-free standard for parallel programming of heterogeneous systems.  The Khronos group manages this standard and released the OpenCL 2.0 specification on November 13 2013.  Implementations of this standard include:


CUDA (Compute Unified Device Architecture) is Nvidia's programming model and instruction set architecture.  CUDA works with all Nvidia GPUs since the G8x series:

  • Tesla architecture (2006):
    • G80 (8/11/2006) - 16 SMs 8 CUDA cores each
    • GT200 (17/6/2008) - 30 SMs 8 CUDA cores each
  • Fermi architecture (2010):
    • GF100 (26/3/2010) - 16 SMs 32 CUDA cores each
    • 1536 max threads per multi-processor
    • unified address space with full support for C++
    • pass by pointer, virtual functions, function pointers, new/delete, try/catch
  • Kepler architecture: GK110 (2012) has:
    • GK110 (22/3/2012) - 15 SMXs 192 CUDA cores each, 32 SFUs (Special Function Units) and 32 LD/STs (Load/Store Units)
    • dynamic parallelism - spawn new threads without returning to the CPU
    • direct data transfer between GPUs
  • Maxwell architecture: GM204 (2014) has:
    • GM107 (21/2/2014) - 5 SMXs 128 CUDA cores each
    • GM204 (18/9/2014) - 16 SMXs 128 CUDA cores each

NVIDIA released CUDA 7.5 on September 8 2015. 

Other Solutions

OpenACC is a collection of compiler directives similar to OpenMP for identifying regions of C, C++, Fortran code for offloading to an accelerator.  NVIDIA, Cray, PGI and CAPS defined OpenACC as an API cross-platform specification for accelerating applications on multi-core and many-core processors using directives. 

DirectCompute is a Microsoft API that ships as part of its DirectX collection.  DirectCompute supports GPGPU programming on Windows platforms. 

C++ AMP is a set of extensions to C++ developed by Microsoft for accelerating execution by taking advantage of data-parallel hardware on GPUs.  C++ AMP runs on Windows operating systems and GPUs that support DirectX.  AMD in collaboration with Microsoft announced the release of C++ AMP 1.2 for both Linux and windows platforms on August 26 2014.

Cilk Plus is Intel's commercial version of Cilk, which provides extension to C and C++.  Intel acquired Cilk from Cilk Arts, which had licensed Cilk from MIT, and now maintains Cilk Plus as a branch of GCC 4.9.  Intel claims that the "The Intel Cilk Plus language, built on the Cilk technology developed at M.I.T. over the past two decades, is designed to provide a simple, well-structured model that makes development, verification and analysis easy.  Because Intel Cilk Plus is an extension to C and C++, programmers typically do not need to restructure programs significantly in order to add parallelism.".  Intel also notes that Cilk Plus provides a simple tool for harnessing the power of its new Xeon Phi coprocessors. 

Intel offers a royalty-free library for modern processors named the Math Kernel Library (MKL).  MKL includes highly optimized code for linear algebra, vector math and statistics functions called from either C or Fortran code.


Domains that have benefitted from parallel processing include:

  • entertainment
  • gaming
  • advertising
  • mobile devices
  • medical imaging
  • weather and ocean patterns
  • air traffic control
  • tectonic plate shifts
  • experimental medicine
  • genetics
  • cell growth
  • quantum chemistry
  • mathematics
  • product design
  • automobile assembly lines
  • sound and light wave propogation
  • scientific computing
  • computational finance
  • data mining
  • statistical analysis


Primary Reading:

Secondary Reading:

  • Intel's roadmap for future processor families
  • The introduction on Wikipedia to DDR3 SDRAM memory.

Previous Reading  Previous: From Serial to Parallel Next: CUDA Preliminaries   Next Reading

  Designed by Chris Szalwinski   Copying From This Site   
Creative Commons License