✅ Quick Summary: CUDA Programming Steps#
| Step | Action | Code/Function Used | Notes |
|------|--------------------------|------------------------------------------------------|-------------------------------|
| 1 | Declare Host Variables | `int *h_A = (int *)malloc(size);` | Use standard C/C++ memory allocation |
| 2 | Declare Device Variables | `int *d_A;` | Only declare pointers here |
| 3 | Allocate Device Memory | `cudaMalloc((void**)&d_A, size);` | Allocates memory on GPU |
| 4 | Copy Host → Device | `cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);` | Copy input data to GPU |
| 5 | Launch Kernel | `myKernel<<<gridDim, blockDim>>>(...);` | Set up grid/block dimensions |
| | | `cudaDeviceSynchronize();` | Ensure kernel execution completes |
| 6 | Copy Device → Host | `cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);` | Retrieve output from GPU |
| 7 | Free Memory | `cudaFree(d_A); free(h_A);` | Cleanup to avoid memory leaks |