From 7078894c6fb704c174f96bb23fa926acf503096e Mon Sep 17 00:00:00 2001 From: Alireza Olama <42338525+Alirezalm@users.noreply.github.com> Date: Sun, 24 May 2026 14:23:07 +0300 Subject: [PATCH 1/3] Update README for Homework Assignment 4 details --- README.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 0f91d63..5929769 100644 --- a/README.md +++ b/README.md @@ -4,9 +4,9 @@ **Instructor: Alireza Olama** -## Homework Assignment 2: Optimizing Matrix Multiplication in C++ +## Homework Assignment 4: Optimizing Matrix Multiplication in C++ -**Due Date**: 08/05/2025 +**Due Date**: 31/05/2026 **Points**: 100 @@ -14,8 +14,7 @@ ### Assignment Overview -Welcome to the second homework assignment of the Parallel Programming course! In Assignment 1, you implemented a naive -matrix multiplication using a triple nested loop. In this assignment, you will optimize the performance of your naive +Welcome to the last homework assignment of the Parallel Programming course! In this assignment, you will optimize the performance of a naive matrix multiplication implementation using two techniques: 1. **Cache Optimization via Blocked Matrix Multiplication**: Improve data locality to reduce cache misses. @@ -23,7 +22,7 @@ implementation using two techniques: Your task is to implement both optimizations in the provided C++ `main.cpp` file, measure their performance, and compare the wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds -on your Assignment 1 code, so ensure your naive implementation is correct before starting. +on naive matmul implementation, so ensure your naive implementation is correct before starting. --- @@ -130,9 +129,9 @@ Example table format: #### Matrix Storage and Memory Management -- Continue using row-major order for all matrices, as in Assignment 1. +- Row-major order for all matrices - Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`). -- Do not use STL containers or smart pointers. +- Do not use smart pointers. --- @@ -157,8 +156,8 @@ Example table format: - Use CLion or Visual Studio with CMake. - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`. - **Linux/Mac Users**: - - Make sure gcc compiler is installed (`brew install gcc` on Mac). - - Configure cmake to use the correct compiler: + - Make sure the GCC compiler is installed (`brew install gcc` on Mac). + - Configure CMake to use the correct compiler: ```bash cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ . ``` @@ -235,4 +234,4 @@ git push origin student-name - Validate each implementation against `output.raw` to ensure correctness before optimizing. - Use small test cases to debug your blocked and parallel implementations. -Good luck, and enjoy optimizing your matrix multiplication! \ No newline at end of file +Good luck, and enjoy optimizing your matrix multiplication! From 46b0cb2c3eec4d891448aebe2212f89e67ff7ff7 Mon Sep 17 00:00:00 2001 From: Somoy Date: Wed, 27 May 2026 18:19:44 +0300 Subject: [PATCH 2/3] Update repository references in README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 5929769..51c7f2a 100644 --- a/README.md +++ b/README.md @@ -173,11 +173,11 @@ Example table format: #### Fork and Clone the Repository -- Fork the Assignment 2 repository (provided separately). +- Fork the Assignment 4 repository (provided separately). - Clone your fork: ```bash - git clone https://github.com/parallelcomputingabo/Homework-2.git - cd Homework-2 + git clone https://github.com/AA-parallel-computing/Assignment-4-Optional.git + cd Assignment-4-Optional ``` #### Create a New Branch From 234bc5667b7e191d1abbe0ece4c1c1f2ac08db2a Mon Sep 17 00:00:00 2001 From: Niklas Linderoos Date: Wed, 3 Jun 2026 21:20:26 +0300 Subject: [PATCH 3/3] niklas-linderoos: Implemented optimized matrix multiplication --- README.md | 232 ++++-------------------------------------------------- main.cpp | 135 +++++++++++++++++++++++++++++-- 2 files changed, 142 insertions(+), 225 deletions(-) diff --git a/README.md b/README.md index 51c7f2a..d8dbb0b 100644 --- a/README.md +++ b/README.md @@ -12,226 +12,24 @@ --- -### Assignment Overview +### Task 1 +I am not sure what I did wrong, or maybe it is just my hardware, no matter the block_size I did not see a increase in performance, usually it resulted in running slower than the naive solution. But from my tests block_size of 64 caused the least performance loss -Welcome to the last homework assignment of the Parallel Programming course! In this assignment, you will optimize the performance of a naive matrix multiplication -implementation using two techniques: +### Task 2 +For my hardware I found that OMP_NUM_THREADS of 4 yielded the best result, and had bigger speedups as the sizes of the matrixes increased. -1. **Cache Optimization via Blocked Matrix Multiplication**: Improve data locality to reduce cache misses. -2. **Parallel Matrix Multiplication using `OpenMP`**: Parallelize the computation across multiple threads. - -Your task is to implement both optimizations in the provided C++ `main.cpp` file, measure their performance, and compare the -wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds -on naive matmul implementation, so ensure your naive implementation is correct before starting. - ---- - -### Technical Requirements - -#### 1. Cache Optimization (Blocked Matrix Multiplication) - -**Why Cache Optimization?** - -Modern CPUs rely on cache memory to reduce the latency of accessing data from main memory. Cache memory is faster but -smaller, organized in cache lines (typically 64 bytes). When a CPU accesses a memory location, it fetches an entire -cache line. Matrix multiplication can suffer from poor performance if memory accesses are not cache-friendly, leading to -frequent cache misses. - -The naive matrix multiplication (with triple nested loops) accesses memory in a way that may not exploit spatial and -temporal locality: - -- **Spatial Locality**: Accessing consecutive memory locations (e.g., elements in the same cache line). -- **Temporal Locality**: Reusing the same data multiple times while it’s still in the cache. - -Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By -performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache -misses and improving performance. - -**Blocked Matrix Multiplication Pseudocode** - -Assume matrices \( A \) (m × n), \( B \) (n × p), and \( C \) (m × p) are stored in row-major order. The blocked matrix -multiplication processes submatrices of size \( block_size × block_size \): - -```cpp -// C = A * B -for (ii = 0; ii < m; ii += block_size) - for (jj = 0; jj < p; jj += block_size) - for (kk = 0; kk < n; kk += block_size) - // Process block: C[ii:ii+block_size, jj:jj+block_size] += A[ii:ii+block_size, kk:kk+block_size] * B[kk:kk+block_size, jj:jj+block_size] - for (i = ii; i < min(ii + block_size, m); i++) - for (j = jj; j < min(jj + block_size, p); j++) - for (k = kk; k < min(kk + block_size, n); k++) - C[i * p + j] += A[i * n + k] * B[k * p + j] -``` - -- **block_size**: Chosen to ensure the block fits in the cache (e.g., 32, 64, or 128, depending on the system). -- **Outer loops (ii, jj, kk)**: Iterate over blocks. -- **Inner loops (i, j, k)**: Compute within a block, reusing data in the cache. - -**Task**: Implement the `blocked_matmul` function in the provided `main.cpp`. Experiment with different block sizes (e.g., -16, 32, 64) and report the best performance. - ---- - -#### 2. Parallel Matrix Multiplication with OpenMP - -**Why OpenMP?** - -`OpenMP` is a portable API for parallel programming in shared-memory systems. It allows you to parallelize loops with -minimal code changes, distributing iterations across multiple threads. In matrix multiplication, the outer loop(s) can -be parallelized, as each element of the output matrix \( C \) can be computed independently. - -**Parallelizing with OpenMP** - -Use OpenMP to parallelize the outer loop(s) of the naive matrix multiplication. For example, parallelize the loop over -rows of \( C \): - -```cpp -#pragma omp parallel for -for (i = 0; i < m; i++) - for (j = 0; j < p; j++) - for (k = 0; k < n; k++) - C[i * p + j] += A[i * n + k] * B[k * p + j]; -``` - -- The `#pragma omp parallel for` directive tells `OpenMP` to distribute iterations of the loop across available threads. -- Ensure thread safety: Since each iteration writes to a distinct element of \( C \), this loop is safe to parallelize - without locks. -- Use `omp_get_wtime()` to measure wall clock time for accurate performance comparisons. - -**Task**: Implement the `parallel_matmul` function in the provided `main.cpp` using `OpenMP`. Test with different numbers of -threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`. - ---- - -#### 3. Performance Measurement - -For each test case (0 through 9 in the `data` folder): - -- Measure the **wall clock time** for: - - Naive matrix multiplication (`naive_matmul`). - - Cache-optimized matrix multiplication (`blocked_matmul`). - - Parallel matrix multiplication (`parallel_matmul`). -- Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time. -- Report the times in a table in your submission README.md, including: - - Test case number. - - Matrix dimensions (m × n × p). - - Wall clock time for each implementation (in seconds). - - Speedup of blocked and parallel implementations over the naive implementation. - -Example table format: +### Performance Measurement table: | Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup | |-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------| -| 0 | 512 × 512 × 512 | 2.345 | 0.987 | 0.543 | 2.38× | 4.32× | - ---- - -#### Matrix Storage and Memory Management - -- Row-major order for all matrices -- Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`). -- Do not use smart pointers. - ---- - -#### Input/Output and Validation - -- Use the same input/output format as Assignment 1: - - Input files: `data//input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)). - - Output file: `data//result.raw` (matrix \( C \)). - - Reference file: `data//output.raw` for validation. -- The executable accepts a case number (0–9) as a command-line argument. -- Validate correctness by comparing `result.raw` with `output.raw` for each implementation. - ---- - -### Build Instructions - -- Use the provided `CMakeLists.txt` to build the project. -- **Additional Requirements**: - - Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC). - - The provided CMake file includes OpenMP support. -- **Windows Users**: - - Use CLion or Visual Studio with CMake. - - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`. -- **Linux/Mac Users**: - - Make sure the GCC compiler is installed (`brew install gcc` on Mac). - - Configure CMake to use the correct compiler: - ```bash - cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ . - ``` - - Run `cmake .` to generate a Makefile, then `make`. -- **Testing OpenMP**: - - Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on - Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows). - - Test with different thread counts to find the best performance. - ---- - -### Submission Requirements - -#### Fork and Clone the Repository - -- Fork the Assignment 4 repository (provided separately). -- Clone your fork: - ```bash - git clone https://github.com/AA-parallel-computing/Assignment-4-Optional.git - cd Assignment-4-Optional - ``` - -#### Create a New Branch - -```bash -git checkout -b student-name -``` - -#### Implement Your Solution - -- Modify the provided `main.cpp` to implement `blocked_matmul` and `parallel_matmul`. -- Update `README.md` with your performance results table. - -#### Commit and Push - -```bash -git add . -git commit -m "student-name: Implemented optimized matrix multiplication" -git push origin student-name -``` - -#### Submit a Pull Request (PR) - -- Create a pull request from your branch to the base repository’s `main` branch. -- Include a description of your optimizations and any challenges faced. - ---- - -### Grading (100 Points Total) - -| Subtask | Points | -|---------------------------------------------|--------| -| Correct implementation of `blocked_matmul` | 30 | -| Correct implementation of `parallel_matmul` | 30 | -| Accurate performance measurements | 20 | -| Performance results table in README.md | 10 | -| Code clarity, commenting, and organization | 10 | -| **Total** | 100 | - ---- - -### Tips for Success - -- **Cache Optimization**: - - Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64). - - Use a block size that balances cache usage without excessive overhead. -- **OpenMP**: - - Test with different thread counts to find the optimal number for your system. - - Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues). -- **Performance Measurement**: - - Run multiple iterations for each test case and report the average time to reduce variability. - - Ensure no other heavy processes are running during measurements. -- **Debugging**: - - Validate each implementation against `output.raw` to ensure correctness before optimizing. - - Use small test cases to debug your blocked and parallel implementations. +| 0 | 64x64x64 | 0.000579959 | 0.00141523 | 0.000540829 | 0.409797x | 1.07235x | +| 1 | 128x64x128 | 0.00276523 | 0.00941577 | 0.00164599 | 0.293681x | 1.67998x | +| 2 | 100x128x56 | 0.00197337 | 0.00393161 | 0.00154423 | 0.501922x | 1.27789x | +| 3 | 128x64x128 | 0.0071317 | 0.0123944 | 0.00348026 | 0.575398x | 2.04919x | +| 4 | 32x128x32 | 0.00046889 | 0.00108778 | 0.00046155 | 0.431053x | 1.0159x | +| 5 | 200x100x256 | 0.021566 | 0.0533294 | 0.00897992 | 0.404393x | 2.40159x | +| 6 | 256x256x256 | 0.149978 | 0.221313 | 0.073618 | 0.677673x | 2.03724x | +| 7 | 256x300x256 | 0.139749 | 0.171278 | 0.0644767 | 0.815918x | 2.16743x | +| 8 | 64x128x64 | 0.00394642 | 0.00917446 | 0.00163739 | 0.430153x | 2.41019x | +| 9 | 256x256x257 | 0.128063 | 0.181834 | 0.0329704 | 0.704285x | 3.88418x | -Good luck, and enjoy optimizing your matrix multiplication! diff --git a/main.cpp b/main.cpp index 65bf108..90d2ab1 100644 --- a/main.cpp +++ b/main.cpp @@ -3,50 +3,166 @@ #include #include #include +#include +#include + void naive_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) { //TODO : Implement naive matrix multiplication + // A is m x n, B is n x p, C is m x p + for (int i = 0; i < m; i++) { + for (int j = 0; j < p; j++) { + float sum = 0; + for (int k = 0; k < n; k++) { + sum += A[i * n + k] * B[k * p + j]; + } + C[i * p + j] = sum; + } + } } void blocked_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t block_size) { // TODO: Implement blocked matrix multiplication // A is m x n, B is n x p, C is m x p // Use block_size to divide matrices into submatrices + for (int ii = 0; ii < m; ii += block_size){ + for (int jj = 0; jj < p; jj += block_size){ + for (int kk = 0; kk < n; kk += block_size){ + for (int i = ii; i < std::min(ii + block_size, m); i++){ + for (int j = jj; j < std::min(jj + block_size, p); j++) { + for (int k = kk; k < std::min(kk + block_size, n); k++) + C[i * p + j] += A[i * n + k] * B[k * p + j]; + } + } + } + } + } } void parallel_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) { // TODO: Implement parallel matrix multiplication using OpenMP // A is m x n, B is n x p, C is m x p + #pragma omp parallel for + for (int i = 0; i < m; i++) { + for (int j = 0; j < p; j++) { + for (int k = 0; k < n; k++) { + C[i * p + j] += A[i * n + k] * B[k * p + j]; + } + } + } +} +float* read_file(const std::string &file, int *row, int *col){ + std::ifstream file_reader(file); + std::string meta_data = ""; + getline(file_reader, meta_data); + std::stringstream row_col(meta_data); + + row_col >> *row; + row_col >> *col; + + float *array = new float[*row * *col]; + + int index = 0; + std::string cur_line; + while (getline (file_reader, cur_line)) { + std::stringstream values(cur_line); + float value; + while (values >> value){ + array[index++] = value; + } + } + file_reader.close(); + return array; +} +float round_up(float value, int decimal_places) { + const int multiplier = std::pow(10, decimal_places); + return roundf(value * multiplier) / multiplier; +} +void write_file (const std::string &file, float *array, int row, int col){ + std::ofstream file_writer(file); + std::stringstream text; + // Write the first line with row and collum amount + text << row << " " << col << std::endl; + // Write the rest of the data in right format + for(int i=0; i < row; i++){ + for(int j=0; j < col; j++){ + // Looks like the provided output.raw files has values rounded to 2 decimal points. Feels like this an cause some errors, but rerunning + text << round_up(array[i * col + j], 2)<< " "; + } + text << std::endl; + } + + + file_writer << text.str(); + file_writer.close(); } bool validate_result(const std::string &result_file, const std::string &reference_file) { - //TODO : Implement result validation + + // Read result file + int res_row; + int res_col; + float* res = read_file(result_file, &res_row, &res_col); + + // Read reference file + int ref_row; + int ref_col; + float* ref = read_file(reference_file, &ref_row, &ref_col); + + // Check if reference file and result file both have different amount of rows and collums + // if so they cant be equal + if (res_row != ref_row || ref_col != ref_col) { + return false; + } + + // Check if the values inside the files are the same (within a small margin of error) + int row = res_row; + int col = res_col; + float max_err = 1e-3; + + + for(int i=0; i < row; i++) { + for(int j=0; j < col; j++) { + float dif = std::abs(res[i * row + j] - ref[i * row + j]); + if (dif > max_err){ + std::cout << "Difference found at: " << i << ", " << j <" << std::endl; return 1; } - + int case_number = std::atoi(argv[1]); if (case_number < 0 || case_number > 9) { std::cerr << "Case number must be between 0 and 9" << std::endl; return 1; } - // Construct file paths std::string folder = "data/" + std::to_string(case_number) + "/"; std::string input0_file = folder + "input0.raw"; std::string input1_file = folder + "input1.raw"; std::string result_file = folder + "result.raw"; std::string reference_file = folder + "output.raw"; - + + int m; + int n; + int p; // TODO Read input0.raw (matrix A) - + float *A = read_file(input0_file, &m, &n); // TODO Read input1.raw (matrix B) - + float *B = read_file(input1_file, &n, &p); // Allocate memory for result matrices float *C_naive = new float[m * p]; @@ -59,7 +175,7 @@ int main(int argc, char *argv[]) { double naive_time = omp_get_wtime() - start_time; // TODO Write naive result to file - + write_file(result_file, C_naive, m, p); // Validate naive result bool naive_correct = validate_result(result_file, reference_file); @@ -68,11 +184,13 @@ int main(int argc, char *argv[]) { } // Measure performance of blocked_matmul (use block_size = 32 as default) + int block_size = 64; start_time = omp_get_wtime(); - blocked_matmul(C_blocked, A, B, m, n, p, 32); + blocked_matmul(C_blocked, A, B, m, n, p, block_size); double blocked_time = omp_get_wtime() - start_time; // TODO Write blocked result to file + write_file(result_file, C_blocked, m, p); // Validate blocked result @@ -87,6 +205,7 @@ int main(int argc, char *argv[]) { double parallel_time = omp_get_wtime() - start_time; // TODO Write parallel result to file + write_file(result_file, C_parallel, m, p); // Validate parallel result