From 7078894c6fb704c174f96bb23fa926acf503096e Mon Sep 17 00:00:00 2001
From: Alireza Olama <42338525+Alirezalm@users.noreply.github.com>
Date: Sun, 24 May 2026 14:23:07 +0300
Subject: [PATCH 1/3] Update README for Homework Assignment 4 details

---
 README.md | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 0f91d63..5929769 100644
--- a/README.md
+++ b/README.md
@@ -4,9 +4,9 @@
 
 **Instructor: Alireza Olama**
 
-## Homework Assignment 2: Optimizing Matrix Multiplication in C++
+## Homework Assignment 4: Optimizing Matrix Multiplication in C++
 
-**Due Date**: 08/05/2025
+**Due Date**: 31/05/2026
 
 **Points**: 100
 
@@ -14,8 +14,7 @@
 
 ### Assignment Overview
 
-Welcome to the second homework assignment of the Parallel Programming course! In Assignment 1, you implemented a naive
-matrix multiplication using a triple nested loop. In this assignment, you will optimize the performance of your naive
+Welcome to the last homework assignment of the Parallel Programming course! In this assignment, you will optimize the performance of a naive matrix multiplication
 implementation using two techniques:
 
 1. **Cache Optimization via Blocked Matrix Multiplication**: Improve data locality to reduce cache misses.
@@ -23,7 +22,7 @@ implementation using two techniques:
 
 Your task is to implement both optimizations in the provided C++ `main.cpp` file, measure their performance, and compare the
 wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds
-on your Assignment 1 code, so ensure your naive implementation is correct before starting.
+on naive matmul implementation, so ensure your naive implementation is correct before starting.
 
 ---
 
@@ -130,9 +129,9 @@ Example table format:
 
 #### Matrix Storage and Memory Management
 
-- Continue using row-major order for all matrices, as in Assignment 1.
+- Row-major order for all matrices
 - Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`).
-- Do not use STL containers or smart pointers.
+- Do not use smart pointers.
 
 ---
 
@@ -157,8 +156,8 @@ Example table format:
     - Use CLion or Visual Studio with CMake.
     - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
 - **Linux/Mac Users**:
-    - Make sure gcc compiler is installed (`brew install gcc` on Mac).
-    - Configure cmake to use the correct compiler:
+    - Make sure the GCC compiler is installed (`brew install gcc` on Mac).
+    - Configure CMake to use the correct compiler:
       ```bash
       cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
       ```
@@ -235,4 +234,4 @@ git push origin student-name
     - Validate each implementation against `output.raw` to ensure correctness before optimizing.
     - Use small test cases to debug your blocked and parallel implementations.
 
-Good luck, and enjoy optimizing your matrix multiplication!
\ No newline at end of file
+Good luck, and enjoy optimizing your matrix multiplication!

From 46b0cb2c3eec4d891448aebe2212f89e67ff7ff7 Mon Sep 17 00:00:00 2001
From: Somoy <Somoy97@gmail.com>
Date: Wed, 27 May 2026 18:19:44 +0300
Subject: [PATCH 2/3] Update repository references in README.md

---
 README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 5929769..51c7f2a 100644
--- a/README.md
+++ b/README.md
@@ -173,11 +173,11 @@ Example table format:
 
 #### Fork and Clone the Repository
 
-- Fork the Assignment 2 repository (provided separately).
+- Fork the Assignment 4 repository (provided separately).
 - Clone your fork:
   ```bash
-  git clone https://github.com/parallelcomputingabo/Homework-2.git
-  cd Homework-2
+  git clone https://github.com/AA-parallel-computing/Assignment-4-Optional.git
+  cd Assignment-4-Optional
   ```
 
 #### Create a New Branch

From 234bc5667b7e191d1abbe0ece4c1c1f2ac08db2a Mon Sep 17 00:00:00 2001
From: Niklas Linderoos <n.linderoos@gmail.com>
Date: Wed, 3 Jun 2026 21:20:26 +0300
Subject: [PATCH 3/3] niklas-linderoos: Implemented optimized matrix
 multiplication

---
 README.md | 232 ++++--------------------------------------------------
 main.cpp  | 135 +++++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 225 deletions(-)

diff --git a/README.md b/README.md
index 51c7f2a..d8dbb0b 100644
--- a/README.md
+++ b/README.md
@@ -12,226 +12,24 @@
 
 ---
 
-### Assignment Overview
+### Task 1
+I am not sure what I did wrong, or maybe it is just my hardware, no matter the block_size I did not see a increase in performance, usually it resulted in running slower than the naive solution. But from my tests block_size of 64 caused the least performance loss
 
-Welcome to the last homework assignment of the Parallel Programming course! In this assignment, you will optimize the performance of a naive matrix multiplication
-implementation using two techniques:
+### Task 2
+For my hardware I found that OMP_NUM_THREADS of 4 yielded the best result, and had bigger speedups as the sizes of the matrixes increased.
 
-1. **Cache Optimization via Blocked Matrix Multiplication**: Improve data locality to reduce cache misses.
-2. **Parallel Matrix Multiplication using `OpenMP`**: Parallelize the computation across multiple threads.
-
-Your task is to implement both optimizations in the provided C++ `main.cpp` file, measure their performance, and compare the
-wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds
-on naive matmul implementation, so ensure your naive implementation is correct before starting.
-
----
-
-### Technical Requirements
-
-#### 1. Cache Optimization (Blocked Matrix Multiplication)
-
-**Why Cache Optimization?**
-
-Modern CPUs rely on cache memory to reduce the latency of accessing data from main memory. Cache memory is faster but
-smaller, organized in cache lines (typically 64 bytes). When a CPU accesses a memory location, it fetches an entire
-cache line. Matrix multiplication can suffer from poor performance if memory accesses are not cache-friendly, leading to
-frequent cache misses.
-
-The naive matrix multiplication (with triple nested loops) accesses memory in a way that may not exploit spatial and
-temporal locality:
-
-- **Spatial Locality**: Accessing consecutive memory locations (e.g., elements in the same cache line).
-- **Temporal Locality**: Reusing the same data multiple times while it’s still in the cache.
-
-Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By
-performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache
-misses and improving performance.
-
-**Blocked Matrix Multiplication Pseudocode**
-
-Assume matrices \( A \) (m × n), \( B \) (n × p), and \( C \) (m × p) are stored in row-major order. The blocked matrix
-multiplication processes submatrices of size \( block_size × block_size \):
-
-```cpp
-// C = A * B
-for (ii = 0; ii < m; ii += block_size)
-    for (jj = 0; jj < p; jj += block_size)
-        for (kk = 0; kk < n; kk += block_size)
-            // Process block: C[ii:ii+block_size, jj:jj+block_size] += A[ii:ii+block_size, kk:kk+block_size] * B[kk:kk+block_size, jj:jj+block_size]
-            for (i = ii; i < min(ii + block_size, m); i++)
-                for (j = jj; j < min(jj + block_size, p); j++)
-                    for (k = kk; k < min(kk + block_size, n); k++)
-                        C[i * p + j] += A[i * n + k] * B[k * p + j]
-```
-
-- **block_size**: Chosen to ensure the block fits in the cache (e.g., 32, 64, or 128, depending on the system).
-- **Outer loops (ii, jj, kk)**: Iterate over blocks.
-- **Inner loops (i, j, k)**: Compute within a block, reusing data in the cache.
-
-**Task**: Implement the `blocked_matmul` function in the provided `main.cpp`. Experiment with different block sizes (e.g.,
-16, 32, 64) and report the best performance.
-
----
-
-#### 2. Parallel Matrix Multiplication with OpenMP
-
-**Why OpenMP?**
-
-`OpenMP` is a portable API for parallel programming in shared-memory systems. It allows you to parallelize loops with
-minimal code changes, distributing iterations across multiple threads. In matrix multiplication, the outer loop(s) can
-be parallelized, as each element of the output matrix \( C \) can be computed independently.
-
-**Parallelizing with OpenMP**
-
-Use OpenMP to parallelize the outer loop(s) of the naive matrix multiplication. For example, parallelize the loop over
-rows of \( C \):
-
-```cpp
-#pragma omp parallel for
-for (i = 0; i < m; i++)
-    for (j = 0; j < p; j++)
-        for (k = 0; k < n; k++)
-            C[i * p + j] += A[i * n + k] * B[k * p + j];
-```
-
-- The `#pragma omp parallel for` directive tells `OpenMP` to distribute iterations of the loop across available threads.
-- Ensure thread safety: Since each iteration writes to a distinct element of \( C \), this loop is safe to parallelize
-  without locks.
-- Use `omp_get_wtime()` to measure wall clock time for accurate performance comparisons.
-
-**Task**: Implement the `parallel_matmul` function in the provided `main.cpp` using `OpenMP`. Test with different numbers of
-threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`.
-
----
-
-#### 3. Performance Measurement
-
-For each test case (0 through 9 in the `data` folder):
-
-- Measure the **wall clock time** for:
-    - Naive matrix multiplication (`naive_matmul`).
-    - Cache-optimized matrix multiplication (`blocked_matmul`).
-    - Parallel matrix multiplication (`parallel_matmul`).
-- Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time.
-- Report the times in a table in your submission README.md, including:
-    - Test case number.
-    - Matrix dimensions (m × n × p).
-    - Wall clock time for each implementation (in seconds).
-    - Speedup of blocked and parallel implementations over the naive implementation.
-
-Example table format:
+### Performance Measurement table:
 
 | Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
 |-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------|
-| 0         | 512 × 512 × 512        | 2.345          | 0.987            | 0.543             | 2.38×           | 4.32×            |
-
----
-
-#### Matrix Storage and Memory Management
-
-- Row-major order for all matrices
-- Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`).
-- Do not use smart pointers.
-
----
-
-#### Input/Output and Validation
-
-- Use the same input/output format as Assignment 1:
-    - Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
-    - Output file: `data/<case>/result.raw` (matrix \( C \)).
-    - Reference file: `data/<case>/output.raw` for validation.
-- The executable accepts a case number (0–9) as a command-line argument.
-- Validate correctness by comparing `result.raw` with `output.raw` for each implementation.
-
----
-
-### Build Instructions
-
-- Use the provided `CMakeLists.txt` to build the project.
-- **Additional Requirements**:
-    - Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
-    - The provided CMake file includes OpenMP support.
-- **Windows Users**:
-    - Use CLion or Visual Studio with CMake.
-    - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
-- **Linux/Mac Users**:
-    - Make sure the GCC compiler is installed (`brew install gcc` on Mac).
-    - Configure CMake to use the correct compiler:
-      ```bash
-      cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
-      ```
-    - Run `cmake .` to generate a Makefile, then `make`.
-- **Testing OpenMP**:
-    - Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
-      Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
-    - Test with different thread counts to find the best performance.
-
----
-
-### Submission Requirements
-
-#### Fork and Clone the Repository
-
-- Fork the Assignment 4 repository (provided separately).
-- Clone your fork:
-  ```bash
-  git clone https://github.com/AA-parallel-computing/Assignment-4-Optional.git
-  cd Assignment-4-Optional
-  ```
-
-#### Create a New Branch
-
-```bash
-git checkout -b student-name
-```
-
-#### Implement Your Solution
-
-- Modify the provided `main.cpp` to implement `blocked_matmul` and `parallel_matmul`.
-- Update `README.md` with your performance results table.
-
-#### Commit and Push
-
-```bash
-git add .
-git commit -m "student-name: Implemented optimized matrix multiplication"
-git push origin student-name
-```
-
-#### Submit a Pull Request (PR)
-
-- Create a pull request from your branch to the base repository’s `main` branch.
-- Include a description of your optimizations and any challenges faced.
-
----
-
-### Grading (100 Points Total)
-
-| Subtask                                     | Points |
-|---------------------------------------------|--------|
-| Correct implementation of `blocked_matmul`  | 30     |
-| Correct implementation of `parallel_matmul` | 30     |
-| Accurate performance measurements           | 20     |
-| Performance results table in README.md      | 10     |
-| Code clarity, commenting, and organization  | 10     |
-| **Total**                                   | 100    |
-
----
-
-### Tips for Success
-
-- **Cache Optimization**:
-    - Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
-    - Use a block size that balances cache usage without excessive overhead.
-- **OpenMP**:
-    - Test with different thread counts to find the optimal number for your system.
-    - Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
-- **Performance Measurement**:
-    - Run multiple iterations for each test case and report the average time to reduce variability.
-    - Ensure no other heavy processes are running during measurements.
-- **Debugging**:
-    - Validate each implementation against `output.raw` to ensure correctness before optimizing.
-    - Use small test cases to debug your blocked and parallel implementations.
+| 0         | 64x64x64               | 0.000579959    | 0.00141523       | 0.000540829       | 0.409797x       | 1.07235x         |
+| 1         | 128x64x128             | 0.00276523     | 0.00941577       | 0.00164599        | 0.293681x       | 1.67998x         |
+| 2         | 100x128x56             | 0.00197337     | 0.00393161       | 0.00154423        | 0.501922x       | 1.27789x         |
+| 3         | 128x64x128             | 0.0071317      | 0.0123944        | 0.00348026        | 0.575398x       | 2.04919x         |
+| 4         | 32x128x32              | 0.00046889     | 0.00108778       | 0.00046155        | 0.431053x       | 1.0159x          |
+| 5         | 200x100x256            | 0.021566       | 0.0533294        | 0.00897992        | 0.404393x       | 2.40159x         |
+| 6         | 256x256x256            | 0.149978       | 0.221313         | 0.073618          | 0.677673x       | 2.03724x         |
+| 7         | 256x300x256            | 0.139749       | 0.171278         | 0.0644767         | 0.815918x       | 2.16743x         |
+| 8         | 64x128x64              | 0.00394642     | 0.00917446       | 0.00163739        | 0.430153x       | 2.41019x         |
+| 9         | 256x256x257            | 0.128063       | 0.181834         | 0.0329704         | 0.704285x       | 3.88418x         |
 
-Good luck, and enjoy optimizing your matrix multiplication!
diff --git a/main.cpp b/main.cpp
index 65bf108..90d2ab1 100644
--- a/main.cpp
+++ b/main.cpp
@@ -3,50 +3,166 @@
 #include <string>
 #include <omp.h>
 #include <cmath>
+#include <stdint.h>
+#include <sstream>
+
 
 void naive_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) {
     //TODO : Implement naive matrix multiplication
+    // A is m x n, B is n x p, C is m x p
+    for (int i = 0; i < m; i++) {
+        for (int j = 0; j < p; j++) {
+            float sum = 0;
+            for (int k = 0; k < n; k++) {
+                sum += A[i * n + k] * B[k * p + j];
+            }
+            C[i * p + j] = sum;
+        }
+    }
 }
 
 void blocked_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t block_size) {
     // TODO: Implement blocked matrix multiplication
     // A is m x n, B is n x p, C is m x p
     // Use block_size to divide matrices into submatrices
+    for (int ii = 0; ii < m; ii += block_size){
+        for (int jj = 0; jj < p; jj += block_size){
+            for (int kk = 0; kk < n; kk += block_size){
+                 for (int i = ii; i < std::min(ii + block_size, m); i++){
+                    for (int j = jj; j < std::min(jj + block_size, p); j++) {
+                        for (int k = kk; k < std::min(kk + block_size, n); k++)
+                            C[i * p + j] += A[i * n + k] * B[k * p + j];
+                    }
+                }
+            }
+        }
+    }
 }
 
 void parallel_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) {
     // TODO: Implement parallel matrix multiplication using OpenMP
     // A is m x n, B is n x p, C is m x p
+    #pragma omp parallel for
+    for (int i = 0; i < m; i++) {
+        for (int j = 0; j < p; j++) {
+            for (int k = 0; k < n; k++) {
+                C[i * p + j] += A[i * n + k] * B[k * p + j];
+            }
+        }
+    }
+}
+float* read_file(const std::string &file, int *row, int *col){
+    std::ifstream file_reader(file);
+    std::string meta_data = "";
+    getline(file_reader, meta_data);
+    std::stringstream row_col(meta_data);
+    
+    row_col >> *row;
+    row_col >> *col;
+
+    float *array = new float[*row * *col];
+
+    int index = 0;
+    std::string cur_line;
+    while (getline (file_reader, cur_line)) {
+        std::stringstream values(cur_line);
+        float value;
+        while (values >> value){
+            array[index++] = value;
+        }
+    }
+    file_reader.close();
+    return array;
+}
+float round_up(float value, int decimal_places) {
+    const int multiplier = std::pow(10, decimal_places);
+    return roundf(value * multiplier) / multiplier;
+}
+void write_file (const std::string &file, float *array, int row, int col){
+    std::ofstream file_writer(file);
+    std::stringstream text;
+    // Write the first line with row and collum amount
+    text << row << " " << col << std::endl;
+    // Write the rest of the data in right format
+    for(int i=0; i < row; i++){
+        for(int j=0; j < col; j++){
+            // Looks like the provided output.raw files has values rounded to 2 decimal points. Feels like this an cause some errors, but rerunning 
+            text << round_up(array[i * col + j], 2)<< " ";
+        }
+        text << std::endl;
+    }
+
+
+    file_writer << text.str();
+    file_writer.close();
 }
 
 bool validate_result(const std::string &result_file, const std::string &reference_file) {
-   //TODO : Implement result validation
+   
+    // Read result file
+    int res_row;
+    int res_col;
+    float* res = read_file(result_file, &res_row, &res_col); 
+
+    // Read reference file
+    int ref_row;
+    int ref_col;
+    float* ref = read_file(reference_file, &ref_row, &ref_col); 
+
+    // Check if reference file and result file both have different amount of rows and collums
+    //  if so they cant be equal
+    if (res_row != ref_row || ref_col != ref_col) {
+        return false;
+    }
+
+    // Check if the values inside the files are the same (within a small margin of error)
+    int row = res_row;
+    int col = res_col;
+    float max_err = 1e-3;
+
+
+    for(int i=0; i < row; i++) {
+        for(int j=0; j < col; j++) {
+            float dif = std::abs(res[i * row + j] - ref[i * row + j]);
+            if (dif > max_err){
+                std::cout << "Difference found at: " << i << ", " << j <<std::endl;
+                return false;
+            }
+
+        }
+    }
+
+   return true;
 }
 
+
+
 int main(int argc, char *argv[]) {
     if (argc != 2) {
         std::cerr << "Usage: " << argv[0] << " <case_number>" << std::endl;
         return 1;
     }
-
+    
     int case_number = std::atoi(argv[1]);
     if (case_number < 0 || case_number > 9) {
         std::cerr << "Case number must be between 0 and 9" << std::endl;
         return 1;
     }
-
     // Construct file paths
     std::string folder = "data/" + std::to_string(case_number) + "/";
     std::string input0_file = folder + "input0.raw";
     std::string input1_file = folder + "input1.raw";
     std::string result_file = folder + "result.raw";
     std::string reference_file = folder + "output.raw";
-
+    
+    int m;
+    int n;
+    int p;
     // TODO Read input0.raw (matrix A)
-
+    float *A = read_file(input0_file, &m, &n);
 
     // TODO Read input1.raw (matrix B)
-
+    float *B = read_file(input1_file, &n, &p);
 
     // Allocate memory for result matrices
     float *C_naive = new float[m * p];
@@ -59,7 +175,7 @@ int main(int argc, char *argv[]) {
     double naive_time = omp_get_wtime() - start_time;
 
     // TODO Write naive result to file
-
+    write_file(result_file, C_naive, m, p);
 
     // Validate naive result
     bool naive_correct = validate_result(result_file, reference_file);
@@ -68,11 +184,13 @@ int main(int argc, char *argv[]) {
     }
 
     // Measure performance of blocked_matmul (use block_size = 32 as default)
+    int block_size = 64;
     start_time = omp_get_wtime();
-    blocked_matmul(C_blocked, A, B, m, n, p, 32);
+    blocked_matmul(C_blocked, A, B, m, n, p, block_size);
     double blocked_time = omp_get_wtime() - start_time;
 
     // TODO Write blocked result to file
+    write_file(result_file, C_blocked, m, p);
 
 
     // Validate blocked result
@@ -87,6 +205,7 @@ int main(int argc, char *argv[]) {
     double parallel_time = omp_get_wtime() - start_time;
 
     // TODO Write parallel result to file
+    write_file(result_file, C_parallel, m, p);
 
 
     // Validate parallel result