Fix no_dereference symlink race in Python API#1407
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
d845ee8 to
b63bbb7
Compare
|
Hello, sorry for the delay in getting back to you, and thank you for the high-quality PR. I've reviewed it. It looks like the PR does what it's supposed to do. My main concern is that it adds quite some complex code, in the critical path, and in part it seems a duplicate of some logic we already have somewhere else (e.g., detecting whether a path is a DIRECTORY, etc.). Another observation: the no_dereference CLI option was never about security; it was about "if magika encounters a symlink, should it return 'symlink' or follow the path?". But I understand how one could assume to be a security feature. Two questions:
To be clear: in principle I'd like to fix this bug as I can surely miss something; here I'm just trying to weigh the current complex code vs. bug fix urgency. Thanks again! |
|
Thanks again for the thoughtful review, and I think those are both fair questions. For simplification/readability: I agree the earlier version of my patch overreached. It added too much logic in the hot path and duplicated some existing path classification behavior. I think the cleanest approach is to keep the existing
That keeps the change focused on the actual TOCTOU point, removes the duplicated directory/unknown/error handling, and makes the patch much easier to reason about. For realistic security impact: I also agree the practical security impact is limited, and I likely framed it too strongly in the initial report. The bug is real in the sense that The most realistic scenarios I can think of are not cases where Magika directly leaks file contents, but cases where Magika is part of a larger automated decision pipeline, for example:
Even in those cases, I agree the impact is more about violating the caller’s expectation and potentially influencing downstream logic than about a direct standalone information disclosure from Magika itself. So my honest view is:
|
|
Thank you! I agree with your assessment. The other consideration is that it's likely much easier to confuse magika's detection itself than to mount TOCTOU attacks like these :-) But again, I'd like to fix this no matter what. Thanks for the updated PR, will take a look and follow up! |
Summary
This PR fixes a TOCTOU symlink race in the Python API when
Magika(no_dereference=True)is used.Previously, Magika checked whether a path was a symlink using
path.is_symlink()and later opened the same path withopen(path, "rb"). Because the check and open were separate filesystem operations, a file could be replaced with a symlink between them.This could cause Magika to follow and process a symlink target despite
no_dereference=True.Impact
Applications may rely on
no_dereference=Trueto avoid following symlinks when scanning attacker-controlled or shared directories.Without an atomic open, an attacker with write access to a scanned directory can race the symlink check and cause Magika to process a different readable file than the one originally checked.
This weakens the expected symlink-protection behavior and can expose file-derived information to downstream consumers.
Fix
This PR makes symlink handling atomic at file-open time where supported.
The patch avoids relying only on a pre-open
is_symlink()check and ensures symlink targets are not followed whenno_dereference=True.Testing
Added/updated regression coverage for
Magika(no_dereference=True)symlink handling.Tested locally by confirming that symlinks are reported as symlinks and are not followed when
no_dereference=True.Notes
This is a defense-in-depth hardening fix for the Python API.