Skip to content

Fix no_dereference symlink race in Python API#1407

Open
Akalanka1337 wants to merge 2 commits into
google:mainfrom
Akalanka1337:fix-python-no-dereference-symlink-race
Open

Fix no_dereference symlink race in Python API#1407
Akalanka1337 wants to merge 2 commits into
google:mainfrom
Akalanka1337:fix-python-no-dereference-symlink-race

Conversation

@Akalanka1337

Copy link
Copy Markdown

Summary

This PR fixes a TOCTOU symlink race in the Python API when Magika(no_dereference=True) is used.

Previously, Magika checked whether a path was a symlink using path.is_symlink() and later opened the same path with open(path, "rb"). Because the check and open were separate filesystem operations, a file could be replaced with a symlink between them.

This could cause Magika to follow and process a symlink target despite no_dereference=True.

Impact

Applications may rely on no_dereference=True to avoid following symlinks when scanning attacker-controlled or shared directories.

Without an atomic open, an attacker with write access to a scanned directory can race the symlink check and cause Magika to process a different readable file than the one originally checked.

This weakens the expected symlink-protection behavior and can expose file-derived information to downstream consumers.

Fix

This PR makes symlink handling atomic at file-open time where supported.

The patch avoids relying only on a pre-open is_symlink() check and ensures symlink targets are not followed when no_dereference=True.

Testing

Added/updated regression coverage for Magika(no_dereference=True) symlink handling.

Tested locally by confirming that symlinks are reported as symlinks and are not followed when no_dereference=True.

Notes

This is a defense-in-depth hardening fix for the Python API.

@Akalanka1337 Akalanka1337 requested a review from reyammer as a code owner May 27, 2026 16:16
@google-cla

google-cla Bot commented May 27, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@Akalanka1337 Akalanka1337 force-pushed the fix-python-no-dereference-symlink-race branch from d845ee8 to b63bbb7 Compare May 27, 2026 16:30
@reyammer

reyammer commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Hello,

sorry for the delay in getting back to you, and thank you for the high-quality PR.

I've reviewed it. It looks like the PR does what it's supposed to do.

My main concern is that it adds quite some complex code, in the critical path, and in part it seems a duplicate of some logic we already have somewhere else (e.g., detecting whether a path is a DIRECTORY, etc.).

Another observation: the no_dereference CLI option was never about security; it was about "if magika encounters a symlink, should it return 'symlink' or follow the path?". But I understand how one could assume to be a security feature.

Two questions:

  • Any idea how we could simplify this further / remove code dups / make the code more readable?
  • Do you have any realistic magika-related deployment scenario in mind where this bug is actually a security problem? I have difficulties imagining a scenario in which 1) the attacker has control over the file system AND 2) in which magika has anything security-related role AND 3) the output of magika (which is just a content type label, not the file content) would be of any importance in terms of "leaking something about a secret file".

To be clear: in principle I'd like to fix this bug as I can surely miss something; here I'm just trying to weigh the current complex code vs. bug fix urgency.

Thanks again!

@Akalanka1337

Copy link
Copy Markdown
Author

Thanks again for the thoughtful review, and I think those are both fair questions.

For simplification/readability:

I agree the earlier version of my patch overreached. It added too much logic in the hot path and duplicated some existing path classification behavior.

I think the cleanest approach is to keep the existing identify_path() flow almost unchanged and only harden the actual file open step when no_dereference=True. Concretely, that means:

  • keep the existing early path.is_symlink() behavior
  • keep the existing handling for missing paths, directories, permission errors, etc.
  • only replace open(path, "rb") with an os.open(..., O_NOFOLLOW)-based open when no_dereference=True
  • map ELOOP back to symlink

That keeps the change focused on the actual TOCTOU point, removes the duplicated directory/unknown/error handling, and makes the patch much easier to reason about.

For realistic security impact:

I also agree the practical security impact is limited, and I likely framed it too strongly in the initial report.

The bug is real in the sense that no_dereference=True can be bypassed by swapping the path between the symlink check and the later open. But whether that becomes a meaningful security issue depends heavily on how Magika is used downstream.

The most realistic scenarios I can think of are not cases where Magika directly leaks file contents, but cases where Magika is part of a larger automated decision pipeline, for example:

  • a multi-user or CI ingestion pipeline scanning files from a writable workspace where another actor can race path changes
  • malware analysis or file triage systems where Magika’s classification influences later handling
  • systems that rely on no_dereference=True as a correctness/safety expectation when traversing untrusted directories

Even in those cases, I agree the impact is more about violating the caller’s expectation and potentially influencing downstream logic than about a direct standalone information disclosure from Magika itself.

So my honest view is:

  • the bug is worth fixing
  • the strongest justification is correctness / robustness, with some security hardening value
  • describing it as a strong standalone information disclosure issue probably overstates the real-world impact

@reyammer

reyammer commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Thank you! I agree with your assessment. The other consideration is that it's likely much easier to confuse magika's detection itself than to mount TOCTOU attacks like these :-) But again, I'd like to fix this no matter what. Thanks for the updated PR, will take a look and follow up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants