Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File renamed without changes.
2 changes: 1 addition & 1 deletion docker/dev/docker-compose-telemetry.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ services:
image: otel/opentelemetry-collector:0.100.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
- ../../data/otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "127.0.0.1:4317:4317"
- "127.0.0.1:4318:4318"
Expand Down
2 changes: 1 addition & 1 deletion docker/dev/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ services:
- TIMESKETCH_PASSWORD=dev
- CHOKIDAR_USEPOLLING=true
- PROMETHEUS_MULTIPROC_DIR=/tmp/
- ENABLE_STRUCTURED_LOGGING=true
# - ENABLE_STRUCTURED_LOGGING=true
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Telemetry settings (Uncomment to enable locally)
# - TIMESKETCH_OTEL_MODE=otlp-grpc
# - TIMESKETCH_OTLP_GRPC_ENDPOINT=otel-collector:4317
Expand Down
61 changes: 12 additions & 49 deletions docs/OpenTelemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,16 @@ This document provides a comprehensive guide for developers, admins, and users o
## 1. Overview
Timesketch uses OpenTelemetry to provide distributed tracing across its web (Flask) and worker (Celery) components. This enables deep observability into request life cycles and background task performance.

### Key Benefits
* **Distributed Tracing:** Track a single request from an external tool (like `dftimewolf`) through the API and into background analyzers.
* **Log Correlation:** Trace IDs and Span IDs are automatically injected into structured JSON logs, allowing you to jump from a log line directly to a trace waterfall in tools like GCP Cloud Trace or Jaeger.
* **Standardized Protocol:** Uses the industry-standard OpenTelemetry Protocol (OTLP).

---

## 2. Architecture
The instrumentation is centralized in a dedicated module: `timesketch/lib/telemetry.py`.
The instrumentation is centralized in `timesketch/lib/telemetry.py`.

* **Flask Instrumentation:** Automatically captures spans for all HTTP requests, including route patterns and status codes.
* **Celery Instrumentation:** Captures spans for both task dispatching (producer) and execution (worker), maintaining the trace context across process boundaries.
* **Async Exporting:** Spans are exported asynchronously using a `BatchSpanProcessor` to ensure minimal impact on application performance.
* **Flask:** Captures all HTTP requests, status codes, and analyst identity.
* **Celery:** Maintains trace context across background tasks (analyzers, data imports).
* **OpenSearch:** Manual instrumentation captures search query structure (`db.statement`), targeted indices, and internal processing time (`took_ms`).
* **SQLAlchemy (Postgres):** Automatically captures SQL statements and database connection health.
* **Async Exporting:** Uses `BatchSpanProcessor` for zero-impact on application performance.

---

Expand All @@ -33,6 +30,7 @@ Telemetry is controlled entirely via environment variables.
| `TIMESKETCH_OTLP_HTTP_ENDPOINT` | OTLP collector endpoint (HTTP). | `http://jaeger:4318/v1/traces` |
| `TIMESKETCH_OTLP_INSECURE` | Use insecure (non-TLS) connection. | `true` (default for dev) |
| `TIMESKETCH_ENV` | Environment identifier. | `production`, `development` |
| `ENABLE_STRUCTURED_LOGGING` | Enable JSON logging with trace context. | `true`, `false` |

### Supported Modes:
1. **`otlp-grpc`:** Best for local collectors (e.g., OTel Collector or Jaeger).
Expand Down Expand Up @@ -74,7 +72,7 @@ The Tilt dashboard will show `otel-collector` and `jaeger` resources, including

---

## 5. Visualization Options
## 6. Visualization Options

The local environment provides two ways to see your traces. You can switch between them by changing the `TIMESKETCH_OTLP_GRPC_ENDPOINT`.

Expand All @@ -91,7 +89,7 @@ The local environment provides two ways to see your traces. You can switch betwe

---

## 6. Triggering Activity & Verification
## 7. Triggering Activity & Verification
Generate some traffic to verify the setup:
```bash
# Trigger a Flask Trace (API Call)
Expand All @@ -102,14 +100,14 @@ docker exec timesketch-dev celery -A timesketch.lib.tasks call timesketch.lib.ta
```

**Check Application Logs:**
Verify that `trace_id` and `span_id` appear in the JSON output:
Verify that `trace_id` appears in the output:
```bash
docker logs timesketch-dev | grep trace_id
```

---

## 7. Secure Private Access (GCP)
## 8. Secure Private Access (GCP)
If you are running Timesketch on a private GCE VM without an external IP, you can "proxy in" securely using **Identity-Aware Proxy (IAP) Tunneling**.

### Accessing the Web Interfaces
Expand Down Expand Up @@ -145,43 +143,8 @@ gcloud compute start-iap-tunnel timesketch-otel-lab 5000 \

---

## 8. Deployment Guide (GCP)
## 9. Deployment Guide (GCP)
To enable production tracing in GCP:
1. Set `TIMESKETCH_OTEL_MODE=otlp-default-gce`.
2. Ensure the service account running Timesketch has the `roles/cloudtrace.agent` role.
3. View your traces in the [GCP Trace Explorer](https://console.cloud.google.com/traces/explorer).

---

## 8. Information for Developers

### Automated Coverage
Most common operations are already covered by auto-instrumentation:
* **Web API:** All Flask routes, status codes, and HTTP methods.
* **Background Tasks:** All Celery task dispatching and executions.
* **Analyzers:** All analyzers automatically report `sketch_id`, `analyzer_name`, `timeline_id`, and execution status via the `BaseAnalyzer` interface.

### Adding Custom Attributes & Events
If you need to record specific domain metadata (e.g., number of matches found, search query used) from within your code, use the helpers in `timesketch.lib.telemetry`.

#### Example: Adding attributes in an Analyzer
```python
from timesketch.lib import telemetry

def analyze(self):
# ... logic ...
matches_found = len(results)

# This will appear in the Span attributes in Jaeger/GCP
telemetry.add_attribute_to_current_span("sigma.matches_count", matches_found)

# Record a significant milestone as an event
telemetry.add_event_to_current_span("Finished parsing rules")

return f"Found {matches_found} matches."
```

#### Best Practices for Attributes
* **Use Namespace Prefixes:** To avoid collisions, prefix your attributes (e.g., `sigma.rule_id`, `sketch.member_count`).
* **Data Types:** Simple types (strings, ints, bools, floats) are stored natively. Complex objects (dicts, lists) are automatically serialized to JSON.
* **Avoid PII:** Never record sensitive user data or authentication tokens in span attributes.
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ opentelemetry-api==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-instrumentation-flask==0.45b0
opentelemetry-instrumentation-celery==0.45b0
opentelemetry-instrumentation-sqlalchemy>=0.45b0
google-cloud-trace>=1.4.0
opentelemetry-exporter-gcp-trace>=1.6.0
opentelemetry-exporter-otlp==1.24.0
Expand Down
34 changes: 15 additions & 19 deletions timesketch/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,6 @@
from flask_restful import Api
from flask_wtf import CSRFProtect

try:
from opentelemetry import trace
except ImportError:
trace = None
from timesketch.lib import telemetry

from timesketch.api.v1.routes import API_ROUTES as V1_API_ROUTES
Expand Down Expand Up @@ -267,20 +263,17 @@ def format(self, record):
"module": record.module,
}

if trace:
span_context = trace.get_current_span().get_span_context()
if span_context.is_valid:
t_id = trace.format_trace_id(span_context.trace_id)
s_id = trace.format_span_id(span_context.span_id)
log_record["trace_id"] = t_id
log_record["span_id"] = s_id
# GCP specific correlation fields
project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
if project_id:
log_record["logging.googleapis.com/trace"] = (
f"projects/{project_id}/traces/{t_id}"
)
log_record["logging.googleapis.com/spanId"] = s_id
# Add trace correlation if TraceLogFilter has run
if hasattr(record, "trace_id"):
log_record["trace_id"] = record.trace_id
log_record["span_id"] = record.span_id
# GCP specific correlation fields
project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
if project_id:
log_record["logging.googleapis.com/trace"] = (
f"projects/{project_id}/traces/{record.trace_id}"
)
log_record["logging.googleapis.com/spanId"] = record.span_id

if record.exc_info:
formatted_trace = self.formatException(record.exc_info)
Expand All @@ -290,6 +283,7 @@ def format(self, record):

logger_object = logging.getLogger("timesketch")
logger_filter = NoESFilter()
trace_filter = telemetry.TraceLogFilter()

use_structured_logging = (
os.environ.get("ENABLE_STRUCTURED_LOGGING", "false").lower() == "true"
Expand All @@ -299,6 +293,7 @@ def format(self, record):
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONLogFormatter(datefmt="%Y-%m-%dT%H:%M:%S%z"))
handler.addFilter(logger_filter)
handler.addFilter(trace_filter)

root = logging.getLogger()
for h in root.handlers[:]:
Expand All @@ -310,11 +305,12 @@ def format(self, record):

else:
logger_formatter = logging.Formatter(
"[%(asctime)s] %(name)s/%(levelname)s %(message)s"
"[%(asctime)s] %(name)s/%(levelname)s [trace_id=%(trace_id)s] %(message)s"
)
for handler in logger_object.parent.handlers:
handler.setFormatter(logger_formatter)
handler.addFilter(logger_filter)
handler.addFilter(trace_filter)


def create_celery_app():
Expand Down
2 changes: 2 additions & 0 deletions timesketch/lib/analyzers/interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -1157,6 +1157,7 @@ def run_wrapper(self, analysis_id):
analysis.set_status("DONE")

telemetry.add_attribute_to_current_span("status", "success")
telemetry.set_status_on_current_span("OK")
telemetry.add_event_to_current_span(f"Analyzer {self.name} completed")
except Exception as e: # pylint: disable=broad-except
analysis.set_status("ERROR")
Expand All @@ -1170,6 +1171,7 @@ def run_wrapper(self, analysis_id):
)
telemetry.add_attribute_to_current_span("status", "error")
telemetry.add_attribute_to_current_span("error_message", str(e))
telemetry.set_status_on_current_span("ERROR", description=str(e))
telemetry.add_event_to_current_span(f"Analyzer {self.name} failed")

# Update database analysis object with result and status
Expand Down
Loading