Skip to content

[CLAUDE] annotate_types(dialect=...) is silently overridden by schema.dialect #7587

@RichardHughes-amp

Description

@RichardHughes-amp

Description

sqlglot.optimizer.annotate_types.annotate_types(expression, schema=..., dialect=...)
appears to accept a call-site dialect for type-annotation dispatch, but the
kwarg is silently dropped when schema is a Schema instance whose own
.dialect is set. The EXPRESSION_METADATA actually used comes from
schema.dialect, not the call-site dialect.

This makes it possible to write seemingly correct cross-dialect annotation
code that silently dispatches through one dialect's typing module for all
calls.

Reproduction (sqlglot main, commit 9f169ab)

from sqlglot import exp, parse_one
from sqlglot.optimizer.annotate_types import annotate_types
from sqlglot.optimizer.qualify import qualify
from sqlglot.schema import MappingSchema

# Schema built once, with hive
schema = MappingSchema({"t": {"e": "TIMESTAMP"}}, dialect="hive")

# Same SQL, same schema, but call-site dialect varies
sql = "SELECT date_add(e, 24) AS r FROM t"
for d in ["hive", "spark", "databricks"]:
    ast = qualify(parse_one(sql, read=d), schema=schema, dialect=d)
    annotated = annotate_types(ast, schema=schema, dialect=d)
    print(f"{d:10s} -> {annotated.selects[0].this.type}")

# Output (sqlglot main):
#   hive       -> UNKNOWN
#   spark      -> UNKNOWN     <-- Spark.EXPRESSION_METADATA is NOT consulted
#   databricks -> UNKNOWN     <-- Databricks.EXPRESSION_METADATA is NOT consulted
#
# Expected (or at least: what the signature suggests):
#   hive       -> UNKNOWN
#   spark      -> <whatever Spark typing says for TsOrDsAdd>
#   databricks -> <whatever Databricks typing says for TsOrDsAdd>

If the schema is rebuilt per-iteration with the matching dialect, the
expected per-dialect dispatch occurs. So the workaround is "build the
schema with the dialect you intend to annotate against."

Root cause

TypeAnnotator.__init__ (sqlglot/optimizer/annotate_types.py:202-205):

self.schema = schema
dialect = schema.dialect or Dialect()
self.dialect = dialect
self.expression_metadata = expression_metadata or dialect.EXPRESSION_METADATA

The schema's dialect wins; the dialect kwarg passed into annotate_types
is forwarded to ensure_schema(schema, dialect=...) and used only when
constructing a schema from raw input. Once a Schema instance exists, its
.dialect is the only source consulted for typing dispatch.

Suggested resolutions (in order of conservatism)

  1. Docstring note on annotate_types stating that schema.dialect
    takes precedence over the dialect kwarg for typing dispatch when a
    Schema instance is passed. Cheapest, no behavior change.

  2. Prefer the call-site dialect in TypeAnnotator.__init__ when one
    is forwarded, falling back to schema.dialect. Behavior change, but
    matches what the public signature implies.

  3. Plumb dialect through to TypeAnnotator so the precedence is
    explicit at the constructor level, not implicit via schema.dialect.

I lean toward (1) as the smallest correct change — schema.dialect
winning is internally consistent (column types in a schema are
dialect-flavored), so the surprise is mostly a documentation gap.

Encountered while writing tests for #7588 (the Databricks
date_add/dateadd disambiguation patch).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions