Skip to content

[feature] Unicode property escapes not supported (\p{L}, \p{N}, etc.) #40

@jbachorik

Description

@jbachorik

Summary

Unicode property escapes (\p{L}, \p{N}, \P{L}, etc.) are not supported. Patterns using them are currently rejected or silently mis-matched.

Examples

  • \p{L} — any Unicode letter
  • \p{N} — any Unicode number
  • \p{Lu} — uppercase letter
  • \P{L} — negated: any non-letter

Impact

  • 66 PCRE tests are currently filtered out entirely because they use unsupported features including Unicode properties
  • Common in real-world patterns for internationalized text

Implementation Notes

  • Difficulty: Medium-High (large Unicode category tables required)
  • Files: RegexParser.java (parse \p{...}), new UnicodePropertyCharClass AST node, ThompsonBuilder.java, charset integration
  • Unicode data can be derived from JDK's Character class to avoid external dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions