This is Alpha 3 Software. Finished releases must be moved to svn:dist/release following these promoting to release instructions.

3.14. Input validation

Up: 3. Developer guide

Prev: 3.13. Authorization security

Next: 3.15. TLS security configuration

Sections:

Overview

Input validation is critical for ATR's security posture. As a system that handles cryptographic signatures and release artifacts, ATR must ensure that all user input is properly validated before processing. This page documents the validation strategies and patterns used throughout the codebase.

Defense in depth

ATR employs multiple layers of validation:

  1. Transport layer: HTTPS required, enforced by httpd
  2. Request layer: Size limits enforced by httpd (MAX_CONTENT_LENGTH)
  3. Form layer: Pydantic models validate structure and types
  4. Application layer: Business logic validation in route handlers, interaction helpers, and storage writers
  5. Database layer: SQLAlchemy ORM with parameterized queries, plus constraints
  6. Markdown layer: Via cmarkgfm
  7. Output layer: Jinja2 auto-escaping for HTML output

Each layer provides independent protection, so a failure in one layer does not compromise the system.

Form validation with Pydantic

All form inputs in ATR are validated through Pydantic models defined in form.py. The base class for forms is Form (Form), which extends Pydantic's BaseModel.

Defining form fields

Form fields are defined using Python type annotations and the label (label) function:

class ExampleForm(Form):
    name: str = label("Project name", "Enter the project name")
    count: int = label("Count", widget=Widget.NUMBER)
    email: EmailStr = label("Contact email", widget=Widget.EMAIL)

The label function accepts a description (shown to users), optional documentation, and an optional widget hint for rendering.

Validation process

When a form is submitted, ATR:

  1. Extracts form data from the request via quart_request (quart_request)
  2. Passes the data to the Pydantic model for validation
  3. If validation fails, collects errors via flash_error_data (flash_error_data)
  4. Displays errors to the user with flash_error_summary (flash_error_summary)
  5. If validation succeeds, proceeds with the validated data

Pydantic provides built-in validators for common types (strings, integers, emails, URLs) and supports custom validators via decorators.

Custom validators

For complex validation logic, use Pydantic's @model_validator decorator:

from pydantic import model_validator

class ReleaseForm(Form):
    version: str = label("Version")

    @model_validator(mode="after")
    def validate_version_format(self):
        if not re.match(r"^\d+\.\d+\.\d+", self.version):
            raise ValueError("Version must start with X.Y.Z")
        return self

CSRF protection

All POST forms must include a CSRF token. The token is generated by csrf_input (csrf_input) and validated automatically by Quart-WTF:

def csrf_input() -> htm.VoidElement:
    csrf_token = utils.generate_csrf()
    return htpy.input(type="hidden", name="csrf_token", value=csrf_token)

In templates, include the CSRF token in every form:

<form method="post">
    {{ csrf_input() }}
    <!-- other form fields -->
</form>

The CSRF token is tied to the user's session and validated on form submission. Requests without a valid CSRF token are rejected. When using the form module renderer, the CSRF token is added automatically.

Validation rules by input type

ASF User IDs

User IDs are validated against a strict pattern in principal.py:

if not re.match(r"^[-_a-z0-9]+$", user):
    raise CommitterError("Invalid characters in User ID")

Only lowercase alphanumeric characters, hyphens, and underscores are permitted.

Email addresses

Email validation uses Pydantic's EmailStr type, which implements RFC 5322 validation:

from pydantic import EmailStr

class ContactForm(Form):
    email: EmailStr = label("Email address")

URLs

URL validation uses Pydantic's HttpUrl type:

from pydantic import HttpUrl

class LinkForm(Form):
    website: HttpUrl = label("Website URL")

Version strings

Version strings are validated according to project-specific patterns. The general pattern allows semantic versioning with optional suffixes:

VERSION_PATTERN = re.compile(r"^[0-9]+\.[0-9]+.*$")

Committee and project names

Committee and project names are validated against the set of known committees and projects from LDAP and the ASF project database. Unknown names are rejected.

File names

File names in uploads are sanitized to prevent path traversal:

  • Directory separators (/, \) and the path token .. are rejected or stripped
  • Null bytes are rejected
  • Only expected extensions are permitted per upload type

Data integrity validation

Beyond input validation, ATR performs data integrity validation on database records using validate.py. This catches inconsistencies that may have been introduced by bugs, migrations, or manual database edits.

Committee validation

Committee records are not created through a form. They are synced from LDAP and Whimsy by _update_committees (_update_committees), so their fields have no Pydantic input layer, and the data integrity validators are the main place each field is confirmed.

The committee (committee) function checks:

  • key must be a valid committee key (the safe.CommitteeKey (CommitteeKey) character set)
  • name must be set, trimmed, and not prefixed with "Apache "
  • committee_members, committers, and release_managers entries must each look like an ASF UID
  • child_committees must be empty (not used)

Per-field coverage:

Field Input layer Data integrity
key none (LDAP project name) valid committee key
name none (Whimsy) set, trimmed, no "Apache " prefix
is_podling set during sync type-enforced (bool)
parent_committee_key none foreign key
committee_members none (LDAP) ASF UID format
committers none (LDAP) ASF UID format
release_managers none ASF UID format
child_committees none must be empty

Project validation

Projects do have an input layer. AddProjectForm (AddProjectForm) validates the key and display name at creation, and EditVersionSchemeForm (EditVersionSchemeForm) rejects a version_pattern or cycle_match that is not a valid regex (and a cycle_match with no capture group). The data integrity validators mirror these and cover fields that have no form.

The project (project) function checks:

  • key must be a valid project key (the safe.ProjectKey (ProjectKey) character set)
  • category must use comma-separated labels without colons
  • committee_key must be set (project must be linked to a committee)
  • created timestamp must be in the past
  • created_by, if set, must look like an ASF UID
  • cycle_match, if set, must be a regex with at least one capture group
  • full_name must be set and start with "Apache "
  • programming_languages must use comma-separated labels without colons
  • release_policy_id must be None (not used)
  • version_pattern, if set, must be a compilable regex

Per-field coverage:

Field Input layer Data integrity
key safe.ProjectKey (charset, lowercase) + AddProjectForm (committee-prefixed) valid project key
name AddProjectForm (Apache prefix, case rules) set, "Apache " prefix
status none (enum) type-enforced (enum)
description EditMetadataForm (free text) none
category AddCategoryForm (free text) comma-separated, no colons
programming_languages AddLanguageForm (free text) comma-separated, no colons
version_method EditVersionSchemeForm (enum) type-enforced (enum)
version_pattern EditVersionSchemeForm (regex validity) compilable regex
cycle_match EditVersionSchemeForm (regex and capture group) regex with capture group
branch_template EditVersionSchemeForm (free text) none, not currently enforced
committee_key safe.CommitteeKey must be set
created set at creation in the past
created_by none ASF UID format

Release validation

The release (release) function checks:

  • created timestamp must be in the past
  • name must match the expected pattern for project and version
  • Release directory must exist on disk and contain files
  • package_managers must be empty (not used)
  • released timestamp must be in the past or None
  • sboms must be empty (not used)
  • Vote logic must be consistent (cannot have vote_resolved without vote_started)
  • votes must be empty (not used)

Running validation

Data integrity validation can be run via the admin interface or programmatically:

async for divergence in validate.everything(data):
    print(f"{divergence.source}: {divergence.divergence}")

These validators are complementary to the live checks described below. Data integrity validation inspects stored records for drift or corruption. Business logic validation stops inconsistent actions before ATR accepts them.

Business logic validation

Field validation is only the first step. ATR also checks whether an action still makes sense in the wider state of the release. These rules compare data across releases, revisions, committees, queued tasks, stored policy, and message delivery settings. They live mainly in interaction.py, storage writers, mail.py, and shared helpers in util.py.

Vote initiation

Before a vote can start, release_ready_to_start_vote (release_ready_to_start_vote) checks that the release still has a latest revision and that the release is still attached to a committee. It also checks that the requested vote mode agrees with the stored project policy, so ATR does not let a user start a manual vote for a project configured for standard voting, or the reverse.

That same validation step then checks the surrounding release state. The user must be a committee member for the project or an ATR administrator. The latest revision must have no blocker results, and the release candidate draft must contain files. When ATR actually promotes the release into the voting phase, promote_to_candidate (promote_to_candidate) adds a task state check and refuses the transition while queued or active tasks still exist for that revision. The same writer atomically enforces that the transition can only happen from the expected latest revision, using a caller-supplied expected revision and a latest-revision subquery in the UPDATE statement as a compare-and-swap guard. This binds vote initiation to release phase, revision state, policy, committee membership, check results, file storage, and task execution rather than to form fields alone.

Trusted Publishing

Trusted Publishing settings are validated when they are stored and again when they are used. On write, validate_trusted_publishing_constraints (validate_trusted_publishing_constraints) and policy.py normalize the configured repository name, branch, and workflow paths and reject incomplete or impossible combinations. A workflow path cannot be stored without a repository name. A branch cannot be stored without a repository name. Repository names are stored without a slash. Every workflow path must begin with .github/workflows/.

At request time, _trusted_project_checks (_trusted_project_checks) and _trusted_project (_trusted_project) compare the GitHub token claims with the stored policy. The repository must be under apache. The workflow reference must begin with that same repository, must include a git ref, and must resolve to a workflow path under .github/workflows/. ATR then looks up the project by repository name and by the phase specific workflow path that was stored for compose, vote, or finish. Distribution callbacks add one more contextual check in trusted_jwt_for_dist (trusted_jwt_for_dist), which refuses the request unless the named release exists and is in the expected phase for the requested operation. The cryptographic validation of the token itself is described in authentication security.

Email delivery

Email validation in ATR also depends on context. validate_email_recipients (validate_email_recipients) requires a primary recipient and rejects duplicate addresses across To, Cc, and Bcc. send (send) then requires the sender to use @apache.org, and _validate_recipient (_validate_recipient) rejects any envelope recipient outside @apache.org and its subdomains. This means that vote and release mail must go to ASF controlled addresses even if the address itself would be syntactically valid.

Output encoding

ATR uses Jinja2 for templating with auto-escaping enabled by default. All variables rendered in templates are automatically HTML-escaped:

<!-- This is safe; user_input is escaped -->
<p>Hello, {{ user_input }}</p>

When HTML output is intentionally generated (e.g., via htpy), it must be explicitly marked safe using markupsafe.Markup:

import markupsafe
safe_html = markupsafe.Markup("<strong>Bold</strong>")

For Markdown rendering, ATR uses markupsafe.Markup(cmarkgfm.github_flavored_markdown_to_html(markdown_text)), which safely filters dangerous input before rendering.

Never mark user-controlled data as safe without proper sanitization.

File upload security

File uploads are handled with several security measures:

Size limits

Maximum upload size is enforced at the httpd layer via MAX_CONTENT_LENGTH. This prevents denial-of-service attacks via large uploads.

Extension validation

Each upload type has an allowlist of permitted file extensions. Files with unexpected extensions are rejected.

Storage location

Uploaded files are stored outside the application in configured directories (e.g., state/unfinished/). They are not directly accessible via HTTP.

File handling

Files are processed via quart.datastructures.FileStorage (quart_request) and validated before being written to disk. Empty files (where the browser sends a file input with no selection) are filtered out.

Injection prevention

SQL injection

ATR uses SQLAlchemy ORM exclusively for database access. All queries use parameterized statements:

# Safe: parameterized query
result = await session.exec(
    select(Project).where(Project.key == project_name)
)

Direct SQL string concatenation is never used.

Cross-site scripting (XSS)

XSS is prevented through:

  • Jinja2 auto-escaping (enabled by default)
  • markupsafe.Markup for trusted HTML only
  • Content Security Policy headers (configured in httpd)

Path traversal

Path traversal is prevented by:

  • Using pathlib.Path for all file operations
  • Validating that paths remain within expected directories
  • Rejecting file names containing path separators

For form fields that accept file or directory paths, always use form.RelPath (or form.RelPathList for multiple paths). These types automatically call to_relpath() (to_relpath), which rejects path traversal sequences, absolute paths, and empty values at the Pydantic validation layer. This is the preferred approach because it prevents path traversal before the handler code runs.

For cases outside of form validation (e.g., URL route parameters), use form.to_relpath() (to_relpath) directly, or validate manually:

import pathlib

base = pathlib.Path("/allowed/directory")
user_path = base / user_filename
# Verify the resolved path is still under base
if not user_path.resolve().is_relative_to(base.resolve()):
    raise ValueError("Path traversal detected")

Command injection

ATR safeguards against command injection as much as possible. Where external commands are necessary (e.g., GPG operations), arguments are passed as lists, never as shell strings:

import subprocess

# Safe: arguments as list
subprocess.run(["gpg", "--verify", signature_file, data_file])

# Unsafe: never do this
subprocess.run(f"gpg --verify {signature_file} {data_file}", shell=True)

Implementation references