3.14. Input validation

Up: 3. Developer guide

Prev: 3.13. Authorization security

Next: (none)

Sections:

Overview
Defense in depth
Form validation with Pydantic
CSRF protection
Validation rules by input type
Data integrity validation
Output encoding
File upload security
Injection prevention
Implementation references

Overview

Input validation is critical for ATR's security posture. As a system that handles cryptographic signatures and release artifacts, ATR must ensure that all user input is properly validated before processing. This page documents the validation strategies and patterns used throughout the codebase.

Defense in depth

ATR employs multiple layers of validation:

Transport layer: HTTPS required, enforced by httpd
Request layer: Size limits enforced by httpd (MAX_CONTENT_LENGTH)
Form layer: Pydantic models validate structure and types
Application layer: Business logic validation in route handlers
Database layer: SQLAlchemy ORM with parameterized queries, plus constraints
Markdown layer: Via cmarkgfm
Output layer: Jinja2 auto-escaping for HTML output

Each layer provides independent protection, so a failure in one layer does not compromise the system.

Form validation with Pydantic

All form inputs in ATR are validated through Pydantic models defined in form.py. The base class for forms is Form, which extends Pydantic's BaseModel.

Defining form fields

Form fields are defined using Python type annotations and the label function:

class ExampleForm(Form):
    name: str = label("Project name", "Enter the project name")
    count: int = label("Count", widget=Widget.NUMBER)
    email: EmailStr = label("Contact email", widget=Widget.EMAIL)

The label function accepts a description (shown to users), optional documentation, and an optional widget hint for rendering.

Validation process

When a form is submitted, ATR:

Extracts form data from the request via quart_request
Passes the data to the Pydantic model for validation
If validation fails, collects errors via flash_error_data
Displays errors to the user with flash_error_summary
If validation succeeds, proceeds with the validated data

Pydantic provides built-in validators for common types (strings, integers, emails, URLs) and supports custom validators via decorators.

Custom validators

For complex validation logic, use Pydantic's @model_validator decorator:

from pydantic import model_validator

class ReleaseForm(Form):
    version: str = label("Version")

    @model_validator(mode="after")
    def validate_version_format(self):
        if not re.match(r"^\d+\.\d+\.\d+", self.version):
            raise ValueError("Version must start with X.Y.Z")
        return self

CSRF protection

All POST forms must include a CSRF token. The token is generated by csrf_input and validated automatically by Quart-WTF:

def csrf_input() -> htm.VoidElement:
    csrf_token = utils.generate_csrf()
    return htpy.input(type="hidden", name="csrf_token", value=csrf_token)

In templates, include the CSRF token in every form:

<form method="post">
    {{ csrf_input() }}
    <!-- other form fields -->
</form>

The CSRF token is tied to the user's session and validated on form submission. Requests without a valid CSRF token are rejected. When using the form module renderer, the CSRF token is added automatically.

Validation rules by input type

ASF User IDs

User IDs are validated against a strict pattern in principal.py:

if not re.match(r"^[-_a-z0-9]+$", user):
    raise CommitterError("Invalid characters in User ID")

Only lowercase alphanumeric characters, hyphens, and underscores are permitted.

Email addresses

Email validation uses Pydantic's EmailStr type, which implements RFC 5322 validation:

from pydantic import EmailStr

class ContactForm(Form):
    email: EmailStr = label("Email address")

URLs

URL validation uses Pydantic's HttpUrl type:

from pydantic import HttpUrl

class LinkForm(Form):
    website: HttpUrl = label("Website URL")

Version strings

Version strings are validated according to project-specific patterns. The general pattern allows semantic versioning with optional suffixes:

VERSION_PATTERN = re.compile(r"^[0-9]+\.[0-9]+.*$")

Committee and project names

Committee and project names are validated against the set of known committees and projects from LDAP and the ASF project database. Unknown names are rejected.

File names

File names in uploads are sanitized to prevent path traversal:

Directory separators (/, \) and the path token .. are rejected or stripped
Null bytes are rejected
Only expected extensions are permitted per upload type

Data integrity validation

Beyond input validation, ATR performs data integrity validation on database records using validate.py. This catches inconsistencies that may have been introduced by bugs, migrations, or manual database edits.

Committee validation

The committee function checks:

child_committees must be empty (not used)
full_name must be set, trimmed, and not prefixed with "Apache "

Project validation

The project function checks:

category must use comma-separated labels without colons
committee_name must be set (project must be linked to a committee)
created timestamp must be in the past
full_name must be set and start with "Apache "
programming_languages must use comma-separated labels without colons
release_policy_id must be None (not used)

Release validation

The release function checks:

created timestamp must be in the past
name must match the expected pattern for project and version
Release directory must exist on disk and contain files
package_managers must be empty (not used)
released timestamp must be in the past or None
sboms must be empty (not used)
Vote logic must be consistent (cannot have vote_resolved without vote_started)
votes must be empty (not used)

Running validation

Data integrity validation can be run via the admin interface or programmatically:

async for divergence in validate.everything(data):
    print(f"{divergence.source}: {divergence.divergence}")

Output encoding

ATR uses Jinja2 for templating with auto-escaping enabled by default. All variables rendered in templates are automatically HTML-escaped:

<!-- This is safe; user_input is escaped -->
<p>Hello, {{ user_input }}</p>

When HTML output is intentionally generated (e.g., via htpy), it must be explicitly marked safe using markupsafe.Markup:

import markupsafe
safe_html = markupsafe.Markup("<strong>Bold</strong>")

For Markdown rendering, ATR uses markupsafe.Markup(cmarkgfm.github_flavored_markdown_to_html(markdown_text)), which safely filters dangerous input before rendering.

Never mark user-controlled data as safe without proper sanitization.

File upload security

File uploads are handled with several security measures:

Size limits

Maximum upload size is enforced at the httpd layer via MAX_CONTENT_LENGTH. This prevents denial-of-service attacks via large uploads.

Extension validation

Each upload type has an allowlist of permitted file extensions. Files with unexpected extensions are rejected.

Storage location

Uploaded files are stored outside the application in configured directories (e.g., state/unfinished/). They are not directly accessible via HTTP.

File handling

Files are processed via quart.datastructures.FileStorage and validated before being written to disk. Empty files (where the browser sends a file input with no selection) are filtered out.

Injection prevention

SQL injection

ATR uses SQLAlchemy ORM exclusively for database access. All queries use parameterized statements:

# Safe: parameterized query
result = await session.exec(
    select(Project).where(Project.name == project_name)
)

Direct SQL string concatenation is never used.

Cross-site scripting (XSS)

XSS is prevented through:

Jinja2 auto-escaping (enabled by default)
markupsafe.Markup for trusted HTML only
Content Security Policy headers (configured in httpd)

Path traversal

Path traversal is prevented by:

Using pathlib.Path for all file operations
Validating that paths remain within expected directories
Rejecting file names containing path separators

import pathlib

base = pathlib.Path("/allowed/directory")
user_path = base / user_filename
# Verify the resolved path is still under base
if not user_path.resolve().is_relative_to(base.resolve()):
    raise ValueError("Path traversal detected")

Command injection

ATR safeguards against command injection as much as possible. Where external commands are necessary (e.g., GPG operations), arguments are passed as lists, never as shell strings:

import subprocess

# Safe: arguments as list
subprocess.run(["gpg", "--verify", signature_file, data_file])

# Unsafe: never do this
subprocess.run(f"gpg --verify {signature_file} {data_file}", shell=True)

Implementation references

form.py - Form definitions, validation, and rendering
validate.py - Data integrity validators
util.py - Utility functions including path handling
htm.py - HTML generation utilities