Reference → Robots.txt — Reference

Robots.txt — Reference

The Robots.txt surface is the platform's crawler-instruction plane. It owns the contents of the robots.txt file served at the site root, the allow / disallow rules per URL path, the sitemap pointer, the crawler-specific overrides keyed by user-agent, and the environment-aware behavior that automatically blocks indexing of non-production environments.

This page is a reference for platform engineers and integrators who need to understand the surface before extending it, scripting against it, or auditing the site's crawl posture. Customer-facing how-tos live in the customer docs set; this page describes the shape of the surface, not the steps to drive it.


Overview

Robots.txt settings live under the SEO → Robots view in SG-Admin, paired with the Sitemap settings on the same screen. The view renders a single configuration document, edited as a form: a default rule block, an optional per-agent override list, and a sitemap-URL field that points the crawler at the platform's sitemap index.

Unlike file-on-disk implementations elsewhere in the industry, SGEN's robots.txt is rendered on demand from this configuration document at request time. There is no file to upload, no FTP step, no template to override at the theme layer. The file served at /robots.txt is the live projection of the saved configuration plus the platform's automatic environment-aware preamble.

The surface holds no list-of-records — it holds one configuration document. Edits take effect immediately on the live site as soon as the form is saved. There is no separate publish step. Because the file is consulted by every crawler on every visit, the guards described under Actions below are designed to make accidental disallow-all changes hard to ship.

Where it lives in SG-Admin:

  • Sidebar: SG-Admin → SEO → Robots & Sitemap
  • URL prefix: /sg-admin/seo/robots
  • View template: application/views/Admin/SEO/robots-form.php
The robots configuration is one of the layered SEO controls. Sitemap configuration, meta-tag defaults, schema.org defaults, and redirect rules live on adjacent SEO views — see Related references at the bottom of this page.
┌──────────────────────────────────────────────────────────────────────┐│ SG-Admin → SEO → Robots & Sitemap [Save] │├──────────────────────────────────────────────────────────────────────┤│ Default rule block ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ User-agent: * │ ││ │ Allow: / │ ││ │ Disallow: /sg-admin/ │ ││ │ Disallow: /cart/ │ ││ │ Disallow: /checkout/ │ ││ └──────────────────────────────────────────────────────────────┘ ││ ││ Per-agent overrides [+ Add agent] ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ GPTBot Disallow: / │ ││ │ CCBot Disallow: / │ ││ └──────────────────────────────────────────────────────────────┘ ││ ││ Sitemap URL https://example.sgen.com/sitemap.xml ││ ││ Environment block (auto) ☑ Block indexing on staging │└──────────────────────────────────────────────────────────────────────┘

Actions

The Robots.txt surface exposes a small set of write operations. Each is described by what it does to the served file, not by its internal method name.

Render

Returns the current robots.txt content as it would be served at the public URL. Used both by external crawlers (as the live response on /robots.txt) and by the form view (as a read-only preview pane). The render path resolves the environment-aware preamble, then the default rule block, then the per-agent overrides, then the sitemap pointer — in that order.

Edit default rule block

Loads the default rule block into the form, pre-populated with the saved allow / disallow lines. Submit replaces the stored value with the posted text after validation (syntax shape, no NUL bytes, no obviously malformed directives). Edits take effect immediately on the next /robots.txt request.

Edit per-agent override

A repeating sub-form. Each entry pairs a user-agent string with its own allow / disallow lines. The surface enforces uniqueness of user-agent within the override list and merges the override at render time. An override block replaces — does not extend — the default block for that agent.

Set sitemap URL

A single text field. Validates the URL is well-formed and points at the same host as the site (cross-host sitemap pointers are rejected — they confuse crawlers more often than help). Empty value is allowed; the sitemap line is omitted from the rendered file in that case.

Toggle environment block

A checkbox. When enabled (the default for non-production environments), the platform prepends a hardcoded User-agent: * / Disallow: / block to the rendered file regardless of the saved rules. This prevents staging sites from being indexed even if a sloppy save would otherwise allow it. The checkbox is locked to "on" on environments tagged staging; on production it can be toggled but emits a confirmation prompt before saving.

Preview

Returns the rendered file as it would appear at /robots.txt, including the environment preamble. Used as a sanity check before save. The preview reflects unsaved form changes so the operator can confirm the output shape before committing.

Reset to defaults

Restores the configuration to the platform's ship-default block (allow all, disallow /sg-admin/ and the standard transactional paths). The per-agent override list is cleared. Sitemap URL is preserved.

GET /robots.txt → 200 OK text/plain# Managed by SGENSitemap: https://example.sgen.com/sitemap.xmlUser-agent: *Allow: /Disallow: /sg-admin/Disallow: /cart/Disallow: /checkout/User-agent: GPTBotDisallow: /User-agent: CCBotDisallow: /

Data model

The robots configuration is stored as a single configuration document. Field names below are the conceptual shape — the on-disk column names match closely but are not contractually stable across releases.

FieldTypeNotes
default_blocktextThe unscoped User-agent: * allow / disallow rules.
agent_overrideslistZero or more {user_agent, rules} entries.
sitemap_urlstringOptional. Must match the site host.
block_stagingbooleanWhen true, environment preamble disallows all on non-production. Locked to true on staging.
last_edited_attimestampUpdated on save. Surfaced in the audit log.
last_edited_byintegerUser id who saved. Resolves to the Users surface.
Render-time composition: the served /robots.txt file is not stored as a string. It is computed at request time from the fields above plus the environment context. The order is fixed: environment preamble → sitemap line → default block → per-agent override blocks. Comments are not user-configurable; the platform inserts a # Managed by SGEN header automatically.

Sitemap pointer: the value is a URL, not a path. When the site is reachable on multiple hosts (custom domain + platform fallback), the pointer is rewritten at render time so the served /robots.txt always names the host that served it. This avoids stale-host pointers after a domain change.

RENDER REQUEST /robots.txt│▼┌──────────────────────────────────────┐│ 1. Environment preamble │ staging → Disallow: /│ (auto, locked on non-production) │ production → empty└─────────────────┬────────────────────┘▼┌──────────────────────────────────────┐│ 2. Sitemap pointer │ rewritten to current host│ Sitemap: <sitemap_url> │└─────────────────┬────────────────────┘▼┌──────────────────────────────────────┐│ 3. Default rule block │ User-agent: *│ │ Allow / Disallow lines└─────────────────┬────────────────────┘▼┌──────────────────────────────────────┐│ 4. Per-agent override blocks │ one per saved override│ │ rendered in saved order└──────────────────────────────────────┘▼served as text/plain

Permissions

Access to the Robots.txt surface is gated at two layers.

Layer 1 — admin gate. Every action under SG-Admin passes through the platform's standard admin access check at request entry. An unauthenticated request never reaches the Robots.txt surface.

Layer 2 — per-action capability. Within SG-Admin, each Robots.txt action checks an SEO capability associated with the calling operator's role. The default role configuration ships with three roles — Administrator, Editor, Viewer — and the capability map is:

CapabilityAdministratorEditorViewer
View robots configurationyesyesyes
Edit default rule blockyesyesno
Edit per-agent overrideyesyesno
Set sitemap URLyesyesno
Toggle environment blockyesnono
Reset to defaultsyesnono
Custom roles defined under Settings → Roles override the default map. The capability slugs are stable; the role slugs are configurable.

Self-protection rules. The surface refuses to save a default rule block whose effective output is User-agent: * / Disallow: / on a production environment unless the operator confirms a second time. The confirmation includes a plain-language summary of what the rule will do to indexing.

Audit. Every save emits an Activity Log entry. The log records the acting operator, a diff of the rendered file before and after the save, and the environment context at save time. Activity Log retention is governed by the site's general settings.


Related references

  • SEO — Reference. Owns global meta defaults, per-content-type defaults, and schema.org defaults. Robots is the crawler-control side of the same SEO module.
  • Sitemap — Reference. The sitemap URL field on Robots points at the sitemap surface. Edits to which records are included in the sitemap live there.
  • Settings — Reference. Owns role definitions, environment-tag configuration, and Activity Log retention.
  • Domains — Reference. The host the served robots.txt advertises depends on the active domain configuration. Host changes propagate into the rendered sitemap pointer.
  • Pages — Reference. Per-page noindex directives live on the page record and supplement (not replace) the robots configuration.
For the corresponding customer-facing walkthrough — opening robots to a new crawler, blocking AI-training agents, pointing search engines at a fresh sitemap — see the SEO section of the customer docs at /docs/seo/robots.
On this page