SEO and AEO for Startups in 2026

Search optimization guide in the age of AI.

Filip Zolyniak Engineering at Remitly

17 min read May 07, 2026

SEO didn't die. The runtime changed.

For ~fifteen years a webpage had basically one consumer: Googlebot, which spat out ten blue links. In 2026 that same page gets parsed by Google's headless browsers, OpenAI's crawlers, ClaudeBot, Perplexity's agents - each with its own budget, its own renderer, and its own criteria for what it'll cite back to a user. The old game was tricking the index. The new game is being the kind of source a model is willing to defend in front of a user. That's a different muscle, and most of what follows is about how to build it.

The way I think about this: a startup in 2026 has to serve two audiences with one system. Humans who need a solution, and machines who need structured, performant data to either index (SEO) or summarize (AEO). The post is organized around the five places where that two-audience problem actually bites, plus measurement and stage-aware prioritization at the end - because shipping the rest without those is how you waste a year. I'll skip most of the philosophy and go straight to information architecture, infrastructure, and the handful of things that move the needle.

Pages Are a Product

Stop thinking of SEO as marketing content. Think of every page as a public API endpoint for your value prop - it has consumers (humans, bots), a contract (does it solve the thing it claims to solve), and an SLA (how fast, how reliable, how indexable). Once that frame clicks, most of the strategy follows.

The first move is to stop chasing broad topics like "What is Sales?" Nobody is searching for that with intent to buy anything. High-intent queries - "JSON to CSV converter", "CRM for 3-person teams", "Stripe alternative for marketplaces" - are people with a problem and (often) a credit card. Map each query to a route, and make the route do the thing the user is asking for. If they want a calculation, ship a tool. A comparison, ship a dynamic matrix. A guide, ship a step-by-step framework. Walls of text optimized for keyword density were a 2015 strategy, and they lose to anything that actually does the work. If your page solves the problem better than the competition, both Google and the AI models will surface it. There's not much more to it than that.

The leverage play here is programmatic SEO - generating hundreds or thousands of landing pages from a single dataset. Instead of writing one article on "Best places to live", you build a database of 50 cities and programmatically generate 50 pages: "Living in Austin vs. Seattle", "Cost of living in Denver" and so on. This is high-risk, high-reward. Ship 1,000 thin pages with identical templates and swapped city names and Google's SpamBrain will catch you - entire domains get demoted overnight. The fix is unique data on each URL: proprietary stats, charts, user reviews that exist only on that page. If you wouldn't be embarrassed to send the page to a friend, it's probably good enough. If you would, the bot can tell too.

Technical Foundation

You can't rank what can't be indexed. Technical SEO is basically performance engineering and resource management for crawlers - the same stuff you'd do for any production system, except the users are bots.

A 2x2 quadrant chart showing how two bot budgets interact. The horizontal axis is crawl frequency (low to high), the vertical axis is performance per visit (slow to fast). Top right, fast and frequently crawled, is labeled 'indexed and cited' - the goal. Bottom left, slow and rarely crawled, is 'invisible to bots' - worst case. Top left, fast but rarely crawled, is 'fast but unseen'. Bottom right, frequently crawled but slow, is 'crawled, half-rendered' - render budget runs out mid-page.

Crawl & Performance Budget

Bots don't have infinite resources. Googlebot, ClaudeBot, Perplexity's agents - each assigns your site a budget that's really two numbers: how often they visit, and how much work they can do once they're there. Both matter, and both come down to the same lever in the end: server speed.

Below ~1,000 pages, ignore this entirely. Focus on content quality and the bots will find everything eventually. Reading about crawl budget at this scale is a procrastination tactic. Above ~10,000 pages - pSEO sites, e-commerce inventories - efficiency becomes mandatory.

The optimization has two prongs.

The first is pruning. Block or delete low-value URLs (faceted navigation, sort orders, dead products) via robots.txt or noindex. Most sites have an embarrassing long tail of garbage URLs nobody knew existed.

The second is performance, and this is where startups consistently skimp. The numbers I actually look at: TTFB under 200ms is good, under 600ms is acceptable, above 1s and you're bleeding budget - and this is a server problem, not a frontend one, so fix your hosting, your DB queries, your CDN. HTML payload should be under 100KB gzipped because bots parse linearly and a 2MB document with everything inlined is a self-inflicted wound. The JavaScript critical-path bundle should be under 200KB compressed; anything heavier and you're gambling that the bot's render budget doesn't expire before hydration finishes. LCP under 2.5s on mobile - this is a Core Web Vital, it directly affects ranking, and it's also a decent proxy for whether AI crawlers can chew through your page efficiently.

A catch on TTFB: bots don't ping you from your office. Both Google and OpenAI publish the exact IP ranges their crawlers use, and both lean heavily on US data centers. If your server lives in a single region without a CDN, your "fast TTFB" from localhost is a fiction. A Warsaw-only origin can show 5ms locally and 600ms+ to Googlebot sitting in California. The fix is the boring one - put a CDN in front (Cloudflare, Fastly, CloudFront) - but the verification is precise: pull the published IP ranges, filter your server logs by them, and look at the actual p75/p95 TTFB the crawlers experience. Don't guess. Measure.

One trap engineers fall into: optimizing only for desktop. Google switched to mobile-first indexing in ~2019, which means Googlebot Smartphone is the primary crawler for the open web. Your beautiful desktop Lighthouse score is basically irrelevant if mobile is slow. Test on a throttled 4G connection on a mid-range Android, not on your fancy MacBook Pro. The gap between those two environments is where most ranking losses hide - and where most engineers stop looking.

The rule of thumb: if your page takes longer than 3 seconds to render meaningful content in a mobile-emulated headless browser, assume both LLMs and Google are getting an incomplete picture. Test with curl (no JS), then with headless Chrome in mobile emulation mode (full render). The diff between those two outputs is your risk surface.

And don't treat Lighthouse as the source of truth. Lighthouse is a synthetic test in a controlled lab - useful for catching regressions, basically useless for understanding what users actually experience. The signal you want is real-user monitoring (RUM) plus Chrome's CrUX data: what your customers on real devices and real networks are seeing right now. Set up RUM, watch the p75 and p95 (the median lies, especially on mobile), and improve proactively. Your dev machine running Lighthouse on localhost is the worst possible proxy for someone on a 3-year-old phone with two bars of LTE.

The wins are usually unglamorous. Inline critical CSS, defer the rest. Ship <img> with explicit width and height to prevent layout shift. Audit third-party scripts ruthlessly: analytics, chat widgets, and A/B testing tools routinely add 500KB+ of JS, most of which isn't needed on first paint and can be deferred or loaded on interaction.

One nuance worth its own paragraph: Cache-Control. Crawlers (Google, AI bots) respect Cache-Control directives and will hold their cached copy of your page for whatever max-age you set (Google documented exactly how Googlebot handles this in late 2024). That's great for static assets - images, CSS, JS bundles can have year-long caches without consequence. It's a foot-gun for anything that changes: pricing pages, product inventories, articles you update. Set max-age too high on a page you edited yesterday and crawlers will happily keep serving (and citing) the stale version for days. So: long max-age for things that don't change; short max-age plus ETag revalidation for things that do. The fastest request is the one that never hits your server - but only if the cached answer is still correct.

The math under all of this is simple. Twice as fast, twice as many pages crawled per session, twice as many cited by an AI tomorrow.

The Rendering Layer (SSR vs. Client-Side)

Search crawlers like Googlebot and Bingbot run headless browsers, but their render budget - the time and memory they're willing to spend executing your JavaScript before giving up - is finite. Google doesn't publish exact numbers, but empirically you're talking seconds, not minutes. If your site relies entirely on client-side JavaScript to load content, you're rolling the dice on whether the bot waits long enough for hydration to finish. Prefer SSR (server-side rendering) or SSG (static site generation) for public pages - hydrate for interactivity, but ship HTML on the first byte. The business case is brutal: if the bot sees a blank page because the JS didn't load in time, you're invisible. Speed is safety.

There's a sharper version of this problem for AI crawlers specifically. Most of them - GPTBot, ClaudeBot, PerplexityBot - don't execute JavaScript at all. Not "they have a small render budget"; not "they sometimes time out". They fetch HTML and parse it, full stop. If your content materializes via client-side JS, what AI sees is your <div id="root"></div> shell - not your content. The cure is the same as for Google - ship real HTML on the first byte - but the stakes are higher: with Google you're gambling on render budget, with AI you've already lost.

One stack-level note that startups consistently overlook: for a static landing page or article, you probably don't need a framework at all. Next.js, Remix, and SvelteKit are powerful, but they're powerful for applications - auth, dashboards, dynamic data. For a blog post or a marketing page, raw HTML with a sprinkle of vanilla JS ships zero framework JS to the client, has nothing to hydrate, gives the bot nothing extra to render, and won't break in two years when the framework moves on. A typical Next.js setup ships ~80-250KB of JS even for a static article - that's pure overhead for content that doesn't need it. The default reach for a framework because "that's what we use" is one of the more expensive defaults in modern web dev. Pick the simpler stack when the page is genuinely simple.

Three columns comparing what AI crawlers actually receive when fetching a page. Column 1, Client-Side Rendering (React or Next.js without SSR), shows an HTML response with an empty div root and a script tag - the AI bot sees nothing because it does not execute JavaScript. Column 2, Server-Side Rendering, shows full HTML content including a Pricing heading and price text - the bot gets everything on the first byte. Column 3, Raw HTML with no framework, shows the same complete content with no JavaScript at all - the bot also gets everything, with the fastest possible parse.

Architecture & Canonicals

Two engineering hygiene items that punch way above their weight. First, graph depth: your most valuable product pages should be within ~3 clicks of the homepage. Anything buried 6 levels deep barely gets crawled. Second, duplicate control. Startups generate duplicate routes constantly - /pricing, /pricing/, /pricing?utm=x, /Pricing - and without explicit canonical tags you're diluting ranking power across N variants of the same page. Tell the search engine which URL is the source of truth. Do it in HTML.

Schema as Translation Layer

Schema markup is how you translate human-readable pages into machine-readable facts. It's the single highest-leverage thing you can ship for AI citation, and it's grossly under-implemented at most startups I've looked at.

LLMs and search engines have the same problem: parsing intent from prose is expensive and error-prone. Structured data - specifically JSON-LD - hands them the answer directly. When ChatGPT cites a product price, a review score, or an author's credentials, it's almost always pulling from schema, not from paragraph text. Think of schema as the API contract between your site and every machine that visits.

JSON-LD - hands them the answer directly. When ChatGPT cites a product price, a review score, or an author's credentials, it's almost always pulling from schema, not from paragraph text. Think of schema as the API contract between your site and every machine that visits.

For most B2B/B2C startups, the minimum viable set is four:

Organization - name, logo, social profiles, founding date. Establishes brand identity. Goes on every page.
Article / BlogPosting - author, datePublished, dateModified, headline. Critical for content cited by AI. Without this, your byline is invisible to machines.
Product - name, price, availability, aggregateRating. If you sell anything, non-negotiable.
BreadcrumbList - helps both Google and LLMs understand site hierarchy.

To make this concrete, here's the actual Article schema for the page you're reading right now:

json-ld

{  "@context": "https://schema.org",  "@type": "Article",  "headline": "SEO and AEO for Startups in 2026",  "description": "Search optimization guide in the age of AI.",  "datePublished": "2026-05-07",  "dateModified": "2026-05-07",  "author": {    "@type": "Person",    "name": "Filip Zolyniak",    "url": "https://zoltw.com"  },  "publisher": {    "@type": "Organization",    "name": "Filip Zolyniak",    "url": "https://zoltw.com"  },  "mainEntityOfPage": {    "@type": "WebPage",    "@id": "https://zoltw.com/blog/seo-aeo-startups-2026"  }}

Around 15 lines. That's the gap between an LLM having to guess this article's metadata from prose and getting it handed over as structured fact. Update dateModified when you edit. Keep author linked to a real person with a real URL. The rest is plumbing.

Add tier two when relevant: FAQPage for Q&A pages (still surfaces in AI answers more often than prose), HowTo for step-by-step guides (AI agents preferentially cite structured procedures), SoftwareApplication for SaaS products, Person for author pages.

Implementation is short. Use JSON-LD in a <script type="application/ld+json"> tag in the <head>. Skip microdata and RDFa - harder to maintain, no advantage. Validate every template through Google's Rich Results Test and Schema.org Validator, because broken schema is worse than no schema; it can suppress rich results entirely. And the schema must match visible content. If your Product schema lists $99 but the page shows $149, Google treats it as deceptive and demotes you. The mental model that works for me: if a fact about your business appears on the page, it should also appear in schema.

The AI Shift

This is the new frontier. You're not just optimizing for keywords anymore - you're optimizing for retrieval-augmented generation. You want AI systems to find your data, trust it, and cite it back to users. That's a different objective function, and a few things follow from it.

AEO (Answer Engine Optimization)

To get cited by ChatGPT or Perplexity, your content has to be machine-readable in a fairly specific way. Three ingredients matter. Semantic density first - AI models have limited context windows, so cut the fluff. Structure second - clear H2/H3 headers, because AI parsers use headers to chunk information, and bad headers mean bad chunks mean worse retrieval. Primary source data third, and this is the big one: AI massively over-indexes on unique data. Original benchmarks, API docs, proprietary statistics. If you have data nobody else has, put it on a public URL. This is the highest-value currency in the AI age, and almost nobody is paying it.

Bot Governance

You have to decide whether your data gets used for training. Two stances. Growth: allow AI crawlers (OAI-SearchBot, ClaudeBot, PerplexityBot) and increase the chance your brand surfaces in AI answers. Protection: block them via robots.txt if you have proprietary data - pricing intelligence, scraped competitor data - that you don't want a competitor's model ingesting.

For most early-stage startups this is a fake choice. You don't have proprietary data worth protecting yet, and you do need every ounce of distribution you can get. Default to growth. Revisit when you actually have a moat to defend.

The Optional Standard: llms.txt

There's been a lot of chatter about /llms.txt - a proposed standard that acts like a sitemap specifically for AI agents. The theory: you host a file at yourdomain.com/llms.txt linking to simplified, markdown-only versions of your docs, giving LLMs a clean, noise-free path to ingest your documentation without HTML bloat.

Personally I suspect this stays optional for a while. As of early 2026 it's still experimental, and the major crawlers (Google, OpenAI, Anthropic) are already excellent at parsing HTML. The verdict: nice-to-have, priority 3. Do it if you have complex technical documentation (APIs, SDKs) and want LLMs to write code accurately based on your docs. Skip it if you're a standard B2B/B2C marketing site - almost certainly unnecessary engineering overhead.

Trust Signals

Google's E-E-A-T framework - Experience, Expertise, Authoritativeness, Trustworthiness - used to be a ranking heuristic. In 2026 it's the bare minimum for both Google and LLMs, because both face the same problem: the open web is increasingly AI-generated, and they need signals to tell primary sources apart from generated noise.

When an LLM decides whether to cite your page, it weighs source credibility heavily - often more heavily than Google does. A well-written but anonymous post will lose to a mediocre post written by a named expert with a verifiable track record. The reasoning is simple: the cost of being wrong is uneven, so models lean toward sources they can defend. Translation: being a real, identifiable person on the internet is now a ranking factor.

What actually moves the needle is unglamorous and hard to fake. Named authors with real bios - every article should have a byline that links to an author page with credentials, a photo, links to LinkedIn / GitHub / published work. "Staff Writer" is invisible to E-E-A-T. An About page that isn't generic - real team photos, real names, real office address (or explicit "remote, registered in X"). Stock photos and "we're a passionate team of experts" actively hurt you. Original research and data - surveys, benchmarks, internal usage stats, customer outcome data; this is what gets cited, linked, and quoted, and it's the single biggest signal that you're a primary source rather than a content mill. Citations and outbound links to credible sources - academic papers, government data, well-known industry reports - counterintuitively help you rank rather than "leaking authority." And customer evidence: logos, named case studies (with full company names, not "a Fortune 500 client"), reviews on third-party platforms (G2, Trustpilot, Capterra). These are off-site signals you don't fully control, which is exactly why they carry weight.

The half you don't own is, in some ways, the bigger half. LLMs lean heavily on what other people say about you - Reddit threads, Hacker News comments, industry newsletters, Wikipedia. You can't fake any of this, but you can earn it. Ship original frameworks, naming, or strong opinions; "programmatic SEO" got a name because somebody wrote about it that way. Show up where your audience already is - a founder posting genuine technical detail in relevant subreddits will produce more LLM-citation-worthy text than a year of corporate blog posts. Get on Wikipedia if you legitimately qualify, because LLM training data over-indexes on it.

One filter that's served me well: for every page, ask whether a skeptical reader could verify who wrote this, when, and why they're qualified. If the answer is no, you're competing with AI-generated content on its own terms. You'll lose.

How You Know It's Working

Everything above is a build playbook. None of it matters without a feedback loop. The pattern I see most often is teams shipping SEO and AEO work for ~six months, getting nothing visible from the system, and quietly winding the effort down. Almost always the failure isn't the work itself - it's the absence of measurement. You can't improve what you can't see.

The good news: a decent measurement stack is mostly free.

Start with Google Search Console. It's free, it's authoritative, and it's the only place where Google tells you directly which queries you rank for, what your CTR looks like, and where your index coverage is leaking. Day one. Wire alerts into Slack so you find out about index drops before your CEO does. Add Bing Webmaster Tools while you're at it - same shape of data, smaller volume, but Bing is what powers ChatGPT's web search now, so it matters more than it used to.

For AI specifically, a new category of tooling emerged in ~2024: AI citation tracking (Profound, Otterly, Peec.ai, and others). What they do is run a corpus of representative queries against ChatGPT, Perplexity, Claude, and Gemini on a schedule, parse the answers, and report which sources got cited. Think of it as the AEO equivalent of rank tracking. Pricing runs from a few hundred to a few thousand dollars a month; for an early-stage startup, even running 50 prompts manually against the major models once a month gives you a free baseline. The point is to have a number, not necessarily an expensive one.

For performance you should already have real-user monitoring in place from the previous section. Watch p75 and p95 over time, alert on regressions. Synthetic tools (Lighthouse-CI, WebPageTest) are useful for catching regressions in PRs - useful, not source of truth.

Cadence is what separates a measurement stack from measurement theater. Roughly: daily for index-coverage alerts and crawler 4xx/5xx spikes; weekly for top-page traffic, primary-keyword rankings, and AI citation deltas; monthly for share-of-voice and content-refresh queue; quarterly for a full audit (broken links, redirect chains, slow pages, abandoned URLs). Most teams check Search Console "occasionally" and call that monitoring. That's not monitoring. That's hope.

One thing teams consistently miss: track which pages get cited by AI separately from which pages get traffic from Google. The lists overlap less than you'd expect. Pages that win Google rankings tend to be long and keyword-dense. Pages that win AI citations tend to be shorter, more data-dense, and look "underdeveloped" by classic SEO standards. If your dashboard only shows Google traffic, you're optimizing blind for half the new game.

The rule: if you can't draw a graph of [your traffic + AI citations] over the last ~90 days, you don't have measurement. Fix that before optimizing anything else.

Stage-Appropriate Playbook

The previous six sections describe what mature SEO/AEO infrastructure looks like. Most readers don't have mature anything. This section is the prioritization view: what to actually do at your stage, and what to skip.

Two facts to accept upfront. First, SEO compounds slowly - 6 to 18 months is a realistic horizon for a new domain to start ranking for anything competitive. Second, the system rewards consistency more than intensity. A startup that ships one good page per week for a year crushes one that drops 50 pages in a sprint and goes silent. That timeline reality drives everything below.

Pre-PMF (first ~6 months). Honestly, ignore the marketing side of SEO. Your job is to find product-market fit, and the conversion paths that prove fit run through HN, Twitter, Reddit, Slack/Discord communities, and direct outreach - not search. Content writing, keyword research, rank tracking - all of that is procrastination dressed as work at this stage.

But the technical foundations are different. They're decisions you should make on day one because the cost of getting them wrong scales with every page you ship. Specifically: SSR or SSG from the start (not "we'll add SSR later" - migrating an entire CSR app is painful), JSON-LD schema on every page from page one, canonical tags configured, clean URL structure without query-string sprawl, a framework choice that matches the page type (don't default to Next.js for a marketing site if raw HTML would do). These take a few hours to set up correctly and weeks to retrofit later. The exception that overrides everything: if your product is itself search-driven - a directory, comparison tool, calculator - then SEO is your distribution channel and the marketing side starts day one too.

Seed / early traction. You have customers, you know who you're selling to, you've started writing. The right moves are foundational, not creative. Search Console alerts wired into Slack. JSON-LD schema for Organization, Article, Product. Named authors with real bios. Basic real-user monitoring. Pick ~10 high-intent queries and write the best page on the internet for each - slowly. Don't even think about pSEO yet. Budget at this stage: $0-$200/month, maybe a half-day per week of someone's time.

Series A / scale. Now SEO can be a meaningful channel and the leverage plays start making sense. Programmatic SEO if your data supports it. Content velocity with editorial process. AI citation tracking ($200-$1k/month) to treat AEO as a real channel. Internal linking infrastructure (pillar pages, topic clusters). Hire your first SEO specialist or sign with a serious agency - the gap between a competent SEO and an incompetent one is roughly two orders of magnitude in this domain, so the hire matters more than the title.

Growth and beyond. Most of this article applies. Share-of-voice tracking, content refresh cycles, investment in original research and proprietary data (the only durable AEO moat), migration planning every time your product or brand changes. At this stage SEO/AEO is a function with headcount, not a side project.

When to skip SEO entirely. Three honest scenarios. (1) Your TAM doesn't search - they're discovered via referral, marketplace, or sales-led motion. (2) Your sales cycle is so long and considered that paid + outbound + warm intros dominate, and SEO traffic is irrelevant on the margin. (3) You're competing against established competitors (DR 80+ on Ahrefs) on queries where you'd need ~5 years to break through. Some markets are locked. Don't fight battles you can't win inside your runway.

The AI content question is the elephant. Every founder reading this is using ChatGPT or Claude to draft content. Fine. But there's a catch: AI-drafted content edited by a domain expert ranks well; pure AI output dropped at scale is exactly what Google's Helpful Content updates were built to suppress. The economic question isn't "should I use AI" - it's "how much expert time goes into each piece." My rough rule: 2-3 hours of subject-matter-expert review per piece is the right ratio for thought leadership. Pure AI for templated landing pages is fine only if the underlying data is genuinely unique to your URL.

Hiring. When in doubt, hire a fractional SEO consultant for 3 months before committing to a full-time hire. Good fractional people will tell you within ~30 days whether you have an SEO opportunity worth the investment. Bad ones recommend a "content strategy" without ever opening your Search Console.

The Checklist

Reference version of everything above, split by who owns each item.

For the Founder (Strategy)

Are we targeting high-intent problems, not generic search volume?
Can we use programmatic SEO to generate hundreds of high-value pages from one dataset (with unique data per URL)?
Does each page actually solve the user's problem (tool, template, answer) or is it just text?
Have we decided whether to allow or block AI training bots? (Default: allow.)
Are our authors real, named, and credentialed? Does our About page look like a real company?
Are we publishing anything proprietary that's worth citing?

For Engineering (Execution)

Critical pages are SSR/SSG (view-source contains content).
If >10k pages, server logs audited so bots aren't burning budget on 404s or infinite loops.
TTFB under 600ms measured from US East, US West, and EU - not from localhost. CDN in front of single-region origins.
HTML under ~100KB gzipped, JS critical path under ~200KB compressed.
LCP under 2.5s, CLS near zero - measured on mobile (throttled 4G, mid-range Android), not just desktop.
Real-user monitoring (RUM) in place. p75 and p95 tracked, not just synthetic Lighthouse scores.
JSON-LD shipped for Organization, Article, Product, BreadcrumbList, validated.
robots.txt with explicit rules for GPTBot, ClaudeBot, CCBot, PerplexityBot.
Decision made on whether llms.txt is worth the effort for your specific docs.

TLDR. The era of tricking search engines is over. Pages are products, and they have to serve two runtimes at once: humans who want a solution and machines who want structured, performant data. The five things that matter most to build are: solving the user's intent (often by shipping a tool, not text), engineering for crawl and render budget the way you would for any production system, shipping JSON-LD schema as if it were the API contract for your business, picking an AI bot stance (probably allow), and proving you're a real, named entity worth citing - both on your own pages and across the web you don't control.

Two things matter around the build. Measure ruthlessly - if you can't draw a graph of your traffic plus AI citations over the last 90 days, you're flying blind regardless of what you ship. And match the work to your stage; pre-PMF startups should ignore most of this and focus on direct distribution, while leverage plays only make sense once you have customers and a thesis worth scaling.

Most teams won't do all of this. The ones that do will compound for years, while everyone else slowly drifts off the first page of both Google and the AI answers. That's the opportunity.