Back to Firecrawl in Google Sheets
SheetXAI logo
Firecrawl logo
Firecrawl · Google Sheets Guide

Crawl a Documentation Site Into Google Sheets for LLM Training Data

2026-05-14
5 min read

The Scenario

You're three weeks into building a fine-tuning dataset for an internal LLM. The plan was to use your company's own API documentation as training corpus — 200 pages of endpoint descriptions, parameter tables, and example responses. The problem: nobody exported it before the documentation migration started, and the old site is going dark in ten days.

You could scrape it manually. You open the docs homepage and start counting pages. There are sub-sections you haven't even clicked into yet.

The bad version:

  • Click into each documentation section, copy the page content into a text file, paste it into the sheet one row at a time
  • Lose track of which subsections you've covered — the sidebar expands into nested categories you didn't know existed
  • Realize on day six that you have 40 pages and need 160 more before the site goes down

Your dataset pipeline is waiting on this input. The model training slot is scheduled. You cannot afford a six-day manual crawl.

The Easy Way: One Prompt in SheetXAI

SheetXAI is an AI agent that lives inside your Google Sheet. It reads your data and through its built-in Firecrawl integration it can crawl an entire documentation site — following internal links, respecting crawl limits — and write each page's URL, title, and markdown content into your sheet as individual rows. One prompt, one crawl, one write-back.

Crawl https://docs.example.com up to 200 pages and write each page's URL into column A, the page title into column B, and the full markdown content into column C of my "Raw Corpus" sheet. Start a new row for each page. Flag any pages that returned a non-200 status in column D.

What You Get

  • Column A with the full URL of each crawled page
  • Column B with the page title, extracted from the HTML title tag or the H1
  • Column C with the full markdown content of the page, ready for tokenization or chunking
  • Column D with the HTTP status code for any page that didn't return cleanly — 404s, redirects, and errors all visible rather than silently excluded

What If the Data Is Not Quite Ready

In the "Raw Corpus" sheet, flag any row in column C where the markdown content is shorter than 150 words by writing "thin" into column E. Then delete all rows marked "thin" so the final corpus only contains substantive pages.

The documentation site has a changelog section I want to exclude — all URLs under /changelog/

Crawl https://docs.example.com up to 200 pages, but skip any URL whose path starts with /changelog/. Write URL, title, and markdown into columns A, B, C of the "Raw Corpus" sheet.

The content needs deduplication — some pages appear at two URLs (canonical and legacy)

In the "Raw Corpus" sheet, compare the content in column C across all rows. If two rows have markdown content that is more than 90% similar, keep the row with the shorter URL and delete the duplicate. Write the count of removed duplicates into cell F1.

The full pipeline: crawl, deduplicate, classify, and flag short pages in one shot

Crawl https://docs.example.com up to 200 pages. Write URL, title, and markdown into columns A, B, C. Then remove any row where column C is under 150 words. Then remove duplicate rows where the content in column C is more than 90% similar, keeping the shorter URL. Finally, classify each remaining row as "Reference", "Tutorial", or "Conceptual" based on the markdown content and write the classification into column D.

One instruction handles the full pipeline — crawl, filter, dedup, classify.

Try It

Get the 7-day free trial of SheetXAI and open your corpus-prep sheet, then ask it to crawl the docs site and write each page as a row with the content ready for chunking. Link to the hub: How to Connect Firecrawl to Google Sheets. Also see: Run Web Search Queries From a Google Sheet and Write Back Results.

Stop memorizing formulas.
Tell your spreadsheet what to do.

Join 4,000+ professionals saving hours every week with SheetXAI.

Learn more