The Scenario
A technical writer at a software company has just been handed a project: migrate 30 pages of a third-party vendor's documentation into the company's internal knowledge base. The URLs are already in column A of a sheet called "Vendor Docs Import." Column B is labeled "Markdown Content" and is completely empty.
The vendor's site has no export button. There's no API for the docs. The only option the writer has been given is "just go through them one by one."
One by one is 30 browser tabs. Thirty copy-paste operations. Thirty formatting cleanups because pasted HTML never comes out clean. The writer's job is documentation, not web scraping.
The bad version:
- Open URL #1, select all the body content, paste into a text editor, manually strip the HTML tags, fix the headings, remove the navigation elements that got pasted in
- Repeat for URL #2 — this one has an embedded table that pastes as a garbled line of text
- Realize at URL 8 that the vendor recently updated their nav structure and the HTML you're cleaning up has five levels of wrapper divs around the actual content
This is not what the writer was hired to do, and the knowledge base migration has a deadline of end of sprint.
The Easy Way: One Prompt in SheetXAI
SheetXAI is an AI agent embedded in your Google Sheet. It reads the sheet, sees the URL list, and uses Hyperbrowser's Markdown fetch capability to retrieve each page's content as clean, structured Markdown — then writes it back into the column you specify, ready for ingestion.
For each URL in column A of the "Vendor Docs Import" tab (rows 2–31), use Hyperbrowser to fetch the page and return the content as Markdown. Write the page title to column B and the full Markdown text to column C. Truncate any Markdown body over 3000 characters and note "truncated" in column D for those rows.
What You Get
- Column B: page title from the HTML
<title>tag - Column C: clean Markdown body — headings, code blocks, lists, and tables preserved; navigation chrome and sidebar content stripped
- Column D: "truncated" flag for any page whose content exceeded the limit, so you know which ones need a closer look
- All 30 rows processed in one operation, with no tab-switching
What If the Data Is Not Quite Ready
Some pages in the vendor docs require authentication to view — they return a login redirect instead of content
For each row in "Vendor Docs Import" where column C is blank or contains only a login prompt, note "auth required" in column D and flag the row in column E so the team can identify which pages need manual handling.
The Markdown in column C has inconsistent heading levels — some pages use H1 for section headers, others use H2 or H3
For each row in "Vendor Docs Import," normalize the Markdown in column C so that the top-level section headings are H2 (##) and subsections are H3 (###). Overwrite column C with the corrected Markdown.
Several pages contain code blocks in three different languages — Python, JavaScript, and bash — and the knowledge base needs them tagged correctly
In "Vendor Docs Import," scan column C for fenced code blocks without a language tag and attempt to identify the language based on syntax. Write the detected language to column F for each row that had untagged code blocks.
The knowledge base system requires each doc to have a one-sentence summary in addition to the full content — and the team needs this before import
For each row in "Vendor Docs Import" where column C is populated, generate a one-sentence summary of the Markdown content and write it to column G. Then produce a combined import-ready format in column H with the title, summary, and full Markdown concatenated in the structure the knowledge base expects.
Running the fetch, the normalization, and the summary generation in one prompt means the sheet goes from a URL list to import-ready documentation without a manual step in between.
Try It
Get the 7-day free trial of SheetXAI and open your documentation URL list, then ask it to fetch every page as Markdown using Hyperbrowser and write the content back into your sheet. Also see bulk scraping structured data from URLs or the Hyperbrowser overview.
