Back to ScrapeGraph AI in Excel
SheetXAI logo
ScrapeGraph AI logo
ScrapeGraph AI · Excel Guide

Dedup and Normalize Scraped Output in a Google Sheet

2026-05-14
5 min read

The Scenario

You're a data analyst. You ran a bulk SmartScraper job yesterday and the output landed in your Excel workbook: 400 rows of product data from supplier sites. The crawl finished at midnight. You woke up to the results.

The results are a mess.

Duplicate SKUs appear across multiple supplier pages — some rows appear three times. The price column has values in five different formats. The description column has 80 blank cells where the scraper hit a page that loaded descriptions via JavaScript after the initial render.

Your supply chain manager is running a purchase order review in three hours and this data is supposed to feed it.

The bad version:

  • Sort by SKU, scan for duplicates, delete them manually row by row — there are 47 duplicate groups
  • Select the price column, write a formula to detect the format of each cell and convert it, realize the formula handles four of the five formats and breaks on the fifth
  • Filter for blank descriptions, type "MISSING" in each blank cell, wonder how many you missed

By the time you've finished the duplicates, you've used 45 minutes and haven't touched the price normalization. The review starts in less than three hours.

The Easy Way: One Prompt in SheetXAI

SheetXAI is an AI agent inside your Excel workbook. It reads your scraped product data, identifies the data quality problems, and cleans them in one pass.

Paste this into the SheetXAI sidebar:

Remove duplicate rows from my scraped product workbook based on matching SKU in column B, normalize all price values in column D to a plain two-decimal number, and flag rows with blank descriptions in column E with the text MISSING

What You Get

  • Duplicate rows identified by SKU in column B are removed, keeping the row with the most complete data
  • Column D values are all converted to a plain numeric value with two decimal places
  • Blank cells in column E are replaced with the text MISSING
  • A note in cell H1 reports how many rows were deduplicated and how many MISSING flags were written

What If the Data Is Not Quite Ready

Duplicate SKUs have different prices and you need to keep the lowest

Deduplicate rows by SKU in column B keeping only the row with the lowest numeric price value in column D; write the number of removed duplicates to cell I1

Some prices include tax and some don't, and you need to normalize to pre-tax

For rows in column F where the value indicates tax-inclusive pricing, divide the price in column D by 1.2 and write the result back; add a note in column G indicating the price was adjusted

Blank descriptions came from pages that load content via JavaScript

For all rows where column E contains "MISSING", write the source URL from column A into column H labeled "needs manual review" so you can re-run the scrape with a JavaScript-rendering option

Full cleanup in one pass: dedup, normalize prices, flag missing, and summarize

For the 400-row scraped product workbook: deduplicate by SKU in column B keeping the row with the lowest price; normalize all price values in column D to two-decimal numeric format; replace blank description cells in column E with MISSING; write a summary to cell H1 showing rows deduped, prices normalized, and MISSING flags added; sort the cleaned workbook by SKU ascending

The supply chain manager gets clean data. You don't spend the morning doing manual data janitorial work.

Try It

If you have a bulk scrape result sitting in your Excel workbook with duplicates, inconsistent formats, and blank fields, Get the 7-day free trial of SheetXAI and clean it in one prompt. For related tasks, see how to crawl supplier category pages into an Excel workbook or apply a consistent schema across a URL batch.

Stop memorizing formulas.
Tell your spreadsheet what to do.

Join 4,000+ professionals saving hours every week with SheetXAI.

Learn more