The Scenario
A research librarian scraped 200 paper titles from three conference proceedings into an Excel column. The goal is to standardize the bibliography before import into the institution's reference manager — which means each title needs a canonical Semantic Scholar paper ID, DOI, publication year, and venue. The person who set up the original scrape left last month. The titles are inconsistent: some are all-caps, some have HTML entities baked in, some are truncated at 80 characters.
The bad version:
- Copy title 1 into the Semantic Scholar search bar, find the closest match among several candidates, verify by checking the year and first author, copy the paper ID and DOI from the URL and metadata panel, switch back to the sheet, paste into columns B through E.
- Title 23 is truncated and returns three plausible matches. You spend 10 minutes verifying which one is correct by cross-referencing another database.
- After an hour you've processed 18 titles and your wrist hurts.
Two hundred titles at this rate is not one afternoon's work. It's closer to a week, and every manual disambiguation judgment is a potential error in your reference database.
The Easy Way: One Prompt in SheetXAI
SheetXAI is an AI agent that lives inside your Google Sheet. It reads the titles in column A, runs each one through Semantic Scholar's title-match endpoint, and writes the canonical paper ID, DOI, year, and venue into columns B through E.
Here is the prompt for this task:
For each paper title in column A, find the closest Semantic Scholar match and return its paper ID, DOI, year, and venue in columns B through E
What You Get
- Columns B through E filled for each title: Paper ID, DOI, Year, Venue.
- Rows where the title match confidence is low are flagged with a note in column F rather than silently assigned to a wrong paper.
- DOI values arrive in standard format (10.xxxx/...) — ready for direct import into your reference manager.
- Year arrives as a four-digit number, not a string.
What If the Data Is Not Quite Ready
Titles contain HTML entities or encoding artifacts from the scrape
Before matching, clean each title in column A by decoding HTML entities and stripping non-ASCII artifacts, then match each cleaned title to Semantic Scholar and write paper ID, DOI, year, and venue into columns B through E
Some titles are clearly truncated at 80 characters and need a fuzzy match strategy
For each title in column A, attempt an exact Semantic Scholar title match first; if the result confidence is low, flag it in column F as Needs Review rather than writing a match — and for all high-confidence matches, write paper ID, DOI, year, and venue into columns B through E
You need to cross-reference matched DOIs against a local database to find duplicates
After matching each title in column A to a Semantic Scholar record and writing the DOI into column B, check each DOI against the ExistingRefs sheet column A and mark matches as Duplicate in column F
Clean titles, match, flag low-confidence rows, and check against a master list in one pass
Decode HTML entities in column A titles, run each through Semantic Scholar title-matching, write paper ID, DOI, year, and venue into columns B through E, mark low-confidence matches as Needs Review in column F, and flag any DOI that already appears in the MasterBib sheet as Already Imported in column G
Try It
Get the 7-day free trial of SheetXAI and open any sheet with a column of scraped or inconsistent paper titles. Ask SheetXAI to resolve every title to its canonical Semantic Scholar record — and deliver a clean bibliography ready for import.
See also: Batch Enrich Paper IDs With Metadata and the Semantic Scholar hub overview.
