2026-05-14 · 9 min read
How to Convert PDF to Excel: 4 Reliable Methods
Tables in PDFs are notoriously fiddly. They look like spreadsheets but they’re really just text positioned in a grid — there are no row or column boundaries the way Excel knows them. A good PDF-to-Excel conversion has to reverse-engineer the grid from pixel coordinates. Done well, you get clean data ready to filter and pivot. Done badly, you get a single column of squashed text. Below are four methods, ranked by how reliable they are and when each wins.
First: is it a real PDF or a scan?
Same fundamental question as any PDF task. Open the file and try to select a number with your cursor. If you can highlight a single cell value, the PDF has real text and any of the methods below will work. If the text won’t select, the PDF is an image of a printed page — you need to OCR it before any conversion will produce meaningful cells. Our guide to OCR walks through that step.
For real-text PDFs, there’s a second consideration: are the tables ruled (with visible borders) or unruled (just whitespace between columns)? Ruled tables are easier — converters can use the rule lines as ground truth. Unruled tables require the converter to infer columns from the gaps, which is harder and where most quality differences show up.
Method 1 — Excel’s built-in PDF import (best for one-off jobs)
Microsoft 365 Excel has had native PDF import since 2020 and it’s surprisingly good for most documents. From a blank workbook: Data → Get Data → From File → From PDF. Pick the file. Excel scans it and lists every table it can identify, page by page. Preview each, pick the ones you want, click Load.
Strengths: handles ruled tables almost perfectly, preserves the original column structure, and loads data via Power Query so you can re-run the import when the source PDF changes (think monthly statements). Weaknesses: struggles with merged cells, multi-line cell content, and unruled tables. For an annual report or a bank statement, this is usually the fastest workflow.
Available on Mac (Microsoft 365), Windows, and the web app since late 2022. LibreOffice Calc’s equivalent is more limited; Google Sheets has no native PDF import as of writing.
Method 2 — Convert via Word, then paste to Excel
A trick that beats most direct PDF-to-Excel converters for messy layouts: convert the PDF to Word first (preserves table structure as Word tables), then copy the table from Word and paste-special into Excel. Word’s table model maps cleanly onto Excel cells, and Word’s PDF importer is mature.
Use a converter that runs on your device — like our free PDF to Word converter — for sensitive financial PDFs you’d rather not upload. For the workflow:
- Convert the PDF to DOCX in your browser.
- Open the DOCX in Word or LibreOffice Writer.
- Click anywhere in the table, click the move handle in the top-left corner to select the whole table, copy.
- In Excel, click your target cell, paste-special → HTML or paste → match destination formatting.
For more on the conversion step itself, our PDF to Word guide compares three approaches.
Method 3 — Tabula (free, offline, open source)
Tabula is a desktop app (Mac, Windows, Linux) built specifically for extracting tables from PDFs. It’s the tool that powers a lot of investigative journalism — ProPublica, the Wall Street Journal, and others use it for parsing scraped government PDFs.
The workflow is interactive: load the PDF, draw a box around each table, hit Preview & Export. Tabula offers two extraction algorithms — “Lattice” for ruled tables (uses border lines) and “Stream” for unruled (uses whitespace clustering). It’s often worth running both and picking the one with cleaner output. Export as CSV or Excel.
Where Tabula wins: stubborn unruled tables that Excel’s importer mangles, multi-page tables with consistent headers, and anything where you need to script the extraction (Tabula’s underlying engine, tabula-py, is a Python wrapper).
Method 4 — Python (Camelot, pdfplumber, tabula-py)
For scripted, repeatable extraction — you have 200 PDFs of the same format, or you want to wire this into a pipeline — go straight to Python. Three libraries dominate:
- Camelot — high quality on ruled tables, two extraction modes (lattice/stream), best when you control the source format.
- pdfplumber — lower-level, gives you character-by-character coordinates. Better for unusual layouts where you need to write custom logic.
- tabula-py — Python wrapper around the same Java engine Tabula uses. Solid all-rounder.
A typical Camelot extraction:
import camelot
tables = camelot.read_pdf("statement.pdf", pages="all", flavor="lattice")
for i, t in enumerate(tables):
t.df.to_excel(f"table_{i}.xlsx", index=False)Three lines of meaningful code and you have one Excel file per table. For 200 PDFs in a directory, wrap it in a loop. Most production data pipelines that ingest PDFs use one of these libraries.
Method comparison
| Method | Best for | Setup | Quality |
|---|---|---|---|
| Excel PDF import | One-off, ruled tables | None | Good |
| Word round-trip | Mixed text + tables | None | Very good |
| Tabula | Stubborn unruled tables | App install | Excellent |
| Python (Camelot) | Many files, automation | pip install | Excellent |
Worked example: a 30-page financial statement
Annual report from a public company, 30 pages, six tables of interest scattered across them. Tables are ruled, multi-page, with repeating headers. Goal: get all six into one Excel workbook with each table on its own sheet.
- Open Excel → Data → Get Data → From PDF. Select file.
- In the navigator, browse the auto-detected tables. Tick the six you want.
- Click Transform Data instead of Load. Power Query opens.
- For each query, fix headers and remove repeated header rows from page breaks (right-click row → Remove rows).
- Close & Load → pick “each query to its own sheet”. Save.
Total time: about ten minutes. Re-running the import next quarter when a new annual report drops is one click — Power Query remembers the steps.
Scanned PDFs: OCR first
If the source PDF is a scan, none of the methods above will produce numeric cells. You need to run OCR first. Tesseract (free, offline) is the open-source standard; ABBYY FineReader is the paid gold standard for tabular accuracy. Once OCR has produced a text-layered PDF, the methods above work normally.
One important caveat: OCR confuses similar characters in numeric tables — 0 vs O, 1 vs l vs I, 5 vs S, 8 vs B. Always check totals after OCR-extracting financial data. If a column doesn’t sum to the printed total, you have OCR errors to chase.
Common quality issues
- Merged cells. Most converters either duplicate the value into every covered cell or drop it from all but one. Fix manually after import.
- Multi-line cell content. Where a single logical cell wraps across multiple lines, converters often split it into separate rows. Easiest fix: a quick Power Query step to fill down the empty key columns.
- Currency and date formatting. “1,234.56” may import as text rather than number. Select the column → Format Cells → Number, or in Power Query change the column type.
- Footnote markers. Superscript (a), (b), etc. in financial tables get inlined into the cell value, breaking numeric parsing. Strip with a regex find/replace.
Privacy considerations
PDFs you’re converting to Excel are often financial — bank statements, payroll, broker confirms. The same advice as for any sensitive PDF applies: prefer methods that run on your device (Excel import, the Word round-trip with an in-browser converter, Tabula, Python) over cloud-hosted “PDF to Excel” services. Several free hosted services in this category have been documented retaining uploaded files, indexing them, or showing them to other users in race conditions.
If your data is mostly text with embedded tables — a research report, a contract appendix — going through Word first usually beats direct PDF-to-Excel tools. Start with our free in-browser PDF to Word converter, edit in Word, then copy each table into Excel. For a wider tour of the conversion landscape, see our PDF to Word methods comparison.