convert

Convert a Scanned PDF to Excel: Japanese Tables Into CSV, With No Garbled Text

A scanned PDF is just an image — you can't copy it, Japanese turns into garbled characters when you paste it into Excel, and every line item collapses into a single cell. Drop the PDF into the space-ocr app and each page is rasterized automatically; define your columns once and you get one row per page, exported as a UTF-8 BOM CSV. It opens cleanly in Excel with no garbled text, and imports straight into freee, Money Forward, and Yayoi. Failed reads aren't charged.

8 min read· 2026-06-25

Open a scanned invoice or receipt PDF and the first thing you notice is this — you can't copy it. Try to select the text and nothing drags; you can't grab even a single number. The reason is simple: a scanned PDF isn't table data, it's a picture of a table. Your eyes see rows, columns, and totals lined up, but to a computer it's nothing but pixels.

So most people end up doing the same thing. Open Excel, look at the PDF, and retype it one row at a time. Or force a copy-paste and watch the result — Japanese turns into garbled characters like 譁�蟄怜喧, every line item collapses into a single cell, and the layout falls apart completely. If you searched "convert scanned PDF to Excel," "PDF table to CSV," or "fix garbled text in Excel" and landed here, that's exactly the pain — when all you really want is clean, column-aligned data you can import into your accounting software.

The short version: no retyping, no copy-paste

All you do is drop the PDF straight into the space-ocr app. The app rasterizes each page automatically and reads the values on the page as named fields. Decide your columns once (vendor, date, total, line items…) and each page stacks up as one row in a sheet, which you can finally export as a CSV.

That CSV is the whole point: it's written with a UTF-8 BOM, so Japanese opens without garbling even when you just double-click it in Excel. Line items don't collapse into one cell — they expand into their own proper rows and columns. The engine detects Japanese automatically, so there's no language to specify.

Try it in 10 seconds, no upload required

The sample below is interactive with nothing to upload. It's a real result from parsing an actual scanned receipt. Hover over any field and a box lights up showing where on the image that value was read from. Each value also carries a match ratio — what fraction of its characters were actually found on the page — so you can spot-check right there before exporting.

Source receipts with extracted-field bounding boxes

Verified fields

KINSHO · 合計 2,045

ライフ · 合計 4,286

Every value carries a verified on-page location — bbox + 4-point vertices + match_ratio — on a 0–1000 normalized grid (0,0 top-left → 1000,1000 bottom-right), the same shape the live API returns. Hover a field to trace it back to the pixels it came from.

Why a scan won't go straight into Excel

The moment you scan paper, the file is a raster image, same as a photo. There's no embedded text, so Excel's "Get Data" and copy-paste can't pick it up as a table. That's why it takes two steps:

Read the image and structure it (each page → a row of named fields)
Write those rows out to CSV (in a format Excel opens cleanly)

In the space-ocr app, when you throw in a PDF it rasterizes each page to PNG and then runs OCR. So there's no manual step on your end to slice pages into images — you just drop the PDF. The hard part is step 1, the reading, and getting that right is the shortest path to wiping out garbled text and collapsed line items for good.

From source image to extraction sheet: a scan becomes "columns"

The values from the document you dropped come out not as a wall of text but as named columns. You can let the engine propose a schema, pick a built-in template (invoice, receipt, purchase order, delivery note, business card, and more), or define the columns yourself. Watch a single scan turn into a labeled row.

DemoDrop a document and the values land in named fields — these become the "rows" that line up in Excel.

Drop a document and the values land in named fields — these become the "rows" that line up in Excel.

For documents with repeating rows — line items on an invoice, products on a receipt — you declare child columns under an array-type field. Then each row on the page becomes its own separate row, in a shape you can total in Excel. That's the answer to the "line items collapse into one cell" problem. There's more on building line items in Read invoices and delivery notes into Excel.

POST /ocr/fields → request body (declaring columns = your schema)

{
  "image": "https://example.com/scan-invoice-p01.png",
  "imageType": "url",
  "fields": [
    { "name": "vendor",       "type": "string" },
    { "name": "invoice_date", "type": "string" },
    { "name": "total",        "type": "string" },
    {
      "name": "line_items", "type": "array",
      "children": [
        { "name": "description", "type": "string" },
        { "name": "unit_price",  "type": "string" },
        { "name": "qty",         "type": "string" }
      ]
    }
  ]
}

✓ Verified

The AI isn't inventing the coordinates from imagination. The language model returns only the value text and a hint about which word tokens it used — never the box itself. First, the engine matches each value's characters one by one against the symbols the vision OCR actually detected on the page (character-level matching is the primary path) and places the box at the real pixel location. Then it attaches a match ratio to every value — what fraction of the characters it actually found on the page. When the model does return token hints, those can override the position of some fields, but in repeating rows the hints can get mismatched, so they're not taken at face value — they're validated and corrected with column clustering and row-consistency checks. The point isn't "the AI never makes mistakes"; it's that every value is matched against the page and a score of how well it matched is kept. Coordinates come back as xmin / ymin / xmax / ymax, 0–1000 normalized (not pixels).

Spot-check before you import

Before anything goes into Excel or your accounting software, you can verify the read on the spot. Hover over a value and the matching spot lights up on the original scan, so your eye jumps straight there without re-reading the whole document. A match ratio of 1.0 means every character was found on the page; anything below 0.85 is a sign to take a second look just in case.

DemoHover a field to check it against the original scan — catch a bad read before it lands in the sheet.

Hover a field to check it against the original scan — catch a bad read before it lands in the sheet.

Export a CSV that opens in Excel

Once the values look right, export the sheet as CSV. The header row carries the column names, array fields expand into columnName.childName, and the repeating line items open out into sub-rows. The file is written with a UTF-8 BOM — and that's the single thing that lets Excel open Japanese, Chinese, and Korean correctly on a plain double-click. Any cells you fixed by hand override the original OCR value on export.

DemoExport a UTF-8 BOM CSV in one click — double-click it and Excel opens it as clean, column-aligned rows.

Export a UTF-8 BOM CSV in one click — double-click it and Excel opens it as clean, column-aligned rows.

Opening it in Excel is simple — just double-click the .csv. Thanks to the BOM, Excel auto-detects it as UTF-8, with no text import wizard and no garbled characters. If you need a native workbook, just Save As → .xlsx from there.

Taking it into your accounting software works the same way: the exit is still a CSV. With freee, Money Forward, and Yayoi, you load it through each app's own CSV import feature (this is importing a CSV file, not automatic syncing through an official API). The receipts-only path to CSV is covered separately in Turn receipts into CSV.

Process in bulk: run a stack of scans through the API

When you want to process a whole folder of scans, create a sheet with your column schema once and upload the page images into it. Each image is read against that schema, appended as a row, and finally exported as a single CSV. The full request/response shapes are in the API docs.

Upload scanned page images to a sheet

curl -X POST https://api.space-ocr.com/upload \
  -H "Authorization: Bearer $SPACE_OCR_API_KEY" \
  -F "path=/請求書 2026" \
  -F "files=@scan-p01.png" \
  -F "files=@scan-p02.png" \
  -F "wait=true"

Why it matters

Pricing is straightforward. Pay as you go at ¥10 per page, a 100-page free tier every month (no credit card), and a flat Pro plan at $39/month for teams. And failed reads aren't charged — if no result comes out, you don't pay. Because the cost is easy to predict, you can try it on your own scans first and then decide.

How to convert a scanned PDF to Excel

Drop the PDF into the app
Drop your scanned PDF into the space-ocr app. The app rasterizes each page to PNG automatically and then runs OCR, so you don't have to slice pages into images yourself.
Decide your columns (the schema)
Set columns like vendor, date, and total using a built-in template, your own field definitions, or an automatic suggestion. For repeating line items such as invoice products, declare child columns under an array type so each row becomes its own separate row.
Spot-check the values
Hover a field and the matching spot lights up on the original scan. A match ratio of 1.0 means every character was found on the page; below 0.85 is a sign of a value to check or fix.
Export the CSV
Export the sheet as CSV. It's written with a UTF-8 BOM, array line items expand into sub-rows, and any manual fixes override the original OCR value.
Open in Excel or your accounting software
Double-click the CSV and Excel recognizes the BOM as UTF-8, opening it as clean, column-aligned rows with no garbled Japanese. If you need a native format, use Save As to get an .xlsx. For freee, Money Forward, and Yayoi, load it through each app's CSV import.

Does it handle Japanese scans too?

Yes. The engine detects the language automatically, so there's nothing to specify. It handles Japanese, Korean, Chinese, and English with one engine, and normalizes full-width/half-width characters, hyphen variants, brackets, CJK whitespace, vertical kanji, and mixed scripts. Skew and rotation (EXIF) from phone photos are corrected as well.

Does it support PDFs directly, or do I need to convert them to images?

The app supports PDFs. When you drop a PDF, each page is rasterized to PNG automatically and then OCR'd, so you don't have to slice pages into images by hand. Note that if you send to the public API directly, the input is a raster image (JPEG, PNG, GIF, BMP, TIFF, WebP).

Will the exported CSV open in Excel without garbling the Japanese?

Yes. The CSV is written with a UTF-8 BOM. That's the key that lets Excel auto-detect the encoding, so Japanese, Chinese, and Korean land in the right columns on a simple double-click. You don't have to go through the text import wizard.

Can I import it into freee, Money Forward, or Yayoi?

The output is a CSV that opens in Excel, so you load it through each accounting app's CSV import feature. This is importing a CSV file, not automatic syncing through an official API. You can clean up the column names and the line-item expansion on the sheet side before exporting.

How can I check accuracy? Is my personal data safe?

Every value carries a verified position on the page (a bounding box) and a match ratio. Hover a field and the matching spot lights up on the original scan; a match ratio of 1.0 means every character was found, and below 0.85 is a sign to verify. In other words, you can audit each value against the original before importing. On pricing, failed reads aren't charged.

Turn scanned PDFs into Excel rows

100 pages free every month, no credit card. Drop a PDF, read it into columns, and get a CSV (UTF-8 BOM) that opens straight in Excel. Failed reads aren't charged.

Start free API docs

Receipt OCR to CSV: Convert Receipts and Import Into freee, Money Forward & Yayoi

Invoice & Delivery Note OCR API: Extract Invoice Data to CSV (Developer Guide)