Tesseract.js
Pure Javascript OCR for more than 100 Languages
README
Tesseract.js is a javascript library that gets words in almost any language out of images. (Demo)
Image Recognition
Video Real-time Recognition
It works in the browser using webpack or plain script tags with a CDN and on the server with Node.js.
After you install it, using it is as simple as:
- ``` js
- import Tesseract from 'tesseract.js';
- Tesseract.recognize(
- 'https://tesseract.projectnaptha.com/img/eng_bw.png',
- 'eng',
- { logger: m => console.log(m) }
- ).then(({ data: { text } }) => {
- console.log(text);
- })
- ```
Or more imperative
- ``` js
- import { createWorker } from 'tesseract.js';
- const worker = await createWorker({
- logger: m => console.log(m)
- });
- (async () => {
- await worker.loadLanguage('eng');
- await worker.initialize('eng');
- const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
- console.log(text);
- await worker.terminate();
- })();
- ```
Check out the docs for a full explanation of the API.
Major changes in v4
Version 4 includes many new features and bug fixes--see this issue for a full list. Several highlights are below.
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- createWorker is now async
- getPDF function replaced by pdf recognize option
Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the example images
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
- Node.js version 18
- Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12
Major changes in v2
- Upgrade to tesseract v4.1.1 (using emscripten 1.39.10 upstream)
- Support multiple languages at the same time, eg: eng+chi\_tra for English and Traditional Chinese
- Supported image formats: png, jpg, bmp, pbm
- Support WebAssembly (fallback to ASM.js when browser doesn't support)
- Support Typescript
Check the support/1.x branch for version 1