OCR Glossary

OCR jargon got you stumped? Our handy glossary explains common terms. Then optimize your document processing easily with our smart OCR tool.

100 terms

English

Accuracy

The measure of how correctly an OCR system recognizes characters. Modern systems achieve 97%+ accuracy through advanced neural networks.

Activation Function

Mathematical functions (ReLU, Sigmoid, Tanh) that introduce non-linearity into neural networks.

Alignment

The process of matching extracted text with its corresponding visual position in the original document.

Anchor Boxes

Predefined bounding boxes of various shapes used in object detection models like YOLO and Faster R-CNN.

ANPR

Specialized OCR technology for reading vehicle license plates in traffic and security applications.

API

A set of protocols enabling software integration. OCR APIs allow developers to integrate text extraction capabilities into applications.

ASCII

Character encoding standard representing text in computers, commonly used in OCR output.

Attention Mechanism

A neural network component allowing models to focus on relevant input parts using Query, Key, and Value parameters.

Augmentation

Techniques to artificially expand training datasets through transformations like rotation, scaling, and noise addition.

Autoencoder

A neural network that learns efficient data representations through encoding and decoding, used for compression and feature extraction.

Backpropagation

Algorithm for training neural networks by calculating gradients and updating weights backward through layers.

Baseline Detection

Identifying the imaginary line on which characters sit, crucial for accurate text line segmentation.

Batch Processing

Automated processing of multiple documents without manual intervention. Modern systems process 1000s of pages daily.

Beam Search

Decoding algorithm that maintains multiple hypotheses during sequence generation for better accuracy.

Benchmark Dataset

Standard datasets like MNIST, COCO-Text, ICDAR used for evaluating OCR model performance.

BERT

Transformer-based model for natural language understanding, used in document AI for semantic comprehension.

BiLSTM

RNN variant processing sequences in both forward and backward directions, used in handwriting recognition.

Binarization

Converting images to black and white (binary) format to simplify OCR processing and improve character detection.

Bounding Box

Rectangular regions enclosing detected text or objects, defined by (x, y, width, height) coordinates.

Byte Pair Encoding

Tokenization method that builds vocabulary by merging frequent character sequences, used in modern NLP.

CER

Metric measuring OCR accuracy = (Substitutions + Deletions + Insertions) / Total Characters. Lower is better.

Character Segmentation

Process of separating individual characters from connected text for recognition.

CLIP

OpenAI model learning visual-textual relationships through contrastive learning, used in DeepSeek-OCR's encoder.

CNN

Deep learning architecture using convolutional layers to extract spatial features from images, fundamental to OCR.

Confidence Score

Probability value indicating how certain the OCR system is about a recognition result.

Context Optical Compression

DeepSeek's technique compressing documents to minimal visual tokens. Achieves 10× compression at 97% accuracy.

Contrastive Learning

Training method using positive/negative pairs to learn discriminative representations.

CRNN

Architecture combining CNN feature extraction with RNN sequence modeling for text recognition.

Cross-Attention

Attention mechanism relating two different sequences, crucial for vision-language model integration.

CTC

Loss function for training sequence models without requiring alignment between input and output.

Data Augmentation

Techniques expanding training data through rotation, scaling, noise, perspective transformation.

Deep Learning

Machine learning using multi-layer neural networks to learn hierarchical representations.

DeepEncoder

DeepSeek-OCR's 380M parameter vision encoder combining SAM (local) and CLIP (global) with 16× compression.

Denoising

Removing noise and artifacts from scanned images to improve OCR accuracy.

Deskewing

Correcting rotated or tilted document images to horizontal alignment before OCR processing.

Detectron2

Facebook's object detection framework used for document layout analysis and table detection.

Document AI

AI technologies for understanding, extracting, and processing information from documents.

Document Layout Analysis

Identifying and categorizing document regions (text, tables, images) with geometric and logical analysis.

DPI

Image resolution metric. OCR typically requires 300 DPI or higher for optimal accuracy.

Dropout

Regularization technique randomly dropping neurons during training to prevent overfitting.

Embedding

Dense vector representations of tokens capturing semantic meaning in continuous space.

Encoder-Decoder

Architecture where encoder processes input and decoder generates output, used in seq2seq tasks.

End-to-End Learning

Training models directly from raw input to final output without manual feature engineering.

F1 Score

Harmonic mean of precision and recall, comprehensive metric for classification performance.

Feature Extraction

Process of identifying and extracting relevant patterns from input data for model processing.

Feature Map

Output of convolutional layers showing detected features at different abstraction levels.

Fine-tuning

Adapting pre-trained models to specific tasks by training on domain-specific data.

Font Recognition

Identifying font types and styles in document images to improve OCR accuracy.

Form Recognition

Specialized OCR for structured documents like invoices, receipts, and forms with key-value extraction.

Fusion Model

Architecture merging representations from multiple modalities (vision + text) for unified understanding.

GPU

Specialized hardware for parallel processing, essential for deep learning model training and inference.

Gradient Descent

Optimization algorithm minimizing loss by iteratively adjusting model parameters.

hOCR

File format embedding OCR output in HTML with layout info, bounding boxes, and confidence scores.

HTR

Machine learning technology recognizing handwritten text, more challenging than printed text OCR.

Hyperparameter

Configuration settings (learning rate, batch size, layers) set before model training.

ICR

Advanced OCR using ML to recognize both typed and handwritten text across various fonts.

Image Preprocessing

Techniques like binarization, deskewing, denoising applied before OCR to improve quality.

Inference

Using trained models to make predictions on new, unseen data.

IoU

Metric measuring bounding box overlap, calculated as intersection area / union area.

IWR

AI technology recognizing words from user-defined dictionaries using OCR/ICR character outputs.

JPEG

Lossy image compression format commonly used for scanned documents, but PNG preferred for OCR.

Kernel

Small matrix used in convolutional layers to detect specific features like edges or textures.

Knowledge Distillation

Transfer learning technique where smaller student model learns from larger teacher model.

Language Model

Statistical model predicting word sequences, used in OCR for spell-checking and context understanding.

LayoutLM

Microsoft's transformer model combining text, layout, and image information for document understanding.

Learning Rate

Hyperparameter controlling step size in gradient descent optimization.

Line Segmentation

Separating document pages into individual text lines for sequential processing.

LoRA

PEFT method fine-tuning models by training small adapter matrices instead of all weights.

Loss Function

Mathematical function measuring difference between predictions and ground truth, guides training.

LSTM

RNN variant with gates to handle long-range dependencies, used in sequence-to-sequence OCR tasks.

MoE

Architecture with multiple specialized sub-networks (experts), DeepSeek uses 64 experts with ~6 active per token.

Multi-Head Attention

Attention mechanism running multiple parallel attention operations for richer representation learning.

Named Entity Recognition

NLP task identifying and classifying entities (names, dates, locations) in extracted text.

Neural Architecture Search

Automated method for discovering optimal neural network architectures.

Normalization

Scaling input data to standard range, improving model training stability and convergence.

OCR

Technology converting images of text into machine-readable digital text using computer vision and ML.

OLR

Document layout analysis segmenting text zones from non-text zones before OCR processing.

OMR

Technology analyzing watermarks, logos, symbols, and patterns on paper documents.

Optimization

Process of adjusting model parameters to minimize loss and improve performance.

Overfitting

Model memorizing training data instead of learning generalizable patterns, poor on new data.

PDF Parsing

Extracting text, layout, and structure from PDF files, combining text extraction with OCR for scanned PDFs.

PEFT

Techniques like LoRA enabling model adaptation with minimal trainable parameters and memory.

Positional Encoding

Adding position information to token embeddings in transformers to maintain sequence order.

Precision

Ratio of true positives to all predicted positives, measuring prediction correctness.

Preprocessing

Preparing raw images through binarization, deskewing, noise removal before OCR.

QLoRA

Memory-efficient fine-tuning loading pretrained models as 4-bit quantized weights with LoRA adapters.

Quantization

Reducing model precision (32-bit to 8-bit/4-bit) to decrease memory and increase inference speed.

Query-Key-Value

Three components of attention mechanism: Query (what to find), Key (compare against), Value (retrieve).

Recall

Ratio of true positives to all actual positives, measuring how many positives are found.

Regularization

Techniques (dropout, weight decay, L1/L2) preventing overfitting by constraining model complexity.

SAM

Meta's vision model for image segmentation, used in DeepSeek-OCR's encoder for local detail capture.

Scene Text Recognition

OCR for text in natural images (street signs, products), more challenging than document OCR.

Self-Attention

Attention mechanism where each token attends to all other tokens in same sequence.

Semantic Segmentation

Pixel-level classification assigning each pixel to a category, used in layout analysis.

Table Detection

Identifying and extracting structured table data from documents with cell segmentation.

Tesseract

Open-source OCR engine supporting 100+ languages, widely used baseline system.

Text-Line Extraction

Segmenting document pages into individual text lines for processing.

Token

Basic unit of text (word, subword, character) processed by NLP models.

Transfer Learning

Using knowledge from pretrained models to improve performance on new tasks with less data.

Transformer

Architecture using self-attention mechanisms, foundation of modern NLP and vision models.