pdf

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 3, 2026 License: BSD-3-Clause Imports: 19 Imported by: 20

README

PDF Parser for Go

A high-performance, lightweight PDF parsing library for Go, forked from rsc/pdf.

This library has been extensively refactored to support modern PDF standards and high-throughput production environments with a focus on memory efficiency and security.

Key Improvements

1. High-Performance Zero-Allocation AST

The internal Abstract Syntax Tree (AST) has been rewritten to use a rigid Object union struct instead of interface{}. This eliminates the overhead of interface boxing for every PDF object (integers, names, strings, etc.), leading to massive reductions in memory allocations and GC pressure.

2. Modern Security Support

Added comprehensive support for encrypted PDFs:

  • AES-128 (v4): Full implementation of AES-CBC decryption for strings and streams.
  • AES-256 (v5): Support for PDF 2.0 / Extension Level 3 security handlers, including SHA-256 based Key Derivation (KDK) and File Encryption Key (FEK) retrieval.
3. Stability & Error Handling
  • Panic-Free Design: Removed legacy panic calls in favor of proper Go error propagation.
  • Safe Method Chaining: The Value struct now carries error state, allowing safe nested calls like doc.Trailer().Key("Root").Key("Pages").Count().
  • Robustness: Improved recovery from malformed PDF structures and strict parsing errors.
4. Memory Efficiency
  • Buffer Pooling: Implemented sync.Pool for parsing buffers.
  • Bulk Scanning: Optimized lex.go with specialized bulk scanners for Names, Keywords, and Strings, drastically reducing per-byte overhead.

Benchmarks

Throughput comparison against the original library (parsing standard documents):

Metric Upstream Library This Version Change
Parsing Speed 79,526 ns/op 66,925 ns/op ~16% Faster
Allocations 2,517 allocs/op 97 allocs/op 96% Reduction
Memory usage 113,712 B/op 87,226 B/op 23% Lower

Usage

import "github.com/digitorus/pdf"

r, err := pdf.NewReader(file, size)
if err != nil {
    return err
}

// Fluent, error-safe access
root := r.Trailer().Key("Root")
if err := root.Err(); err != nil {
    return err
}

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrInvalidPassword = fmt.Errorf("encrypted PDF: invalid password")

Functions

func Interpret

func Interpret(strm Value, do func(stk *Stack, op string))

Interpret interprets the content in a stream as a basic PostScript program, pushing values onto a stack and then calling the do function to execute operators. The do function may push or pop values from the stack as needed to implement op.

Interpret handles the operators "dict", "currentdict", "begin", "end", "def", and "pop" itself.

Interpret is not a full-blown PostScript interpreter. Its job is to handle the very limited PostScript found in certain supporting file formats embedded in PDF files, such as cmap files that describe the mapping from font code points to Unicode code points.

There is no support for executable blocks, among other limitations.

Types

type Content

type Content struct {
	Text []Text
	Rect []Rect
}

Content describes the basic content on a page: the text and any drawn rectangles.

type Font

type Font struct {
	V Value
}

A Font represent a font in a PDF file. The methods interpret a Font dictionary stored in V.

func (Font) BaseFont

func (f Font) BaseFont() string

BaseFont returns the font's name (BaseFont property).

func (Font) Encoder

func (f Font) Encoder() TextEncoding

Encoder returns the encoding between font code point sequences and UTF-8.

func (Font) FirstChar

func (f Font) FirstChar() int

FirstChar returns the code point of the first character in the font.

func (Font) LastChar

func (f Font) LastChar() int

LastChar returns the code point of the last character in the font.

func (Font) Width

func (f Font) Width(code int) float64

Width returns the width of the given code point.

func (Font) Widths

func (f Font) Widths() []float64

Widths returns the widths of the glyphs in the font. In a well-formed PDF, len(f.Widths()) == f.LastChar()+1 - f.FirstChar().

type Kind added in v0.2.0

type Kind int

Kind represents the kind of value stored in an Object.

const (
	Null Kind = iota
	Bool
	Integer
	Real
	String
	Name
	Dict
	Array
	Stream
	Indirect // Reference: 1 0 R; renamed from Ptr to avoid collision with Ptr struct
	Keyword  // Internal: obj, endobj, etc.
)

type Object added in v0.2.0

type Object struct {
	Kind         Kind
	BoolVal      bool
	Int64Val     int64
	Float64Val   float64
	NameVal      string
	StringVal    string
	KeywordVal   string
	ArrayVal     []Object
	DictVal      map[string]Object
	PtrVal       objptr
	StreamOffset int64 // For Stream, DictVal holds the header
}

Object represents a PDF object using a tagged union approach to avoid interface{} boxing.

func GetDict added in v0.1.2

func GetDict() Object

type Outline

type Outline struct {
	Title string    // title for this element
	Child []Outline // child elements
}

An Outline is a tree describing the outline (also known as the table of contents) of a document.

type Page

type Page struct {
	V Value
}

A Page represent a single page in a PDF file. The methods interpret a Page dictionary stored in V.

func (Page) Content

func (p Page) Content() (result Content)

Content returns the page's content. It recovers from panics caused by malformed content streams and returns an empty Content in such cases for security and robustness.

func (Page) Font

func (p Page) Font(name string) Font

Font returns the font with the given name associated with the page.

func (Page) Fonts

func (p Page) Fonts() []string

Fonts returns a list of the fonts associated with the page.

func (Page) Resources

func (p Page) Resources() Value

Resources returns the resources dictionary associated with the page.

type Point

type Point struct {
	X float64
	Y float64
}

A Point represents an X, Y pair.

type Ptr added in v0.2.0

type Ptr struct {
	// contains filtered or unexported fields
}

Ptr represents a PDF Object Reference (Indirect Object) This is the public API struct.

func (Ptr) GetGen added in v0.2.0

func (p Ptr) GetGen() uint16

GetGen returns the generation number.

func (Ptr) GetID added in v0.2.0

func (p Ptr) GetID() uint32

GetID returns the object number.

type Reader

type Reader struct {
	XrefInformation ReaderXrefInformation
	PDFVersion      string
	// contains filtered or unexported fields
}

A Reader is a single PDF file open for reading.

func NewReader

func NewReader(f io.ReaderAt, size int64) (*Reader, error)

NewReader opens a file for reading, using the data in f with the given total size.

func NewReaderEncrypted

func NewReaderEncrypted(f io.ReaderAt, size int64, pw func() string) (*Reader, error)

NewReaderEncrypted opens a file for reading, using the data in f with the given total size. If the PDF is encrypted, NewReaderEncrypted calls pw repeatedly to obtain passwords to try. If pw returns the empty string, NewReaderEncrypted stops trying to decrypt the file and returns an error.

func Open

func Open(file string) (*Reader, error)

Open opens a file for reading.

func (*Reader) Close added in v0.2.0

func (r *Reader) Close() error

Close closes the Reader and the underlying file if it implements io.Closer.

func (*Reader) GetObject added in v0.2.0

func (r *Reader) GetObject(id uint32) (Value, error)

GetObject reads and returns the object with the given ID. It resolves the object from the XRef table, using the cache if available.

func (*Reader) NumPage

func (r *Reader) NumPage() int

NumPage returns the number of pages in the PDF file.

func (*Reader) Outline

func (r *Reader) Outline() Outline

Outline returns the document outline. The Outline returned is the root of the outline tree and typically has no Title itself. That is, the children of the returned root are the top-level entries in the outline.

func (*Reader) Page

func (r *Reader) Page(num int) Page

Page returns the page for the given page number. Page numbers are indexed starting at 1, not 0. If the page is not found, Page returns a Page with p.V.IsNull().

func (*Reader) Trailer

func (r *Reader) Trailer() Value

Trailer returns the file's Trailer value.

func (*Reader) Xref added in v0.1.2

func (r *Reader) Xref() []xref

type ReaderXrefInformation added in v0.1.2

type ReaderXrefInformation struct {
	StartPos               int64
	EndPos                 int64
	Length                 int64
	PositionLength         int64
	PositionStartPos       int64
	PositionEndPos         int64
	ItemCount              int64
	Type                   string
	IncludingTrailerEndPos int64
	IncludingTrailerLength int64
}

func (*ReaderXrefInformation) PrintDebug added in v0.1.2

func (info *ReaderXrefInformation) PrintDebug()

type Rect

type Rect struct {
	Min, Max Point
}

A Rect represents a rectangle.

type Stack

type Stack struct {
	// contains filtered or unexported fields
}

A Stack represents a stack of values.

func (*Stack) Len

func (stk *Stack) Len() int

func (*Stack) Pop

func (stk *Stack) Pop() Value

func (*Stack) Push

func (stk *Stack) Push(v Value)

type Text

type Text struct {
	Font     string  // the font used
	FontSize float64 // the font size, in points (1/72 of an inch)
	X        float64 // the X coordinate, in points, increasing left to right
	Y        float64 // the Y coordinate, in points, increasing bottom to top
	W        float64 // the width of the text, in points
	S        string  // the actual UTF-8 text
}

A Text represents a single piece of text drawn on a page.

type TextEncoding

type TextEncoding interface {
	// Decode returns the UTF-8 text corresponding to
	// the sequence of code points in raw.
	Decode(raw string) (text string)
}

A TextEncoding represents a mapping between font code points and UTF-8 text.

type TextHorizontal

type TextHorizontal []Text

TextHorizontal implements sort.Interface for sorting a slice of Text values in horizontal order, left to right, and then top to bottom within a column.

func (TextHorizontal) Len

func (x TextHorizontal) Len() int

func (TextHorizontal) Less

func (x TextHorizontal) Less(i, j int) bool

func (TextHorizontal) Swap

func (x TextHorizontal) Swap(i, j int)

type TextVertical

type TextVertical []Text

TextVertical implements sort.Interface for sorting a slice of Text values in vertical order, top to bottom, and then left to right within a line.

func (TextVertical) Len

func (x TextVertical) Len() int

func (TextVertical) Less

func (x TextVertical) Less(i, j int) bool

func (TextVertical) Swap

func (x TextVertical) Swap(i, j int)

type Value

type Value struct {
	// contains filtered or unexported fields
}

A Value represents a value in a PDF file.

func (Value) Bool

func (v Value) Bool() bool

Bool returns v's boolean value.

func (Value) Data added in v0.2.0

func (v Value) Data() []byte

Data returns the raw data of the stream v.

func (Value) Err added in v0.2.0

func (v Value) Err() error

Err returns the error associated with the value, if any.

func (Value) Float64

func (v Value) Float64() float64

Float64 returns v's float value.

func (Value) GetPtr added in v0.1.2

func (v Value) GetPtr() Ptr

GetPtr returns the object reference for the value.

func (Value) Header added in v0.2.0

func (v Value) Header() Value

Header returns the header dictionary for the stream v.

func (Value) Index

func (v Value) Index(i int) Value

Index returns the i'th element of the array v.

func (Value) Int64

func (v Value) Int64() int64

Int64 returns v's integer value.

func (Value) IsNull

func (v Value) IsNull() bool

IsNull reports whether v is a null value.

func (Value) Key

func (v Value) Key(key string) Value

Key returns the value associated with the key k in the dictionary v.

func (Value) Keys

func (v Value) Keys() []string

Keys returns the keys of the dictionary v, sorted alphabetically.

func (Value) Kind

func (v Value) Kind() Kind

Kind returns the kind of value v is.

func (Value) Len

func (v Value) Len() int

Len returns the number of elements in the array v.

func (Value) Name

func (v Value) Name() string

Name returns v's name value.

func (Value) RawString

func (v Value) RawString() string

RawString returns v's string value.

func (Value) Reader

func (v Value) Reader() io.ReadCloser

Reader returns a reader for the stream v.

func (Value) String

func (v Value) String() string

String returns a textual representation of the value v.

func (Value) Text

func (v Value) Text() string

Text returns v's string value interpreted as a “text string” (defined in the PDF spec) and converted to UTF-8.

Directories

Path Synopsis
Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length.
Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL