float16

package module

v0.2.0 Latest Latest Go to latest Published: Mar 3, 2026 License: Apache-2.0 Imports: 4 Imported by: 6

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/float16

Links

Open Source Insights

README ¶

float16

A comprehensive Go implementation of IEEE 754-2008 16-bit floating-point (half-precision) arithmetic with full support for special values, multiple rounding modes, and high-performance operations.

Features

Full IEEE 754-2008 compliance for 16-bit floating-point arithmetic
Complete special value support: ±0, ±∞, NaN (with payload), normalized and subnormal numbers
Multiple rounding modes: nearest-even, toward zero, toward ±∞, nearest-away
Flexible conversion modes: IEEE standard, strict error handling, fast approximations
High-performance operations with optional fast math optimizations
Comprehensive test suite with extensive edge case coverage
Zero dependencies - pure Go implementation

Installation

go get github.com/zerfoo/float16

Quick Start

package main

import (
    "fmt"
    "github.com/zerfoo/float16"
)

func main() {
    // Create float16 values
    a := float16.FromFloat32(3.14159)
    b := float16.FromFloat64(2.71828)
    
    // Basic arithmetic
    sum := a.Add(b)
    product := a.Mul(b)
    
    // Convert back to other types
    fmt.Printf("Sum: %v (float32: %f)\n", sum, sum.ToFloat32())
    fmt.Printf("Product: %v (float64: %f)\n", product, product.ToFloat64())
    
    // Work with special values
    inf := float16.Inf(1)  // positive infinity
    nan := float16.NaN()   // quiet NaN
    zero := float16.Zero() // positive zero
    
    fmt.Printf("Infinity: %v\n", inf)
    fmt.Printf("NaN: %v\n", nan)
    fmt.Printf("Zero: %v\n", zero)
}

Core Types and Constants

Float16 Type

The Float16 type represents a 16-bit IEEE 754 half-precision floating-point value:

type Float16 uint16

Special Values

const (
    PositiveZero     Float16 = 0x0000 // +0.0
    NegativeZero     Float16 = 0x8000 // -0.0
    PositiveInfinity Float16 = 0x7C00 // +∞
    NegativeInfinity Float16 = 0xFC00 // -∞
    MaxValue         Float16 = 0x7BFF // ~65504
    MinValue         Float16 = 0xFBFF // ~-65504
)

Conversion Functions

From Other Types

// From float32/float64
f16 := float16.FromFloat32(3.14159)
f16 := float16.FromFloat64(2.71828)

// From bit representation
f16 := float16.FromBits(0x4200) // 3.0

// From string
f16, err := float16.ParseFloat("3.14159", 32)

To Other Types

f32 := f16.ToFloat32()
f64 := f16.ToFloat64()
bits := f16.Bits()
str := f16.String()

Arithmetic Operations

a := float16.FromFloat32(5.0)
b := float16.FromFloat32(3.0)

// Basic arithmetic
sum := a.Add(b)        // 8.0
diff := a.Sub(b)       // 2.0
product := a.Mul(b)    // 15.0
quotient := a.Div(b)   // 1.666...

// Mathematical functions
sqrt := a.Sqrt()       // √5
abs := a.Abs()         // |a|
neg := a.Neg()         // -a

Rounding Modes

Configure rounding behavior for conversions:

import "github.com/zerfoo/float16"

// Set global rounding mode
config := float16.GetConfig()
config.DefaultRoundingMode = float16.RoundTowardZero
float16.Configure(config)

// Available rounding modes:
// - RoundNearestEven (default)
// - RoundTowardZero
// - RoundTowardPositive  
// - RoundTowardNegative
// - RoundNearestAway

Conversion Modes

Control conversion behavior and error handling:

config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

// Available modes:
// - ModeIEEE: Standard IEEE 754 behavior
// - ModeStrict: Returns errors for overflow/underflow
// - ModeFast: Optimized for performance

Special Value Handling

f := float16.FromFloat32(math.Inf(1))

// Check value types
if f.IsInf(0) {
    fmt.Println("Value is infinity")
}
if f.IsNaN() {
    fmt.Println("Value is NaN")
}
if f.IsFinite() {
    fmt.Println("Value is finite")
}
if f.IsNormal() {
    fmt.Println("Value is normalized")
}
if f.IsSubnormal() {
    fmt.Println("Value is subnormal")
}

// IEEE 754 classification
class := f.Class()
switch class {
case float16.ClassPositiveInfinity:
    fmt.Println("Positive infinity")
case float16.ClassQuietNaN:
    fmt.Println("Quiet NaN")
// ... other classes
}

Performance Features

Fast Math Operations

// Enable fast math for better performance (may sacrifice precision)
config := float16.GetConfig()
config.EnableFastMath = true
float16.Configure(config)

// Use fast operations
result := float16.FastAdd(a, b)
result := float16.FastMul(a, b)

Vectorized Operations

// Vectorized operations (optimized for SIMD when available)
a := []float16.Float16{...}
b := []float16.Float16{...}

sum := float16.VectorAdd(a, b)
product := float16.VectorMul(a, b)

Error Handling

// Strict mode returns errors for exceptional conditions
config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

f16, err := float16.FromFloat32WithMode(1e10, float16.ModeStrict)
if err != nil {
    if float16Err, ok := err.(*float16.Float16Error); ok {
        switch float16Err.Code {
        case float16.ErrOverflow:
            fmt.Println("Value too large for float16")
        case float16.ErrUnderflow:
            fmt.Println("Value too small for float16")
        }
    }
}

Utilities

Statistics for Slices

values := []float16.Float16{
    float16.FromFloat32(1.0),
    float16.FromFloat32(2.0),
    float16.FromFloat32(3.0),
}

stats := float16.ComputeSliceStats(values)
fmt.Printf("Min: %v, Max: %v, Mean: %v\n", stats.Min, stats.Max, stats.Mean)

Debugging and Monitoring

// Get memory usage
usage := float16.GetMemoryUsage()
fmt.Printf("Memory usage: %d bytes\n", usage)

// Get debug information
debug := float16.DebugInfo()
fmt.Printf("Debug info: %+v\n", debug)

Benchmarking

The package includes built-in benchmarking utilities:

ops := float16.GetBenchmarkOperations()
for name, op := range ops {
    // Benchmark operation
    fmt.Printf("Benchmarking %s\n", name)
}

Range and Precision

Float16 has the following characteristics:

Range: ±6.55×10⁴ (approximately ±65,504)
Precision: ~3-4 decimal digits
Smallest positive normal: ~6.10×10⁻⁵
Smallest positive subnormal: ~5.96×10⁻⁸
Machine epsilon: ~9.77×10⁻⁴

Use Cases

Float16 is ideal for:

Machine Learning: Reduced memory usage and faster training
Graphics Programming: Color values, texture coordinates
Scientific Computing: Large datasets where precision can be traded for memory
Embedded Systems: Memory-constrained environments
Data Compression: Storing floating-point data more efficiently

Performance Considerations

Conversions between float16 and float32/float64 have computational overhead
Native float16 arithmetic is generally faster than conversion-based approaches
Enable fast math mode for performance-critical applications where precision can be sacrificed
Use vectorized operations for bulk processing

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

References

Documentation ¶

Overview ¶

Package float16 implements both 16-bit floating point data types: - Float16: IEEE 754-2008 half-precision (1 sign, 5 exponent, 10 mantissa bits) - BFloat16: "Brain Floating Point" format (1 sign, 8 exponent, 7 mantissa bits)

This implementation provides conversion between both types and other floating-point types (float32 and float64) with support for various rounding modes and error handling.

Special Values ¶

The float16 type supports all IEEE 754-2008 special values:

Positive and negative zero
Positive and negative infinity
Not-a-Number (NaN) values with payload
Normalized numbers
Subnormal (denormal) numbers

Subnormal Numbers ¶

When converting to higher-precision types (float32/float64), subnormal float16 values are preserved. However, when converting back from higher-precision types to float16, subnormal values may be rounded to the nearest representable normal float16 value. This behavior is consistent with many hardware implementations that handle subnormals in a similar way for performance reasons.

Rounding Modes ¶

The following rounding modes are supported for conversions:

RoundNearestEven: Round to nearest, ties to even (default)
RoundTowardZero: Round toward zero (truncate)
RoundTowardPositive: Round toward positive infinity
RoundTowardNegative: Round toward negative infinity
RoundNearestAway: Round to nearest, ties away from zero

Error Handling ¶

Conversion functions with a ConversionMode parameter can return errors for:

Overflow: When a value is too large to be represented
Underflow: When a value is too small to be represented (in strict mode)
Inexact: When rounding occurs (in strict mode)

See: http://en.wikipedia.org/wiki/Half-precision_floating-point_format

Index ¶

Constants
Variables
func BFloat16Equal(a, b BFloat16) bool
func BFloat16Greater(a, b BFloat16) bool
func BFloat16GreaterEqual(a, b BFloat16) bool
func BFloat16Less(a, b BFloat16) bool
func BFloat16LessEqual(a, b BFloat16) bool
func Configure(cfg *Config)
func DebugInfo() map[string]interface{}
func Equal(a, b Float16) bool
func GetBenchmarkOperations() map[string]BenchmarkOperation
func GetMemoryUsage() int
func GetVersion() string
func Greater(a, b Float16) bool
func GreaterEqual(a, b Float16) bool
func IsFinite(f Float16) bool
func IsInf(f Float16, sign int) bool
func IsNaN(f Float16) bool
func IsNormal(f Float16) bool
func IsSubnormal(f Float16) bool
func Less(a, b Float16) bool
func LessEqual(a, b Float16) bool
func Signbit(f Float16) bool
func ToSlice32(s []Float16) []float32
func ToSlice64(s []Float16) []float64
func ValidateSliceLength(a, b []Float16) error
type ArithmeticMode
type BFloat16
- func BFloat16Abs(b BFloat16) BFloat16
- func BFloat16Add(a, b BFloat16) BFloat16
- func BFloat16Div(a, b BFloat16) BFloat16
- func BFloat16FromBits(bits uint16) BFloat16
- func BFloat16FromFloat16(f Float16) BFloat16
- func BFloat16FromFloat32(f float32) BFloat16
- func BFloat16FromFloat32WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)
- func BFloat16FromFloat32WithRounding(f float32, mode RoundingMode) BFloat16
- func BFloat16FromFloat64(f float64) BFloat16
- func BFloat16FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)
- func BFloat16FromFloat64WithRounding(f float64, mode RoundingMode) BFloat16
- func BFloat16Max(a, b BFloat16) BFloat16
- func BFloat16Min(a, b BFloat16) BFloat16
- func BFloat16Mul(a, b BFloat16) BFloat16
- func BFloat16Neg(b BFloat16) BFloat16
- func BFloat16Sub(a, b BFloat16) BFloat16
- func (b BFloat16) Bits() uint16
- func (b BFloat16) Class() FloatClass
- func (b BFloat16) CopySign(s BFloat16) BFloat16
- func (b BFloat16) IsFinite() bool
- func (b BFloat16) IsInf(sign int) bool
- func (b BFloat16) IsNaN() bool
- func (b BFloat16) IsNormal() bool
- func (b BFloat16) IsSubnormal() bool
- func (b BFloat16) IsZero() bool
- func (b BFloat16) Signbit() bool
- func (b BFloat16) String() string
- func (b BFloat16) ToFloat16() Float16
- func (b BFloat16) ToFloat32() float32
type BenchmarkOperation
type Config
- func DefaultConfig() *Config
- func GetConfig() *Config
type ConversionMode
type ErrorCode
type Float16
- func Abs(f Float16) Float16
- func Acos(f Float16) Float16
- func Add(a, b Float16) Float16
- func AddSlice(a, b []Float16) []Float16
- func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func Asin(f Float16) Float16
- func Atan(f Float16) Float16
- func Atan2(y, x Float16) Float16
- func Cbrt(f Float16) Float16
- func Ceil(f Float16) Float16
- func Clamp(f, min, max Float16) Float16
- func CopySign(f, sign Float16) Float16
- func Cos(f Float16) Float16
- func Cosh(f Float16) Float16
- func Dim(f, g Float16) Float16
- func Div(a, b Float16) Float16
- func DivSlice(a, b []Float16) []Float16
- func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func DotProduct(a, b []Float16) Float16
- func Erf(f Float16) Float16
- func Erfc(f Float16) Float16
- func Exp(f Float16) Float16
- func Exp2(f Float16) Float16
- func Exp10(f Float16) Float16
- func FastAdd(a, b Float16) Float16
- func FastMul(a, b Float16) Float16
- func Float16FromBFloat16(b BFloat16) Float16
- func Floor(f Float16) Float16
- func Frexp(f Float16) (frac Float16, exp int)
- func FromBits(b uint16) Float16
- func FromFloat32(f32 float32) Float16
- func FromFloat32WithRounding(f32 float32, mode RoundingMode) Float16
- func FromFloat64(f64 float64) Float16
- func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
- func FromInt(i int) Float16
- func FromInt32(i int32) Float16
- func FromInt64(i int64) Float16
- func FromSlice64(s []float64) []Float16
- func Gamma(f Float16) Float16
- func Hypot(f, g Float16) Float16
- func Inf(sign int) Float16
- func J0(f Float16) Float16
- func J1(f Float16) Float16
- func Ldexp(frac Float16, exp int) Float16
- func Lerp(a, b, t Float16) Float16
- func Lgamma(f Float16) (Float16, int)
- func Log(f Float16) Float16
- func Log2(f Float16) Float16
- func Log10(f Float16) Float16
- func Max(a, b Float16) Float16
- func Min(a, b Float16) Float16
- func Mod(f, divisor Float16) Float16
- func Modf(f Float16) (integer, frac Float16)
- func Mul(a, b Float16) Float16
- func MulSlice(a, b []Float16) []Float16
- func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func NaN() Float16
- func NextAfter(f, g Float16) Float16
- func Norm2(s []Float16) Float16
- func One() Float16
- func Parse(s string) (Float16, error)
- func ParseFloat(s string, precision int) (Float16, error)
- func Pow(f, exp Float16) Float16
- func Remainder(f, divisor Float16) Float16
- func Round(f Float16) Float16
- func RoundToEven(f Float16) Float16
- func ScaleSlice(s []Float16, scalar Float16) []Float16
- func Sign(f Float16) Float16
- func Sin(f Float16) Float16
- func Sinh(f Float16) Float16
- func Sqrt(f Float16) Float16
- func Sub(a, b Float16) Float16
- func SubSlice(a, b []Float16) []Float16
- func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func SumSlice(s []Float16) Float16
- func Tan(f Float16) Float16
- func Tanh(f Float16) Float16
- func ToFloat16(f64 float64) Float16
- func ToSlice16(s []float32) []Float16
- func ToSlice16WithMode(s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)
- func Trunc(f Float16) Float16
- func VectorAdd(a, b []Float16) []Float16
- func VectorMul(a, b []Float16) []Float16
- func Y0(f Float16) Float16
- func Y1(f Float16) Float16
- func Zero() Float16
- func (f Float16) Abs() Float16
- func (f Float16) Bits() uint16
- func (f Float16) Class() FloatClass
- func (f Float16) CopySign(s Float16) Float16
- func (f Float16) GoString() string
- func (f Float16) IsFinite() bool
- func (f Float16) IsInf(sign int) bool
- func (f Float16) IsNaN() bool
- func (f Float16) IsNormal() bool
- func (f Float16) IsSubnormal() bool
- func (f Float16) IsZero() bool
- func (f Float16) Neg() Float16
- func (f Float16) Sign() int
- func (f Float16) Signbit() bool
- func (f Float16) String() string
- func (f Float16) ToBFloat16() BFloat16
- func (f Float16) ToFloat32() float32
- func (f Float16) ToFloat64() float64
- func (f Float16) ToInt() int
- func (f Float16) ToInt32() int32
- func (f Float16) ToInt64() int64
type Float16Error
- func (e *Float16Error) Error() string
type FloatClass
- func FpClassify(f Float16) FloatClass
type RoundingMode
type SliceStats
- func ComputeSliceStats(s []Float16) SliceStats

Constants ¶

View Source

const (
	BFloat16SignMask     = 0x8000 // 0b1000000000000000 - Sign bit mask
	BFloat16ExponentMask = 0x7F80 // 0b0111111110000000 - Exponent bits mask
	BFloat16MantissaMask = 0x007F // 0b0000000001111111 - Mantissa bits mask
	BFloat16MantissaLen  = 7      // Number of mantissa bits
	BFloat16ExponentLen  = 8      // Number of exponent bits

	// Exponent bias and limits for BFloat16
	// bias = 2^(exponent_bits-1) - 1 = 2^7 - 1 = 127 (same as Float32)
	BFloat16ExponentBias = 127 // Bias for 8-bit exponent
	BFloat16ExponentMax  = 255 // Maximum exponent value
	BFloat16ExponentMin  = 0   // Minimum exponent value

	// Normalized exponent range
	BFloat16ExponentNormalMin = 1   // Minimum normalized exponent
	BFloat16ExponentNormalMax = 254 // Maximum normalized exponent (infinity at 255)

	// Special exponent values
	BFloat16ExponentZero     = 0   // Zero and subnormal numbers
	BFloat16ExponentInfinity = 255 // Infinity and NaN
)

BFloat16 format constants

View Source

const (
	Version      = "1.0.0"
	VersionMajor = 1
	VersionMinor = 0
	VersionPatch = 0
)

Package version information

View Source

const (
	SignMask     = 0x8000 // 0b1000000000000000 - Sign bit mask
	ExponentMask = 0x7C00 // 0b0111110000000000 - Exponent bits mask
	MantissaMask = 0x03FF // 0b0000001111111111 - Mantissa bits mask
	MantissaLen  = 10     // Number of mantissa bits
	ExponentLen  = 5      // Number of exponent bits

	// Exponent bias and limits for IEEE 754 half-precision
	// bias = 2^(exponent_bits-1) - 1 = 2^4 - 1 = 15
	ExponentBias = 15 // Bias for 5-bit exponent
	ExponentMax  = 31 // Maximum exponent value (11111 binary)
	ExponentMin  = 0  // Minimum exponent value

	// Normalized exponent range
	ExponentNormalMin = 1  // Minimum normalized exponent
	ExponentNormalMax = 30 // Maximum normalized exponent (infinity at 31)

	// Float32 constants for conversion
	Float32ExponentBias = 127 // IEEE 754 single precision bias
	Float32ExponentLen  = 8   // Float32 exponent bits
	Float32MantissaLen  = 23  // Float32 mantissa bits

	// Special exponent values
	ExponentZero     = 0  // Zero and subnormal numbers
	ExponentInfinity = 31 // Infinity and NaN
)

IEEE 754 half-precision format constants

Variables ¶

View Source

var (
	DefaultArithmeticMode = ModeIEEEArithmetic
	DefaultRounding       = DefaultRoundingMode
)

Global arithmetic settings

View Source

var (
	BFloat16Zero  = BFloat16PositiveZero
	BFloat16One   = BFloat16FromFloat32(1.0)
	BFloat16Two   = BFloat16FromFloat32(2.0)
	BFloat16Half  = BFloat16FromFloat32(0.5)
	BFloat16E     = BFloat16FromFloat32(float32(math.E))
	BFloat16Pi    = BFloat16FromFloat32(float32(math.Pi))
	BFloat16Sqrt2 = BFloat16FromFloat32(float32(math.Sqrt2))
)

Convenience constants for common BFloat16 values

View Source

var (
	DefaultConversionMode ConversionMode = ModeIEEE
	DefaultRoundingMode   RoundingMode   = RoundNearestEven
)

View Source

var (
	// Common integer values
	Zero16  = PositiveZero
	One16   = FromFloat32(1.0)
	Two16   = FromFloat32(2.0)
	Three16 = FromFloat32(3.0)
	Four16  = FromFloat32(4.0)
	Five16  = FromFloat32(5.0)
	Ten16   = FromFloat32(10.0)

	// Common fractional values
	Half16    = FromFloat32(0.5)
	Quarter16 = FromFloat32(0.25)
	Third16   = FromFloat32(1.0 / 3.0)

	// Special mathematical values
	NaN16  = QuietNaN
	PosInf = PositiveInfinity
	NegInf = NegativeInfinity

	// Commonly used constants
	Deg2Rad = FromFloat32(float32(math.Pi / 180.0)) // Degrees to radians
	Rad2Deg = FromFloat32(float32(180.0 / math.Pi)) // Radians to degrees
)

Constants for common values

View Source

var (
	E       = FromFloat32(float32(math.E))       // Euler's number
	Pi      = FromFloat32(float32(math.Pi))      // Pi
	Phi     = FromFloat32(float32(math.Phi))     // Golden ratio
	Sqrt2   = FromFloat32(float32(math.Sqrt2))   // Square root of 2
	SqrtE   = FromFloat32(float32(math.SqrtE))   // Square root of E
	SqrtPi  = FromFloat32(float32(math.SqrtPi))  // Square root of Pi
	SqrtPhi = FromFloat32(float32(math.SqrtPhi)) // Square root of Phi
	Ln2     = FromFloat32(float32(math.Ln2))     // Natural logarithm of 2
	Log2E   = FromFloat32(float32(math.Log2E))   // Base-2 logarithm of E
	Ln10    = FromFloat32(float32(math.Ln10))    // Natural logarithm of 10
	Log10E  = FromFloat32(float32(math.Log10E))  // Base-10 logarithm of E
)

Mathematical constants as Float16 values

Functions ¶

func BFloat16Equal ¶ added in v0.2.0

func BFloat16Equal(a, b BFloat16) bool

BFloat16Equal returns true if a equals b

func BFloat16Greater ¶ added in v0.2.0

func BFloat16Greater(a, b BFloat16) bool

BFloat16Greater returns true if a > b

func BFloat16GreaterEqual ¶ added in v0.2.0

func BFloat16GreaterEqual(a, b BFloat16) bool

BFloat16GreaterEqual returns true if a >= b

func BFloat16Less ¶ added in v0.2.0

func BFloat16Less(a, b BFloat16) bool

BFloat16Less returns true if a < b

func BFloat16LessEqual ¶ added in v0.2.0

func BFloat16LessEqual(a, b BFloat16) bool

BFloat16LessEqual returns true if a <= b

func Configure ¶

func Configure(cfg *Config)

Configure applies the given configuration to the package

func DebugInfo ¶

func DebugInfo() map[string]interface{}

DebugInfo returns debugging information about the package state

func Equal ¶

func Equal(a, b Float16) bool

Equal returns true if two Float16 values are equal

func GetBenchmarkOperations ¶

func GetBenchmarkOperations() map[string]BenchmarkOperation

GetBenchmarkOperations returns a map of operations suitable for benchmarking

func GetMemoryUsage ¶

func GetMemoryUsage() int

GetMemoryUsage returns the current memory usage of the package in bytes

func GetVersion ¶

func GetVersion() string

GetVersion returns the package version string

func Greater ¶

func Greater(a, b Float16) bool

Greater returns true if a > b

func GreaterEqual ¶

func GreaterEqual(a, b Float16) bool

GreaterEqual returns true if a >= b

func IsFinite ¶

func IsFinite(f Float16) bool

IsFinite reports whether f is neither infinite nor NaN

func IsInf ¶

func IsInf(f Float16, sign int) bool

IsInf reports whether f is an infinity, according to sign If sign > 0, IsInf reports whether f is positive infinity If sign < 0, IsInf reports whether f is negative infinity If sign == 0, IsInf reports whether f is either infinity

func IsNaN ¶

func IsNaN(f Float16) bool

IsNaN reports whether f is an IEEE 754 "not-a-number" value

func IsNormal ¶

func IsNormal(f Float16) bool

IsNormal reports whether f is a normal number (not zero, subnormal, infinite, or NaN)

func IsSubnormal ¶

func IsSubnormal(f Float16) bool

IsSubnormal reports whether f is a subnormal number

func Less ¶

func Less(a, b Float16) bool

Less returns true if a < b

func LessEqual ¶

func LessEqual(a, b Float16) bool

LessEqual returns true if a <= b

func Signbit ¶

func Signbit(f Float16) bool

Signbit reports whether f is negative or negative zero

func ToSlice32 ¶

func ToSlice32(s []Float16) []float32

ToSlice32 converts a slice of Float16 to a slice of float32

func ToSlice64 ¶

func ToSlice64(s []Float16) []float64

ToSlice64 converts a slice of Float16 to a slice of float64

func ValidateSliceLength ¶

func ValidateSliceLength(a, b []Float16) error

ValidateSliceLength checks if two slices have the same length

Types ¶

type ArithmeticMode ¶

type ArithmeticMode int

ArithmeticMode defines the precision/performance trade-off for arithmetic operations

const (
	// ModeIEEE provides full IEEE 754 compliance with proper rounding
	ModeIEEEArithmetic ArithmeticMode = iota
	// ModeFastArithmetic optimizes for speed, may sacrifice some precision
	ModeFastArithmetic
	// ModeExactArithmetic provides exact results when possible, errors on precision loss
	ModeExactArithmetic
)

type BFloat16 ¶ added in v0.2.0

type BFloat16 uint16

BFloat16 represents a 16-bit "Brain Floating Point" format value Used by Google Brain, TensorFlow, and various ML frameworks Format: 1 sign bit, 8 exponent bits, 7 mantissa bits

const (
	BFloat16PositiveZero     BFloat16 = 0x0000 // +0.0
	BFloat16NegativeZero     BFloat16 = 0x8000 // -0.0
	BFloat16PositiveInfinity BFloat16 = 0x7F80 // +∞
	BFloat16NegativeInfinity BFloat16 = 0xFF80 // -∞
	BFloat16QuietNaN         BFloat16 = 0x7FC0 // Quiet NaN
	BFloat16SignalingNaN     BFloat16 = 0x7F81 // Signaling NaN

	// Largest finite values
	BFloat16MaxValue    BFloat16 = 0x7F7F // Largest positive normal
	BFloat16MinValue    BFloat16 = 0xFF7F // Largest negative normal (most negative)
	BFloat16SmallestPos BFloat16 = 0x0080 // Smallest positive normal
	BFloat16SmallestNeg BFloat16 = 0x8080 // Smallest negative normal

	// Smallest subnormal values
	BFloat16SmallestPosSubnormal BFloat16 = 0x0001 // Smallest positive subnormal
	BFloat16SmallestNegSubnormal BFloat16 = 0x8001 // Smallest negative subnormal
)

Special BFloat16 values

func BFloat16Abs ¶ added in v0.2.0

func BFloat16Abs(b BFloat16) BFloat16

BFloat16Abs returns the absolute value of b

func BFloat16Add ¶ added in v0.2.0

func BFloat16Add(a, b BFloat16) BFloat16

BFloat16Add adds two BFloat16 values

func BFloat16Div ¶ added in v0.2.0

func BFloat16Div(a, b BFloat16) BFloat16

BFloat16Div divides two BFloat16 values

func BFloat16FromBits ¶ added in v0.2.0

func BFloat16FromBits(bits uint16) BFloat16

FromBits creates a BFloat16 from its bit representation

func BFloat16FromFloat16 ¶ added in v0.2.0

func BFloat16FromFloat16(f Float16) BFloat16

BFloat16FromFloat16 converts a Float16 to BFloat16

func BFloat16FromFloat32 ¶ added in v0.2.0

func BFloat16FromFloat32(f float32) BFloat16

FromFloat32 converts a float32 to BFloat16 using round-to-nearest-even BFloat16 is essentially a truncated float32, so conversion is straightforward

func BFloat16FromFloat32WithMode ¶ added in v0.2.0

func BFloat16FromFloat32WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)

BFloat16FromFloat32WithMode converts a float32 to BFloat16 with specified conversion and rounding modes.

func BFloat16FromFloat32WithRounding ¶ added in v0.2.0

func BFloat16FromFloat32WithRounding(f float32, mode RoundingMode) BFloat16

BFloat16FromFloat32WithRounding converts a float32 to BFloat16 with the specified rounding mode.

func BFloat16FromFloat64 ¶ added in v0.2.0

func BFloat16FromFloat64(f float64) BFloat16

FromFloat64 converts a float64 to BFloat16

func BFloat16FromFloat64WithMode ¶ added in v0.2.0

func BFloat16FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)

BFloat16FromFloat64WithMode converts a float64 to BFloat16 with specified conversion and rounding modes.

func BFloat16FromFloat64WithRounding ¶ added in v0.2.0

func BFloat16FromFloat64WithRounding(f float64, mode RoundingMode) BFloat16

BFloat16FromFloat64WithRounding converts a float64 to BFloat16 with the specified rounding mode.

func BFloat16Max ¶ added in v0.2.0

func BFloat16Max(a, b BFloat16) BFloat16

BFloat16Max returns the larger of a or b

func BFloat16Min ¶ added in v0.2.0

func BFloat16Min(a, b BFloat16) BFloat16

BFloat16Min returns the smaller of a or b

func BFloat16Mul ¶ added in v0.2.0

func BFloat16Mul(a, b BFloat16) BFloat16

BFloat16Mul multiplies two BFloat16 values

func BFloat16Neg ¶ added in v0.2.0

func BFloat16Neg(b BFloat16) BFloat16

BFloat16Neg returns the negation of b

func BFloat16Sub ¶ added in v0.2.0

func BFloat16Sub(a, b BFloat16) BFloat16

BFloat16Sub subtracts two BFloat16 values

func (BFloat16) Bits ¶ added in v0.2.0

func (b BFloat16) Bits() uint16

Bits returns the bit representation of the BFloat16

func (BFloat16) Class ¶ added in v0.2.0

func (b BFloat16) Class() FloatClass

Class returns the IEEE 754 classification of the BFloat16 value

func (BFloat16) CopySign ¶ added in v0.2.0

func (b BFloat16) CopySign(s BFloat16) BFloat16

CopySign returns a value with the magnitude of f and the sign of s

func (BFloat16) IsFinite ¶ added in v0.2.0

func (b BFloat16) IsFinite() bool

IsFinite reports whether b is neither infinite nor NaN

func (BFloat16) IsInf ¶ added in v0.2.0

func (b BFloat16) IsInf(sign int) bool

IsInf reports whether b is an infinity, according to sign

func (BFloat16) IsNaN ¶ added in v0.2.0

func (b BFloat16) IsNaN() bool

IsNaN reports whether b is an IEEE 754 "not-a-number" value

func (BFloat16) IsNormal ¶ added in v0.2.0

func (b BFloat16) IsNormal() bool

IsNormal reports whether b is a normal number

func (BFloat16) IsSubnormal ¶ added in v0.2.0

func (b BFloat16) IsSubnormal() bool

IsSubnormal reports whether b is a subnormal number

func (BFloat16) IsZero ¶ added in v0.2.0

func (b BFloat16) IsZero() bool

IsZero returns true if the BFloat16 is zero (positive or negative)

func (BFloat16) Signbit ¶ added in v0.2.0

func (b BFloat16) Signbit() bool

Signbit reports whether b is negative or negative zero

func (BFloat16) String ¶ added in v0.2.0

func (b BFloat16) String() string

String returns a string representation of the BFloat16

func (BFloat16) ToFloat16 ¶ added in v0.2.0

func (b BFloat16) ToFloat16() Float16

ToFloat16 converts a BFloat16 to Float16

func (BFloat16) ToFloat32 ¶ added in v0.2.0

func (b BFloat16) ToFloat32() float32

ToFloat32 converts BFloat16 to float32

type BenchmarkOperation ¶

type BenchmarkOperation func(Float16, Float16) Float16

BenchmarkOperation represents a benchmarkable operation

type Config ¶

type Config struct {
	DefaultConversionMode ConversionMode
	DefaultRoundingMode   RoundingMode
	DefaultArithmeticMode ArithmeticMode
	EnableFastMath        bool // Package float16 implements the 16-bit floating point data type (IEEE 754-2008).

}

Package configuration

func DefaultConfig ¶

func DefaultConfig() *Config

DefaultConfig returns the default package configuration

func GetConfig ¶

func GetConfig() *Config

GetConfig returns the current package configuration

type ConversionMode ¶

type ConversionMode int

ConversionMode controls error reporting behavior for conversions

const (
	// ModeIEEE performs IEEE-style conversion, saturating to Inf/0 with no errors
	ModeIEEE ConversionMode = iota
	// ModeStrict reports errors for NaN, Inf, overflow, and underflow
	ModeStrict
)

type ErrorCode ¶

type ErrorCode int

ErrorCode represents specific error categories for float16 operations

const (
	ErrInvalidOperation ErrorCode = iota
	ErrNaN
	ErrInfinity
	ErrOverflow
	ErrUnderflow
	ErrDivisionByZero
)

type Float16 ¶

type Float16 uint16

Float16 represents a 16-bit IEEE 754 half-precision floating-point value

const (
	PositiveZero     Float16 = 0x0000 // +0.0
	NegativeZero     Float16 = 0x8000 // -0.0
	PositiveInfinity Float16 = 0x7C00 // +∞
	NegativeInfinity Float16 = 0xFC00 // -∞

	// Largest finite values
	MaxValue Float16 = 0x7BFF // Largest positive finite value (~65504)
	MinValue Float16 = 0xFBFF // Largest negative finite value (~-65504)

	// Smallest normalized positive value
	SmallestNormal Float16 = 0x0400 // 2^-14 ≈ 6.103515625e-05

	// Largest subnormal value
	LargestSubnormal Float16 = 0x03FF // (1023/1024) * 2^-14 ≈ 6.097555161e-05

	// Smallest positive subnormal value
	SmallestSubnormal Float16 = 0x0001 // 2^-24 ≈ 5.960464478e-08

	// Common NaN representations
	QuietNaN     Float16 = 0x7E00 // Quiet NaN (most significant mantissa bit set)
	SignalingNaN Float16 = 0x7D00 // Signaling NaN
	NegativeQNaN Float16 = 0xFE00 // Negative quiet NaN
)

Special values following IEEE 754 half-precision standard

func Abs ¶

func Abs(f Float16) Float16

Abs returns the absolute value of f

func Acos ¶

func Acos(f Float16) Float16

Acos returns the arccosine of f

func Add ¶

func Add(a, b Float16) Float16

Add performs addition of two Float16 values

func AddSlice ¶

func AddSlice(a, b []Float16) []Float16

AddSlice performs element-wise addition of two Float16 slices

func AddWithMode ¶

func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

AddWithMode performs addition with specified arithmetic and rounding modes

func Asin ¶

func Asin(f Float16) Float16

Asin returns the arcsine of f

func Atan ¶

func Atan(f Float16) Float16

Atan returns the arctangent of f

func Atan2 ¶

func Atan2(y, x Float16) Float16

Atan2 returns the arctangent of y/x

func Cbrt ¶

func Cbrt(f Float16) Float16

Cbrt returns the cube root of the Float16 value

func Ceil ¶

func Ceil(f Float16) Float16

Ceil returns the smallest integer value greater than or equal to f

func Clamp ¶

func Clamp(f, min, max Float16) Float16

Clamp restricts f to the range [min, max]

func CopySign ¶

func CopySign(f, sign Float16) Float16

CopySign returns a Float16 with the magnitude of f and the sign of sign

func Cos ¶

func Cos(f Float16) Float16

Cos returns the cosine of f (in radians)

func Cosh ¶

func Cosh(f Float16) Float16

Cosh returns the hyperbolic cosine of f

func Dim ¶

func Dim(f, g Float16) Float16

Dim returns the positive difference between f and g: max(f-g, 0)

func Div ¶

func Div(a, b Float16) Float16

Div performs division of two Float16 values

func DivSlice ¶

func DivSlice(a, b []Float16) []Float16

DivSlice performs element-wise division of two Float16 slices

func DivWithMode ¶

func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

DivWithMode performs division with specified arithmetic and rounding modes

func DotProduct ¶

func DotProduct(a, b []Float16) Float16

DotProduct computes the dot product of two Float16 slices

func Erf ¶

func Erf(f Float16) Float16

Erf returns the error function of f

func Erfc ¶

func Erfc(f Float16) Float16

Erfc returns the complementary error function of f

func Exp ¶

func Exp(f Float16) Float16

Exp returns e^f

func Exp2 ¶

func Exp2(f Float16) Float16

Exp2 returns 2^f

func Exp10 ¶

func Exp10(f Float16) Float16

Exp10 returns 10^f

func FastAdd ¶

func FastAdd(a, b Float16) Float16

FastAdd performs addition optimized for speed (may sacrifice precision)

func FastMul ¶

func FastMul(a, b Float16) Float16

FastMul performs multiplication optimized for speed (may sacrifice precision)

func Float16FromBFloat16 ¶ added in v0.2.0

func Float16FromBFloat16(b BFloat16) Float16

Float16FromBFloat16 converts a BFloat16 to Float16

func Floor ¶

func Floor(f Float16) Float16

Floor returns the largest integer value less than or equal to f

func Frexp ¶

func Frexp(f Float16) (frac Float16, exp int)

Frexp breaks f into a normalized fraction and an integral power of two It returns frac and exp satisfying f == frac × 2^exp, with the absolute value of frac in the interval [0.5, 1) or zero

func FromBits ¶

func FromBits(b uint16) Float16

FromBits constructs a Float16 from its IEEE 754 half-precision bit pattern

func FromFloat32 ¶

func FromFloat32(f32 float32) Float16

FromFloat32 converts a float32 value to a Float16 value. It handles special cases like NaN, infinities, and zeros. The conversion follows IEEE 754-2008 rules for half-precision.

func FromFloat32WithRounding ¶ added in v0.2.0

func FromFloat32WithRounding(f32 float32, mode RoundingMode) Float16

FromFloat32WithRounding converts a float32 to Float16 using the provided rounding mode. It mirrors fromFloat32New but respects the explicit rounding mode instead of always rounding to nearest-even.

func FromFloat64 ¶

func FromFloat64(f64 float64) Float16

FromFloat64 converts a float64 value to a Float16 value. It handles special cases like NaN, infinities, and zeros.

func FromFloat64WithMode ¶

func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)

FromFloat64WithMode converts a float64 to Float16 with specified conversion and rounding modes

func FromInt ¶

func FromInt(i int) Float16

FromInt converts an integer to Float16

func FromInt32 ¶

func FromInt32(i int32) Float16

FromInt32 converts an int32 to Float16

func FromInt64 ¶

func FromInt64(i int64) Float16

FromInt64 converts an int64 to Float16

func FromSlice64 ¶

func FromSlice64(s []float64) []Float16

FromSlice64 converts a slice of float64 to a slice of Float16

func Gamma ¶

func Gamma(f Float16) Float16

Gamma returns the Gamma function of f

func Hypot ¶

func Hypot(f, g Float16) Float16

Hypot returns sqrt(f*f + g*g), taking care to avoid overflow and underflow

func Inf ¶

func Inf(sign int) Float16

Inf returns a Float16 infinity value If sign >= 0, returns positive infinity If sign < 0, returns negative infinity

func J0 ¶

func J0(f Float16) Float16

J0 returns the order-zero Bessel function of the first kind

func J1 ¶

func J1(f Float16) Float16

J1 returns the order-one Bessel function of the first kind

func Ldexp ¶

func Ldexp(frac Float16, exp int) Float16

Ldexp returns frac × 2^exp

func Lerp ¶

func Lerp(a, b, t Float16) Float16

Lerp performs linear interpolation between a and b by factor t

func Lgamma ¶

func Lgamma(f Float16) (Float16, int)

Lgamma returns the natural logarithm and sign of Gamma(f)

func Log ¶

func Log(f Float16) Float16

Log returns the natural logarithm of f

func Log2 ¶

func Log2(f Float16) Float16

Log2 returns the base-2 logarithm of f

func Log10 ¶

func Log10(f Float16) Float16

Log10 returns the base-10 logarithm of f

func Max ¶

func Max(a, b Float16) Float16

Max returns the larger of two Float16 values

func Min ¶

func Min(a, b Float16) Float16

Min returns the smaller of two Float16 values

func Mod ¶

func Mod(f, divisor Float16) Float16

Mod returns the floating-point remainder of f/divisor

func Modf ¶

func Modf(f Float16) (integer, frac Float16)

Modf returns integer and fractional floating-point numbers that sum to f Both values have the same sign as f

func Mul ¶

func Mul(a, b Float16) Float16

Mul performs multiplication of two Float16 values

func MulSlice ¶

func MulSlice(a, b []Float16) []Float16

MulSlice performs element-wise multiplication of two Float16 slices

func MulWithMode ¶

func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

MulWithMode performs multiplication with specified arithmetic and rounding modes

func NaN ¶

func NaN() Float16

NaN returns a Float16 quiet NaN value

func NextAfter ¶

func NextAfter(f, g Float16) Float16

NextAfter returns the next representable Float16 value after f in the direction of g

func Norm2 ¶

func Norm2(s []Float16) Float16

Norm2 computes the L2 norm (Euclidean norm) of a Float16 slice

func One ¶

func One() Float16

One returns a Float16 value representing 1.0

func Parse ¶

func Parse(s string) (Float16, error)

Parse converts a string to a Float16 value This is a simplified implementation for testing

func ParseFloat ¶ added in v0.2.0

func ParseFloat(s string, precision int) (Float16, error)

ParseFloat converts a string to a Float16 value. The precision parameter is ignored for Float16. It returns the Float16 value and an error if the string cannot be parsed.

func Pow ¶

func Pow(f, exp Float16) Float16

Pow returns f raised to the power of exp

func Remainder ¶

func Remainder(f, divisor Float16) Float16

Remainder returns the IEEE 754 floating-point remainder of f/divisor

func Round ¶

func Round(f Float16) Float16

Round returns the nearest integer value to f

func RoundToEven ¶

func RoundToEven(f Float16) Float16

RoundToEven returns the nearest integer value to f, rounding ties to even

func ScaleSlice ¶

func ScaleSlice(s []Float16, scalar Float16) []Float16

ScaleSlice multiplies each element in the slice by a scalar

func Sign ¶

func Sign(f Float16) Float16

Sign returns -1, 0, or 1 depending on the sign of f

func Sin ¶

func Sin(f Float16) Float16

Sin returns the sine of f (in radians)

func Sinh ¶

func Sinh(f Float16) Float16

Sinh returns the hyperbolic sine of f

func Sqrt ¶

func Sqrt(f Float16) Float16

Sqrt returns the square root of the Float16 value

func Sub ¶

func Sub(a, b Float16) Float16

Sub performs subtraction of two Float16 values

func SubSlice ¶

func SubSlice(a, b []Float16) []Float16

SubSlice performs element-wise subtraction of two Float16 slices

func SubWithMode ¶

func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

SubWithMode performs subtraction with specified arithmetic and rounding modes

func SumSlice ¶

func SumSlice(s []Float16) Float16

SumSlice returns the sum of all elements in the slice

func Tan ¶

func Tan(f Float16) Float16

Tan returns the tangent of f (in radians)

func Tanh ¶

func Tanh(f Float16) Float16

Tanh returns the hyperbolic tangent of f

func ToFloat16 ¶

func ToFloat16(f64 float64) Float16

ToFloat16 converts a float64 to a Float16 value. This is a convenience wrapper used in tests and utilities.

func ToSlice16 ¶

func ToSlice16(s []float32) []Float16

ToSlice16 converts a slice of float32 to a slice of Float16. This is a convenience wrapper used in tests and utilities.

func ToSlice16WithMode ¶

func ToSlice16WithMode(s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)

ToSlice16WithMode converts a slice of float32 to Float16 with specified modes

func Trunc ¶

func Trunc(f Float16) Float16

Trunc returns the integer part of f (truncated towards zero)

func VectorAdd ¶

func VectorAdd(a, b []Float16) []Float16

VectorAdd performs vectorized addition (placeholder for future SIMD implementation)

func VectorMul ¶

func VectorMul(a, b []Float16) []Float16

VectorMul performs vectorized multiplication (placeholder for future SIMD implementation)

func Y0 ¶

func Y0(f Float16) Float16

Y0 returns the order-zero Bessel function of the second kind

func Y1 ¶

func Y1(f Float16) Float16

Y1 returns the order-one Bessel function of the second kind

func Zero ¶

func Zero() Float16

Zero returns a Float16 zero value

func (Float16) Abs ¶

func (f Float16) Abs() Float16

Abs returns the absolute value of the Float16

func (Float16) Bits ¶

func (f Float16) Bits() uint16

Bits returns the IEEE 754 half-precision bit pattern of f

func (Float16) Class ¶

func (f Float16) Class() FloatClass

Class returns the IEEE 754 classification of the value

func (Float16) CopySign ¶

func (f Float16) CopySign(s Float16) Float16

CopySign returns a value with the magnitude of f and the sign of s

func (Float16) GoString ¶

func (f Float16) GoString() string

GoString returns a Go syntax representation of the Float16 value

func (Float16) IsFinite ¶

func (f Float16) IsFinite() bool

IsFinite returns true if the Float16 value is finite (not infinity or NaN)

func (Float16) IsInf ¶

func (f Float16) IsInf(sign int) bool

IsInf returns true if the Float16 value represents infinity If sign > 0, returns true only for positive infinity If sign < 0, returns true only for negative infinity If sign == 0, returns true for either infinity

func (Float16) IsNaN ¶

func (f Float16) IsNaN() bool

IsNaN returns true if the Float16 value represents NaN (Not a Number)

func (Float16) IsNormal ¶

func (f Float16) IsNormal() bool

IsNormal returns true if the Float16 value is normalized (not zero, subnormal, infinite, or NaN)

func (Float16) IsSubnormal ¶

func (f Float16) IsSubnormal() bool

IsSubnormal returns true if the Float16 value is subnormal (denormalized)

func (Float16) IsZero ¶

func (f Float16) IsZero() bool

IsZero returns true if the Float16 value represents zero (positive or negative)

func (Float16) Neg ¶

func (f Float16) Neg() Float16

Neg returns the negation of the Float16

func (Float16) Sign ¶

func (f Float16) Sign() int

Sign returns the sign of the Float16 value: 1 for positive, -1 for negative, 0 for zero

func (Float16) Signbit ¶

func (f Float16) Signbit() bool

Signbit returns true if the Float16 value has a negative sign bit

func (Float16) String ¶

func (f Float16) String() string

String returns a string representation of the Float16 value

func (Float16) ToBFloat16 ¶ added in v0.2.0

func (f Float16) ToBFloat16() BFloat16

ToBFloat16 converts a Float16 to BFloat16

func (Float16) ToFloat32 ¶

func (f Float16) ToFloat32() float32

ToFloat32 converts a Float16 value to a float32 value. It handles special cases like NaN, infinities, and zeros.

func (Float16) ToFloat64 ¶

func (f Float16) ToFloat64() float64

ToFloat64 converts a Float16 value to a float64 value. It handles special cases like NaN, infinities, and zeros.

func (Float16) ToInt ¶

func (f Float16) ToInt() int

ToInt converts Float16 to int (truncates toward zero)

func (Float16) ToInt32 ¶

func (f Float16) ToInt32() int32

func (Float16) ToInt64 ¶

func (f Float16) ToInt64() int64

ToInt64 converts Float16 to int64 (truncates toward zero)

type Float16Error ¶

type Float16Error struct {
	Op   string
	Msg  string
	Code ErrorCode
}

Float16Error provides detailed error information for float16 operations

func (*Float16Error) Error ¶

func (e *Float16Error) Error() string

type FloatClass ¶

type FloatClass int

FloatClass enumerates the IEEE 754 classification of a Float16 value

const (
	ClassPositiveZero FloatClass = iota
	ClassNegativeZero
	ClassPositiveSubnormal
	ClassNegativeSubnormal
	ClassPositiveNormal
	ClassNegativeNormal
	ClassPositiveInfinity
	ClassNegativeInfinity
	ClassQuietNaN
	ClassSignalingNaN
)

func FpClassify ¶

func FpClassify(f Float16) FloatClass

FpClassify returns the IEEE 754 class of f

type RoundingMode ¶

type RoundingMode int

RoundingMode controls how results are rounded during conversion/arithmetic

const (
	// Round to nearest, ties to even
	RoundNearestEven RoundingMode = iota
	// Round toward zero (truncate)
	RoundTowardZero
	// Round toward +Inf
	RoundTowardPositive
	// Round toward -Inf
	RoundTowardNegative
	// Round to nearest, ties away from zero
	RoundNearestAway
)

type SliceStats ¶

type SliceStats struct {
	Min    Float16
	Max    Float16
	Sum    Float16
	Mean   Float16
	Length int
}

SliceStats computes basic statistics for a Float16 slice

func ComputeSliceStats ¶

func ComputeSliceStats(s []Float16) SliceStats

ComputeSliceStats calculates statistics for a Float16 slice

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
temp

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL