Documentation
¶
Overview ¶
Package float16 implements both 16-bit floating point data types: - Float16: IEEE 754-2008 half-precision (1 sign, 5 exponent, 10 mantissa bits) - BFloat16: "Brain Floating Point" format (1 sign, 8 exponent, 7 mantissa bits)
This implementation provides conversion between both types and other floating-point types (float32 and float64) with support for various rounding modes and error handling.
Special Values ¶
The float16 type supports all IEEE 754-2008 special values:
- Positive and negative zero
- Positive and negative infinity
- Not-a-Number (NaN) values with payload
- Normalized numbers
- Subnormal (denormal) numbers
Subnormal Numbers ¶
When converting to higher-precision types (float32/float64), subnormal float16 values are preserved. However, when converting back from higher-precision types to float16, subnormal values may be rounded to the nearest representable normal float16 value. This behavior is consistent with many hardware implementations that handle subnormals in a similar way for performance reasons.
Rounding Modes ¶
The following rounding modes are supported for conversions:
- RoundNearestEven: Round to nearest, ties to even (default)
- RoundTowardZero: Round toward zero (truncate)
- RoundTowardPositive: Round toward positive infinity
- RoundTowardNegative: Round toward negative infinity
- RoundNearestAway: Round to nearest, ties away from zero
Error Handling ¶
Conversion functions with a ConversionMode parameter can return errors for:
- Overflow: When a value is too large to be represented
- Underflow: When a value is too small to be represented (in strict mode)
- Inexact: When rounding occurs (in strict mode)
See: http://en.wikipedia.org/wiki/Half-precision_floating-point_format
Index ¶
- Constants
- Variables
- func BFloat16Equal(a, b BFloat16) bool
- func BFloat16Greater(a, b BFloat16) bool
- func BFloat16GreaterEqual(a, b BFloat16) bool
- func BFloat16Less(a, b BFloat16) bool
- func BFloat16LessEqual(a, b BFloat16) bool
- func Configure(cfg *Config)
- func DebugInfo() map[string]interface{}
- func Equal(a, b Float16) bool
- func GetBenchmarkOperations() map[string]BenchmarkOperation
- func GetMemoryUsage() int
- func GetVersion() string
- func Greater(a, b Float16) bool
- func GreaterEqual(a, b Float16) bool
- func IsFinite(f Float16) bool
- func IsInf(f Float16, sign int) bool
- func IsNaN(f Float16) bool
- func IsNormal(f Float16) bool
- func IsSubnormal(f Float16) bool
- func Less(a, b Float16) bool
- func LessEqual(a, b Float16) bool
- func Signbit(f Float16) bool
- func ToSlice32(s []Float16) []float32
- func ToSlice64(s []Float16) []float64
- func ValidateSliceLength(a, b []Float16) error
- type ArithmeticMode
- type BFloat16
- func BFloat16Abs(b BFloat16) BFloat16
- func BFloat16Add(a, b BFloat16) BFloat16
- func BFloat16Div(a, b BFloat16) BFloat16
- func BFloat16FromBits(bits uint16) BFloat16
- func BFloat16FromFloat16(f Float16) BFloat16
- func BFloat16FromFloat32(f float32) BFloat16
- func BFloat16FromFloat32WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)
- func BFloat16FromFloat32WithRounding(f float32, mode RoundingMode) BFloat16
- func BFloat16FromFloat64(f float64) BFloat16
- func BFloat16FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)
- func BFloat16FromFloat64WithRounding(f float64, mode RoundingMode) BFloat16
- func BFloat16Max(a, b BFloat16) BFloat16
- func BFloat16Min(a, b BFloat16) BFloat16
- func BFloat16Mul(a, b BFloat16) BFloat16
- func BFloat16Neg(b BFloat16) BFloat16
- func BFloat16Sub(a, b BFloat16) BFloat16
- func (b BFloat16) Bits() uint16
- func (b BFloat16) Class() FloatClass
- func (b BFloat16) CopySign(s BFloat16) BFloat16
- func (b BFloat16) IsFinite() bool
- func (b BFloat16) IsInf(sign int) bool
- func (b BFloat16) IsNaN() bool
- func (b BFloat16) IsNormal() bool
- func (b BFloat16) IsSubnormal() bool
- func (b BFloat16) IsZero() bool
- func (b BFloat16) Signbit() bool
- func (b BFloat16) String() string
- func (b BFloat16) ToFloat16() Float16
- func (b BFloat16) ToFloat32() float32
- type BenchmarkOperation
- type Config
- type ConversionMode
- type ErrorCode
- type Float16
- func Abs(f Float16) Float16
- func Acos(f Float16) Float16
- func Add(a, b Float16) Float16
- func AddSlice(a, b []Float16) []Float16
- func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func Asin(f Float16) Float16
- func Atan(f Float16) Float16
- func Atan2(y, x Float16) Float16
- func Cbrt(f Float16) Float16
- func Ceil(f Float16) Float16
- func Clamp(f, min, max Float16) Float16
- func CopySign(f, sign Float16) Float16
- func Cos(f Float16) Float16
- func Cosh(f Float16) Float16
- func Dim(f, g Float16) Float16
- func Div(a, b Float16) Float16
- func DivSlice(a, b []Float16) []Float16
- func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func DotProduct(a, b []Float16) Float16
- func Erf(f Float16) Float16
- func Erfc(f Float16) Float16
- func Exp(f Float16) Float16
- func Exp2(f Float16) Float16
- func Exp10(f Float16) Float16
- func FastAdd(a, b Float16) Float16
- func FastMul(a, b Float16) Float16
- func Float16FromBFloat16(b BFloat16) Float16
- func Floor(f Float16) Float16
- func Frexp(f Float16) (frac Float16, exp int)
- func FromBits(b uint16) Float16
- func FromFloat32(f32 float32) Float16
- func FromFloat32WithRounding(f32 float32, mode RoundingMode) Float16
- func FromFloat64(f64 float64) Float16
- func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
- func FromInt(i int) Float16
- func FromInt32(i int32) Float16
- func FromInt64(i int64) Float16
- func FromSlice64(s []float64) []Float16
- func Gamma(f Float16) Float16
- func Hypot(f, g Float16) Float16
- func Inf(sign int) Float16
- func J0(f Float16) Float16
- func J1(f Float16) Float16
- func Ldexp(frac Float16, exp int) Float16
- func Lerp(a, b, t Float16) Float16
- func Lgamma(f Float16) (Float16, int)
- func Log(f Float16) Float16
- func Log2(f Float16) Float16
- func Log10(f Float16) Float16
- func Max(a, b Float16) Float16
- func Min(a, b Float16) Float16
- func Mod(f, divisor Float16) Float16
- func Modf(f Float16) (integer, frac Float16)
- func Mul(a, b Float16) Float16
- func MulSlice(a, b []Float16) []Float16
- func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func NaN() Float16
- func NextAfter(f, g Float16) Float16
- func Norm2(s []Float16) Float16
- func One() Float16
- func Parse(s string) (Float16, error)
- func ParseFloat(s string, precision int) (Float16, error)
- func Pow(f, exp Float16) Float16
- func Remainder(f, divisor Float16) Float16
- func Round(f Float16) Float16
- func RoundToEven(f Float16) Float16
- func ScaleSlice(s []Float16, scalar Float16) []Float16
- func Sign(f Float16) Float16
- func Sin(f Float16) Float16
- func Sinh(f Float16) Float16
- func Sqrt(f Float16) Float16
- func Sub(a, b Float16) Float16
- func SubSlice(a, b []Float16) []Float16
- func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func SumSlice(s []Float16) Float16
- func Tan(f Float16) Float16
- func Tanh(f Float16) Float16
- func ToFloat16(f64 float64) Float16
- func ToSlice16(s []float32) []Float16
- func ToSlice16WithMode(s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)
- func Trunc(f Float16) Float16
- func VectorAdd(a, b []Float16) []Float16
- func VectorMul(a, b []Float16) []Float16
- func Y0(f Float16) Float16
- func Y1(f Float16) Float16
- func Zero() Float16
- func (f Float16) Abs() Float16
- func (f Float16) Bits() uint16
- func (f Float16) Class() FloatClass
- func (f Float16) CopySign(s Float16) Float16
- func (f Float16) GoString() string
- func (f Float16) IsFinite() bool
- func (f Float16) IsInf(sign int) bool
- func (f Float16) IsNaN() bool
- func (f Float16) IsNormal() bool
- func (f Float16) IsSubnormal() bool
- func (f Float16) IsZero() bool
- func (f Float16) Neg() Float16
- func (f Float16) Sign() int
- func (f Float16) Signbit() bool
- func (f Float16) String() string
- func (f Float16) ToBFloat16() BFloat16
- func (f Float16) ToFloat32() float32
- func (f Float16) ToFloat64() float64
- func (f Float16) ToInt() int
- func (f Float16) ToInt32() int32
- func (f Float16) ToInt64() int64
- type Float16Error
- type FloatClass
- type RoundingMode
- type SliceStats
Constants ¶
const ( BFloat16SignMask = 0x8000 // 0b1000000000000000 - Sign bit mask BFloat16ExponentMask = 0x7F80 // 0b0111111110000000 - Exponent bits mask BFloat16MantissaMask = 0x007F // 0b0000000001111111 - Mantissa bits mask BFloat16MantissaLen = 7 // Number of mantissa bits BFloat16ExponentLen = 8 // Number of exponent bits // Exponent bias and limits for BFloat16 // bias = 2^(exponent_bits-1) - 1 = 2^7 - 1 = 127 (same as Float32) BFloat16ExponentBias = 127 // Bias for 8-bit exponent BFloat16ExponentMax = 255 // Maximum exponent value BFloat16ExponentMin = 0 // Minimum exponent value // Normalized exponent range BFloat16ExponentNormalMin = 1 // Minimum normalized exponent BFloat16ExponentNormalMax = 254 // Maximum normalized exponent (infinity at 255) // Special exponent values BFloat16ExponentZero = 0 // Zero and subnormal numbers BFloat16ExponentInfinity = 255 // Infinity and NaN )
BFloat16 format constants
const ( Version = "1.0.0" VersionMajor = 1 VersionMinor = 0 VersionPatch = 0 )
Package version information
const ( SignMask = 0x8000 // 0b1000000000000000 - Sign bit mask ExponentMask = 0x7C00 // 0b0111110000000000 - Exponent bits mask MantissaMask = 0x03FF // 0b0000001111111111 - Mantissa bits mask MantissaLen = 10 // Number of mantissa bits ExponentLen = 5 // Number of exponent bits // Exponent bias and limits for IEEE 754 half-precision // bias = 2^(exponent_bits-1) - 1 = 2^4 - 1 = 15 ExponentBias = 15 // Bias for 5-bit exponent ExponentMax = 31 // Maximum exponent value (11111 binary) ExponentMin = 0 // Minimum exponent value // Normalized exponent range ExponentNormalMin = 1 // Minimum normalized exponent ExponentNormalMax = 30 // Maximum normalized exponent (infinity at 31) // Float32 constants for conversion Float32ExponentBias = 127 // IEEE 754 single precision bias Float32ExponentLen = 8 // Float32 exponent bits Float32MantissaLen = 23 // Float32 mantissa bits // Special exponent values ExponentZero = 0 // Zero and subnormal numbers ExponentInfinity = 31 // Infinity and NaN )
IEEE 754 half-precision format constants
Variables ¶
var ( DefaultArithmeticMode = ModeIEEEArithmetic DefaultRounding = DefaultRoundingMode )
Global arithmetic settings
var ( BFloat16Zero = BFloat16PositiveZero BFloat16One = BFloat16FromFloat32(1.0) BFloat16Two = BFloat16FromFloat32(2.0) BFloat16Half = BFloat16FromFloat32(0.5) BFloat16E = BFloat16FromFloat32(float32(math.E)) BFloat16Pi = BFloat16FromFloat32(float32(math.Pi)) BFloat16Sqrt2 = BFloat16FromFloat32(float32(math.Sqrt2)) )
Convenience constants for common BFloat16 values
var ( DefaultConversionMode ConversionMode = ModeIEEE DefaultRoundingMode RoundingMode = RoundNearestEven )
var ( // Common integer values Zero16 = PositiveZero One16 = FromFloat32(1.0) Two16 = FromFloat32(2.0) Three16 = FromFloat32(3.0) Four16 = FromFloat32(4.0) Five16 = FromFloat32(5.0) Ten16 = FromFloat32(10.0) // Common fractional values Half16 = FromFloat32(0.5) Quarter16 = FromFloat32(0.25) Third16 = FromFloat32(1.0 / 3.0) // Special mathematical values NaN16 = QuietNaN PosInf = PositiveInfinity NegInf = NegativeInfinity // Commonly used constants Deg2Rad = FromFloat32(float32(math.Pi / 180.0)) // Degrees to radians Rad2Deg = FromFloat32(float32(180.0 / math.Pi)) // Radians to degrees )
Constants for common values
var ( E = FromFloat32(float32(math.E)) // Euler's number Pi = FromFloat32(float32(math.Pi)) // Pi Phi = FromFloat32(float32(math.Phi)) // Golden ratio Sqrt2 = FromFloat32(float32(math.Sqrt2)) // Square root of 2 SqrtE = FromFloat32(float32(math.SqrtE)) // Square root of E SqrtPi = FromFloat32(float32(math.SqrtPi)) // Square root of Pi SqrtPhi = FromFloat32(float32(math.SqrtPhi)) // Square root of Phi Ln2 = FromFloat32(float32(math.Ln2)) // Natural logarithm of 2 Log2E = FromFloat32(float32(math.Log2E)) // Base-2 logarithm of E Ln10 = FromFloat32(float32(math.Ln10)) // Natural logarithm of 10 Log10E = FromFloat32(float32(math.Log10E)) // Base-10 logarithm of E )
Mathematical constants as Float16 values
Functions ¶
func BFloat16Equal ¶ added in v0.2.0
BFloat16Equal returns true if a equals b
func BFloat16Greater ¶ added in v0.2.0
BFloat16Greater returns true if a > b
func BFloat16GreaterEqual ¶ added in v0.2.0
BFloat16GreaterEqual returns true if a >= b
func BFloat16Less ¶ added in v0.2.0
BFloat16Less returns true if a < b
func BFloat16LessEqual ¶ added in v0.2.0
BFloat16LessEqual returns true if a <= b
func Configure ¶
func Configure(cfg *Config)
Configure applies the given configuration to the package
func DebugInfo ¶
func DebugInfo() map[string]interface{}
DebugInfo returns debugging information about the package state
func GetBenchmarkOperations ¶
func GetBenchmarkOperations() map[string]BenchmarkOperation
GetBenchmarkOperations returns a map of operations suitable for benchmarking
func GetMemoryUsage ¶
func GetMemoryUsage() int
GetMemoryUsage returns the current memory usage of the package in bytes
func IsInf ¶
IsInf reports whether f is an infinity, according to sign If sign > 0, IsInf reports whether f is positive infinity If sign < 0, IsInf reports whether f is negative infinity If sign == 0, IsInf reports whether f is either infinity
func IsNormal ¶
IsNormal reports whether f is a normal number (not zero, subnormal, infinite, or NaN)
func IsSubnormal ¶
IsSubnormal reports whether f is a subnormal number
func ValidateSliceLength ¶
ValidateSliceLength checks if two slices have the same length
Types ¶
type ArithmeticMode ¶
type ArithmeticMode int
ArithmeticMode defines the precision/performance trade-off for arithmetic operations
const ( // ModeIEEE provides full IEEE 754 compliance with proper rounding ModeIEEEArithmetic ArithmeticMode = iota // ModeFastArithmetic optimizes for speed, may sacrifice some precision ModeFastArithmetic // ModeExactArithmetic provides exact results when possible, errors on precision loss ModeExactArithmetic )
type BFloat16 ¶ added in v0.2.0
type BFloat16 uint16
BFloat16 represents a 16-bit "Brain Floating Point" format value Used by Google Brain, TensorFlow, and various ML frameworks Format: 1 sign bit, 8 exponent bits, 7 mantissa bits
const ( BFloat16PositiveZero BFloat16 = 0x0000 // +0.0 BFloat16NegativeZero BFloat16 = 0x8000 // -0.0 BFloat16PositiveInfinity BFloat16 = 0x7F80 // +∞ BFloat16NegativeInfinity BFloat16 = 0xFF80 // -∞ BFloat16QuietNaN BFloat16 = 0x7FC0 // Quiet NaN BFloat16SignalingNaN BFloat16 = 0x7F81 // Signaling NaN // Largest finite values BFloat16MaxValue BFloat16 = 0x7F7F // Largest positive normal BFloat16MinValue BFloat16 = 0xFF7F // Largest negative normal (most negative) BFloat16SmallestPos BFloat16 = 0x0080 // Smallest positive normal BFloat16SmallestNeg BFloat16 = 0x8080 // Smallest negative normal // Smallest subnormal values BFloat16SmallestPosSubnormal BFloat16 = 0x0001 // Smallest positive subnormal BFloat16SmallestNegSubnormal BFloat16 = 0x8001 // Smallest negative subnormal )
Special BFloat16 values
func BFloat16Abs ¶ added in v0.2.0
BFloat16Abs returns the absolute value of b
func BFloat16Add ¶ added in v0.2.0
BFloat16Add adds two BFloat16 values
func BFloat16Div ¶ added in v0.2.0
BFloat16Div divides two BFloat16 values
func BFloat16FromBits ¶ added in v0.2.0
FromBits creates a BFloat16 from its bit representation
func BFloat16FromFloat16 ¶ added in v0.2.0
BFloat16FromFloat16 converts a Float16 to BFloat16
func BFloat16FromFloat32 ¶ added in v0.2.0
FromFloat32 converts a float32 to BFloat16 using round-to-nearest-even BFloat16 is essentially a truncated float32, so conversion is straightforward
func BFloat16FromFloat32WithMode ¶ added in v0.2.0
func BFloat16FromFloat32WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)
BFloat16FromFloat32WithMode converts a float32 to BFloat16 with specified conversion and rounding modes.
func BFloat16FromFloat32WithRounding ¶ added in v0.2.0
func BFloat16FromFloat32WithRounding(f float32, mode RoundingMode) BFloat16
BFloat16FromFloat32WithRounding converts a float32 to BFloat16 with the specified rounding mode.
func BFloat16FromFloat64 ¶ added in v0.2.0
FromFloat64 converts a float64 to BFloat16
func BFloat16FromFloat64WithMode ¶ added in v0.2.0
func BFloat16FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (BFloat16, error)
BFloat16FromFloat64WithMode converts a float64 to BFloat16 with specified conversion and rounding modes.
func BFloat16FromFloat64WithRounding ¶ added in v0.2.0
func BFloat16FromFloat64WithRounding(f float64, mode RoundingMode) BFloat16
BFloat16FromFloat64WithRounding converts a float64 to BFloat16 with the specified rounding mode.
func BFloat16Max ¶ added in v0.2.0
BFloat16Max returns the larger of a or b
func BFloat16Min ¶ added in v0.2.0
BFloat16Min returns the smaller of a or b
func BFloat16Mul ¶ added in v0.2.0
BFloat16Mul multiplies two BFloat16 values
func BFloat16Neg ¶ added in v0.2.0
BFloat16Neg returns the negation of b
func BFloat16Sub ¶ added in v0.2.0
BFloat16Sub subtracts two BFloat16 values
func (BFloat16) Class ¶ added in v0.2.0
func (b BFloat16) Class() FloatClass
Class returns the IEEE 754 classification of the BFloat16 value
func (BFloat16) CopySign ¶ added in v0.2.0
CopySign returns a value with the magnitude of f and the sign of s
func (BFloat16) IsSubnormal ¶ added in v0.2.0
IsSubnormal reports whether b is a subnormal number
func (BFloat16) IsZero ¶ added in v0.2.0
IsZero returns true if the BFloat16 is zero (positive or negative)
type BenchmarkOperation ¶
BenchmarkOperation represents a benchmarkable operation
type Config ¶
type Config struct {
DefaultConversionMode ConversionMode
DefaultRoundingMode RoundingMode
DefaultArithmeticMode ArithmeticMode
EnableFastMath bool // Package float16 implements the 16-bit floating point data type (IEEE 754-2008).
}
Package configuration
func DefaultConfig ¶
func DefaultConfig() *Config
DefaultConfig returns the default package configuration
type ConversionMode ¶
type ConversionMode int
ConversionMode controls error reporting behavior for conversions
const ( // ModeIEEE performs IEEE-style conversion, saturating to Inf/0 with no errors ModeIEEE ConversionMode = iota // ModeStrict reports errors for NaN, Inf, overflow, and underflow ModeStrict )
type ErrorCode ¶
type ErrorCode int
ErrorCode represents specific error categories for float16 operations
type Float16 ¶
type Float16 uint16
Float16 represents a 16-bit IEEE 754 half-precision floating-point value
const ( PositiveZero Float16 = 0x0000 // +0.0 NegativeZero Float16 = 0x8000 // -0.0 PositiveInfinity Float16 = 0x7C00 // +∞ NegativeInfinity Float16 = 0xFC00 // -∞ // Largest finite values MaxValue Float16 = 0x7BFF // Largest positive finite value (~65504) MinValue Float16 = 0xFBFF // Largest negative finite value (~-65504) // Smallest normalized positive value SmallestNormal Float16 = 0x0400 // 2^-14 ≈ 6.103515625e-05 // Largest subnormal value LargestSubnormal Float16 = 0x03FF // (1023/1024) * 2^-14 ≈ 6.097555161e-05 // Smallest positive subnormal value SmallestSubnormal Float16 = 0x0001 // 2^-24 ≈ 5.960464478e-08 // Common NaN representations QuietNaN Float16 = 0x7E00 // Quiet NaN (most significant mantissa bit set) SignalingNaN Float16 = 0x7D00 // Signaling NaN NegativeQNaN Float16 = 0xFE00 // Negative quiet NaN )
Special values following IEEE 754 half-precision standard
func AddWithMode ¶
func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
AddWithMode performs addition with specified arithmetic and rounding modes
func DivWithMode ¶
func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
DivWithMode performs division with specified arithmetic and rounding modes
func DotProduct ¶
DotProduct computes the dot product of two Float16 slices
func Float16FromBFloat16 ¶ added in v0.2.0
Float16FromBFloat16 converts a BFloat16 to Float16
func Frexp ¶
Frexp breaks f into a normalized fraction and an integral power of two It returns frac and exp satisfying f == frac × 2^exp, with the absolute value of frac in the interval [0.5, 1) or zero
func FromFloat32 ¶
FromFloat32 converts a float32 value to a Float16 value. It handles special cases like NaN, infinities, and zeros. The conversion follows IEEE 754-2008 rules for half-precision.
func FromFloat32WithRounding ¶ added in v0.2.0
func FromFloat32WithRounding(f32 float32, mode RoundingMode) Float16
FromFloat32WithRounding converts a float32 to Float16 using the provided rounding mode. It mirrors fromFloat32New but respects the explicit rounding mode instead of always rounding to nearest-even.
func FromFloat64 ¶
FromFloat64 converts a float64 value to a Float16 value. It handles special cases like NaN, infinities, and zeros.
func FromFloat64WithMode ¶
func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
FromFloat64WithMode converts a float64 to Float16 with specified conversion and rounding modes
func FromSlice64 ¶
FromSlice64 converts a slice of float64 to a slice of Float16
func Inf ¶
Inf returns a Float16 infinity value If sign >= 0, returns positive infinity If sign < 0, returns negative infinity
func Modf ¶
Modf returns integer and fractional floating-point numbers that sum to f Both values have the same sign as f
func MulWithMode ¶
func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
MulWithMode performs multiplication with specified arithmetic and rounding modes
func NextAfter ¶
NextAfter returns the next representable Float16 value after f in the direction of g
func Parse ¶
Parse converts a string to a Float16 value This is a simplified implementation for testing
func ParseFloat ¶ added in v0.2.0
ParseFloat converts a string to a Float16 value. The precision parameter is ignored for Float16. It returns the Float16 value and an error if the string cannot be parsed.
func RoundToEven ¶
RoundToEven returns the nearest integer value to f, rounding ties to even
func ScaleSlice ¶
ScaleSlice multiplies each element in the slice by a scalar
func SubWithMode ¶
func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
SubWithMode performs subtraction with specified arithmetic and rounding modes
func ToFloat16 ¶
ToFloat16 converts a float64 to a Float16 value. This is a convenience wrapper used in tests and utilities.
func ToSlice16 ¶
ToSlice16 converts a slice of float32 to a slice of Float16. This is a convenience wrapper used in tests and utilities.
func ToSlice16WithMode ¶
func ToSlice16WithMode(s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)
ToSlice16WithMode converts a slice of float32 to Float16 with specified modes
func VectorAdd ¶
VectorAdd performs vectorized addition (placeholder for future SIMD implementation)
func VectorMul ¶
VectorMul performs vectorized multiplication (placeholder for future SIMD implementation)
func (Float16) Class ¶
func (f Float16) Class() FloatClass
Class returns the IEEE 754 classification of the value
func (Float16) IsFinite ¶
IsFinite returns true if the Float16 value is finite (not infinity or NaN)
func (Float16) IsInf ¶
IsInf returns true if the Float16 value represents infinity If sign > 0, returns true only for positive infinity If sign < 0, returns true only for negative infinity If sign == 0, returns true for either infinity
func (Float16) IsNormal ¶
IsNormal returns true if the Float16 value is normalized (not zero, subnormal, infinite, or NaN)
func (Float16) IsSubnormal ¶
IsSubnormal returns true if the Float16 value is subnormal (denormalized)
func (Float16) IsZero ¶
IsZero returns true if the Float16 value represents zero (positive or negative)
func (Float16) Sign ¶
Sign returns the sign of the Float16 value: 1 for positive, -1 for negative, 0 for zero
func (Float16) ToBFloat16 ¶ added in v0.2.0
ToBFloat16 converts a Float16 to BFloat16
func (Float16) ToFloat32 ¶
ToFloat32 converts a Float16 value to a float32 value. It handles special cases like NaN, infinities, and zeros.
func (Float16) ToFloat64 ¶
ToFloat64 converts a Float16 value to a float64 value. It handles special cases like NaN, infinities, and zeros.
type Float16Error ¶
Float16Error provides detailed error information for float16 operations
func (*Float16Error) Error ¶
func (e *Float16Error) Error() string
type FloatClass ¶
type FloatClass int
FloatClass enumerates the IEEE 754 classification of a Float16 value
const ( ClassPositiveZero FloatClass = iota ClassNegativeZero ClassPositiveSubnormal ClassNegativeSubnormal ClassPositiveNormal ClassNegativeNormal ClassPositiveInfinity ClassNegativeInfinity ClassQuietNaN ClassSignalingNaN )
type RoundingMode ¶
type RoundingMode int
RoundingMode controls how results are rounded during conversion/arithmetic
const ( // Round to nearest, ties to even RoundNearestEven RoundingMode = iota // Round toward zero (truncate) RoundTowardZero // Round toward +Inf RoundTowardPositive // Round toward -Inf RoundTowardNegative // Round to nearest, ties away from zero RoundNearestAway )
type SliceStats ¶
SliceStats computes basic statistics for a Float16 slice
func ComputeSliceStats ¶
func ComputeSliceStats(s []Float16) SliceStats
ComputeSliceStats calculates statistics for a Float16 slice