Everyone uses math libraries (i.e., libm), which provide approximations for elementary functions. Surprisingly (or not), mainstream math libraries (i.e., Intel’s and GCC’s libm) do not produce correct results for several thousands of inputs! Developers of modern software systems are seldom aware of them, which affects the reproducibility and portability of software systems. This blog post describes our new method for synthesizing elementary functions that produces correct results for all inputs but is also significantly faster than the state of the art libraries, which have been optimized for decades.
Elementary functions are functions of a single variable that are defined as performing addition, multiplication, and composition of finitely many polynomials, logarithmic functions, and trigonometric functions. The result of an elementary function is typically an irrational value. Hence, there are two main challenges in approximating them. First, developing polynomial approximations, which when evaluated with real numbers, minimize the error with respect to an elementary function is challenging (i.e., there is inevitably some approximation error). Second, software and hardware implementations use finite-precision representations (e.g., floating point (FP)) and evaluating any polynomial approximation will experience additional rounding error. It is challenging to determine the amount of precision required to produce correct results for all inputs (i.e., the table-maker’s dilemma). Although there is a large body of seminal work on approximating elementary functions (see Jean-Michel Muller’s book on Elementary Functions for a survey), the standards still recommend correct results for elementary functions but do not require them.
The overall goal of our RLIBM project is to make correctly rounded elementary functions mandatory rather than a recommendation by the standards (at least for 32 bits or lower). In our RLIBM project, we have been making a case for generating polynomials that approximate the correctly rounded result of f(x) rather than the real value of f(x) because our goal is to generate correct, yet efficient implementations. By approximating the correctly rounded result, we consider both approximation errors (i.e., polynomial approximation) and rounding errors (i.e., with polynomial evaluation with finite precision representations), which enables the generation of correct results. One can use existing high-precision libraries, which are slow, to generate oracle correctly rounded results. Now our goal is to generate efficient polynomials that produce these correctly rounded results for all inputs. Given a correctly rounded result, there is an interval of real values around the correct result that the polynomial can produce, which still rounds to the correct result. This interval is the amount of freedom available that our polynomial generator has for a given input. Using this interval, our idea is to structure the problem of polynomial generation that produces correct results for all inputs as a linear programming (LP) problem. To scale to representations with a large number of inputs, we propose counterexample guided polynomial generation, generate piecewise polynomials rather than a single polynomial, and develop techniques that support various complex range reductions. More detailed description of our techniques can be found in our PLDI 2021 distinguished paper.
What is the Correctly Rounded Result of f(x)?
The 32-bit float type (or any finite precision representation) can only represent a finite number of real values. The result of an elementary function (e.g., ln(x)) is typically an irrational value. If the real value of f(x) cannot be exactly represented as a float value, it is rounded to one of the adjacent float values. Consider the real number line below:
The values v1, v2, and v3 represent three adjacent float values. When the result of an elementary function (i.e., ln(x1) is the blue circle) is not exactly representable in float, it is rounded to a float value according to the rounding mode. Using the default IEEE-754 rounding mode, round-to-nearest-tie-goes-to-even, ln(x1) rounds to v1. Hence, v1 is called the correctly rounded result of ln(x1). An approximation function that produces the correctly rounded results of an elementary function f(x) for all inputs in the target representation is called a correctly rounded implementation.
Prior Approaches for Approximating f(x)
Mainstream math libraries approximate the real value of f(x) using mini-max polynomials, which minimize the maximum error of the approximating polynomial (i.e., when evaluated with real numbers) with respect to the real valued elementary function. The Remez algorithm is a widely used algorithm to generate such mini-max polynomials. The minimum error of such a mini-max polynomial needs to be small enough to produce correctly rounded results for all inputs. This is especially challenging when the real value of f(x1) is close to the rounding boundary of two float values:
In the figure above, the grey region represents the minimum error that the mini-max polynomial can exhibit while still producing the correct result. Even a small amount of error either in polynomial approximation or in evaluation of the polynomial (i.e., P(x1)) can lead to an incorrect result (i.e., v2).
In our RLIBM project, we have been making a case for approximating the correctly rounded result of f(x) rather than the real value of f(x). The key insight is that there is a range of values in the vicinity of the correctly rounded result (i.e., v1) where all values in the interval round to v1:
As long as our polynomial approximation produces a value (i.e., the white circle) in this interval, the result of the polynomial will produce the correctly rounded result. We call this interval the rounding interval. As you can see from the above figure, the amount of freedom provided by approximating the correctly rounded result is much larger than mini-max polynomials that approximate the real value of f(x).
Using this insight, the RLIBM approach generates polynomial approximations by first computing the correctly rounded result of f(x) for each input x (black dots in the above graph). Next, it identifies the rounding interval for each correctly rounded result. The rounding interval [li, hi] for each input xi defines a constraint on the output of the polynomial P(x) that we wish to generate. If P(xi) for a given input xi produces a value in [li, hi], then the result will round to the correctly rounded result of f(xi). Now, we are interested in identifying a polynomial of degree d that satisfies the following constraint for each input xi:
We encode these constraints into a system of linear inequalities and use an LP solver to solve for the coefficients of the polynomial. This basic RLIBM approach has been successful in generating correctly rounded approximation functions for 16-bit representations (i.e., bfloat16 and posit16).
Scaling the RLIBM Approach to 32-bit Representations
Widely used 32-bit representations (i.e., float) have four billion inputs. LP solvers cannot handle billions of constraints. Additionally, even if it is possible to generate a polynomial that produces the correctly rounded results for all 32-bit inputs, this polynomial will likely have a high degree, which may not be ideal for performance. To address these challenges, we generate piecewise polynomials and use counterexample guided polynomial generation. Our PLDI 2021 paper provides more detail on these techniques.
Instead of trying to generate a high degree polynomial, we generate piecewise polynomials. Each component of the piecewise polynomial will have a lower degree resulting in faster performance with polynomial evaluation.
To generate piecewise polynomials, we split the input domain into multiple subdomains and generate a polynomial that produces the correctly rounded results for all inputs in each subdomain. To efficiently identify subdomains and choose the appropriate polynomial to evaluate given an input, we split the domain based on the bit-pattern of the input. This strategy enables us to identify the appropriate polynomial to use with an and and a shift operation on the input in double precision.
Even after splitting the input domain into smaller subdomains, each subdomain may still contain millions of inputs. To scale the search for polynomials, we sample a small fraction of the inputs and their corresponding rounding intervals. We generate a candidate polynomial that satisfies the constraints in the sample. Next, we check whether the candidate polynomial satisfies all constraints in the subdomain, adding any counterexample to the sample. We repeat this process until the candidate polynomial satisfies all constraints in the subdomain or the LP solver determines that it is not possible to generate a polynomial that satisfies all constraints in the sample. We call this technique counterexample guided polynomial generation, an allusion to counterexample guided program synthesis. Our intuition is that it is not necessary to reason about every input and its corresponding rounding interval. Rather, only the most constrained rounding intervals are necessary to generate correctly rounded polynomial approximations.
Using this approach, we have created RLIBM-32, a library containing implementations of several elementary functions that produce correctly rounded results for all inputs for the 32-bit float type and the 32-bit posit type. The table below reports the list of elementary functions for the 32-bit float available in RLIBM-32. The tick mark indicates that the particular implementation produces the correct result for all inputs. The x mark indicates that a particular library does not produce correct results for all inputs. In such cases, we also provide the number of wrong results.
All functions in RLIBM-32 produce the correctly rounded results for all float inputs. In contrast, mainstream math libraries (i.e., Intel’s and glibc’s libm) do not produce correctly rounded results for all inputs. Even repurposing a correctly rounded math library for the double type (i.e., CR-LIBM) does not produce correctly rounded float results for all inputs, due to double rounding errors.
The functions in RLIBM-32 are significantly faster than the state of the art. The above graphs show the speedup of RLIBM-32’s float functions compared to glibc’s and Intel’s libm. On average, RLIBM-32 has 1.1x and 1.2x speedup over glibc’s float and double functions. RLIBM-32 has 1.5x and 1.6x speedup over Intel’s float and double functions. Overall, RLIBM-32 not only produces correctly rounded results for all inputs but it is faster than mainstream math libraries, which have been optimized for decades.
The RLIBM project advocates approximating the correctly rounded result rather than the real value of f(x), which enables us to structure the problem of generating correctly rounded results as a linear programming problem. To make the LP formulation feasible for 32-bit representations, we generate piecewise polynomials and use counterexample guided polynomial generation. The resulting elementary functions are fast and produce correctly rounded results for all 32-bit inputs. The overall focus of the RLIBM project is to make correctly rounded elementary functions mandatory rather than a recommendation by the standards. The RLIBM project page provides links to papers, talks, and implementations for various representations.
Acknowledgments: We thank Adrian Sampson for his feedback on the drafts of this article. We also thank collaborators Mridul Aanjaneya and John Gustafson for their inputs to the the RLIBM project.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.