Jekyll2019-08-28T19:45:58+00:00http://bduvenhage.me/feed.xmlBernardt Duvenhage’s BlogSaving the world one line of code at a time - a blog on 'efficiency with algorithms, performance with data structures' and things I don't (didn't?) know.Generating Equidistant Points on a Sphere2019-07-31T00:00:00+00:002019-07-31T00:00:00+00:00http://bduvenhage.me/geometry/2019/07/31/generating-equidistant-vectors<p>In this post I’ll revive work I did during my PhD to generate 3D points that are equally spaced on the unit sphere. Such equidistant points are useful for many operations over the sphere as well as to properly tesselate it. The method is based on a spiral walk of the spherical surface in angular increments equal to the golden angle. The golden angle is related to the golden ratio.</p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML" async=""></script>
<p>Two quantities are in the golden ratio, <script type="math/tex">\varphi</script>, if their ratio is the same as the ratio of their sum to the larger of the two quantities. <script type="math/tex">\frac{a}{b} = \frac{a+b}{a} := \varphi</script> which is approximately 1.6180339887… The golden angle <script type="math/tex">\vartheta</script> is the angle subtended by the small arc <script type="math/tex">b</script> which is approximately 2.3999632297… radians or 137.5077640500… degrees.</p>
<p><img src="/assets/images/golden_angle.jpg" width="240" /></p>
<p>The ratios between consecutive Fibonacci numbers approach the golden ratio. Also, an alternative for expressing the Fibonacci sequence is <script type="math/tex">F\left(n\right)=\frac{\varphi^{n} - (1-\varphi)^{n}}{\sqrt{5}}</script>. #mindblown 🤯. The spiral walk discussed here is therefore often referred to as a spherical Fibonacci lattice or a Fibonacci spiral sphere.</p>
<p>The method presented here I originally implemented for a paper on <a href="https://dl.acm.org/citation.cfm?id=2513499">Numerical Verification of Bidirectional Reflectance Distribution Functions for Physical Plausibility</a>. A pre-print of the paper is available from <a href="https://www.researchgate.net/publication/259885429_Numerical_Verification_of_Bidirectional_Reflectance_Distribution_Functions_for_Physical_Plausibility">ResearchGate</a> and via my <a href="https://scholar.google.com/citations?user=jqhH0o4AAAAJ">Google Scholar page</a>. The paper also discusses an alternative method based on subdivision of a 20-sided regular icosahedron.</p>
<h2 id="the-fibonacci-spiral-disc">The Fibonacci Spiral Disc</h2>
<p>The Fibonacci spiral is a way of stepping around a circle to generate angular positions with limited repeated structure in the sequence. The step size is equal to the golden angle. Due to the properties of the golden angle, if one were to create a histogram of angles generated by this methods then the angle bins will always be approximately equally filled.</p>
<p>Using <a href="https://www.sciencedirect.com/science/article/abs/pii/0025556479900804?via%3Dihub">Vogal’s method</a>, one can combine this property with an increasing radius <script type="math/tex">r</script> to distribute points on a 2D disc. Even distribution of points over different radii of the disc is ensured by having <script type="math/tex">r = k\sqrt{i}</script> for <script type="math/tex">i</script> the index of the point being generated and <script type="math/tex">k</script> inversely proportional to the overall density of the points. Due to the relationship between the radius and the point’s index, the disc’s surface area correctly grows in proportion to the number of points.</p>
<p>Putting this together, <script type="math/tex">P_i = (k\sqrt{i}, i\vartheta)</script> is the range and angle cylindrical polar coordinate of point <script type="math/tex">i</script> for <script type="math/tex">i > 0</script>.</p>
<p><img src="/assets/images/fibonacci_spiral_disc_10.jpg" width="320" />
<img src="/assets/images/fibonacci_spiral_disc_5.jpg" width="320" /></p>
<p>Shown above is a spiral disc with 500 points for <script type="math/tex">k = 10.0</script> (left) and <script type="math/tex">k = 5.0</script> (right). A characteristic of this method is that there seems to be a space in the centre for another point.</p>
<p>The C++ code to generate the points on the disc is:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Vec2</span><span class="o">></span> <span class="n">fibonacci_spiral_disc</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="n">num_points</span><span class="p">,</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">k</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Vec2</span><span class="o">></span> <span class="n">vectors</span><span class="p">;</span>
<span class="n">vectors</span><span class="p">.</span><span class="n">reserve</span><span class="p">(</span><span class="n">num_points</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">gr</span><span class="o">=</span><span class="p">(</span><span class="n">sqrt</span><span class="p">(</span><span class="mf">5.0</span><span class="p">)</span> <span class="o">+</span> <span class="mf">1.0</span><span class="p">)</span> <span class="o">/</span> <span class="mf">2.0</span><span class="p">;</span> <span class="c1">// golden ratio = 1.6180339887498948482
</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">ga</span><span class="o">=</span><span class="p">(</span><span class="mf">2.0</span> <span class="o">-</span> <span class="n">gr</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">2.0</span><span class="o">*</span><span class="n">M_PI</span><span class="p">);</span> <span class="c1">// golden angle = 2.39996322972865332
</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">num_points</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">*</span> <span class="n">k</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">ga</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">x</span> <span class="o">=</span> <span class="n">cos</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span> <span class="o">*</span> <span class="n">r</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">y</span> <span class="o">=</span> <span class="n">sin</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span> <span class="o">*</span> <span class="n">r</span><span class="p">;</span>
<span class="n">vectors</span><span class="p">.</span><span class="n">emplace_back</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">vectors</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<h2 id="the-fibonacci-spiral-sphere">The Fibonacci Spiral Sphere</h2>
<p>One can use a similar spiral method to also distribute points on a sphere. To evenly distribute the points, proportionally more turns are allocated to larger circles on the sphere. If the spiral starts at a sphere’s pole then the radius and circumference of the spiral at any point is proportional to <script type="math/tex">\cos(lat)</script> for <script type="math/tex">lat</script> in <script type="math/tex">[-\frac{\pi}{2}, \frac{\pi}{2}]</script>. The density of the turns of the spiral is therefore also proportional to <script type="math/tex">\cos(lat)</script> and the continuous distribution function (CDF) of the turns proportional to <script type="math/tex">\sin(lat)+1</script>.</p>
<p>Given this CDF, the point index <script type="math/tex">i</script> of a latitude can be calculated with <script type="math/tex">i = \frac{N+1}{2} (\sin(lat)+1)</script> for <script type="math/tex">N</script> the total number of points required. Then taking the inverse gives, <script type="math/tex">lat = \arcsin(i\frac{2}{N+1} - 1)</script>.</p>
<p>Wrapping this up, <script type="math/tex">P_i = (\arcsin(i\frac{2}{N+1} - 1), i\vartheta)</script>, is the latitude & longitude spherical polar coordinate of point <script type="math/tex">i</script> for <script type="math/tex">i</script> in <script type="math/tex">[1,N]</script>. Notice that the longitude component of the coordinate is the same as for the disc, but the disc’s radius component has been adapted to a latitude component for the sphere of <script type="math/tex">N</script> points.</p>
<p><img src="/assets/images/fibomesh.jpg" width="320" />
<img src="/assets/images/fibogeodual.jpg" width="320" /></p>
<p>Shown above is a tessellated Fibonacci spiral sphere with 162 points (on the left) and its geometric dual (on the right). The geometric dual shows the shapes of the spaces around each of the equidistant points. Note that the points are generated in latitude order. To tesselate the sphere one still needs to apply a Delaunay or similar triangulation algorithm.</p>
<p>The C++ code to generate the points on the sphere is:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Vec3</span><span class="o">></span> <span class="n">fibonacci_spiral_sphere</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="n">num_points</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Vec3</span><span class="o">></span> <span class="n">vectors</span><span class="p">;</span>
<span class="n">vectors</span><span class="p">.</span><span class="n">reserve</span><span class="p">(</span><span class="n">num_points</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">gr</span><span class="o">=</span><span class="p">(</span><span class="n">sqrt</span><span class="p">(</span><span class="mf">5.0</span><span class="p">)</span> <span class="o">+</span> <span class="mf">1.0</span><span class="p">)</span> <span class="o">/</span> <span class="mf">2.0</span><span class="p">;</span> <span class="c1">// golden ratio = 1.6180339887498948482
</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">ga</span><span class="o">=</span><span class="p">(</span><span class="mf">2.0</span> <span class="o">-</span> <span class="n">gr</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">2.0</span><span class="o">*</span><span class="n">M_PI</span><span class="p">);</span> <span class="c1">// golden angle = 2.39996322972865332
</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">num_points</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">lat</span> <span class="o">=</span> <span class="n">asin</span><span class="p">(</span><span class="o">-</span><span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="kt">double</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">num_points</span><span class="o">+</span><span class="mi">1</span><span class="p">));</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">lon</span> <span class="o">=</span> <span class="n">ga</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">x</span> <span class="o">=</span> <span class="n">cos</span><span class="p">(</span><span class="n">lon</span><span class="p">)</span><span class="o">*</span><span class="n">cos</span><span class="p">(</span><span class="n">lat</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">y</span> <span class="o">=</span> <span class="n">sin</span><span class="p">(</span><span class="n">lon</span><span class="p">)</span><span class="o">*</span><span class="n">cos</span><span class="p">(</span><span class="n">lat</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">z</span> <span class="o">=</span> <span class="n">sin</span><span class="p">(</span><span class="n">lat</span><span class="p">);</span>
<span class="n">vectors</span><span class="p">.</span><span class="n">emplace_back</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">vectors</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<h2 id="summary">Summary</h2>
<p>The Fibonacci lattice/spiral is a simple and efficient method for generating equidistant points on the unit sphere. The golden angle is approximately 2.400 radians or 137.5 degrees. Therefore, each turn of the spiral walk adds two or three points to the sphere. To tesselate the sphere one still needs to apply a Delaunay or similar triangulation algorithm.</p>
<p>Full <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/geometry">source</a> for generating the equidistant points is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp</a> GitHub repo.</p>In this post I’ll revive work I did during my PhD to generate 3D points that are equally spaced on the unit sphere. Such equidistant points are useful for many operations over the sphere as well as to properly tesselate it. The method is based on a spiral walk of the spherical surface in angular increments equal to the golden angle. The golden angle is related to the golden ratio.The High Performance Time-Stamp Counter2019-06-22T00:00:00+00:002019-06-22T00:00:00+00:00http://bduvenhage.me/performance/2019/06/22/high-performance-timer<p>Your computer has a high performance Time-Stamp Counter (TSC) that increments at a rate similar to the CPU clock. On modern processors this counter increments at a constant rate and may be used as a wall clock timer. The benefit of the TSC compared to the Linux system timeofday or clock_gettime functions, for example, is that the TSC takes only a few clock cycles to read.</p>
<p>From Section 17.15 <em>Time Stamp Counter</em> of the ‘Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B’: “Constant TSC behaviour ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behaviour moving forward.”</p>
<p>So the architectural behaviour now and moving forward is that the TSC increments at a constant rate. This will be true even though the actual CPU clock rate may drop to save power or increase during turbo boost. Note that on certain processors, the TSC frequency may not be the same as the CPU frequency in the brand string.</p>
<p>On virtual hosts modern processors also got your back with two features called <em>TSC offsetting</em> and <em>TSC scaling</em>. Virtualisation software can appropriately set the scale and offset of the TSC when read by guest software so that the guest wouldn’t notice being migrated from one platform to another. Intel has a ‘Timestamp-Counter Scaling for Virtualization White Paper’ that you can read for more info. I’m not sure how widely modern processors has adopted scaling yet, but offsetting seems pretty standard.</p>
<p>I originally investigated and used the TSC to reduce my timing overhead on TopCoder’s marathon match platform. The timeofday operation was extremely slow on the platform and would take 130ms compared to around 30ns locally. This was probably due to the anti-cheating measures they put in place. In these situations having an alternative timing operation that takes less than 10ns is useful.</p>
<h2 id="the-rdtsc-and-rdtscp-instructions">The RDTSC and RDTSCP instructions</h2>
<p>The RDTSC and RDTSCP instructions can be used by user-mode / guest code to read the TSC. These instructions read the 64-bit TSC value into EDX:EAX. The high-order 32 bits of each of RAX and RDX are cleared.</p>
<p>The RDTSC instruction can be called as shown below:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="n">ALWAYS_INLINE</span> <span class="kt">uint64_t</span> <span class="nf">_get_tsc_ticks_since_reset</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">uint32_t</span> <span class="n">countlo</span><span class="p">,</span> <span class="n">counthi</span><span class="p">;</span>
<span class="n">__asm__</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"RDTSC"</span> <span class="o">:</span> <span class="s">"=a"</span> <span class="p">(</span><span class="n">countlo</span><span class="p">),</span> <span class="s">"=d"</span> <span class="p">(</span><span class="n">counthi</span><span class="p">));</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">(</span><span class="n">counthi</span><span class="p">)</span> <span class="o"><<</span> <span class="mi">32</span><span class="p">)</span> <span class="o">|</span> <span class="n">countlo</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>The RDTSC instruction is not a serializing instruction so it does not wait until all previous instructions have been executed before reading the counter and subsequent instructions may begin execution before the read operation is performed. This means that without adding fences the measurement might be contaminated by instructions that come before and after RDTSC within the out of order window of the processor. This might impact on the measurement bias and variance by a few cycles. Adding fences would improve the accuracy of timing, but at the cost of some measurement overhead.</p>
<p>The RDTSCP instruction can be called as shown below:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="n">ALWAYS_INLINE</span> <span class="kt">uint64_t</span> <span class="nf">_get_tsc_ticks_since_reset_p</span><span class="p">(</span><span class="kt">int</span> <span class="o">&</span><span class="n">chip</span><span class="p">,</span> <span class="kt">int</span> <span class="o">&</span><span class="n">core</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">uint32_t</span> <span class="n">countlo</span><span class="p">,</span> <span class="n">counthi</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">chx</span><span class="p">;</span> <span class="c1">// Set to processor signature register - set to chip/socket & core ID by recent Linux kernels.
</span>
<span class="n">__asm__</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"RDTSCP"</span> <span class="o">:</span> <span class="s">"=a"</span> <span class="p">(</span><span class="n">countlo</span><span class="p">),</span> <span class="s">"=d"</span> <span class="p">(</span><span class="n">counthi</span><span class="p">),</span> <span class="s">"=c"</span> <span class="p">(</span><span class="n">chx</span><span class="p">));</span>
<span class="n">chip</span> <span class="o">=</span> <span class="p">(</span><span class="n">chx</span> <span class="o">&</span> <span class="mh">0xFFF000</span><span class="p">)</span><span class="o">>></span><span class="mi">12</span><span class="p">;</span>
<span class="n">core</span> <span class="o">=</span> <span class="p">(</span><span class="n">chx</span> <span class="o">&</span> <span class="mh">0xFFF</span><span class="p">);</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">(</span><span class="n">counthi</span><span class="p">)</span> <span class="o"><<</span> <span class="mi">32</span><span class="p">)</span> <span class="o">|</span> <span class="n">countlo</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>The RDTSCP instruction also returns the processor ID (chip and core) that the instruction was executed on. RDTSCP does wait until all previous instructions have executed and all previous loads are globally visible, but it still does not wait for previous stores to be globally visible and subsequent instructions may begin execution before the read operation is performed.</p>
<p>On a Macbook Intel Core i5-5287U (06_3D) @ 2.90GHz the RDTSC and RDTSCP instructions seem to take around 23 cycles or 8 ns. RDTSC(P) are therefore fairly long latency instructions. The ‘Intel 64 and IA-32 Architectures Optimization Reference Manual’ reports an instruction throughput of 10 cycles (for DisplayFamily_Display_Model = 06_3D) which is what one can probably expect if the timing instructions are spaced further apart than in the testing code I used. The point is that even RDTSC(P) timing instructions shouldn’t be used in the inner loop of your code if timing overhead is a concern.</p>
<h2 id="measuring-wall-clock-time">Measuring Wall Clock Time</h2>
<p>Given that the TSC frequency is invariant, if the frequency of the counter is known then it can be used to measure wall clock time in seconds.</p>
<p>If <code class="highlighter-rouge">init_tick</code> is the reference ‘zero’ tick and <code class="highlighter-rouge">seconds_per_tick_</code> is the timer period (reciprocal of the frequency) then the wall clock time in seconds since the reference init time is returned by:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="n">ALWAYS_INLINE</span> <span class="kt">double</span> <span class="nf">get_tsc_time</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">(</span><span class="n">_get_tsc_ticks_since_reset</span><span class="p">()</span> <span class="o">-</span> <span class="n">init_tick_</span><span class="p">)</span> <span class="o">*</span> <span class="n">seconds_per_tick_</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>As mentioned, the TSC frequency can be different from the clock frequency in the CPUs brand string returned by the CPUID instruction. One solution to finding the TSC frequency is to count the change in the TSC between two known times and then calculate the TSC’s <code class="highlighter-rouge">seconds_per_tick</code>.</p>
<p>If <code class="highlighter-rouge">init_time_</code> is the known reference time and <code class="highlighter-rouge">_get_tod_seconds()</code> returns the wall clock time using timeofday, for example, then <code class="highlighter-rouge">seconds_per_tick</code> may be updated with:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="kt">void</span> <span class="nf">sync_tsc_time</span><span class="p">()</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">dTime</span> <span class="o">=</span> <span class="n">_get_tod_seconds</span><span class="p">()</span> <span class="o">-</span> <span class="n">init_time_</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">uint64_t</span> <span class="n">dTicks</span> <span class="o">=</span> <span class="n">_get_tsc_ticks_since_reset_p</span><span class="p">(</span><span class="n">chip_</span><span class="p">,</span> <span class="n">core_</span><span class="p">)</span> <span class="o">-</span> <span class="n">init_tick_</span><span class="p">;</span>
<span class="n">seconds_per_tick_</span> <span class="o">=</span> <span class="n">dTime</span> <span class="o">/</span> <span class="n">dTicks</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>I usually init my timer then do the setup of my app and by the time I need to start using the timer enough time has passed to get an accurate estimate of <code class="highlighter-rouge">seconds_per_tick_</code>. The full <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/time">source</a> with execution timing is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp</a> GitHub repo.</p>
<h2 id="multi-socket-behaviour">Multi-Socket Behaviour</h2>
<p>On modern platforms the TSC is synchronised between cores of the same socket and is reset with the processor reset signal. The processor reset signal is, similar to the reference clock, synchronised between multiple processor sockets on the same motherboard. It is therefore reasonable to assume that the TSC is at least approximately synchronised between sockets.</p>
<p>The code I’ve implemented assumes that the TSC is synchronised between sockets, but I don’t have enough experience with multi-socket systems to confirm this behaviour. I’ll update this post if I find that the TSC is not adequately synchronised across sockets.</p>
<p>To accommodate any TSC skew between sockets, the code could be adapted to maintain <code class="highlighter-rouge">init_tick_</code> values and perhaps also <code class="highlighter-rouge">seconds_per_tick_</code> values for each socket using a list or an unordered map indexed by the socket ID returned by <code class="highlighter-rouge">RDTSCP</code>.</p>
<h2 id="the-c-stdchronohigh_resolution_timer">The C++ std::chrono::high_resolution_timer</h2>
<p>A good alternative to directly using the TSC is to use C++’s high resolution std::chrono timer:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">high_resolution_clock</span><span class="o">::</span><span class="n">time_point</span> <span class="n">time_point</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">high_resolution_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span></code></pre></figure>
<p>The high_resolution_clock is aliased to the highest resolution counter provided by the compiler implementation. On a Macbook Intel Core i5-5287U (06_3D) @ 2.90GHz platform running Apple LLVM (clang) 10.0.1 the high resolution clock is aliased to <code class="highlighter-rouge">std::chrono::steady_clock</code>.</p>
<p>On this platform the steady_clock counter has a frequency of 1GHz and a throughput of around 45 ns. It seems to be the same clock as as <code class="highlighter-rouge">clock_gettime(CLOCK_UPTIME_RAW, ...)</code> which in turn seems from the man pages to be the same clock that <code class="highlighter-rouge">mach_absolute_time()</code> uses. Therefore, <code class="highlighter-rouge">std::chrono::high_resolution_clock</code> would likely suffer from the same long latencies on platforms like TopCoder.</p>
<p>A <a href="https://github.com/bduvenhage/Bits-O-Cpp/blob/master/time/main_cpp11_chrono.cpp">simple example</a> of std::chrono is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp</a> GitHub repo.</p>
<h2 id="summary">Summary</h2>
<p>The TSC can be used as a high performance timer. Moving forward, the architectural behaviour of the TSC is to be invariant and available for wall clock time measurements. This is also true for guest code running on virtualisation software.</p>
<p>The full <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/time">source</a> with execution timing
is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp</a> GitHub repo.</p>Your computer has a high performance Time-Stamp Counter (TSC) that increments at a rate similar to the CPU clock. On modern processors this counter increments at a constant rate and may be used as a wall clock timer. The benefit of the TSC compared to the Linux system timeofday or clock_gettime functions, for example, is that the TSC takes only a few clock cycles to read.A Fast, Compact Approximation of the Exponential Function2019-06-04T00:00:00+00:002019-06-04T00:00:00+00:00http://bduvenhage.me/performance/machine_learning/2019/06/04/fast-exp<p>The exponential function and specifically the natural exponential function <code class="highlighter-rouge">exp</code> is used often in machine learning, simulated annealing and many calculations within the sciences. Part of the reason why the natural exponential is so useful is that it is quite unique in that it is its own derivative. Put differently, the slope of the function is equal to its height. A fast non-table-lookup approximation of the natural exponential function would be very useful to accelerate many applications of <code class="highlighter-rouge">exp</code>.</p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML" async=""></script>
<p>In 1999 Nicol N. Schraudolph published a note in Neural Computation 11, 853–862 (1999) on <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.4508&rep=rep1&type=pdf">“A Fast, Compact Approximation of the Exponential Function”</a>. The note presents an approximation to the exponential that exploits the format of the IEEE-754 double precision floating point format.</p>
<h2 id="the-algorithm">The Algorithm</h2>
<p>IEEE-754 floating point numbers are represented in the form <script type="math/tex">(-1)^s (1 + m) 2^{x-x_0}</script>. <script type="math/tex">x_0</script> is the exponent bias. The 64-bit double precision format has 52 mantissa (m) bits, 11 exponent (x) bits, one sign (s) bit, <script type="math/tex">x_0=1023</script> and exponents range from −1022 to 1023. The figure below shows the bit format.</p>
<p><img src="/assets/images/64_bit_float.png" width="600" /></p>
<p>The bits may also be manipulated by accessing the memory of a 64-bit float as two 32-bit integers. One can, for example, raise the exponent of a number by directly setting the <code class="highlighter-rouge">x</code> bits in the high 32-bits of the 64-bit float. So setting the high integer to <script type="math/tex">2^{20} (y + 1023)</script> sets the floating point value to <script type="math/tex">2^y</script> for integer <script type="math/tex">y</script>.</p>
<p>What is cool about the exponent in IEEE-754 is that the fractional part of an exponent may be allowed to flow over to the mantissa. The fractional part then naturally linearly ‘interpolates’ between <script type="math/tex">2^y</script> and <script type="math/tex">2^{y+1}</script> for example.</p>
<p>Now the final pieces of the solution are that <script type="math/tex">e^x = 2^\frac{x}{\ln 2} = 2^y</script> for <script type="math/tex">y=\frac{x}{\ln 2}</script> <em>and</em> the <script type="math/tex">\ln 2</script> constant can be pre-factored into the integer mantissa scale. See in the code below that <script type="math/tex">\frac{2^{20}}{\ln 2}</script> is used instead of <script type="math/tex">2^{20}</script>. Note that the same method can similarly be used to approximate other exponential functions.</p>
<h2 id="some-code-and-performance-results">Some Code and Performance Results.</h2>
<p>The code for the double precision floating point version of the algorithm is shown below. The ‘c’ value has been set to zero so that fast_exp(0.0) = 1.0. Schraudolph’s note has more details on how to set this value.</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="c1">//! Approximate exp by Schraudolph, 1999 - double precision floating point version.
</span> <span class="n">ALWAYS_INLINE</span> <span class="kt">double</span> <span class="n">fast_exp</span><span class="p">(</span><span class="k">const</span> <span class="kt">double</span> <span class="n">x</span><span class="p">)</span> <span class="k">noexcept</span> <span class="p">{</span>
<span class="c1">// Based on Schraudolph 1999, A Fast, Compact Approximation of the Exponential Function.
</span> <span class="c1">// - See the improved fast_exp_64 implementation below!
</span> <span class="c1">// - Valid for x in approx range (-700, 700).
</span> <span class="k">union</span><span class="p">{</span><span class="kt">double</span> <span class="n">d_</span><span class="p">;</span> <span class="kt">int32_t</span> <span class="n">i_</span><span class="p">[</span><span class="mi">2</span><span class="p">];}</span> <span class="n">uid</span><span class="p">;</span> <span class="c1">//This could be moved to the thread scope.
</span> <span class="c1">//BBBD(sizeof(uid)!=8)
</span> <span class="n">uid</span><span class="p">.</span><span class="n">i_</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">uid</span><span class="p">.</span><span class="n">i_</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="kt">int32_t</span><span class="p">(</span><span class="kt">double</span><span class="p">((</span><span class="mi">1</span><span class="o"><<</span><span class="mi">20</span><span class="p">)</span> <span class="o">/</span> <span class="n">log</span><span class="p">(</span><span class="mf">2.0</span><span class="p">))</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="kt">double</span><span class="p">((</span><span class="mi">1</span><span class="o"><<</span><span class="mi">20</span><span class="p">)</span> <span class="o">*</span> <span class="mi">1023</span> <span class="o">-</span> <span class="mi">0</span><span class="p">));</span> <span class="c1">//c=0 for 1.0 at zero.
</span> <span class="k">return</span> <span class="n">uid</span><span class="p">.</span><span class="n">d_</span><span class="p">;</span>
<span class="p">}</span> </code></pre></figure>
<!-- Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz (turbo boosted to 3.30GHz)-->
<!-- exp_perf = 5.99225e-09 s/call -->
<!-- fast_exp_perf = 4.80012e-10 s/call -->
<p>The performance of the <code class="highlighter-rouge">fast_exp</code> function was measured on an Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz (turbo boosted to 3.30GHz). The normal double precision exponential from math.h takes around 6ns per call while the fast exponential takes around 0.5ns or approximately two clock cycles on this platform.</p>
<p>The images below compares two sigmoid <script type="math/tex">s(x) = \frac{1}{1 + e^{-x}}</script> curves at different scales. The green curve is generated with the fast exponential. As explained in the paper, the global fit is reasonably good. The implicit linear interpolation of the algorithm is evident when one looks closer while the staircase effect is evident when looking at the sixth decimal scale.</p>
<p><img src="/assets/images/fast_sigmoid_global_fit.png" width="230" />
<img src="/assets/images/fast_sigmoid_lin_interpol.png" width="230" />
<img src="/assets/images/fast_sigmoid_staircase.png" width="230" /></p>
<!-- See https://twitter.com/bernardt_d/status/1010176425884901377 -->
<p>It would be interesting to instead of 32-bit integers use a 64-bit integer so that all the mantissa bits may be used for the fractional exponent. This should reduce the staircase effect and 64-bit integers are quite fast on modern CPUs.</p>
<p>[Edit: I tried the above and it works pretty well. The below adaptation uses a 64-bit integer. The performance and accuracy is equivalent <em>and</em> the staircase effect is no longer present :-) ]</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="c1">//! Approximate exp adapted from Schraudolph, 1999 - double precision floating point version.
</span> <span class="n">ALWAYS_INLINE</span> <span class="kt">double</span> <span class="n">fast_exp_64</span><span class="p">(</span><span class="k">const</span> <span class="kt">double</span> <span class="n">x</span><span class="p">)</span> <span class="k">noexcept</span> <span class="p">{</span>
<span class="c1">// Based on Schraudolph 1999, A Fast, Compact Approximation of the Exponential Function.
</span> <span class="c1">// - Adapted to use 64-bit integer; reduces staircase effect.
</span> <span class="c1">// - Valid for x in approx range (-700, 700).
</span> <span class="k">union</span><span class="p">{</span><span class="kt">double</span> <span class="n">d_</span><span class="p">;</span> <span class="kt">int64_t</span> <span class="n">i_</span><span class="p">;}</span> <span class="n">uid</span><span class="p">;</span> <span class="c1">//This could be moved to the thread scope.
</span> <span class="c1">//BBBD(sizeof(uid)!=8)
</span> <span class="n">uid</span><span class="p">.</span><span class="n">i_</span> <span class="o">=</span> <span class="kt">int64_t</span><span class="p">(</span><span class="kt">double</span><span class="p">((</span><span class="kt">int64_t</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o"><<</span> <span class="mi">52</span><span class="p">)</span> <span class="o">/</span> <span class="n">log</span><span class="p">(</span><span class="mf">2.0</span><span class="p">))</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="kt">double</span><span class="p">((</span><span class="kt">int64_t</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o"><<</span> <span class="mi">52</span><span class="p">)</span> <span class="o">*</span> <span class="mi">1023</span> <span class="o">-</span> <span class="mi">0</span><span class="p">));</span> <span class="c1">//c=0 for 1.0 at zero.
</span> <span class="k">return</span> <span class="n">uid</span><span class="p">.</span><span class="n">d_</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<h2 id="summary">Summary</h2>
<p>Schraudolph’s note presents a detailed analysis of the accuracy of the approximation and is quite an interesting read. In summary, the accuracy is reasonable for most applications and my tests show that it is about 10x quicker than the standard C++ math library’s version.</p>
<p>Given the reduced accuracy of the approximation it is likely a good idea to test the impact of the approximation in your solution. The basic recommendation is to start with <code class="highlighter-rouge">exp</code> and then to apply <code class="highlighter-rouge">fast_exp</code> where it makes sense. The full <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/math">source</a> with execution timing
is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp</a> GitHub repo.</p>The exponential function and specifically the natural exponential function exp is used often in machine learning, simulated annealing and many calculations within the sciences. Part of the reason why the natural exponential is so useful is that it is quite unique in that it is its own derivative. Put differently, the slope of the function is equal to its height. A fast non-table-lookup approximation of the natural exponential function would be very useful to accelerate many applications of exp.Making Fire and Water2019-04-30T00:00:00+00:002019-04-30T00:00:00+00:00http://bduvenhage.me/memory_bank/2019/04/30/memory-bank_making-fire-and-water<p>Future Crew’s 1993 PC demo called <a href="https://en.wikipedia.org/wiki/Second_Reality">Second Reality</a> had a Brobdingnagian impact on my interest in programming. Much of what I researched and played with at a young age was due to this demo. The demo won a couple of prizes and eventually made it onto <a href="https://slashdot.org/story/99/12/13/0943241/slashdots-top-10-hacks-of-all-time">Slashdot’s Top 10 Hacks of All Time</a>. The <a href="https://github.com/mtuomi/SecondReality">source code</a> which contains lots of PAS, POV and ASM files was released to celebrate the 20th anniversary of the demo. A <a href="https://www.youtube.com/watch?v=iw17c70uJes">high quality video</a> of the demo is also available.</p>
<p>This post is my first “memory bank” post. In these posts I’ll dig up some of my old code, make it run and preserve the code by committing it to github. In this post I’ll revive some of my old code for making fire and water.</p>
<h2 id="coding-like-its-1998">Coding like it’s 1998.</h2>
<p><a href="https://en.wikipedia.org/wiki/Mode_13h">Mode 13h</a> was a standard 320x200 VGA graphics mode with 256 colours. Mode 13h provided a linear 320x200 block of video memory at 0xA000:0000, where each byte represents one pixel.</p>
<p>To set the RGB value of each of the 256 colours in the palette one would write the colour number first to the DAC Write Index register at 0x3C8 and then writing three 6-bit RGB components to the DAC Data register at 0x3C9 using the outp() function from conio.h in MSDOS. <a href="http://www.jagregory.com/abrash-black-book/">Michael Abrash’s Graphics Programming Black Book</a> is a good reference for more info on how this all worked.</p>
<p>For interest sake, I looked up the compiler from the binaries I had compiled back then. The compiler was Borland Turbo C++ 3.2 Copyright 1991 Borland Intl. The IDE looked like:</p>
<p><img src="/assets/images/turbo_cpp_3.2.jpg" width="400" /></p>
<p>To get the code running I used an SDL texture as a framebuffer and added some code to emulate a 256 colour palette. The <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/demoscene">source code</a> is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp GitHub repo</a>.</p>
<h2 id="simple-fire">Simple Fire</h2>
<p>Simple fire can be created by having a heat source with random noise that is continually convolved with an asymmetrical filter kernel. The asymmetrical filter creates a flow effect. On a modern processor this effect runs at almost a 1000 FPS.</p>
<p><img src="/assets/images/demoscene_fire.gif" width="600" /></p>
<h2 id="doom-fire">Doom Fire</h2>
<p>Doom style fire can be created by having a heat source that is propagated up while being randomly extinguised and scattered. I generate one more random number than the <a href="http://fabiensanglard.net/doom_fire_psx/">method recently documented by Fabien Sanglard</a>, but it is almost as fast as the simple fire above.</p>
<p><img src="/assets/images/demoscene_doom_fire.gif" width="600" /></p>
<h2 id="water">Water</h2>
<p>This water effect is quite cool. A water heightmap is maintained between two buffers. An <a href="https://web.archive.org/web/20160418004149/http://freespace.virgin.net/hugo.elias/graphics/x_water.htm">archived explanation</a> contains more details on how this works. Here I render the water height directly, but one should really create a refractive offset into a texture map. A demo of this effect is available at http://www.onlinetutorialsweb.com/demo/javascript-water-ripple/ .</p>
<p><img src="/assets/images/demoscene_water.gif" width="600" /></p>
<h2 id="summary">Summary</h2>
<p>The <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/demoscene">source code</a> is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp GitHub repo</a>. Thanks for the inspiration <a href="https://en.wikipedia.org/wiki/Future_Crew">Future Crew</a>.</p>Future Crew’s 1993 PC demo called Second Reality had a Brobdingnagian impact on my interest in programming. Much of what I researched and played with at a young age was due to this demo. The demo won a couple of prizes and eventually made it onto Slashdot’s Top 10 Hacks of All Time. The source code which contains lots of PAS, POV and ASM files was released to celebrate the 20th anniversary of the demo. A high quality video of the demo is also available.The Heap Allocation Behaviour of the C++ std::unordered_set2019-04-22T00:00:00+00:002019-04-22T00:00:00+00:00http://bduvenhage.me/performance/2019/04/22/size-of-hash-table<p>Hash tables are very useful whenever a constant time lookup of a key or key-value pair is required. It is therefore good to know how these containers are implemented and used. The C++ std::unordered_map and std::unordered_set are good examples of a hash table and hash set respectively. This post will investigate the heap allocation behaviour of std::unordered_set. One can expect the behaviour of std::unordered_map to be similar.</p>
<h2 id="the-stdunordered_set-implementation">The std::unordered_set Implementation</h2>
<p>std::unordered_set is a set container implemented as a hash set. The container maintains a number of ‘buckets’ that can each hold more than one item. When an item needs to be inserted into the container a bucket index is first calculated from a hash of the item. The item is then added to the bucket which is typically implemented as a linked list.</p>
<p>The C++ std::unordered_set keeps track of its load factor i.e. the average number of items per bucket. Once the load factor goes beyond the maximum load factor (default 1.0) the container is resized to double its number of buckets which halves its load factor. Keeping the load factor low means that the chance of having multiple items per bucket would be relatively small which in turn keeps the cost of the linked list implementation of the bucket low.</p>
<h2 id="the-heap-allocation-behaviour">The Heap Allocation Behaviour</h2>
<p>The results below show the behaviour of the unordered set container when inserting 20 million random 32-bit unsigned integers. There is a small probability of randomly generating the same number more than once which resulted in an actual set size of 19.95 million items. The <a href="https://github.com/bduvenhage/Bits-O-Cpp/blob/master/containers/main_hash_table.cpp">source code</a> for the tests is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp GitHub repo</a>. The tests were compiled on Apple LLVM (clang) compiler version 10.0.1 with -O3 (default Xcode release flags).</p>
<p>The container was allowed to automatically resize to keep its load factor from exceeding the default max load factor of 1.0. The below figure shows how the heap size of the container grows as more items are inserted. The ‘jumps’ in the container size happens everytime the container resizes itself.</p>
<p><img src="/assets/images/unordered_set_heap_size.png" width="600" /></p>
<p>I would have liked to use valgrind’s massif tool (i.e. valgrind –tool=massif …) to analyse the memory usage of the std::unordered_set, but it seems that valgrind does not yet support macOS Mojave. So I wrote a minimal allocator to track the number of bytes and breakdown of allocations on the heap. The tracking allocator calls the default allocator’s allocate and deallocate functions.</p>
<p>As random numbers are inserted into the set the load factor increases. When the load factor exceeds the max load factor of 1.0 the number of buckets are doubled which halves the load factor. This behaviour of the load factor and number of buckets is shown below.</p>
<p><img src="/assets/images/unordered_set_load_factor.png" width="600" /></p>
<p><img src="/assets/images/unordered_set_buckets.png" width="600" /></p>
<p>Increasing the number of buckets requires a reallocation of buckets and a redistribution of items. The below figure shows the time lost whenever the number of buckets is doubled.</p>
<p><img src="/assets/images/unordered_set_running_time.png" width="600" /></p>
<p>Storing 20 million uint32_t values in an unordered_set required 657 MB of heap memory. After having adding all the items, the allocation breakdown looked like:</p>
<table>
<thead>
<tr>
<th>Element Size</th>
<th>Total Num. Elements Allocated</th>
<th>Num. Currently on Heap</th>
</tr>
</thead>
<tbody>
<tr>
<td>size = 8</td>
<td>52679739</td>
<td>26339969</td>
</tr>
<tr>
<td>size = 24</td>
<td>19953544</td>
<td>19953544</td>
</tr>
</tbody>
</table>
<p>The reserved buckets are stored as an array of 64-bit (8 byte) pointers. Each linked list bucket entry is stored as the item’s hash, its uint32_t value (with 4 bytes of padding) and a pointer to the next item in the bucket. A bucket entry in this specific instance therefore requires 24 bytes and would look similar to this struct:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">struct</span> <span class="n">BucketItem</span>
<span class="p">{</span>
<span class="kt">size_t</span> <span class="n">hash_</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">item_</span><span class="p">;</span>
<span class="c1">//4 bytes of padding lay here.
</span> <span class="n">BucketItem</span> <span class="o">*</span><span class="n">next_</span><span class="p">;</span>
<span class="p">};</span></code></pre></figure>
<p>It is interesting to note that the number of the 24 byte bucket entry allocations is the same as the number of items in the set. This shows that an allocated bucket entry is sensibly reused when the hash set is resized. In contrast to this, the total number of buckets that are allocated is about twice the number of buckets eventually used which is as expected for an array that is resized by doubling.</p>
<h2 id="summary">Summary</h2>
<p>This post has hopefully clarified how the heap size and number of bins of the C++ unordered_set (and similarly the unordered_map) is related to the number of items by the max load factor. A good rule of thumb for the container overhead on a 64-bit system and a max load factor of 1.0 would be between 8+16 and 16+16 bytes per item depending on the load factor of the container. A future investigation could perhaps explore the performance to memory tradeoff of different container max load factors.</p>Hash tables are very useful whenever a constant time lookup of a key or key-value pair is required. It is therefore good to know how these containers are implemented and used. The C++ std::unordered_map and std::unordered_set are good examples of a hash table and hash set respectively. This post will investigate the heap allocation behaviour of std::unordered_set. One can expect the behaviour of std::unordered_map to be similar.The Performance of C++ std::deque vs. std::vector2019-04-12T00:00:00+00:002019-04-12T00:00:00+00:00http://bduvenhage.me/performance/2019/04/12/performance-of-deque<p>Many algorithms make use of a double ended queue (a.k.a. a deque). For example, in breadth first search a deque is used to hold the search frontier. Having a fast implementation of a deque is very useful.</p>
<p>If the size of the deque is bounded then one can implement it as a ring buffer over a pre-allocated array. If the size of the deque is not bounded then one would have to trade some performance for the ability to dynamically grow the capacity of the deque.</p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML" async=""></script>
<p>I’ve been considering implementing a ring buffer STL container or container adaptor for doing breadth first search faster than would perhaps be possible with an STL deque. However, an STL deque is a much cooler (and faster) container than I initially thought. The deque stores its elements in cache friendly chunks and still allows constant time random access (although with one more dereference than required with a vector).</p>
<p>Given its design, the std::deque has the potential to be very efficient. This post discusses std::deque’s internal workings and compares its performance to that of std::vector.</p>
<h2 id="the-stl-deque-implementation">The STL Deque Implementation</h2>
<p>The image below shows the memory layout of a std::deque. The data is stored in blocks or chunks which are referenced from a list of chunks called a map. The beginning and end of the map as well as the beginning of the first chunk and the end of the last chunk can have empty slots. begin() points to the first unused element in the first chunk and end() points to one past the last element in the last chunk.</p>
<p><img src="/assets/images/deque.png" width="600" />
<img src="/assets/images/64_bit_float.png" width="600" /></p>
<p>The push_back operation adds an element to the next available slot of the last chunk. If no space is available in the last chunk then a new chunk is added to the next unused slot in the map. If no space is available at the end of the map then the map is first resized by a reallocation and move similar to how a vector is resized. A similar mechanism is used to add elements to the front via push_front. Popping elements might make a chunk unused at which point the chunk may be freed or returned to a chunk pool.</p>
<p>It is worth noting that the deque can keep growing in size without having to reallocate and move any of the data elements. However, a drawback of the container is that to access an element one first needs to reference its chunk and then the element within the chunk.</p>
<h2 id="the-performance-of-stddeque-vs-stdvector">The Performance of std::deque vs. std::vector</h2>
<p>I compared the performance of std::deque to std::vector on Apple LLVM (clang) compiler version 10.0.1. The code was compiled with -O3 (default Xcode release flags).</p>
<p>I’m specifically interested in the relative performance of adding to the end of the container and also the performance of iterating over elements. One can expect adding to and removing from the front of the container to be much faster for a deque than for a vector. The test does a push_back of 50 million random integers, then inserts 50 million random integers at the front, then sorts the container and finally iterates over and sets all the elements to a constant value.</p>
<p>The total time these operations take are shown below:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Vector</th>
<th>Deque</th>
</tr>
</thead>
<tbody>
<tr>
<td>push_back (50M)</td>
<td>0.36s</td>
<td>0.33s</td>
</tr>
<tr>
<td>insert front (50M)</td>
<td><script type="math/tex">\approx \infty</script></td>
<td>0.30s</td>
</tr>
<tr>
<td>sort (100M)</td>
<td>8.80s</td>
<td>10.0s</td>
</tr>
<tr>
<td>iterate (100M)</td>
<td>0.041s</td>
<td>0.063s</td>
</tr>
<tr>
<td>pop_back (50M)</td>
<td>0.013s</td>
<td>0.13s</td>
</tr>
</tbody>
</table>
<p>Note that some of the time spent during push back and insert at front is probably spent in making the allocated memory pages available to the process. The vector space was reserved before doing the tests. Without reserving the vector space the push_back vector operations take about double the above time. The time taken to generate 50M random numbers is around 0.06 seconds.</p>
<p>The <a href="https://github.com/bduvenhage/Bits-O-Cpp/blob/master/containers/main_vector_vs_deque.cpp">source code</a> for the tests is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp">Bits-O-Cpp GitHub repo</a>.</p>
<h2 id="summary">Summary</h2>
<p>The push_back operations on a vector and deque require similar time if the vector space is already reserved. If the vector space is not already reserved then growing the vector would require it to be moved every time the vector runs out of space. As expected, inserting elements at the front of a deque is much quicker than inserting at the front of a vector.</p>
<p>Sorting a deque is almost 15% slower than sorting a vector. This is likely due to the additional dereference required when accessing a random element of a deque. Iterating over the deque is about 50% slower than iterating over the vector. Popping from a vector just decrements the size variable and doesn’t free any memory so it takes almost no time.</p>
<p>The std::deque is slower than a ring buffer based deque would be, but probably not by more than 50%. std::deque is likely a good implementation choice when the maximum size of the deque is unknown. In a future post I’ll show what a ring buffer based deque looks like and how it performs relative to std::vector and std::deque.</p>
<p>From the information in this post it seems that the best way to implement a ring buffer that is faster than a std::deque would be to implement it similar to a container adapter on vector.</p>Many algorithms make use of a double ended queue (a.k.a. a deque). For example, in breadth first search a deque is used to hold the search frontier. Having a fast implementation of a deque is very useful.The Intel DRNG2019-04-06T00:00:00+00:002019-04-06T00:00:00+00:00http://bduvenhage.me/rng/2019/04/06/the-intel-drng<p>I recently got around to testing Intel’s Secure Key Digital Random Number Generator (DRNG). Intel Secure Key (code-named Bull Mountain Technology) is the name for the Intel® 64 and IA-32 Architectures instructions RDRAND and RDSEED and the underlying hardware implementation.</p>
<p>The RDRAND and RDSEED instructions address the need for a fast source of entropy. From Intel’s <a href="https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide">Software Implementation Guide</a>: “The DRNG using the RDRAND instruction is useful for generating high-quality keys for cryptographic protocols, and the RDSEED instruction is provided for seeding software-based pseudorandom number generators (PRNGs).”</p>
<p>Section 2 of the software implementation guide gives an overview of Random Number Generators (RNGs). For another overview of RNGs watch the excellent <a href="http://www.pcg-random.org/posts/stanford-colloquium-talk.html">talk by Melissa O’Neill</a>, or watch <a href="https://www.youtube.com/watch?v=jWXZ07YBsPM&feature=youtu.be">my talk</a>. I’ve also found <a href="https://lemire.me/blog/?s=random">Daniel Lemire’s blog</a> to be an excellent resource on implementing RNGs.</p>
<p>Section 3 of the software implementation guide gives an overview of how Intel’s DRNG works. Thermal noise is the fundamental source of entropy. A hardware CSPRNG (Cryptographically secure PRNG) digital random bit generator feeds the RDRAND instructions over all cores while an ENRNG (Enhanced Non-deterministic Random Number Generator) feeds the RDSEED instructions over all cores. The RDRAND DRNG is continuously reseeded from the hardware entropy source while the RDSEED generator makes conditioned entropy samples directly available.</p>
<h2 id="determining-support-for-intels-drng">Determining Support for Intel’s DRNG</h2>
<p>Support for RDRAND can be determined by examining bit 30 of the ECX register returned by CPUID, and support for RDSEED can be determined by examining bit 18 of the EBX register.</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="kt">void</span> <span class="nf">get_cpuid</span><span class="p">(</span><span class="n">cpuid_t</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">leaf</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">subleaf</span><span class="p">)</span> <span class="p">{</span>
<span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"cpuid"</span>
<span class="o">:</span> <span class="s">"=a"</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">eax</span><span class="p">),</span> <span class="s">"=b"</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">ebx</span><span class="p">),</span> <span class="s">"=c"</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">ecx</span><span class="p">),</span> <span class="s">"=d"</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">edx</span><span class="p">)</span>
<span class="o">:</span> <span class="s">"a"</span> <span class="p">(</span><span class="n">leaf</span><span class="p">),</span> <span class="s">"c"</span> <span class="p">(</span><span class="n">subleaf</span><span class="p">));</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="nf">is_intel_cpu</span><span class="p">()</span> <span class="p">{</span>
<span class="n">cpuid_t</span> <span class="n">info</span><span class="p">;</span>
<span class="n">get_cpuid</span><span class="p">(</span><span class="o">&</span><span class="n">info</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">memcmp</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">info</span><span class="p">.</span><span class="n">ebx</span><span class="p">,</span> <span class="s">"Genu"</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span> <span class="o">||</span>
<span class="n">memcmp</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">info</span><span class="p">.</span><span class="n">edx</span><span class="p">,</span> <span class="s">"ineI"</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span> <span class="o">||</span>
<span class="n">memcmp</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">info</span><span class="p">.</span><span class="n">ecx</span><span class="p">,</span> <span class="s">"ntel"</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span> <span class="p">{</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="nf">is_drng_supported</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">bool</span> <span class="n">rdrand_supported</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">rdseed_supported</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_intel_cpu</span><span class="p">())</span> <span class="p">{</span>
<span class="n">cpuid_t</span> <span class="n">info</span><span class="p">;</span>
<span class="n">get_cpuid</span><span class="p">(</span><span class="o">&</span><span class="n">info</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">((</span><span class="n">info</span><span class="p">.</span><span class="n">ecx</span> <span class="o">&</span> <span class="mh">0x40000000</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x40000000</span><span class="p">)</span> <span class="p">{</span>
<span class="n">rdrand_supported</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">get_cpuid</span><span class="p">(</span><span class="o">&</span><span class="n">info</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span> <span class="p">(</span><span class="n">info</span><span class="p">.</span><span class="n">ebx</span> <span class="o">&</span> <span class="mh">0x40000</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x40000</span> <span class="p">)</span> <span class="p">{</span>
<span class="n">rdseed_supported</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">rdrand_supported</span> <span class="o">&</span> <span class="n">rdseed_supported</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<h2 id="using-rdrand-and-rdseed">Using RDRAND and RDSEED</h2>
<p>The RDRAND and RDSEED instructions may be called as shown below. The size of the operand register determines whether 16-, 32- or 64-bit random numbers are returned. If the carry flag is zero after a DRNG instruction it means a random number wasn’t available yet and the software should retry if a random number is still required.</p>
<p>Similar to a splitmix64_stateless generator (shown for reference), the rdseed64 generator shown below may be used to seed RNGs:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">//! Stateless [0,2^64) splitmix64 by Daniel Lemire https://github.com/lemire/testingRNG . Useful for seeding RNGs.
</span><span class="n">ALWAYS_INLINE</span> <span class="kt">uint64_t</span> <span class="nf">splitmix64_stateless</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint64_t</span> <span class="n">index</span><span class="p">)</span> <span class="p">{</span><span class="c1">// 1.3 ns on local.
</span> <span class="kt">uint64_t</span> <span class="n">z</span> <span class="o">=</span> <span class="n">index</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x9E3779B97F4A7C15</span><span class="p">);</span>
<span class="n">z</span> <span class="o">=</span> <span class="p">(</span><span class="n">z</span> <span class="o">^</span> <span class="p">(</span><span class="n">z</span> <span class="o">>></span> <span class="mi">30</span><span class="p">))</span> <span class="o">*</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0xBF58476D1CE4E5B9</span><span class="p">);</span>
<span class="n">z</span> <span class="o">=</span> <span class="p">(</span><span class="n">z</span> <span class="o">^</span> <span class="p">(</span><span class="n">z</span> <span class="o">>></span> <span class="mi">27</span><span class="p">))</span> <span class="o">*</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x94D049BB133111EB</span><span class="p">);</span>
<span class="k">return</span> <span class="n">z</span> <span class="o">^</span> <span class="p">(</span><span class="n">z</span> <span class="o">>></span> <span class="mi">31</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">//! 64-bit Intel RDSEED. Useful for seeding RNGs.
</span><span class="n">ALWAYS_INLINE</span> <span class="kt">uint64_t</span> <span class="nf">rdseed64</span><span class="p">()</span> <span class="p">{</span> <span class="c1">// 450 ns on local.
</span> <span class="kt">uint64_t</span> <span class="n">rand</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">ok</span><span class="p">;</span>
<span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"rdseed %0; setc %1"</span>
<span class="o">:</span> <span class="s">"=r"</span> <span class="p">(</span><span class="n">rand</span><span class="p">),</span> <span class="s">"=qm"</span> <span class="p">(</span><span class="n">ok</span><span class="p">));</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">ok</span><span class="p">)</span> <span class="p">{</span>
<span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"pause"</span> <span class="o">:</span> <span class="p">);</span>
<span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"rdseed %0; setc %1"</span>
<span class="o">:</span> <span class="s">"=r"</span> <span class="p">(</span><span class="n">rand</span><span class="p">),</span> <span class="s">"=qm"</span> <span class="p">(</span><span class="n">ok</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">rand</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>The assembler for the above rdseed64 generator would look similar to the below snippet. Notice the ‘pause’ instruction which is recommended by Intel so that the core can perhaps still do other work while waiting for a random number.</p>
<figure class="highlight"><pre><code class="language-nasm" data-lang="nasm"><span class="n">rdseed64</span><span class="p">()</span><span class="o">:</span>
<span class="k">jmp</span> <span class="p">.</span><span class="n">L6</span>
<span class="p">.</span><span class="n">L3</span><span class="o">:</span>
<span class="k">pause</span>
<span class="p">.</span><span class="n">L6</span><span class="o">:</span>
<span class="n">rdseed</span> <span class="n">rax</span><span class="c">; setc dl</span>
<span class="k">test</span> <span class="n">dl</span><span class="p">,</span> <span class="n">dl</span>
<span class="k">je</span> <span class="p">.</span><span class="n">L3</span>
<span class="k">ret</span></code></pre></figure>
<p>Below is a Lehmer RNG class (shown for reference) and an Intel 32-bit DRNG class using RDRAND:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">//! Lehmer RNG with 64bit multiplier, derived from https://github.com/lemire/testingRNG.
</span><span class="k">class</span> <span class="nc">TC_MCG_Lehmer_RandFunc32</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">TC_MCG_Lehmer_RandFunc32</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">seed</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span><span class="n">init</span><span class="p">(</span><span class="n">seed</span><span class="p">);}</span>
<span class="c1">//!Calc LCG random number in [0,2^32)
</span> <span class="n">ALWAYS_INLINE</span> <span class="kt">uint32_t</span> <span class="k">operator</span><span class="p">()()</span> <span class="p">{</span><span class="c1">// 1.0 ns on local.
</span> <span class="n">state_</span><span class="p">.</span><span class="n">s128_</span> <span class="o">*=</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0xda942042e4dd58b5</span><span class="p">);</span>
<span class="k">return</span> <span class="kt">uint32_t</span><span class="p">(</span><span class="n">state_</span><span class="p">.</span><span class="n">s64_</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">init</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">seed</span><span class="p">)</span> <span class="p">{</span><span class="n">state_</span><span class="p">.</span><span class="n">s128_</span> <span class="o">=</span> <span class="p">(</span><span class="n">__uint128_t</span><span class="p">(</span><span class="n">splitmix64_stateless</span><span class="p">(</span><span class="n">seed</span><span class="p">))</span> <span class="o"><<</span> <span class="mi">64</span><span class="p">)</span> <span class="o">+</span> <span class="n">splitmix64_stateless</span><span class="p">(</span><span class="n">seed</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);}</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">double</span> <span class="n">max_plus_one</span><span class="p">()</span> <span class="p">{</span><span class="k">return</span> <span class="mf">4294967296.0</span><span class="p">;}</span> <span class="c1">//0x1p32
</span> <span class="k">static</span> <span class="k">constexpr</span> <span class="kt">double</span> <span class="n">recip_max_plus_one</span><span class="p">()</span> <span class="p">{</span><span class="k">return</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">/</span> <span class="mf">4294967296.0</span><span class="p">);}</span> <span class="c1">//1.0/0x1p32
</span> <span class="k">static</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">num_bits</span><span class="p">()</span> <span class="p">{</span><span class="k">return</span> <span class="mi">32</span><span class="p">;}</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">union</span><span class="p">{</span><span class="n">__uint128_t</span> <span class="n">s128_</span><span class="p">;</span> <span class="kt">uint64_t</span> <span class="n">s64_</span><span class="p">[</span><span class="mi">2</span><span class="p">];}</span> <span class="n">state_</span><span class="p">;</span> <span class="c1">//Assumes little endian so that s64[0] is the low 64 bits of s128_.
</span><span class="p">};</span>
<span class="c1">// 32-bit RNG using Intel's DRNG CPU instructions. Warning: It is slow! 100x slower than PCG!
</span><span class="k">class</span> <span class="nc">TC_IntelDRNG_RandFunc32</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">TC_IntelDRNG_RandFunc32</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">seed</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span><span class="n">init</span><span class="p">(</span><span class="n">seed</span><span class="p">);}</span>
<span class="c1">//!Intel DRNG random number in [0,2^32)
</span> <span class="n">ALWAYS_INLINE</span> <span class="kt">uint32_t</span> <span class="k">operator</span><span class="p">()()</span> <span class="p">{</span><span class="c1">//??ns on TC's EC2! 120ns on local!!!
</span> <span class="kt">uint32_t</span> <span class="n">rand</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">ok</span><span class="p">;</span>
<span class="k">do</span> <span class="p">{</span>
<span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"rdrand %0; setc %1"</span>
<span class="o">:</span> <span class="s">"=r"</span> <span class="p">(</span><span class="n">rand</span><span class="p">),</span> <span class="s">"=qm"</span> <span class="p">(</span><span class="n">ok</span><span class="p">));</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">ok</span><span class="p">);</span>
<span class="k">return</span> <span class="n">rand</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">init</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">seed</span><span class="p">)</span> <span class="p">{}</span> <span class="c1">//No seeding required.
</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">double</span> <span class="n">max_plus_one</span><span class="p">()</span> <span class="p">{</span><span class="k">return</span> <span class="mf">4294967296.0</span><span class="p">;}</span> <span class="c1">//0x1p32
</span> <span class="k">static</span> <span class="k">constexpr</span> <span class="kt">double</span> <span class="n">recip_max_plus_one</span><span class="p">()</span> <span class="p">{</span><span class="k">return</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">/</span> <span class="mf">4294967296.0</span><span class="p">);}</span> <span class="c1">//1.0/0x1p32
</span> <span class="k">static</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">num_bits</span><span class="p">()</span> <span class="p">{</span><span class="k">return</span> <span class="mi">32</span><span class="p">;}</span>
<span class="p">};</span></code></pre></figure>
<h2 id="performance-results">Performance Results:</h2>
<p>In the Intel software implementation guide it is stated that “On real-world systems, a single thread executing RDRAND continuously may see throughputs ranging from 70 to 200 MB/sec, depending on the SPU architecture.”</p>
<p>I also ran some performance measurements on my laptop (which is a 2.9 GHz Intel Core i5 and has cpu_ticks_per_ns = 2.89991):</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">rng_seed_</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">TC_MCG_Lehmer_RandFunc32</span> <span class="n">lehmer_rng</span><span class="p">(</span><span class="n">rng_seed_</span><span class="p">);</span>
<span class="c1">//Should check is_intel_cpu()
</span> <span class="c1">//Should check is_drng_supported()
</span> <span class="n">TC_IntelDRNG_RandFunc32</span> <span class="n">intel_rng_</span><span class="p">(</span><span class="n">rng_seed_</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Generating some random numbers..."</span><span class="p">;</span>
<span class="n">TCTimer</span><span class="o">::</span><span class="n">init_timer</span><span class="p">(</span><span class="mf">2.89992e+09</span><span class="p">);</span> <span class="c1">// The param is the initial guess of your CPU's clock rate in GHz.
</span>
<span class="k">const</span> <span class="kt">uint64_t</span> <span class="n">num_iterations</span> <span class="o">=</span> <span class="kt">uint64_t</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o"><<</span> <span class="mi">27</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">ri</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">start_time</span> <span class="o">=</span> <span class="n">TCTimer</span><span class="o">::</span><span class="n">get_time</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">num_iterations</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//ri += lehmer_rng();
</span> <span class="n">ri</span> <span class="o">+=</span> <span class="n">intel_rng_</span><span class="p">();</span>
<span class="c1">//ri += rdseed64();
</span> <span class="p">}</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">end_time</span> <span class="o">=</span> <span class="n">TCTimer</span><span class="o">::</span><span class="n">sync_tsc_time</span><span class="p">();</span> <span class="c1">// Same as get_time(), but also estimates CPU's seconds_per_tick_!
</span> <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"done.</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="n">ri</span> <span class="o"><<</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span> <span class="c1">// Print the sum so that the RNG doesn't get optimised out.
</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">cpu_ticks_per_ns</span> <span class="o">=</span> <span class="n">TCTimer</span><span class="o">::</span><span class="n">get_clock_freq</span><span class="p">()</span> <span class="o">*</span> <span class="mf">0.000000001</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">ns_per_iteration</span> <span class="o">=</span> <span class="p">(</span><span class="n">end_time</span><span class="o">-</span><span class="n">start_time</span><span class="p">)</span> <span class="o">/</span> <span class="n">num_iterations</span> <span class="o">*</span> <span class="mf">1000000000.0</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">cpu_ticks_per_iteration</span> <span class="o">=</span> <span class="n">cpu_ticks_per_ns</span> <span class="o">*</span> <span class="n">ns_per_iteration</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">double</span> <span class="n">millions_numbers_per_second</span> <span class="o">=</span> <span class="n">num_iterations</span> <span class="o">/</span> <span class="p">(</span><span class="n">end_time</span><span class="o">-</span><span class="n">start_time</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.000001</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>Result for lehmar64:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ns_per_number = 0.90284
cpu_ticks_per_number = 2.61816
mbits_per_second = 70887.4
</code></pre></div></div>
<p>Result for rdrand32:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ns_per_number = 112.152
cpu_ticks_per_number = 325.231
mbits_per_second = 285.327
</code></pre></div></div>
<p>Result for rdseed64:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ns_per_number = 445.784
cpu_ticks_per_number = 1292.73
mbits_per_second = 143.5672
</code></pre></div></div>
<p>RDRAND and RDSEED is slower than, for example, the Lehmar generator. However, it provides cryptographically secure hardware entropy based random numbers significantly faster than seems to be otherwise possible.</p>
<h2 id="the-code">The Code</h2>
<p>The full code is available in my <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/random">Bits-O-Cpp GitHub repo</a>. That code uses some headers for timing and platform info from the repo, but the Bits-O-Cpp/random/README.md file contains info on how to compile the example.</p>I recently got around to testing Intel’s Secure Key Digital Random Number Generator (DRNG). Intel Secure Key (code-named Bull Mountain Technology) is the name for the Intel® 64 and IA-32 Architectures instructions RDRAND and RDSEED and the underlying hardware implementation.The Knapsack Problem2019-04-04T00:00:00+00:002019-04-04T00:00:00+00:00http://bduvenhage.me/algorithms/dynamic%20programming/2019/04/04/the-knapsack-problem<p>The knapsack problem comes up quite often and it is important to know how
to solve it. For example, given a certain material budget and the cost
vs. perceived value of building various edges in a potential road network,
which edges should one build? The goal is to optimise the perceived value
of the built roads within the fixed material budget.</p>
<p>I recently encountered this problem within a TopCoder marathon
match. This post discusses two solutions and the code that I’ll likely reuse for
this problem in future.</p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML" async=""></script>
<p>In terms of a knapsack, given a collection of <script type="math/tex">n</script> objects (each with a weight
<script type="math/tex">w_i</script> and value <script type="math/tex">v_i</script>) as well as a knapsack that can carry a certain weight
<script type="math/tex">W</script>, which objects would you choose to pack? The goal is to optimise the total
value of the objects that one can fit into the weight budget of the knapsack.</p>
<p>The below solutions are for the 0-1 knapsack problem which restricts the
number of ‘copies’ of each of the n objects to zero or one.</p>
<p>More formally, the goal is to maximize <script type="math/tex">\sum _{i=1}^{n}v_{i}x_{i}</script> subject to <script type="math/tex">\sum _{i=1}^{n}w_{i}x_{i}\leqslant W</script> and <script type="math/tex">x_{i}\in \{0,1\}</script>.</p>
<h2 id="a-greedy-approximate-solution">A Greedy Approximate Solution</h2>
<p>The ‘greedy’ approach to solving the problem is to repeatedly choose the best
value per weight object until no more objects can be added to the knapsack. This
solution has a complexity of <script type="math/tex">O(nm)</script> for m the number of items that typically fit
into the knapsack.</p>
<p>For example, given three objects with weights <script type="math/tex">w_1=3,\,w_2=4,\,w_3=2</script> and
values <script type="math/tex">v_1=8,\,v_2=12,\,v_3=5</script> and <script type="math/tex">W = 5</script>. The value per weight scores
for these objects are <script type="math/tex">s_1=2\frac{2}{3}</script>, <script type="math/tex">s_2=3</script>, <script type="math/tex">s_3=2\frac{1}{2}</script>.</p>
<p>For a greedy approach one would try to first
choose object two, then object one and then object three. However, one can
only fit object two (<script type="math/tex">w_2=4</script>) in the knapsack. Adding either object one or
three (<script type="math/tex">w_1=3,\,w_3=2</script>) would make the knapsack heavier than <script type="math/tex">W = 5</script>.
Therefore, the greedy solution is a knapsack value <script type="math/tex">\{v_2\} = 12</script>.</p>
<p>A greedy solution is however often not optimal. A better solution would have been
to rather pack objects one and three with weight <script type="math/tex">\{w_1=3,\,w_3=2\} = 5</script> and value
<script type="math/tex">\{v_1=8,\,v_3=5\} = 13</script>.</p>
<h2 id="a-dynamic-programming-solution">A Dynamic Programming Solution</h2>
<p>A solution that uses <a href="https://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a>
exists that can find the optimal solution. Dynamic
programming refers to simplifying a complicated problem by breaking it down
into simpler sub-problems. Another
well known example of such a solution is
<a href="https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm">Dijkstra’s algorithm</a>
for the shortest path problem.</p>
<p>For the 0-1 knapsack problem define <script type="math/tex">m[n,W]</script> to be the maximum value that
can be attained with weight less than or equal to <script type="math/tex">W</script> using the set of <script type="math/tex">n</script>
objects.</p>
<p>We can define <script type="math/tex">m[i,w]</script> recursively as follows:</p>
<ul>
<li><script type="math/tex">m[0,\,w] = 0</script>,</li>
<li><script type="math/tex">m[i,\,w] = m[i-1,\,w]</script> if <script type="math/tex">w_{i} > w</script>,</li>
<li><script type="math/tex">m[i,\,w] = \max(m[i-1,\,w],\,m[i-1,w-w_{i}] + v_{i})</script> if <script type="math/tex">w_{i} \leqslant w</script>.</li>
</ul>
<p>The maximum value of the objects that can be packed in the knapsack may then
be found by calculating <script type="math/tex">m[n,W]</script>. The C++ code for this would look like:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">w</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">w</span> <span class="o"><=</span> <span class="n">W_</span><span class="p">;</span> <span class="o">++</span><span class="n">w</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">w</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">w</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">w</span> <span class="o"><=</span> <span class="n">W</span><span class="p">;</span> <span class="o">++</span><span class="n">w</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">></span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">w</span><span class="p">]</span> <span class="o">=</span> <span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="n">w</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">else</span> <span class="p">{</span>
<span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">w</span><span class="p">]</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">max</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="n">w</span><span class="p">],</span> <span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="n">w</span><span class="o">-</span><span class="n">w</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span> <span class="o">+</span> <span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>For the above three object example, <script type="math/tex">m</script> ends up as a table of <script type="math/tex">n+1</script> rows by <script type="math/tex">W+1</script>
columns:</p>
<table>
<thead>
<tr>
<th> </th>
<th>w=0</th>
<th>w=1</th>
<th>w=2</th>
<th>w=3</th>
<th>w=4</th>
<th>w=5</th>
</tr>
</thead>
<tbody>
<tr>
<td>i=0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>i=1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>8</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>i=2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>8</td>
<td>12</td>
<td>12</td>
</tr>
<tr>
<td>i=3</td>
<td>0</td>
<td>0</td>
<td>5</td>
<td>8</td>
<td>12</td>
<td>13</td>
</tr>
</tbody>
</table>
<p>From the table one can see that <script type="math/tex">m[3,5]</script> is a knapsack of value 13 and building the table
has a time complexity of <script type="math/tex">O(nW)</script>.</p>
<p>The code to find the objects used in the solution is interesting and looks like:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">w</span> <span class="o">=</span> <span class="n">W</span><span class="p">;</span>
<span class="k">do</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">w</span><span class="p">]</span> <span class="o">!=</span> <span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="n">w</span><span class="p">])</span>
<span class="p">{</span><span class="c1">//Object i (1 indexed) seems to have contributed to the value and
</span> <span class="c1">//must therefore be part of the solution.
</span> <span class="n">object_used</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="c1">// i-1 gives zero indexed object!
</span> <span class="n">w</span> <span class="o">-=</span> <span class="n">w</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span> <span class="c1">// Subtract the weight of this object so that
</span> <span class="c1">// traversal continues from m[i-1][w - w[i-1]].
</span> <span class="p">}</span>
<span class="n">i</span> <span class="o">-=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span><span class="o">></span><span class="mi">0</span><span class="p">);</span></code></pre></figure>
<p>Note that the dynamic programming solution is a lot slower than the greedy solution and
uses LOTS more memory. To solve the problem with knapsack weight of <script type="math/tex">W</script> and <script type="math/tex">n</script> objects
requires a table of size <script type="math/tex">(W+1) \times (n+1)</script>. The table size can quickly become
prohibitive. I’ll do a future post on using a search strategy like simulated annealing to
find good solutions to very big knapsack problems.</p>
<p>The full <a href="https://github.com/bduvenhage/Bits-O-Cpp/tree/master/knapsack">source</a> with execution timing
is available in my Bits-O-Cpp repo. I use this repo as a reference for myself, but I aim
to also maintain examples of how anyone can use those bits of sweet C++ code.</p>The knapsack problem comes up quite often and it is important to know how to solve it. For example, given a certain material budget and the cost vs. perceived value of building various edges in a potential road network, which edges should one build? The goal is to optimise the perceived value of the built roads within the fixed material budget.First post with Jekyll2019-03-31T00:00:00+00:002019-03-31T00:00:00+00:00http://bduvenhage.me/jekyll/2019/03/31/first-post-with-jekyll<p>Jekyll is pretty cool. I followed the <a href="https://jekyllrb.com/docs/">Quickstart</a> guide which generates a basic blog site using the minima theme. Once you push the source to your <code class="highlighter-rouge">https://github.com/<username>/<username>.github.io</code> repo, Github Pages will build your site and make it available online at <code class="highlighter-rouge"><username>.github.io</code>. You can see what the site looks like locally before pushing by running <code class="highlighter-rouge">bundle exec jekyll serve --drafts</code> and pointing your browser at <code class="highlighter-rouge">127.0.0.1:4000</code>.</p>
<p>To add a new post, one simply pushes a markdown formatted post to the <code class="highlighter-rouge">_posts</code> directory that follows the convention <code class="highlighter-rouge">YYYY-MM-DD-name-of-post.md</code> and includes the necessary front matter. Take a look at the <a href="https://raw.githubusercontent.com/bduvenhage/bduvenhage.github.io/master/_posts/2019-03-31-first-post-with-jekyll.md">source</a> for this post to get an idea of what the front matter looks like.</p>
<p>Jekyll also offers powerful support for code snippets:</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="cp">#include <iostream>
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Hi, Tom.</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="o">//</span> <span class="n">prints</span> <span class="s">"Hi, Tom."</span> <span class="n">to</span> <span class="n">stdout</span><span class="p">.</span></code></pre></figure>
<p>Inline tables look like so:</p>
<table>
<thead>
<tr>
<th>Priority apples</th>
<th>Second priority</th>
<th>Third priority</th>
</tr>
</thead>
<tbody>
<tr>
<td>ambrosia</td>
<td>gala</td>
<td>red delicious</td>
</tr>
<tr>
<td>pink lady</td>
<td>jazz</td>
<td>macintosh</td>
</tr>
<tr>
<td>honeycrisp</td>
<td>granny smith</td>
<td>fuji</td>
</tr>
</tbody>
</table>
<p>Inline figures look like so:</p>
<!--- - ![Logo Jekyll](http://bduvenhage.me/assets/images/jekyll-logo.png ) -->
<!--- - ![Logo Jekyll](http://bduvenhage.me/assets/images/jekyll-logo.png) -->
<ul>
<li>
<p><img src="http://memofil.github.io/assets/images/categories/jekyll-logo.png" width="80" /></p>
</li>
<li>
<p><img src="/assets/images/jekyll-logo.png" width="80" /></p>
</li>
</ul>
<!--- - ![Logo Jekyll](/jekyll-logo.png) -->
<p>Inline math look like so:
<!--- Put the below script somewhere. -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML" async=""></script></p>
<script type="math/tex; mode=display">\sum_{i=1}^m y^{(i)}</script>
<script type="math/tex; mode=display">x = {-b \pm \sqrt{b^2-4ac} \over 2a}.</script>
<p>The GitHub help page on <a href="https://help.github.com/en/articles/setting-up-your-github-pages-site-locally-with-jekyll">setting up your github pages site locally with jekyll</a> might be useful. Check out the <a href="https://jekyllrb.com/docs/home">Jekyll docs</a> for more info on how to get the most out of Jekyll and the <a href="https://github.com/jekyll/minima">source</a> of the minima theme to see what a complete Jekyll site looks like. If you have questions, you can ask them on <a href="https://talk.jekyllrb.com/">Jekyll Talk</a>.</p>Jekyll is pretty cool. I followed the Quickstart guide which generates a basic blog site using the minima theme. Once you push the source to your https://github.com/<username>/<username>.github.io repo, Github Pages will build your site and make it available online at <username>.github.io. You can see what the site looks like locally before pushing by running bundle exec jekyll serve --drafts and pointing your browser at 127.0.0.1:4000.