webdump_tests

Testfiles for webdump
git clone git://git.codemadness.org/webdump_tests
Log | Files | Refs | README

commit 6d7678a78907c999dc004785d718347ccc01d0b3
parent 667920b3abc123846ebb61af422fdeba4282a43c
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue, 19 Sep 2023 20:03:22 +0200

add catb.org page

Diffstat:
Arealworld/catb.org.html | 639+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 639 insertions(+), 0 deletions(-)

diff --git a/realworld/catb.org.html b/realworld/catb.org.html @@ -0,0 +1,638 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>The Lost Art of Structure Packing</title><link rel="stylesheet" type="text/css" href="docbook-xsl.css" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /></head><body><div xml:lang="en" class="article" lang="en"><div class="titlepage"><div><div><h2 class="title"><a id="idm1"></a>The Lost Art of Structure Packing</h2></div><div><div class="author"><h3 class="author"><span class="firstname">Eric</span> <span class="othername">S.</span> <span class="surname">Raymond</span></h3><code class="email">&lt;<a class="email" href="mailto:esr@thyrsus.com">esr@thyrsus.com</a>&gt;</code></div></div></div><hr /></div><div class="toc"><p><strong>Table of Contents</strong></p><dl class="toc"><dt><span class="section"><a href="#_who_should_read_this">1. Who should read this</a></span></dt><dt><span class="section"><a href="#_why_i_wrote_it">2. Why I wrote it</a></span></dt><dt><span class="section"><a href="#_alignment_requirements">3. Alignment requirements</a></span></dt><dt><span class="section"><a href="#_padding">4. Padding</a></span></dt><dt><span class="section"><a href="#_structure_alignment_and_padding">5. Structure alignment and padding</a></span></dt><dt><span class="section"><a href="#_bitfields">6. Bitfields</a></span></dt><dt><span class="section"><a href="#_structure_reordering">7. Structure reordering</a></span></dt><dt><span class="section"><a href="#_awkward_scalar_cases">8. Awkward scalar cases</a></span></dt><dt><span class="section"><a href="#_readability_and_cache_locality">9. Readability and cache locality</a></span></dt><dt><span class="section"><a href="#_other_packing_techniques">10. Other packing techniques</a></span></dt><dt><span class="section"><a href="#_overriding_alignment_rules">11. Overriding alignment rules</a></span></dt><dt><span class="section"><a href="#_tools">12. Tools</a></span></dt><dt><span class="section"><a href="#_proof_and_exceptional_cases">13. Proof and exceptional cases</a></span></dt><dt><span class="section"><a href="#C-like">14. Other languages</a></span></dt><dd><dl><dt><span class="section"><a href="#_c">14.1. C++</a></span></dt><dt><span class="section"><a href="#_go">14.2. Go</a></span></dt><dt><span class="section"><a href="#_rust">14.3. Rust</a></span></dt><dt><span class="section"><a href="#_java">14.4. Java</a></span></dt><dt><span class="section"><a href="#_swift">14.5. Swift</a></span></dt><dt><span class="section"><a href="#_c_2">14.6. C#</a></span></dt></dl></dd><dt><span class="section"><a href="#_supporting_this_work">15. Supporting this work</a></span></dt><dt><span class="section"><a href="#_related_reading">16. Related Reading</a></span></dt><dt><span class="section"><a href="#_version_history">17. Version history</a></span></dt></dl></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_who_should_read_this"></a>1. Who should read this</h2></div></div></div><p>This page is about a technique for reducing the memory footprint of +programs in compiled languages with C-like structures - manually +repacking these declarations for reduced size. To read it, you will +require basic knowledge of the C programming language.</p><p>You need to know this technique if you intend to write code for +memory-constrained embedded systems, or operating-system kernels. It +is useful if you are working with application data sets so large that +your programs routinely hit memory limits. It is good to know in any +application where you really, <span class="emphasis"><em>really</em></span> care about optimizing your use +of memory bandwidth and minimizing cache-line misses.</p><p>Finally, knowing this technique is a gateway to other esoteric C +topics. You are not an advanced C programmer until you have grasped +these rules. You are not a master of C until you could have written +this document yourself and can criticize it intelligently.</p><p>This document originated with "C" in the title, but almost everything +in it applies to C++ as well. Many of the techniques discussed here +also apply to the Go language, and should generalize to any compiled +language with C-like structures. There is a note discussing C++, Go, +Rust, Java, Swift, and C# towards the end.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_why_i_wrote_it"></a>2. Why I wrote it</h2></div></div></div><p>This webpage exists because in late 2013 I found myself heavily +applying an optimization technique that I had learned more than two +decades previously and not used much since.</p><p>I needed to reduce the memory footprint of a program that used thousands - +sometimes hundreds of thousands - of C struct instances. The program +was <a class="ulink" href="http://www.catb.org/~esr/cvs-fast-export" target="_top">cvs-fast-export</a> and +the problem was that it was dying with out-of-memory errors on large +repositories.</p><p>There are ways to reduce memory usage significantly in situations like +this, by rearranging the order of structure members in careful ways. +This can lead to dramatic gains - in my case I was able to cut the +working-set size by around 40%, enabling the program to handle +much larger repositories without dying.</p><p>But as I worked, and thought about what I was doing, it began to dawn +on me that the technique I was using has been more than half forgotten +in these latter days. A little web research confirmed that +programmers don’t seem to talk about it much any more, at least not +where a search engine can see them. A couple of Wikipedia entries +touch the topic, but I found nobody who covered it comprehensively.</p><p>There are actually reasons for this that aren’t stupid. CS courses +(rightly) steer people away from micro-optimization towards finding better +algorithms. The plunging price of machine resources has made squeezing +memory usage less necessary. And the way hackers used to learn how to +do it back in the day was by bumping their noses on strange hardware +architectures - a less common experience now.</p><p>But the technique still has value in important situations, and will +as long as memory is finite. This document is intended to save +programmers from having to rediscover the technique, so they can +concentrate effort on more important things.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_alignment_requirements"></a>3. Alignment requirements</h2></div></div></div><p>The first thing to understand is that, on modern processors, the way +your compiler lays out basic datatypes in memory is constrained +in order to make memory accesses faster. Our examples are in C, but +any compiled language generates code under the same constraints.</p><p>There is a large class of modern ISAs (Instruction Set Architectures) +for which these constraints lead to identical layouts. These ISAs +include Intel, ARM, and RISC-V; I will refer to these as "vanilla" +ISAs.</p><p>Storage for the basic C datatypes on a vanilla ISA doesn’t +normally start at arbitrary byte addresses in memory. Rather, each +type except char has an <span class="emphasis"><em>alignment requirement</em></span>; chars can start on +any byte address, but 2-byte shorts must start on an even address, +4-byte ints or floats must start on an address divisible by 4, and +8-byte longs or doubles must start on an address divisible by 8. +Signed or unsigned makes no difference.</p><p>The jargon for this is that basic C types on a vanilla ISA are +<span class="emphasis"><em>self-aligned</em></span>. Pointers, whether 32-bit (4-byte) or 64-bit (8-byte) +are self-aligned too.</p><p>Self-alignment makes access faster because it facilitates generating +single-instruction fetches and puts of the typed data. Without +alignment constraints, on the other hand, the code might end up having +to do two or more accesses spanning machine-word boundaries. +Characters are a special case; they’re equally expensive from anywhere +they live inside a single machine word. That’s why they don’t have a +preferred alignment.</p><p>I said "on modern processors" because on some older ones forcing your +C program to violate alignment rules (say, by casting an odd address +into an int pointer and trying to use it) didn’t just slow your code +down, it caused an illegal instruction fault. This was the behavior, for +example, on Sun SPARC chips. In fact, with sufficient determination +and the right (e18) hardware flag set on the processor, you can still +trigger this on x86.</p><p>Also, self-alignment is not the only possible rule. Historically, some +processors (especially those lacking +<a class="ulink" href="https://en.wikipedia.org/wiki/Barrel_shifter" target="_top">barrel shifters</a>) have had +more restrictive ones. If you do embedded systems, you might trip over one +of these lurking in the underbrush. Be aware this is possible.</p><p>One curious and illustrative exception is the Motorola 68020 and its +successors. These are word-oriented 32-bit machines - that is, the +underlying granularity of fast access is 16 bits. Compilers can start +structs on 16-bit boundaries without a speed penalty, even if the +first member was a 32-bit scalar. Therefore, only character fields +with odd byte lengths can ever cause padding.</p><p>From when it was first written at the beginning of 2014 until late +2016, this section ended with other caveats about odd +architectures. During that period I’ve learned something rather +reassuring from working with the source code for the reference +implementation of NTP. It does packet analysis by reading packets off +the wire directly into memory that the rest of the code sees as a +struct, relying on the assumption of minimal self-aligned padding - or +zero padding in odd cases like 690x0.</p><p>The interesting news is that NTP has apparently being getting away +with this for <span class="strong"><strong>decades</strong></span> across a very wide span of hardware, operating +systems, and compilers, including not just Unixes but under Windows +variants as well. This suggests that platforms with padding rules +other than self-alignment are either nonexistent or confined to +such specialized niches that they’re never either NTP servers or +clients.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_padding"></a>4. Padding</h2></div></div></div><p>Now we’ll look at a simple example of variable layout in +memory. Consider the following series of variable declarations +in the top level of a C module:</p><pre class="screen">char *p; +char c; +int x;</pre><p>If you didn’t know anything about data alignment, you might assume +that these three variables would occupy a continuous span of bytes in +memory. That is, on a 32-bit machine 4 bytes of pointer would be +immediately followed by 1 byte of char and that immediately followed +by 4 bytes of int. And a 64-bit machine would be different only +in that the pointer would be 8 bytes.</p><p>In fact, the hidden assumption that the allocated order of static +variables is their source order is not necessarily valid; the C +standards don’t mandate it. I’m going to ignore this detail because +(a) that hidden assumption is usually correct anyway, and (b) the +actual purpose of talking about padding and packing outside structures +is to prepare you for what happens inside them.</p><p>Here’s what actually happens (on an x86 or ARM or anything else with +self-aligned types). The storage for <code class="literal">p</code> starts on a self-aligned 4- +or 8-byte boundary depending on the machine word size. This is +<span class="emphasis"><em>pointer alignment</em></span> - the strictest possible.</p><p>The storage for <code class="literal">c</code> follows immediately. But the 4-byte alignment +requirement of <code class="literal">x</code> forces a gap in the layout; it comes out as though +there were a fourth intervening variable, like this:</p><pre class="screen">char *p; /* 4 or 8 bytes */ +char c; /* 1 byte */ +char pad[3]; /* 3 bytes */ +int x; /* 4 bytes */</pre><p>The <code class="literal">pad[3]</code> character array represents the fact that there are three +bytes of waste space in the structure. The old-school term for this +was "slop". The value of the padding bits is undefined; in particular +it is not guaranteed that they will be zeroed.</p><p>Compare what happens if <code class="literal">x</code> is a 2-byte short:</p><pre class="screen">char *p; +char c; +short x;</pre><p>In that case, the actual layout will be this:</p><pre class="screen">char *p; /* 4 or 8 bytes */ +char c; /* 1 byte */ +char pad[1]; /* 1 byte */ +short x; /* 2 bytes */</pre><p>On the other hand, if <code class="literal">x</code> is a long on a 64-bit machine</p><pre class="screen">char *p; +char c; +long x;</pre><p>we end up with this:</p><pre class="screen">char *p; /* 8 bytes */ +char c; /* 1 byte +char pad[7]; /* 7 bytes */ +long x; /* 8 bytes */</pre><p>If you have been following carefully, you are probably now wondering about +the case where the shorter variable declaration comes first:</p><pre class="screen">char c; +char *p; +int x;</pre><p>If the actual memory layout were written like this</p><pre class="screen">char c; +char pad1[M]; +char *p; +char pad2[N]; +int x;</pre><p>what can we say about <code class="literal">M</code> and <code class="literal">N</code>?</p><p>First, in this case <code class="literal">N</code> will be zero. The address of <code class="literal">x</code>, coming +right after <code class="literal">p</code>, is guaranteed to be pointer-aligned, which is +never less strict than int-aligned.</p><p>The value of <code class="literal">M</code> is less predictable. If the compiler +happened to map <code class="literal">c</code> to the last byte of a machine word, the next +byte (the first of <code class="literal">p</code>) would be the first byte of the next one +and properly pointer-aligned. M would be zero.</p><p>It is more likely that <code class="literal">c</code> will be mapped to the <span class="emphasis"><em>first</em></span> byte of a +machine word. In that case M will be whatever padding is needed to +ensure that <code class="literal">p</code> has pointer alignment - 3 on a 32-bit machine, +7 on a 64-bit machine.</p><p>Intermediate cases are possible. M can be anything from 0 to 7 +(0 to 3 on 32-bit) because a char can start on any byte boundary +in a machine word.</p><p>If you wanted to make those variables take up less space, +you could get that effect by swapping <code class="literal">x</code> with <code class="literal">c</code> in the +original sequence.</p><pre class="screen">char *p; /* 8 bytes */ +long x; /* 8 bytes */ +char c; /* 1 byte</pre><p>Usually, for the small number of scalar variables in your C programs, +bumming out the few bytes you can get by changing the order of +declaration won’t save you enough to be significant. The technique +becomes more interesting when applied to nonscalar variables - +especially structs.</p><p>Before we get to those, let’s dispose of arrays of scalars. On a +platform with self-aligned types, arrays of +char/short/int/long/pointer have no internal padding; each +member is automatically self-aligned at the end of the next one.</p><p>All these rules and examples map over to Go structs, and to Rust +structs with the "repr(C)" attribute, with only syntactic changes.</p><p>In the next section we will see that the same is <span class="strong"><strong>not</strong></span> necessarily true of +structure arrays.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_alignment_and_padding"></a>5. Structure alignment and padding</h2></div></div></div><p>In general, a struct instance will have the alignment of its widest scalar +member. Compilers do this as the easiest way to ensure that all the +members are self-aligned for fast access.</p><p>Also, in C (and Go, and Rust) the address of a struct is the same as +the address of its first member - there is no leading padding. In +C++ this may not be true; see <a class="xref" href="#C-like" title="14. Other languages">Section 14, “Other languages”</a>.</p><p>(When you’re in doubt about this sort of thing, ANSI C provides an +offsetof() macro which can be used to read out structure member +offsets.)</p><p>Consider this struct:</p><pre class="screen">struct foo1 { + char *p; + char c; + long x; +};</pre><p>Assuming a 64-bit machine, any instance of <code class="literal">struct foo1</code> will have +8-byte alignment. The memory layout of one of these looks +unsurprising, like this:</p><pre class="screen">struct foo1 { + char *p; /* 8 bytes */ + char c; /* 1 byte + char pad[7]; /* 7 bytes */ + long x; /* 8 bytes */ +};</pre><p>It’s laid out exactly as though variables of these types has +been separately declared. But if we put <code class="literal">c</code> first, that’s +no longer true.</p><pre class="screen">struct foo2 { + char c; /* 1 byte */ + char pad[7]; /* 7 bytes */ + char *p; /* 8 bytes */ + long x; /* 8 bytes */ +};</pre><p>If the members were separate variables, <code class="literal">c</code> could start at any byte +boundary and the size of <code class="literal">pad</code> might vary. Because <code class="literal">struct foo2</code> has +the pointer alignment of its widest member, that’s no longer possible. +Now <code class="literal">c</code> has to be pointer-aligned, and following padding of 7 bytes is +locked in.</p><p>Now let’s talk about trailing padding on structures. To explain this, +I need to introduce a basic concept which I’ll call the <span class="emphasis"><em>stride +address</em></span> of a structure. It is the first address following the +structure data that has the <span class="strong"><strong>same alignment as the structure</strong></span>.</p><p>The general rule of trailing structure padding is this: the compiler +will behave as though the structure has trailing padding out to +its stride address. This rule controls what <code class="literal">sizeof()</code> will return.</p><p>Consider this example on a 64-bit x86 or ARM machine:</p><pre class="screen">struct foo3 { + char *p; /* 8 bytes */ + char c; /* 1 byte */ +}; + +struct foo3 singleton; +struct foo3 quad[4];</pre><p>You might think that <code class="literal">sizeof(struct foo3)</code> should be 9, but it’s +actually 16. The stride address is that of <code class="literal">(&amp;p)[2]</code>. Thus, in the <code class="literal">quad</code> +array, each member has 7 bytes of trailing padding, because the first +member of each following struct wants to be self-aligned on an 8-byte +boundary. The memory layout is as though the structure had been +declared like this:</p><pre class="screen">struct foo3 { + char *p; /* 8 bytes */ + char c; /* 1 byte */ + char pad[7]; +};</pre><p>For contrast, consider the following example:</p><pre class="screen">struct foo4 { + short s; /* 2 bytes */ + char c; /* 1 byte */ +};</pre><p>Because <code class="literal">s</code> only needs to be 2-byte aligned, the stride address is +just one byte after <code class="literal">c</code>, and <code class="literal">struct foo4</code> as a whole only needs one +byte of trailing padding. It will be laid out like this:</p><pre class="screen">struct foo4 { + short s; /* 2 bytes */ + char c; /* 1 byte */ + char pad[1]; +};</pre><p>and <code class="literal">sizeof(struct foo4)</code> will return 4.</p><p>Here’s a last important detail: If your structure has structure +members, the inner structs want to have the alignment of +longest scalar too. Suppose you write this:</p><pre class="screen">struct foo5 { + char c; + struct foo5_inner { + char *p; + short x; + } inner; +};</pre><p>The <code class="literal">char *p</code> member in the inner struct forces the outer struct to be +pointer-aligned as well as the inner. Actual layout will be like +this on a 64-bit machine:</p><pre class="screen">struct foo5 { + char c; /* 1 byte*/ + char pad1[7]; /* 7 bytes */ + struct foo5_inner { + char *p; /* 8 bytes */ + short x; /* 2 bytes */ + char pad2[6]; /* 6 bytes */ + } inner; +};</pre><p>This structure gives us a hint of the savings that might be possible +from repacking structures. Of 24 bytes, 13 of them are padding. +That’s more than 50% waste space!</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_bitfields"></a>6. Bitfields</h2></div></div></div><p>Now let’s consider C bitfields. What they give you the ability to do is +declare structure fields of smaller than character width, down to 1 +bit, like this:</p><pre class="screen">struct foo6 { + short s; + char c; + int flip:1; + int nybble:4; + int septet:7; +};</pre><p>The thing to know about bitfields is that they are implemented with +word- and byte-level mask and rotate instructions operating on machine +words, and cannot cross word boundaries. C99 guarentees that +bit-fields will be packed as tightly as possible, provided they don’t +cross storage unit boundaries (6.7.2.1 #10).</p><p>This restriction is relaxed in C11 (6.7.2.1p11) and C++14 +([class.bit]p1); these revisions do not actually require <code class="literal">struct foo9</code> +to be 64 bits instead of 32; a bit-field can span multiple allocation +units instead of starting a new one. It’s up to the implementation to +decide; GCC leaves it up to the ABI, which for x64 does prevent them +from sharing an allocation unit.</p><p>Assuming we’re on a 32-bit machine, the C99 rules imply that the layout may +look like this:</p><pre class="screen">struct foo6 { + short s; /* 2 bytes */ + char c; /* 1 byte */ + int flip:1; /* total 1 bit */ + int nybble:4; /* total 5 bits */ + int pad1:3; /* pad to an 8-bit boundary */ + int septet:7; /* 7 bits */ + int pad2:25; /* pad to 32 bits */ +};</pre><p>But this isn’t the only possibility, because the C standard does not +specify that bits are allocated low-to-high. So the layout could look +like this:</p><pre class="screen">struct foo6 { + short s; /* 2 bytes */ + char c; /* 1 byte */ + int pad1:3; /* pad to an 8-bit boundary */ + int flip:1; /* total 1 bit */ + int nybble:4; /* total 5 bits */ + int pad2:25; /* pad to 32 bits */ + int septet:7; /* 7 bits */ +};</pre><p>That is, the padding could precede rather than following the +payload bits.</p><p>Note also that, as with normal structure padding, the padding bits are +not guaranteed to be zero; C99 mentions this.</p><p>Note that the base type of a bit field is interpreted for signedness +but not necessarily for size. It is up to implementors whether +"short flip:1" or "long flip:1" are supported, and whether those +base types change the size of the storage unit the field is packed into.</p><p>Proceed with caution and check with -Wpadded if you have it available +(e.g. under clang). Compilers on exotic hardware might interpret the +C99 rules in surprising ways; older compilers might not quite follow +them.</p><p>The restriction that bitfields cannot cross machine word boundaries +means that, while the first two of the following structures pack into +one and two 32-bit words as you’d expect, the third (<code class="literal">struct foo9</code>) takes +up <span class="strong"><strong>three</strong></span> 32-bit words in C99, in the last of which only one bit is used.</p><pre class="screen">struct foo7 { + int bigfield:31; /* 32-bit word 1 begins */ + int littlefield:1; +}; + +struct foo8 { + int bigfield1:31; /* 32-bit word 1 begins /* + int littlefield1:1; + int bigfield2:31; /* 32-bit word 2 begins */ + int littlefield2:1; +}; + +struct foo9 { + int bigfield1:31; /* 32-bit word 1 begins */ + int bigfield2:31; /* 32-bit word 2 begins */ + int littlefield1:1; + int littlefield2:1; /* 32-bit word 3 begins */ +};</pre><p>Again, C11 and C++14 may pack foo9 tighter, but it would perhaps be +unwise to count on this.</p><p>On the other hand, <code class="literal">struct foo8</code> would fit into a single 64-bit word +if the machine has those.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_reordering"></a>7. Structure reordering</h2></div></div></div><p>Now that you know how and why compilers insert padding in and after your +structures we’ll examine what you can do to squeeze out the slop. +This is the art of structure packing.</p><p>The first thing to notice is that slop only happens in two places. +One is where storage bound to a larger data type (with stricter +alignment requirements) follows storage bound to a smaller one. The +other is where a struct naturally ends before its stride address, +requiring padding so the next one will be properly aligned.</p><p>The simplest way to eliminate slop is to reorder the structure members +by decreasing alignment. That is: make all the pointer-aligned +subfields come first, because on a 64-bit machine they will be 8 +bytes. Then the 4-byte ints; then the 2-byte shorts; then the +character fields.</p><p>So, for example, consider this simple linked-list structure:</p><pre class="screen">struct foo10 { + char c; + struct foo10 *p; + short x; +};</pre><p>With the implied slop made explicit, here it is:</p><pre class="screen">struct foo10 { + char c; /* 1 byte */ + char pad1[7]; /* 7 bytes */ + struct foo10 *p; /* 8 bytes */ + short x; /* 2 bytes */ + char pad2[6]; /* 6 bytes */ +};</pre><p>That’s 24 bytes. If we reorder by size, we get this:</p><pre class="screen">struct foo11 { + struct foo11 *p; + short x; + char c; +};</pre><p>Considering self-alignment, we see that none of the data fields need +padding. This is because the stride address for a (longer) field with +stricter alignment is always a validly-aligned start address for a +(shorter) field with less strict requirements. All the repacked struct +actually requires is trailing padding:</p><pre class="screen">struct foo11 { + struct foo11 *p; /* 8 bytes */ + short x; /* 2 bytes */ + char c; /* 1 byte */ + char pad[5]; /* 5 bytes */ +};</pre><p>Our repack transformation drops the size from 24 to 16 bytes. This +might not seem like a lot, but suppose you have a linked list of 200K +of these? The savings add up fast - especially on memory-constrained +embedded systems or in the core part of an OS kernel that has to stay +resident.</p><p>Note that reordering is not guaranteed to produce savings. Applying this +technique to an earlier example, <code class="literal">struct foo5</code>, we get this:</p><pre class="screen">struct foo12 { + struct foo5 { + char *p; /* 8 bytes */ + short x; /* 2 bytes */ + } inner; + char c; /* 1 byte*/ +};</pre><p>With padding written out, this is</p><pre class="screen">struct foo12 { + struct foo5 { + char *p; /* 8 bytes */ + short x; /* 2 bytes */ + char pad[6]; /* 6 bytes */ + } inner; + char c; /* 1 byte*/ + char pad[7]; /* 7 bytes */ +};</pre><p>It’s still 24 bytes because <code class="literal">c</code> cannot back into the inner struct’s +trailing padding. To collect that gain you would need to redesign +your data structures.</p><p>Curiously, strictly ordering your structure fields by <span class="emphasis"><em>increasing</em></span> +size also works to mimimize padding. You can minimize padding with +any order in which (a) all fields of any one size are in a continuous +span (completely eliminating padding between them), and (b) the gaps +between those spans are such that the sizes on either side have as few +doubling steps of difference from each other as possible. Usually +this means no padding at all on one side.</p><p>Even more general minimal-padding orders are possible. Example:</p><pre class="screen">struct foo13 { + int32_t i; + int32_t i2; + char octet[8]; + int32_t i3; + int32_t i4; + int64_t l; + int32_t i5; + int32_t i6; +};</pre><p>This struct has zero padding under self-alignment rules. Working out +why is a useful exercise to develop your understanding.</p><p>Since shipping the first version of this guide I have been asked why, +if reordering for minimal slop is so simple, C compilers don’t do it +automatically. The answer: C is a language originally designed for +writing operating systems and other code close to the +hardware. Automatic reordering would interfere with a systems +programmer’s ability to lay out structures that exactly match the +byte and bit-level layout of memory-mapped device control blocks.</p><p>Go hews to the C philosophy and does not reorder fields. Rust makes +the opposite choice; by default, its compiler <span class="emphasis"><em>may</em></span> reorder structure +fields.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_awkward_scalar_cases"></a>8. Awkward scalar cases</h2></div></div></div><p>Using enumerated types instead of #defines is a good idea, if only +because symbolic debuggers have those symbols available and can +show them rather than raw integers. But, while enums are guaranteed +to be compatible with an integral type, the C standard does +not specify <span class="emphasis"><em>which</em></span> underlying integral type is to be used for them.</p><p>Be aware when repacking your structs that while enumerated-type +variables are usually ints, this is compiler-dependent; they could be +shorts, longs, or even chars by default. Your compiler may have a +pragma or command-line option to force the size.</p><p>The <code class="literal">long double</code> type is a similar trouble spot. Some C platforms +implement this in 80 bits, some in 128, and some of the 80-bit +platforms pad it to 96 or 128 bits.</p><p>In both cases it’s best to use <code class="literal">sizeof()</code> to check the storage size.</p><p>Finally, under x86 Linux doubles are sometimes an exception to the +self-alignment rule; an 8-byte double may require only 4-byte +alignment within a struct even though standalone doubles variables +have 8-byte self-alignment. This depends on compiler and options.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_readability_and_cache_locality"></a>9. Readability and cache locality</h2></div></div></div><p>While reordering by size is the simplest way to eliminate slop, it’s +not necessarily the right thing. There are two more issues: +readability and cache locality.</p><p>Programs are not just communications to a computer, they are +communications to other human beings. Code readability is important +even (or especially!) when the audience of the communication is only +your future self.</p><p>A clumsy, mechanical reordering of your structure can harm +readability. When possible, it is better to reorder fields so they +remain in coherent groups with semantically related pieces of data +kept close together. Ideally, the design of your structure should +communicate the design of your program.</p><p>When your program frequently accesses a structure, or parts of a +structure, it is helpful for performance if the accesses tend to fit +within a <span class="emphasis"><em>cache line</em></span> - the memory block fetched by your processor +when it is told to get any single address within the block. On 64-bit +x86 a cache line is 64 bytes beginning on a self-aligned address; on +other platforms it is often 32 bytes.</p><p>The things you should do to preserve readability - grouping related +and co-accessed data in adjacent fields - also improve cache-line +locality. These are both reasons to reorder intelligently, with +awareness of your code’s data-access patterns.</p><p>If your code does concurrent access to a structure from multiple +threads, there’s a third issue: cache line bouncing. To minimize +expensive bus traffic, you should arrange your data so that reads +come from one cache line and writes go to another in your tighter +loops.</p><p>And yes, this sometimes contradicts the previous guidance about grouping +related data in the same cache-line-sized block. Multithreading is hard. +Cache-line bouncing and other multithread optimization issues are very +advanced topics which deserve an entire tutorial of their own. The +best I can do here is make you aware that these issues exist.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_other_packing_techniques"></a>10. Other packing techniques</h2></div></div></div><p>Reordering works best when combined with other techniques for slimming +your structures. If you have several boolean flags in a struct, for example, +consider reducing them to 1-bit bitfields and packing them into a +place in the structure that would otherwise be slop.</p><p>You’ll take a small access-time penalty for this - but if it squeezes +the working set enough smaller, that penalty will be swamped by your +gains from avoided cache misses.</p><p>More generally, look for ways to shorten data field sizes. In +cvs-fast-export, for example, one squeeze I applied was to use the +knowledge that RCS and CVS repositories didn’t exist before 1982. I +dropped a 64-bit Unix <code class="literal">time_t</code> (zero date at the beginning of 1970) +for a 32-bit time offset from 1982-01-01T00:00:00; this will cover +dates to 2118. (Note: if you pull a trick like this, <span class="emphasis"><em>do a bounds +check whenever you set the field</em></span> to prevent nasty bugs!)</p><p>Each such field shortening not only decreases the explicit size +of your structure, it may remove slop and/or create additional +opportunities for gains from field reordering. Virtuous cascades +of such effects are not very hard to trigger.</p><p>The riskiest form of packing is to use unions. If you know that +certain fields in your structure are never used in combination with +certain other fields, consider using a union to make them share +storage. But be extra careful and verify your work with regression +testing, because if your lifetime analysis is even slightly wrong you +will get bugs ranging from crashes to (much worse) subtle data +corruption.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_overriding_alignment_rules"></a>11. Overriding alignment rules</h2></div></div></div><p>Sometimes you can coerce your compiler into not using the processor’s +normal alignment rules by using a pragma, usually <code class="literal">#pragma pack</code>. +GCC and clang have a "packed" attribute you can attach to +individual structure declarations; GCC has an -fpack-struct option +for entire compilations.</p><p>Do not do this casually, as it forces the generation of more expensive +and slower code. Usually you can save as much memory, or almost as +much, with the techniques I describe here.</p><p>The only really compelling reason for <code class="literal">#pragma pack</code> is if you have to +exactly match your C data layout to some kind of bit-level hardware or +protocol requirement, like a memory-mapped hardware port, and +violating normal alignment is required for that to work. If you’re in +that situation, and you don’t already know everything else I’m writing +about here, you’re in deep trouble and I wish you luck.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_tools"></a>12. Tools</h2></div></div></div><p>The clang compiler has a -Wpadded option that causes it to generate +messages about alignment holes and padding. Some versions also have +an undocumented -fdump-record-layouts option that yields +<a class="ulink" href="http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-July/037778.html" target="_top">more +information</a>.</p><p>If you’re using C11, you can deploy static_assert to check your +assumptions about type and structure sizes. Example:</p><pre class="screen">#include &lt;assert.h&gt; +struct foo4 { + short s; /* 2 bytes */ + char c; /* 1 byte */ +}; +static_assert(sizeof(struct foo4) == 4, “Check your assumptions");</pre><p>I have not used it myself, but several respondents speak well of a +program called <code class="literal">pahole</code>. This tool cooperates with a compiler to produce +reports on your structures that describe padding, alignment, and +cache line boundaries. This was at one time a standalone C program, +but that is now unmaintained; a script with the name pahole now ships +with gdb and that is what you should use.</p><p>I’ve received a report that a proprietary code auditing tool called +"PVS Studio" can detect structure-packing opportunities.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_proof_and_exceptional_cases"></a>13. Proof and exceptional cases</h2></div></div></div><p>You can download sourcecode for a little program that demonstrates the +assertions about scalar and structure sizes made above. It is +<a class="ulink" href="packtest.c" target="_top">packtest.c</a>.</p><p>If you look through enough strange combinations of compilers, options, +and unusual hardware, you will find exceptions to some of the rules I +have described. They get more common as you go back in time to older +processor designs.</p><p>The next level beyond knowing these rules is knowing how and when to +expect that they will be broken. In the years when I learned them +(the early 1980s) we spoke of people who didn’t get this as victims of +"all-the-world’s-a-VAX" syndrome. Remember that not all the world is +vanilla.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="C-like"></a>14. Other languages</h2></div></div></div><p>In this section we’ll call a language "C-like" if structure and array +members are self-aligned, are not reordered by the compiler, the address +of a struct is the address of its first member, and structs have +trailing pdding to their stride length.</p><p>If you know the implications of self-aligment in C, you can apply them +directly to calculating sizes and offsets in any language that is +C-like in this sense, and to space-optimizing in the language’s +structures.</p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c"></a>14.1. C++</h3></div></div></div><p>C++ is C-like, except that classes that look like structs may ignore +the rule that the address of a struct is the address of its first +member! Whether they do or not depends on how base classes and virtual +member functions are implemented, and varies by compiler. Otherwise +everything we’ve observed about C applies.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_go"></a>14.2. Go</h3></div></div></div><p>The Go language is in many respects similar to C. It has structures +and arrays, though not bitfields or unions. Go compilers have the +same optimization and alignment issues as C compilers. As in C, array +elements are padded up to the following stride address.</p><p>There is no Go equivalent of the C #pack pragma.</p><p>Variables and struct fields will normally be self-aligned for the same +reasons rgis is a rule in C. However, there is one peculiat +exception; on 32—bit platforms, 64-bit struct fields only require +akignment on a machine word boundary, e.g. 32 bits. There has been +discussion of +<a class="ulink" href="https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a" target="_top">https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a</a> +proposal to change this, but it stands.</p><p>On the other hand, the Go specification makes no guarantees about how +structure fields are ordered. Unlike C, it would be legal for a Go +compiler to lay out fields in an order different from their +specification in the source. As of 2022 no Go compiler that +actually does this has beenbn sughted in the wild.</p><p>Go has one odd really odd quirk. Since Go 1.5, a zero-length field at +the end of a struct (that is, a zero-length array or empty struct) is +sized and aligned as though it is one byte. The reasons for this are +discussed in an essay +<a class="ulink" href="https://dave.cheney.net/2015/10/09/padding-is-hard" target="_top">Padding is Hard</a> by +one of the Go developers.</p><p>There’s a +<a class="ulink" href="https://itnext.io/structure-size-optimization-in-golang-alignment-padding-more-effective-memory-layout-linters-fffdcba27c61" target="_top">specific +discussion of Go alignment rules</a> that includes pointers tp some tools +that can automatically tweak your structures to optimal alignment. +Another such tool is +<a class="ulink" href="https://github.com/orijtech/structslop" target="_top">strucrslop</a>. I have not used +any of these, try at your own risk.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_rust"></a>14.3. Rust</h3></div></div></div><p>Rust follows C-like packing rules if a structure is annotated with +"repr(C)". Otherwise (by default) all bets are off: padding rules are +(deliberately) unspecified and the compiler may even reorder structure +members. It is probably best to let the Rust compiler do space +optimization rather than forcing it.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_java"></a>14.4. Java</h3></div></div></div><p>Java’s JNI (Java Native Interface) supports C-like packing rules for +structure members so that JNI Java structures can map exactly to C +equivalents. There is a pack pragma.</p><p>Packing in the JVM itself, however, is not well defined. The JVM +spec simply says "The Java Virtual Machine does not mandate any +particular internal structure for objects.", making choices about +this implementation-dependent.</p><p>That said, many JVM implementations are word-oriented in an even +stricter way than Motorola processors - structure and array members +can start at any 32-bit boundary but not at a 16- or 8-bit one. This +will create internal padding after char and 16-bit short members where +it wouldn’t be expected under C-like rules.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_swift"></a>14.5. Swift</h3></div></div></div><p>Swift is +<a class="ulink" href="https://swiftunboxed.com/internals/size-stride-alignment/" target="_top">exactly +C-like</a>. There is no equivalent of a pack pragma.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c_2"></a>14.6. C#</h3></div></div></div><p>C# is C-like with the default structure layout attribute +LayoutKind.Sequential. LayoutKind.Auto allows the compiler to +reorder, and LayoutKind.Explicit allows the programmmer to specify +field sizes explicitly. There’s a Pack modifier which is equivalent to +a C pack pragma.</p></div></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_supporting_this_work"></a>15. Supporting this work</h2></div></div></div><p>If you were educated or entertained by this document, please sign up +for my <a class="ulink" href="https://www.patreon.com/esr" target="_top">Patreon feed</a>. The time needed to +write and maintain documents like this one is not free, and while +I enjoying giving them to the world my bills won’t pay themselves. +Even a few dollars a month - from enough of you - helps a lot.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_related_reading"></a>16. Related Reading</h2></div></div></div><p>This section exists to collect pointers to essays which I judge to be +good companions to this one.</p><p><a class="ulink" href="http://blog.regehr.org/archives/213" target="_top">A Guide to Undefined Behavior in C and C++</a></p><p><a class="ulink" href="http://www.catb.org/esr/time-programming/" target="_top">Time, Clock, and Calendar Programming In C</a></p><p><a class="ulink" href="http://www.catb.org/esr/faqs/things-every-hacker-once-knew/" target="_top">Things Every Hacker Once Knew</a></p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_version_history"></a>17. Version history</h2></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"> +2.5 @ 2020-04-30 +</span></dt><dd> + Revision and expansion of the section on Go packing/alignment rules. +</dd><dt><span class="term"> +2.4 @ 2020-04-28 +</span></dt><dd> + Added coverage of Java, Swift, and C#. +</dd><dt><span class="term"> +2.3 @ 2020-04-20 +</span></dt><dd> + Describe exceptional alignment rules on Motorola 680x0. + Sightly improve coverage of Rust alignment rules. +</dd><dt><span class="term"> +2.2 @ 2019-12-19 +</span></dt><dd> + Minor markup fix. +</dd><dt><span class="term"> +2.1 @ 2019-02-14 +</span></dt><dd> + Correct a minor error in one of the examples pointed out by + Luis Emilio Moreno Durán. +</dd><dt><span class="term"> +2.0 @ 2018-08-06 +</span></dt><dd> + Drop "C" out of the title as these techniques are applicable to Go + and Rust as well. More coverage of Go and Rust. The pahole tool + has been replaced by a script in the gdb distribution. +</dd><dt><span class="term"> +1.19 @ 2018-02-12 +</span></dt><dd> + Describe usefulness of static_assert() in C11. Added Go and Rust coverage. +</dd><dt><span class="term"> +1.18 @ 2017-06-01 +</span></dt><dd> + More general zero-padding orders. C11 and C14 relax a constraint + on bitfield packing. +</dd><dt><span class="term"> +1.17 @ 2016-11-14 +</span></dt><dd> + Typo fixes. +</dd><dt><span class="term"> +1.16 @ 2016-10-21 +</span></dt><dd> + Answer an objection about allocation order being unrelated to + source order. +</dd><dt><span class="term"> +1.15 @ 2016-10-20 +</span></dt><dd> + Note the field evidence from NTP. +</dd><dt><span class="term"> +1.14 @ 2015-12-19 +</span></dt><dd> + Typo correction: -Wpadding → -Wpadded. +</dd><dt><span class="term"> +1.13 @ 2015-11-23 +</span></dt><dd> + Be explicit about padding bits being undefined. More about bitfields. +</dd><dt><span class="term"> +1.12 @ 2015-11-11 +</span></dt><dd> + Major revision of section on bitfields reflecting C99 rules. +</dd><dt><span class="term"> +1.11 @ 2015-07-23 +</span></dt><dd> + Mention the clang -fdump-record-layouts option. +</dd><dt><span class="term"> +1.10 @ 2015-02-20 +</span></dt><dd> + Mention <span class="emphasis"><em>attribute</em></span><a id="idm359" class="indexterm"></a><span class="emphasis"><em>packed</em></span>, -fpack-struct, and PVS Studio. +</dd><dt><span class="term"> +1.9 @ 2014-10-01 +</span></dt><dd> + Added link to "Time, Clock, and Calendar Programming In C". +</dd><dt><span class="term"> +1.8 @ 2014-05-20 +</span></dt><dd> + Improved explanation for the bitfield examples, +</dd><dt><span class="term"> +1.7 @ 2014-05-17 +</span></dt><dd> + Correct a minor error in the description of the layout of <code class="literal">struct foo8</code>. +</dd><dt><span class="term"> +1.6 @ 2014-05-14 +</span></dt><dd> + Emphasize that bitfields cannot cross word boundaries. Idea from + Dale Gulledge. +</dd><dt><span class="term"> +1.5 @ 2014-01-13 +</span></dt><dd> + Explain why structure member reordering is not done automatically. +</dd><dt><span class="term"> +1.4 @ 2014-01-04 +</span></dt><dd> + A note about double under x86 Linux. +</dd><dt><span class="term"> +1.3 @ 2014-01-03 +</span></dt><dd> + New sections on awkward scalar cases, readability and cache + locality, and tools. +</dd><dt><span class="term"> +1.2 @ 2014-01-02 +</span></dt><dd> + Correct an erroneous address calculation. +</dd><dt><span class="term"> +1.1 @ 2014-01-01 +</span></dt><dd> + Explain why aligned accesses are faster. Mention offsetof. + Various minor fixes, including the packtest.c download link. +</dd><dt><span class="term"> +1.0 @ 2014-01-01 +</span></dt><dd> + Initial release. +</dd></dl></div></div></div></body></html> +\ No newline at end of file