commit 6d7678a78907c999dc004785d718347ccc01d0b3
parent 667920b3abc123846ebb61af422fdeba4282a43c
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue, 19 Sep 2023 20:03:22 +0200
add catb.org page
Diffstat:
1 file changed, 639 insertions(+), 0 deletions(-)
diff --git a/realworld/catb.org.html b/realworld/catb.org.html
@@ -0,0 +1,638 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>The Lost Art of Structure Packing</title><link rel="stylesheet" type="text/css" href="docbook-xsl.css" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /></head><body><div xml:lang="en" class="article" lang="en"><div class="titlepage"><div><div><h2 class="title"><a id="idm1"></a>The Lost Art of Structure Packing</h2></div><div><div class="author"><h3 class="author"><span class="firstname">Eric</span> <span class="othername">S.</span> <span class="surname">Raymond</span></h3><code class="email"><<a class="email" href="mailto:esr@thyrsus.com">esr@thyrsus.com</a>></code></div></div></div><hr /></div><div class="toc"><p><strong>Table of Contents</strong></p><dl class="toc"><dt><span class="section"><a href="#_who_should_read_this">1. Who should read this</a></span></dt><dt><span class="section"><a href="#_why_i_wrote_it">2. Why I wrote it</a></span></dt><dt><span class="section"><a href="#_alignment_requirements">3. Alignment requirements</a></span></dt><dt><span class="section"><a href="#_padding">4. Padding</a></span></dt><dt><span class="section"><a href="#_structure_alignment_and_padding">5. Structure alignment and padding</a></span></dt><dt><span class="section"><a href="#_bitfields">6. Bitfields</a></span></dt><dt><span class="section"><a href="#_structure_reordering">7. Structure reordering</a></span></dt><dt><span class="section"><a href="#_awkward_scalar_cases">8. Awkward scalar cases</a></span></dt><dt><span class="section"><a href="#_readability_and_cache_locality">9. Readability and cache locality</a></span></dt><dt><span class="section"><a href="#_other_packing_techniques">10. Other packing techniques</a></span></dt><dt><span class="section"><a href="#_overriding_alignment_rules">11. Overriding alignment rules</a></span></dt><dt><span class="section"><a href="#_tools">12. Tools</a></span></dt><dt><span class="section"><a href="#_proof_and_exceptional_cases">13. Proof and exceptional cases</a></span></dt><dt><span class="section"><a href="#C-like">14. Other languages</a></span></dt><dd><dl><dt><span class="section"><a href="#_c">14.1. C++</a></span></dt><dt><span class="section"><a href="#_go">14.2. Go</a></span></dt><dt><span class="section"><a href="#_rust">14.3. Rust</a></span></dt><dt><span class="section"><a href="#_java">14.4. Java</a></span></dt><dt><span class="section"><a href="#_swift">14.5. Swift</a></span></dt><dt><span class="section"><a href="#_c_2">14.6. C#</a></span></dt></dl></dd><dt><span class="section"><a href="#_supporting_this_work">15. Supporting this work</a></span></dt><dt><span class="section"><a href="#_related_reading">16. Related Reading</a></span></dt><dt><span class="section"><a href="#_version_history">17. Version history</a></span></dt></dl></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_who_should_read_this"></a>1. Who should read this</h2></div></div></div><p>This page is about a technique for reducing the memory footprint of
+programs in compiled languages with C-like structures - manually
+repacking these declarations for reduced size. To read it, you will
+require basic knowledge of the C programming language.</p><p>You need to know this technique if you intend to write code for
+memory-constrained embedded systems, or operating-system kernels. It
+is useful if you are working with application data sets so large that
+your programs routinely hit memory limits. It is good to know in any
+application where you really, <span class="emphasis"><em>really</em></span> care about optimizing your use
+of memory bandwidth and minimizing cache-line misses.</p><p>Finally, knowing this technique is a gateway to other esoteric C
+topics. You are not an advanced C programmer until you have grasped
+these rules. You are not a master of C until you could have written
+this document yourself and can criticize it intelligently.</p><p>This document originated with "C" in the title, but almost everything
+in it applies to C++ as well. Many of the techniques discussed here
+also apply to the Go language, and should generalize to any compiled
+language with C-like structures. There is a note discussing C++, Go,
+Rust, Java, Swift, and C# towards the end.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_why_i_wrote_it"></a>2. Why I wrote it</h2></div></div></div><p>This webpage exists because in late 2013 I found myself heavily
+applying an optimization technique that I had learned more than two
+decades previously and not used much since.</p><p>I needed to reduce the memory footprint of a program that used thousands -
+sometimes hundreds of thousands - of C struct instances. The program
+was <a class="ulink" href="http://www.catb.org/~esr/cvs-fast-export" target="_top">cvs-fast-export</a> and
+the problem was that it was dying with out-of-memory errors on large
+repositories.</p><p>There are ways to reduce memory usage significantly in situations like
+this, by rearranging the order of structure members in careful ways.
+This can lead to dramatic gains - in my case I was able to cut the
+working-set size by around 40%, enabling the program to handle
+much larger repositories without dying.</p><p>But as I worked, and thought about what I was doing, it began to dawn
+on me that the technique I was using has been more than half forgotten
+in these latter days. A little web research confirmed that
+programmers don’t seem to talk about it much any more, at least not
+where a search engine can see them. A couple of Wikipedia entries
+touch the topic, but I found nobody who covered it comprehensively.</p><p>There are actually reasons for this that aren’t stupid. CS courses
+(rightly) steer people away from micro-optimization towards finding better
+algorithms. The plunging price of machine resources has made squeezing
+memory usage less necessary. And the way hackers used to learn how to
+do it back in the day was by bumping their noses on strange hardware
+architectures - a less common experience now.</p><p>But the technique still has value in important situations, and will
+as long as memory is finite. This document is intended to save
+programmers from having to rediscover the technique, so they can
+concentrate effort on more important things.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_alignment_requirements"></a>3. Alignment requirements</h2></div></div></div><p>The first thing to understand is that, on modern processors, the way
+your compiler lays out basic datatypes in memory is constrained
+in order to make memory accesses faster. Our examples are in C, but
+any compiled language generates code under the same constraints.</p><p>There is a large class of modern ISAs (Instruction Set Architectures)
+for which these constraints lead to identical layouts. These ISAs
+include Intel, ARM, and RISC-V; I will refer to these as "vanilla"
+ISAs.</p><p>Storage for the basic C datatypes on a vanilla ISA doesn’t
+normally start at arbitrary byte addresses in memory. Rather, each
+type except char has an <span class="emphasis"><em>alignment requirement</em></span>; chars can start on
+any byte address, but 2-byte shorts must start on an even address,
+4-byte ints or floats must start on an address divisible by 4, and
+8-byte longs or doubles must start on an address divisible by 8.
+Signed or unsigned makes no difference.</p><p>The jargon for this is that basic C types on a vanilla ISA are
+<span class="emphasis"><em>self-aligned</em></span>. Pointers, whether 32-bit (4-byte) or 64-bit (8-byte)
+are self-aligned too.</p><p>Self-alignment makes access faster because it facilitates generating
+single-instruction fetches and puts of the typed data. Without
+alignment constraints, on the other hand, the code might end up having
+to do two or more accesses spanning machine-word boundaries.
+Characters are a special case; they’re equally expensive from anywhere
+they live inside a single machine word. That’s why they don’t have a
+preferred alignment.</p><p>I said "on modern processors" because on some older ones forcing your
+C program to violate alignment rules (say, by casting an odd address
+into an int pointer and trying to use it) didn’t just slow your code
+down, it caused an illegal instruction fault. This was the behavior, for
+example, on Sun SPARC chips. In fact, with sufficient determination
+and the right (e18) hardware flag set on the processor, you can still
+trigger this on x86.</p><p>Also, self-alignment is not the only possible rule. Historically, some
+processors (especially those lacking
+<a class="ulink" href="https://en.wikipedia.org/wiki/Barrel_shifter" target="_top">barrel shifters</a>) have had
+more restrictive ones. If you do embedded systems, you might trip over one
+of these lurking in the underbrush. Be aware this is possible.</p><p>One curious and illustrative exception is the Motorola 68020 and its
+successors. These are word-oriented 32-bit machines - that is, the
+underlying granularity of fast access is 16 bits. Compilers can start
+structs on 16-bit boundaries without a speed penalty, even if the
+first member was a 32-bit scalar. Therefore, only character fields
+with odd byte lengths can ever cause padding.</p><p>From when it was first written at the beginning of 2014 until late
+2016, this section ended with other caveats about odd
+architectures. During that period I’ve learned something rather
+reassuring from working with the source code for the reference
+implementation of NTP. It does packet analysis by reading packets off
+the wire directly into memory that the rest of the code sees as a
+struct, relying on the assumption of minimal self-aligned padding - or
+zero padding in odd cases like 690x0.</p><p>The interesting news is that NTP has apparently being getting away
+with this for <span class="strong"><strong>decades</strong></span> across a very wide span of hardware, operating
+systems, and compilers, including not just Unixes but under Windows
+variants as well. This suggests that platforms with padding rules
+other than self-alignment are either nonexistent or confined to
+such specialized niches that they’re never either NTP servers or
+clients.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_padding"></a>4. Padding</h2></div></div></div><p>Now we’ll look at a simple example of variable layout in
+memory. Consider the following series of variable declarations
+in the top level of a C module:</p><pre class="screen">char *p;
+char c;
+int x;</pre><p>If you didn’t know anything about data alignment, you might assume
+that these three variables would occupy a continuous span of bytes in
+memory. That is, on a 32-bit machine 4 bytes of pointer would be
+immediately followed by 1 byte of char and that immediately followed
+by 4 bytes of int. And a 64-bit machine would be different only
+in that the pointer would be 8 bytes.</p><p>In fact, the hidden assumption that the allocated order of static
+variables is their source order is not necessarily valid; the C
+standards don’t mandate it. I’m going to ignore this detail because
+(a) that hidden assumption is usually correct anyway, and (b) the
+actual purpose of talking about padding and packing outside structures
+is to prepare you for what happens inside them.</p><p>Here’s what actually happens (on an x86 or ARM or anything else with
+self-aligned types). The storage for <code class="literal">p</code> starts on a self-aligned 4-
+or 8-byte boundary depending on the machine word size. This is
+<span class="emphasis"><em>pointer alignment</em></span> - the strictest possible.</p><p>The storage for <code class="literal">c</code> follows immediately. But the 4-byte alignment
+requirement of <code class="literal">x</code> forces a gap in the layout; it comes out as though
+there were a fourth intervening variable, like this:</p><pre class="screen">char *p; /* 4 or 8 bytes */
+char c; /* 1 byte */
+char pad[3]; /* 3 bytes */
+int x; /* 4 bytes */</pre><p>The <code class="literal">pad[3]</code> character array represents the fact that there are three
+bytes of waste space in the structure. The old-school term for this
+was "slop". The value of the padding bits is undefined; in particular
+it is not guaranteed that they will be zeroed.</p><p>Compare what happens if <code class="literal">x</code> is a 2-byte short:</p><pre class="screen">char *p;
+char c;
+short x;</pre><p>In that case, the actual layout will be this:</p><pre class="screen">char *p; /* 4 or 8 bytes */
+char c; /* 1 byte */
+char pad[1]; /* 1 byte */
+short x; /* 2 bytes */</pre><p>On the other hand, if <code class="literal">x</code> is a long on a 64-bit machine</p><pre class="screen">char *p;
+char c;
+long x;</pre><p>we end up with this:</p><pre class="screen">char *p; /* 8 bytes */
+char c; /* 1 byte
+char pad[7]; /* 7 bytes */
+long x; /* 8 bytes */</pre><p>If you have been following carefully, you are probably now wondering about
+the case where the shorter variable declaration comes first:</p><pre class="screen">char c;
+char *p;
+int x;</pre><p>If the actual memory layout were written like this</p><pre class="screen">char c;
+char pad1[M];
+char *p;
+char pad2[N];
+int x;</pre><p>what can we say about <code class="literal">M</code> and <code class="literal">N</code>?</p><p>First, in this case <code class="literal">N</code> will be zero. The address of <code class="literal">x</code>, coming
+right after <code class="literal">p</code>, is guaranteed to be pointer-aligned, which is
+never less strict than int-aligned.</p><p>The value of <code class="literal">M</code> is less predictable. If the compiler
+happened to map <code class="literal">c</code> to the last byte of a machine word, the next
+byte (the first of <code class="literal">p</code>) would be the first byte of the next one
+and properly pointer-aligned. M would be zero.</p><p>It is more likely that <code class="literal">c</code> will be mapped to the <span class="emphasis"><em>first</em></span> byte of a
+machine word. In that case M will be whatever padding is needed to
+ensure that <code class="literal">p</code> has pointer alignment - 3 on a 32-bit machine,
+7 on a 64-bit machine.</p><p>Intermediate cases are possible. M can be anything from 0 to 7
+(0 to 3 on 32-bit) because a char can start on any byte boundary
+in a machine word.</p><p>If you wanted to make those variables take up less space,
+you could get that effect by swapping <code class="literal">x</code> with <code class="literal">c</code> in the
+original sequence.</p><pre class="screen">char *p; /* 8 bytes */
+long x; /* 8 bytes */
+char c; /* 1 byte</pre><p>Usually, for the small number of scalar variables in your C programs,
+bumming out the few bytes you can get by changing the order of
+declaration won’t save you enough to be significant. The technique
+becomes more interesting when applied to nonscalar variables -
+especially structs.</p><p>Before we get to those, let’s dispose of arrays of scalars. On a
+platform with self-aligned types, arrays of
+char/short/int/long/pointer have no internal padding; each
+member is automatically self-aligned at the end of the next one.</p><p>All these rules and examples map over to Go structs, and to Rust
+structs with the "repr(C)" attribute, with only syntactic changes.</p><p>In the next section we will see that the same is <span class="strong"><strong>not</strong></span> necessarily true of
+structure arrays.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_alignment_and_padding"></a>5. Structure alignment and padding</h2></div></div></div><p>In general, a struct instance will have the alignment of its widest scalar
+member. Compilers do this as the easiest way to ensure that all the
+members are self-aligned for fast access.</p><p>Also, in C (and Go, and Rust) the address of a struct is the same as
+the address of its first member - there is no leading padding. In
+C++ this may not be true; see <a class="xref" href="#C-like" title="14. Other languages">Section 14, “Other languages”</a>.</p><p>(When you’re in doubt about this sort of thing, ANSI C provides an
+offsetof() macro which can be used to read out structure member
+offsets.)</p><p>Consider this struct:</p><pre class="screen">struct foo1 {
+ char *p;
+ char c;
+ long x;
+};</pre><p>Assuming a 64-bit machine, any instance of <code class="literal">struct foo1</code> will have
+8-byte alignment. The memory layout of one of these looks
+unsurprising, like this:</p><pre class="screen">struct foo1 {
+ char *p; /* 8 bytes */
+ char c; /* 1 byte
+ char pad[7]; /* 7 bytes */
+ long x; /* 8 bytes */
+};</pre><p>It’s laid out exactly as though variables of these types has
+been separately declared. But if we put <code class="literal">c</code> first, that’s
+no longer true.</p><pre class="screen">struct foo2 {
+ char c; /* 1 byte */
+ char pad[7]; /* 7 bytes */
+ char *p; /* 8 bytes */
+ long x; /* 8 bytes */
+};</pre><p>If the members were separate variables, <code class="literal">c</code> could start at any byte
+boundary and the size of <code class="literal">pad</code> might vary. Because <code class="literal">struct foo2</code> has
+the pointer alignment of its widest member, that’s no longer possible.
+Now <code class="literal">c</code> has to be pointer-aligned, and following padding of 7 bytes is
+locked in.</p><p>Now let’s talk about trailing padding on structures. To explain this,
+I need to introduce a basic concept which I’ll call the <span class="emphasis"><em>stride
+address</em></span> of a structure. It is the first address following the
+structure data that has the <span class="strong"><strong>same alignment as the structure</strong></span>.</p><p>The general rule of trailing structure padding is this: the compiler
+will behave as though the structure has trailing padding out to
+its stride address. This rule controls what <code class="literal">sizeof()</code> will return.</p><p>Consider this example on a 64-bit x86 or ARM machine:</p><pre class="screen">struct foo3 {
+ char *p; /* 8 bytes */
+ char c; /* 1 byte */
+};
+
+struct foo3 singleton;
+struct foo3 quad[4];</pre><p>You might think that <code class="literal">sizeof(struct foo3)</code> should be 9, but it’s
+actually 16. The stride address is that of <code class="literal">(&p)[2]</code>. Thus, in the <code class="literal">quad</code>
+array, each member has 7 bytes of trailing padding, because the first
+member of each following struct wants to be self-aligned on an 8-byte
+boundary. The memory layout is as though the structure had been
+declared like this:</p><pre class="screen">struct foo3 {
+ char *p; /* 8 bytes */
+ char c; /* 1 byte */
+ char pad[7];
+};</pre><p>For contrast, consider the following example:</p><pre class="screen">struct foo4 {
+ short s; /* 2 bytes */
+ char c; /* 1 byte */
+};</pre><p>Because <code class="literal">s</code> only needs to be 2-byte aligned, the stride address is
+just one byte after <code class="literal">c</code>, and <code class="literal">struct foo4</code> as a whole only needs one
+byte of trailing padding. It will be laid out like this:</p><pre class="screen">struct foo4 {
+ short s; /* 2 bytes */
+ char c; /* 1 byte */
+ char pad[1];
+};</pre><p>and <code class="literal">sizeof(struct foo4)</code> will return 4.</p><p>Here’s a last important detail: If your structure has structure
+members, the inner structs want to have the alignment of
+longest scalar too. Suppose you write this:</p><pre class="screen">struct foo5 {
+ char c;
+ struct foo5_inner {
+ char *p;
+ short x;
+ } inner;
+};</pre><p>The <code class="literal">char *p</code> member in the inner struct forces the outer struct to be
+pointer-aligned as well as the inner. Actual layout will be like
+this on a 64-bit machine:</p><pre class="screen">struct foo5 {
+ char c; /* 1 byte*/
+ char pad1[7]; /* 7 bytes */
+ struct foo5_inner {
+ char *p; /* 8 bytes */
+ short x; /* 2 bytes */
+ char pad2[6]; /* 6 bytes */
+ } inner;
+};</pre><p>This structure gives us a hint of the savings that might be possible
+from repacking structures. Of 24 bytes, 13 of them are padding.
+That’s more than 50% waste space!</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_bitfields"></a>6. Bitfields</h2></div></div></div><p>Now let’s consider C bitfields. What they give you the ability to do is
+declare structure fields of smaller than character width, down to 1
+bit, like this:</p><pre class="screen">struct foo6 {
+ short s;
+ char c;
+ int flip:1;
+ int nybble:4;
+ int septet:7;
+};</pre><p>The thing to know about bitfields is that they are implemented with
+word- and byte-level mask and rotate instructions operating on machine
+words, and cannot cross word boundaries. C99 guarentees that
+bit-fields will be packed as tightly as possible, provided they don’t
+cross storage unit boundaries (6.7.2.1 #10).</p><p>This restriction is relaxed in C11 (6.7.2.1p11) and C++14
+([class.bit]p1); these revisions do not actually require <code class="literal">struct foo9</code>
+to be 64 bits instead of 32; a bit-field can span multiple allocation
+units instead of starting a new one. It’s up to the implementation to
+decide; GCC leaves it up to the ABI, which for x64 does prevent them
+from sharing an allocation unit.</p><p>Assuming we’re on a 32-bit machine, the C99 rules imply that the layout may
+look like this:</p><pre class="screen">struct foo6 {
+ short s; /* 2 bytes */
+ char c; /* 1 byte */
+ int flip:1; /* total 1 bit */
+ int nybble:4; /* total 5 bits */
+ int pad1:3; /* pad to an 8-bit boundary */
+ int septet:7; /* 7 bits */
+ int pad2:25; /* pad to 32 bits */
+};</pre><p>But this isn’t the only possibility, because the C standard does not
+specify that bits are allocated low-to-high. So the layout could look
+like this:</p><pre class="screen">struct foo6 {
+ short s; /* 2 bytes */
+ char c; /* 1 byte */
+ int pad1:3; /* pad to an 8-bit boundary */
+ int flip:1; /* total 1 bit */
+ int nybble:4; /* total 5 bits */
+ int pad2:25; /* pad to 32 bits */
+ int septet:7; /* 7 bits */
+};</pre><p>That is, the padding could precede rather than following the
+payload bits.</p><p>Note also that, as with normal structure padding, the padding bits are
+not guaranteed to be zero; C99 mentions this.</p><p>Note that the base type of a bit field is interpreted for signedness
+but not necessarily for size. It is up to implementors whether
+"short flip:1" or "long flip:1" are supported, and whether those
+base types change the size of the storage unit the field is packed into.</p><p>Proceed with caution and check with -Wpadded if you have it available
+(e.g. under clang). Compilers on exotic hardware might interpret the
+C99 rules in surprising ways; older compilers might not quite follow
+them.</p><p>The restriction that bitfields cannot cross machine word boundaries
+means that, while the first two of the following structures pack into
+one and two 32-bit words as you’d expect, the third (<code class="literal">struct foo9</code>) takes
+up <span class="strong"><strong>three</strong></span> 32-bit words in C99, in the last of which only one bit is used.</p><pre class="screen">struct foo7 {
+ int bigfield:31; /* 32-bit word 1 begins */
+ int littlefield:1;
+};
+
+struct foo8 {
+ int bigfield1:31; /* 32-bit word 1 begins /*
+ int littlefield1:1;
+ int bigfield2:31; /* 32-bit word 2 begins */
+ int littlefield2:1;
+};
+
+struct foo9 {
+ int bigfield1:31; /* 32-bit word 1 begins */
+ int bigfield2:31; /* 32-bit word 2 begins */
+ int littlefield1:1;
+ int littlefield2:1; /* 32-bit word 3 begins */
+};</pre><p>Again, C11 and C++14 may pack foo9 tighter, but it would perhaps be
+unwise to count on this.</p><p>On the other hand, <code class="literal">struct foo8</code> would fit into a single 64-bit word
+if the machine has those.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_reordering"></a>7. Structure reordering</h2></div></div></div><p>Now that you know how and why compilers insert padding in and after your
+structures we’ll examine what you can do to squeeze out the slop.
+This is the art of structure packing.</p><p>The first thing to notice is that slop only happens in two places.
+One is where storage bound to a larger data type (with stricter
+alignment requirements) follows storage bound to a smaller one. The
+other is where a struct naturally ends before its stride address,
+requiring padding so the next one will be properly aligned.</p><p>The simplest way to eliminate slop is to reorder the structure members
+by decreasing alignment. That is: make all the pointer-aligned
+subfields come first, because on a 64-bit machine they will be 8
+bytes. Then the 4-byte ints; then the 2-byte shorts; then the
+character fields.</p><p>So, for example, consider this simple linked-list structure:</p><pre class="screen">struct foo10 {
+ char c;
+ struct foo10 *p;
+ short x;
+};</pre><p>With the implied slop made explicit, here it is:</p><pre class="screen">struct foo10 {
+ char c; /* 1 byte */
+ char pad1[7]; /* 7 bytes */
+ struct foo10 *p; /* 8 bytes */
+ short x; /* 2 bytes */
+ char pad2[6]; /* 6 bytes */
+};</pre><p>That’s 24 bytes. If we reorder by size, we get this:</p><pre class="screen">struct foo11 {
+ struct foo11 *p;
+ short x;
+ char c;
+};</pre><p>Considering self-alignment, we see that none of the data fields need
+padding. This is because the stride address for a (longer) field with
+stricter alignment is always a validly-aligned start address for a
+(shorter) field with less strict requirements. All the repacked struct
+actually requires is trailing padding:</p><pre class="screen">struct foo11 {
+ struct foo11 *p; /* 8 bytes */
+ short x; /* 2 bytes */
+ char c; /* 1 byte */
+ char pad[5]; /* 5 bytes */
+};</pre><p>Our repack transformation drops the size from 24 to 16 bytes. This
+might not seem like a lot, but suppose you have a linked list of 200K
+of these? The savings add up fast - especially on memory-constrained
+embedded systems or in the core part of an OS kernel that has to stay
+resident.</p><p>Note that reordering is not guaranteed to produce savings. Applying this
+technique to an earlier example, <code class="literal">struct foo5</code>, we get this:</p><pre class="screen">struct foo12 {
+ struct foo5 {
+ char *p; /* 8 bytes */
+ short x; /* 2 bytes */
+ } inner;
+ char c; /* 1 byte*/
+};</pre><p>With padding written out, this is</p><pre class="screen">struct foo12 {
+ struct foo5 {
+ char *p; /* 8 bytes */
+ short x; /* 2 bytes */
+ char pad[6]; /* 6 bytes */
+ } inner;
+ char c; /* 1 byte*/
+ char pad[7]; /* 7 bytes */
+};</pre><p>It’s still 24 bytes because <code class="literal">c</code> cannot back into the inner struct’s
+trailing padding. To collect that gain you would need to redesign
+your data structures.</p><p>Curiously, strictly ordering your structure fields by <span class="emphasis"><em>increasing</em></span>
+size also works to mimimize padding. You can minimize padding with
+any order in which (a) all fields of any one size are in a continuous
+span (completely eliminating padding between them), and (b) the gaps
+between those spans are such that the sizes on either side have as few
+doubling steps of difference from each other as possible. Usually
+this means no padding at all on one side.</p><p>Even more general minimal-padding orders are possible. Example:</p><pre class="screen">struct foo13 {
+ int32_t i;
+ int32_t i2;
+ char octet[8];
+ int32_t i3;
+ int32_t i4;
+ int64_t l;
+ int32_t i5;
+ int32_t i6;
+};</pre><p>This struct has zero padding under self-alignment rules. Working out
+why is a useful exercise to develop your understanding.</p><p>Since shipping the first version of this guide I have been asked why,
+if reordering for minimal slop is so simple, C compilers don’t do it
+automatically. The answer: C is a language originally designed for
+writing operating systems and other code close to the
+hardware. Automatic reordering would interfere with a systems
+programmer’s ability to lay out structures that exactly match the
+byte and bit-level layout of memory-mapped device control blocks.</p><p>Go hews to the C philosophy and does not reorder fields. Rust makes
+the opposite choice; by default, its compiler <span class="emphasis"><em>may</em></span> reorder structure
+fields.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_awkward_scalar_cases"></a>8. Awkward scalar cases</h2></div></div></div><p>Using enumerated types instead of #defines is a good idea, if only
+because symbolic debuggers have those symbols available and can
+show them rather than raw integers. But, while enums are guaranteed
+to be compatible with an integral type, the C standard does
+not specify <span class="emphasis"><em>which</em></span> underlying integral type is to be used for them.</p><p>Be aware when repacking your structs that while enumerated-type
+variables are usually ints, this is compiler-dependent; they could be
+shorts, longs, or even chars by default. Your compiler may have a
+pragma or command-line option to force the size.</p><p>The <code class="literal">long double</code> type is a similar trouble spot. Some C platforms
+implement this in 80 bits, some in 128, and some of the 80-bit
+platforms pad it to 96 or 128 bits.</p><p>In both cases it’s best to use <code class="literal">sizeof()</code> to check the storage size.</p><p>Finally, under x86 Linux doubles are sometimes an exception to the
+self-alignment rule; an 8-byte double may require only 4-byte
+alignment within a struct even though standalone doubles variables
+have 8-byte self-alignment. This depends on compiler and options.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_readability_and_cache_locality"></a>9. Readability and cache locality</h2></div></div></div><p>While reordering by size is the simplest way to eliminate slop, it’s
+not necessarily the right thing. There are two more issues:
+readability and cache locality.</p><p>Programs are not just communications to a computer, they are
+communications to other human beings. Code readability is important
+even (or especially!) when the audience of the communication is only
+your future self.</p><p>A clumsy, mechanical reordering of your structure can harm
+readability. When possible, it is better to reorder fields so they
+remain in coherent groups with semantically related pieces of data
+kept close together. Ideally, the design of your structure should
+communicate the design of your program.</p><p>When your program frequently accesses a structure, or parts of a
+structure, it is helpful for performance if the accesses tend to fit
+within a <span class="emphasis"><em>cache line</em></span> - the memory block fetched by your processor
+when it is told to get any single address within the block. On 64-bit
+x86 a cache line is 64 bytes beginning on a self-aligned address; on
+other platforms it is often 32 bytes.</p><p>The things you should do to preserve readability - grouping related
+and co-accessed data in adjacent fields - also improve cache-line
+locality. These are both reasons to reorder intelligently, with
+awareness of your code’s data-access patterns.</p><p>If your code does concurrent access to a structure from multiple
+threads, there’s a third issue: cache line bouncing. To minimize
+expensive bus traffic, you should arrange your data so that reads
+come from one cache line and writes go to another in your tighter
+loops.</p><p>And yes, this sometimes contradicts the previous guidance about grouping
+related data in the same cache-line-sized block. Multithreading is hard.
+Cache-line bouncing and other multithread optimization issues are very
+advanced topics which deserve an entire tutorial of their own. The
+best I can do here is make you aware that these issues exist.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_other_packing_techniques"></a>10. Other packing techniques</h2></div></div></div><p>Reordering works best when combined with other techniques for slimming
+your structures. If you have several boolean flags in a struct, for example,
+consider reducing them to 1-bit bitfields and packing them into a
+place in the structure that would otherwise be slop.</p><p>You’ll take a small access-time penalty for this - but if it squeezes
+the working set enough smaller, that penalty will be swamped by your
+gains from avoided cache misses.</p><p>More generally, look for ways to shorten data field sizes. In
+cvs-fast-export, for example, one squeeze I applied was to use the
+knowledge that RCS and CVS repositories didn’t exist before 1982. I
+dropped a 64-bit Unix <code class="literal">time_t</code> (zero date at the beginning of 1970)
+for a 32-bit time offset from 1982-01-01T00:00:00; this will cover
+dates to 2118. (Note: if you pull a trick like this, <span class="emphasis"><em>do a bounds
+check whenever you set the field</em></span> to prevent nasty bugs!)</p><p>Each such field shortening not only decreases the explicit size
+of your structure, it may remove slop and/or create additional
+opportunities for gains from field reordering. Virtuous cascades
+of such effects are not very hard to trigger.</p><p>The riskiest form of packing is to use unions. If you know that
+certain fields in your structure are never used in combination with
+certain other fields, consider using a union to make them share
+storage. But be extra careful and verify your work with regression
+testing, because if your lifetime analysis is even slightly wrong you
+will get bugs ranging from crashes to (much worse) subtle data
+corruption.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_overriding_alignment_rules"></a>11. Overriding alignment rules</h2></div></div></div><p>Sometimes you can coerce your compiler into not using the processor’s
+normal alignment rules by using a pragma, usually <code class="literal">#pragma pack</code>.
+GCC and clang have a "packed" attribute you can attach to
+individual structure declarations; GCC has an -fpack-struct option
+for entire compilations.</p><p>Do not do this casually, as it forces the generation of more expensive
+and slower code. Usually you can save as much memory, or almost as
+much, with the techniques I describe here.</p><p>The only really compelling reason for <code class="literal">#pragma pack</code> is if you have to
+exactly match your C data layout to some kind of bit-level hardware or
+protocol requirement, like a memory-mapped hardware port, and
+violating normal alignment is required for that to work. If you’re in
+that situation, and you don’t already know everything else I’m writing
+about here, you’re in deep trouble and I wish you luck.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_tools"></a>12. Tools</h2></div></div></div><p>The clang compiler has a -Wpadded option that causes it to generate
+messages about alignment holes and padding. Some versions also have
+an undocumented -fdump-record-layouts option that yields
+<a class="ulink" href="http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-July/037778.html" target="_top">more
+information</a>.</p><p>If you’re using C11, you can deploy static_assert to check your
+assumptions about type and structure sizes. Example:</p><pre class="screen">#include <assert.h>
+struct foo4 {
+ short s; /* 2 bytes */
+ char c; /* 1 byte */
+};
+static_assert(sizeof(struct foo4) == 4, “Check your assumptions");</pre><p>I have not used it myself, but several respondents speak well of a
+program called <code class="literal">pahole</code>. This tool cooperates with a compiler to produce
+reports on your structures that describe padding, alignment, and
+cache line boundaries. This was at one time a standalone C program,
+but that is now unmaintained; a script with the name pahole now ships
+with gdb and that is what you should use.</p><p>I’ve received a report that a proprietary code auditing tool called
+"PVS Studio" can detect structure-packing opportunities.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_proof_and_exceptional_cases"></a>13. Proof and exceptional cases</h2></div></div></div><p>You can download sourcecode for a little program that demonstrates the
+assertions about scalar and structure sizes made above. It is
+<a class="ulink" href="packtest.c" target="_top">packtest.c</a>.</p><p>If you look through enough strange combinations of compilers, options,
+and unusual hardware, you will find exceptions to some of the rules I
+have described. They get more common as you go back in time to older
+processor designs.</p><p>The next level beyond knowing these rules is knowing how and when to
+expect that they will be broken. In the years when I learned them
+(the early 1980s) we spoke of people who didn’t get this as victims of
+"all-the-world’s-a-VAX" syndrome. Remember that not all the world is
+vanilla.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="C-like"></a>14. Other languages</h2></div></div></div><p>In this section we’ll call a language "C-like" if structure and array
+members are self-aligned, are not reordered by the compiler, the address
+of a struct is the address of its first member, and structs have
+trailing pdding to their stride length.</p><p>If you know the implications of self-aligment in C, you can apply them
+directly to calculating sizes and offsets in any language that is
+C-like in this sense, and to space-optimizing in the language’s
+structures.</p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c"></a>14.1. C++</h3></div></div></div><p>C++ is C-like, except that classes that look like structs may ignore
+the rule that the address of a struct is the address of its first
+member! Whether they do or not depends on how base classes and virtual
+member functions are implemented, and varies by compiler. Otherwise
+everything we’ve observed about C applies.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_go"></a>14.2. Go</h3></div></div></div><p>The Go language is in many respects similar to C. It has structures
+and arrays, though not bitfields or unions. Go compilers have the
+same optimization and alignment issues as C compilers. As in C, array
+elements are padded up to the following stride address.</p><p>There is no Go equivalent of the C #pack pragma.</p><p>Variables and struct fields will normally be self-aligned for the same
+reasons rgis is a rule in C. However, there is one peculiat
+exception; on 32—bit platforms, 64-bit struct fields only require
+akignment on a machine word boundary, e.g. 32 bits. There has been
+discussion of
+<a class="ulink" href="https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a" target="_top">https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a</a>
+proposal to change this, but it stands.</p><p>On the other hand, the Go specification makes no guarantees about how
+structure fields are ordered. Unlike C, it would be legal for a Go
+compiler to lay out fields in an order different from their
+specification in the source. As of 2022 no Go compiler that
+actually does this has beenbn sughted in the wild.</p><p>Go has one odd really odd quirk. Since Go 1.5, a zero-length field at
+the end of a struct (that is, a zero-length array or empty struct) is
+sized and aligned as though it is one byte. The reasons for this are
+discussed in an essay
+<a class="ulink" href="https://dave.cheney.net/2015/10/09/padding-is-hard" target="_top">Padding is Hard</a> by
+one of the Go developers.</p><p>There’s a
+<a class="ulink" href="https://itnext.io/structure-size-optimization-in-golang-alignment-padding-more-effective-memory-layout-linters-fffdcba27c61" target="_top">specific
+discussion of Go alignment rules</a> that includes pointers tp some tools
+that can automatically tweak your structures to optimal alignment.
+Another such tool is
+<a class="ulink" href="https://github.com/orijtech/structslop" target="_top">strucrslop</a>. I have not used
+any of these, try at your own risk.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_rust"></a>14.3. Rust</h3></div></div></div><p>Rust follows C-like packing rules if a structure is annotated with
+"repr(C)". Otherwise (by default) all bets are off: padding rules are
+(deliberately) unspecified and the compiler may even reorder structure
+members. It is probably best to let the Rust compiler do space
+optimization rather than forcing it.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_java"></a>14.4. Java</h3></div></div></div><p>Java’s JNI (Java Native Interface) supports C-like packing rules for
+structure members so that JNI Java structures can map exactly to C
+equivalents. There is a pack pragma.</p><p>Packing in the JVM itself, however, is not well defined. The JVM
+spec simply says "The Java Virtual Machine does not mandate any
+particular internal structure for objects.", making choices about
+this implementation-dependent.</p><p>That said, many JVM implementations are word-oriented in an even
+stricter way than Motorola processors - structure and array members
+can start at any 32-bit boundary but not at a 16- or 8-bit one. This
+will create internal padding after char and 16-bit short members where
+it wouldn’t be expected under C-like rules.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_swift"></a>14.5. Swift</h3></div></div></div><p>Swift is
+<a class="ulink" href="https://swiftunboxed.com/internals/size-stride-alignment/" target="_top">exactly
+C-like</a>. There is no equivalent of a pack pragma.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c_2"></a>14.6. C#</h3></div></div></div><p>C# is C-like with the default structure layout attribute
+LayoutKind.Sequential. LayoutKind.Auto allows the compiler to
+reorder, and LayoutKind.Explicit allows the programmmer to specify
+field sizes explicitly. There’s a Pack modifier which is equivalent to
+a C pack pragma.</p></div></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_supporting_this_work"></a>15. Supporting this work</h2></div></div></div><p>If you were educated or entertained by this document, please sign up
+for my <a class="ulink" href="https://www.patreon.com/esr" target="_top">Patreon feed</a>. The time needed to
+write and maintain documents like this one is not free, and while
+I enjoying giving them to the world my bills won’t pay themselves.
+Even a few dollars a month - from enough of you - helps a lot.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_related_reading"></a>16. Related Reading</h2></div></div></div><p>This section exists to collect pointers to essays which I judge to be
+good companions to this one.</p><p><a class="ulink" href="http://blog.regehr.org/archives/213" target="_top">A Guide to Undefined Behavior in C and C++</a></p><p><a class="ulink" href="http://www.catb.org/esr/time-programming/" target="_top">Time, Clock, and Calendar Programming In C</a></p><p><a class="ulink" href="http://www.catb.org/esr/faqs/things-every-hacker-once-knew/" target="_top">Things Every Hacker Once Knew</a></p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_version_history"></a>17. Version history</h2></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term">
+2.5 @ 2020-04-30
+</span></dt><dd>
+ Revision and expansion of the section on Go packing/alignment rules.
+</dd><dt><span class="term">
+2.4 @ 2020-04-28
+</span></dt><dd>
+ Added coverage of Java, Swift, and C#.
+</dd><dt><span class="term">
+2.3 @ 2020-04-20
+</span></dt><dd>
+ Describe exceptional alignment rules on Motorola 680x0.
+ Sightly improve coverage of Rust alignment rules.
+</dd><dt><span class="term">
+2.2 @ 2019-12-19
+</span></dt><dd>
+ Minor markup fix.
+</dd><dt><span class="term">
+2.1 @ 2019-02-14
+</span></dt><dd>
+ Correct a minor error in one of the examples pointed out by
+ Luis Emilio Moreno Durán.
+</dd><dt><span class="term">
+2.0 @ 2018-08-06
+</span></dt><dd>
+ Drop "C" out of the title as these techniques are applicable to Go
+ and Rust as well. More coverage of Go and Rust. The pahole tool
+ has been replaced by a script in the gdb distribution.
+</dd><dt><span class="term">
+1.19 @ 2018-02-12
+</span></dt><dd>
+ Describe usefulness of static_assert() in C11. Added Go and Rust coverage.
+</dd><dt><span class="term">
+1.18 @ 2017-06-01
+</span></dt><dd>
+ More general zero-padding orders. C11 and C14 relax a constraint
+ on bitfield packing.
+</dd><dt><span class="term">
+1.17 @ 2016-11-14
+</span></dt><dd>
+ Typo fixes.
+</dd><dt><span class="term">
+1.16 @ 2016-10-21
+</span></dt><dd>
+ Answer an objection about allocation order being unrelated to
+ source order.
+</dd><dt><span class="term">
+1.15 @ 2016-10-20
+</span></dt><dd>
+ Note the field evidence from NTP.
+</dd><dt><span class="term">
+1.14 @ 2015-12-19
+</span></dt><dd>
+ Typo correction: -Wpadding → -Wpadded.
+</dd><dt><span class="term">
+1.13 @ 2015-11-23
+</span></dt><dd>
+ Be explicit about padding bits being undefined. More about bitfields.
+</dd><dt><span class="term">
+1.12 @ 2015-11-11
+</span></dt><dd>
+ Major revision of section on bitfields reflecting C99 rules.
+</dd><dt><span class="term">
+1.11 @ 2015-07-23
+</span></dt><dd>
+ Mention the clang -fdump-record-layouts option.
+</dd><dt><span class="term">
+1.10 @ 2015-02-20
+</span></dt><dd>
+ Mention <span class="emphasis"><em>attribute</em></span><a id="idm359" class="indexterm"></a><span class="emphasis"><em>packed</em></span>, -fpack-struct, and PVS Studio.
+</dd><dt><span class="term">
+1.9 @ 2014-10-01
+</span></dt><dd>
+ Added link to "Time, Clock, and Calendar Programming In C".
+</dd><dt><span class="term">
+1.8 @ 2014-05-20
+</span></dt><dd>
+ Improved explanation for the bitfield examples,
+</dd><dt><span class="term">
+1.7 @ 2014-05-17
+</span></dt><dd>
+ Correct a minor error in the description of the layout of <code class="literal">struct foo8</code>.
+</dd><dt><span class="term">
+1.6 @ 2014-05-14
+</span></dt><dd>
+ Emphasize that bitfields cannot cross word boundaries. Idea from
+ Dale Gulledge.
+</dd><dt><span class="term">
+1.5 @ 2014-01-13
+</span></dt><dd>
+ Explain why structure member reordering is not done automatically.
+</dd><dt><span class="term">
+1.4 @ 2014-01-04
+</span></dt><dd>
+ A note about double under x86 Linux.
+</dd><dt><span class="term">
+1.3 @ 2014-01-03
+</span></dt><dd>
+ New sections on awkward scalar cases, readability and cache
+ locality, and tools.
+</dd><dt><span class="term">
+1.2 @ 2014-01-02
+</span></dt><dd>
+ Correct an erroneous address calculation.
+</dd><dt><span class="term">
+1.1 @ 2014-01-01
+</span></dt><dd>
+ Explain why aligned accesses are faster. Mention offsetof.
+ Various minor fixes, including the packtest.c download link.
+</dd><dt><span class="term">
+1.0 @ 2014-01-01
+</span></dt><dd>
+ Initial release.
+</dd></dl></div></div></div></body></html>
+\ No newline at end of file