The Significance of the x86 LFENCE Instruction

Posted on May 14, 2018 by Hadi Brais

Note: SFENCE is discussed in another blog post. This post is about LFENCE.

The x86 ISA currently offers three “fence” instructions: MFENCE, SFENCE, and LFENCE. Sometimes they are described as “memory fence” instructions. In some other architectures and in the literature about memory ordering models, terms such as memory fences, store fences, and load fences are used. The terms “memory fence” and “load fence” have not been used in the Intel Manual Volume 3, but they have been used in the Intel Manual Volume 2 and in the AMD manuals a couple of times. I’ll focus in this article on “load fences”. Throughout this article, I’ll be referring to the latest Intel and AMD manuals at the time of writing this article.

The fact that the term “load fence” has been used in different ISAs, textbooks, and research papers has resulted in a critical misunderstanding of the x86 LFENCE instruction and confusion regarding what it does and how to use it.Calling it “load fence” gives the impression that it serializes load operations. And since the x86 memory ordering model already guarantees that loads (those explicitly issued by the instructions being executed as specified in the ISA, see the comments for discussion) will not be reordered according to the observable behavior, then it appears that LFENCE is rather useless. However, LFENCE is actually not a load fence in the traditional meaning of the word, even though it’s sometimes called so.

LFENCE was first introduced in the Pentium 4 processor in 2001 as part of the SSE2 instruction set extension, and is supported by all later Intel x86 processors. Also it is supported by the AMD Opteron and Athlon 64 processors and all later AMD x86 processors.

If you read the Intel manuals for processors that precede the Pentium 4 (P5 and P6, but not 486 and earlier, whose manuals did not explicitly use the term “serialization”), you’ll see that there might be a need for using serializing instructions or I/O instructions in certain situations. However, Intel have decided to introduce the fence instructions (SFENCE in Pentium III and LFENCE and MFENCE in Pentium 4) to provide some ordering guarantees but without being fully serializing instructions, potentially making them have less impact on performance.

Intel Manual Volume 3 Section 8.2.5:

The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data.

Intel Manual Volume 3 Section 8.2.5:

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.

One important phrase here is “weakly-ordered results”. But what does it mean? Is the x86 memory model a binary thing (weak vs. strong) or is it more complicated than that? Unfortunately, Intel did not define precisely what that phrase means. Anyway, although it’s very important and relevant to the LFENCE instruction, I’ll not directly address this issue in this article.

The fence instructions including LFENCE can be executed at any privilege level, in any operating mode, and in any architectural state. Their behavior is the same in all Intel processors that support them, except for LFENCE in AMD processors as I’ll explain later.

Intel Manual Volume 3 Section 8.2.5:

LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.

This sounds like the definition of a load fence, but there is more in the footnote in the same section and also repeated in Volume 2 (the sentences are numbered by me).

(1) Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. (2) As a result, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE. (3) An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible. (4) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

This part indicates that LFENCE is more than just a load fence, but it’s a little vague. What does “completed locally” mean? Well, the second sentence says that LFENCE does not complete until all previous loads in program order have received the data. This means that other agents in the system might be able to determine that the logical processor have read a particular memory location in case it was globally updated (the update has reached the coherence domain) at least once by an agent other than the logical processor. This is possible because the logical processor may produce different results based on the fetched data. But can we say that the other agents can definitely observe these loads? Yes we can. That’s because Intel uses the terms “retire” and “complete” interchangeably throughout the manual. When an instruction retires, it means that all of its side effects are either globally visible or will become globally visible at some time later. In other words, the logical processor cannot retract a retired instruction, but it might be able to retire other subsequent instructions that override its effects before they become globally visible. The third sentence clarifies this situation. Writes to memory from retired instructions may not yet be globally visible even though they are visible from the logical processor that retired the instructions. Such writes can be held in buffers known as write buffers and are only guaranteed to become globally visible when they leave these write buffers. Note that this behavior is only specified by the ISA, but the implementation can actually be such that retirement means global visibility.

(Note how Intel uses the term “load from memory” and “load” interchangeably. It’s important to know that when reading the Intel manuals to avoid confusion.)

The first and fourth sentences and the following quote show that LFENCE has some serialization properties (in contrast to SFENCE and MFENCE, which are purely memory fences).

Intel Manual Volume 3 Section 8.3:

LFENCE does provide some guarantees on instruction ordering. It does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

In particular, LFENCE does not prevent the processor from fetching and decoding instructions, but it does prevent the processor from executing (dispatching) almost any later instructions (I’ll discuss the exceptions later) until LFENCE itself retires, which happens only when all prior instructions retire. This means that later memory loads and stores will not get issued until all earlier instructions retire. This applies to memory accesses from memory regions of all types.

But doesn’t that make it a serializing instruction? Not really. The two most important differences are the following:

LFENCE does not prevent the processor from fetching and decoding later instructions.
LFENCE does not ensure that earlier memory writes to become globally visible (in contrast to SFENCE and MFENCE and serializing instructions).

Again, this is important, LFENCE is not a guaranteed to be a serializing instruction.

The Intel manual shows, but does not explain, three use cases of the LFENCE instruction. I’ll discuss them here.

Intel Manual Volume 2:

The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. The following items may guide software seeking to order executions of RDTSC:
• If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible, it can execute LFENCE immediately before RDTSC.
• If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.
• If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC. This instruction was introduced by the Pentium processor.

The RDTSC instruction can be used to measure execution time. However, it’s not a serializing instruction, so it might be executed speculatively and out-of-order with respect to other instructions, which may jeopardize the accuracy of the measurement. Let’s discuss each of the three points from the manual.

inst1
load1
inst2
store1
LFENCE
RDTSC
inst3
load2
store2

What’s the impact of LFENCE at that location? Well, LFENCE will allow the processor to fetch and decode all later instructions including RDTSC, but not execute any of them until LFENCE and all earlier instructions retire. Although this does not necessarily mean that all stores have become globally visible. This means the value captured by RDTSC will account for the execution of all earlier instructions. However, this may include also the execution time of later instructions because there are no guarantees regarding the order in which instructions after RDTSC will execute (including other RDTSC’s, which may lead to non-monotonic measurements on the same logical processor).

load1
inst2
store1
MFENCE
LFENCE
RDTSC
...

MFENCE here enables RDTSC to capture the time it takes to make all stores globally visible. This is the only difference that it makes in this code. Now consider this.

load1
inst2
store1
LFENCE
MFENCE
RDTSC
...

Interesting, right? I changed the order of LFENCE and MFENCE. What do you think? If you already know what MFENCE does, then you should be able to figure it out by yourself. Otherwise, you can just skip it.

The sequence LFENCE;RDTSC enabled us to order RDTSC with respect to all previous instructions (with few exceptions discussed later). We can also do something similar so that RDTSC is ordered with respect to all later instructions (with few exceptions discussed later). It’s not hard to see that this can be achieved using the RDTSC;LFENCE sequence. A standard-compliant accurate (low variance) measurement requires sandwiching RDTSC between two LFENCE instructions. It’s worth noting that multiple executions of RDTSC in the same software thread may still result in non-monotonic samples when the thread gets rescheduled to run on different logical processors.

The Intel Manual Volume 2 also shows how LFENCE can be useful when using RDTSCP. It’s basically the same thing, so I’ll just skip it.

The third example is shown in the following quote.

Intel Manual Volume 3 Section 10.12.3:

To allow for efficient access to the APIC registers in x2APIC mode, the serializing semantics of WRMSR are relaxed when writing to the APIC registers. Thus, system software should not use “WRMSR to APIC registers in x2APIC mode” as a serializing instruction. Read and write accesses to the APIC registers will occur in program order. A WRMSR to an APIC register may complete before all preceding stores are globally visible; software can prevent this by inserting a serializing instruction or the sequence MFENCE;LFENCE before the WRMSR.

The sequence MFENCE;LFENCE has the same effect as before and is used for a similar purpose.

Now something very important from the AMD manual.

AMD Manual Volume 3

(1) LFENCE acts as a barrier to force strong memory ordering (serialization) between load instructions preceding the LFENCE and load instructions that follow the LFENCE. (2) Loads from differing memory types may be performed out of order, in particular between WC/WC+ and other memory types. (3) The LFENCE instruction assures that the system completes all previous loads before executing subsequent loads.

AMD has always in their manual described their implementation of LFENCE as a load serializing instruction. The original use case for LFENCE was ordering WC memory type loads. WC Loads may be performed out of order with respect to all other loads. This is also mentioned (although difficult to find) in the Intel manual.

Intel Manual Volume 3 Section 8.1.2.2:

Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

Intel Manual Volume 3 Section 11.3.1:

The WC memory type is weakly ordered by definition.

Even though it says “weakly ordered memory types”, there is currently only one weakly ordered memory type, namely WC.

However, after the speculative execution vulnerabilities were discovered, AMD released a document in January 2018 entitled “Software techniques for managing speculation on AMD processors” that discusses when and how to make LFENCE behave similar to Intel’s LFENCE (giving it stronger serializing properties). This is what they say in that document.

Description: Set an MSR in the processor so that LFENCE is a dispatch serializing instruction and then use LFENCE in code streams to serialize dispatch (LFENCE is faster than RDTSCP which is also dispatch serializing). This mode of LFENCE may be enabled by setting MSR C001_1029[1]=1.
Effect: Upon encountering an LFENCE when the MSR bit is set, dispatch will stop until the LFENCE instruction becomes the oldest instruction in the machine.
Applicability: All AMD family 10h/12h/14h/15h/16h/17h processors support this MSR. LFENCE support is indicated by CPUID function1 EDX bit 26, SSE2. AMD family 0Fh/11h processors support LFENCE as serializing always but do not support this MSR. AMD plans support for this MSR and access to this bit for all future processors.

This is the first and only document in which MSR C001_1029[1] is mentioned (other bits of C001_1029 are discussed in some AMD documents, but not bit 1). When C001_1029[1] is set to 1, LFENCE behaves as a dispatch serializing instruction (which is more expensive than merely load serializing). Since this MSR is available on most older AMD processors, it seems that it has almost always been supported. Maybe because they thought they might need in the future to maintain compatibility with Intel processors regarding the behavior of LFENCE.

One thing not clear to me is the part regarding AMD families 0Fh and 11h processors. That statement is vague because it doesn’t clearly say whether LFENCE on AMD families 0Fh and 11h is fully serializing (in AMD terminology) or dispatch serializing (in AMD terminology). To be safe, it should be interpreted as dispatch serializing only. The AMD family-specific manuals don’t mention LFENCE or MSR C001_1029.

There are exceptions to the ordering rules of fence instructions and serializing instructions and instructions that have serializing properties. These exceptions are subtly different between Intel and AMD processors. One example is the CLFLUSH instruction. So AMD and Intel mean slightly different things when they talk about instructions with serializing properties.

The last sentence of quote indicates that MSR C001_1029[1] is not part of the AMD x86 architecture.

That LFENCE with this serializing behavior on Intel and AMD processors has been used to control speculative execution as a mitigation for the Spectre vulnerabilities.

Now I’ll discuss the weak-ordering characteristics of LFENCE on both Intel and AMD processors.

First, as already discussed, LFENCE does not prevent the processor from fetching and decoding instructions, but only from dispatching instructions. This means that LFENCE is concurrent with instruction fetches.

LFENCE is not ordered with SFENCE, the global visibility of earlier writes, software prefetching instructions, hardware prefetching, and page table walks, as specified in the following quotes and in other locations in the manuals. This means that LFENCE is concurrent with these operations.

Intel Manual Volume 2

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it is not ordered with respect to executions of the LFENCE instruction; data can be brought into the caches speculatively just before, during, or after the execution of an LFENCE instruction.

AMD Manual Volume 3

The LFENCE instruction is weakly-ordered with respect to store instructions, data and instruction prefetches, and the SFENCE instruction. Speculative loads initiated by the processor, or specified explicitly using cache-prefetch instructions, can be reordered around an LFENCE.

In the following quote from the Intel manual, it says that writes, CLFLUSH, and CLFLUSHOPT cannot pass earlier LFENCE. We already know this about the writes.

Intel Manual Volume 3 Section 8.2.2:

Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

This quote alone might give impression that CLFLUSH and CLFLUSHOPT might pass later LFENCE. However, other places in the manual specify that they are fully ordered with LFENCE.

Intel Manual Volume 2:

Executions of the CLFLUSH instruction are ordered with respect to each other and with respect to writes, locked read-modify-write instructions, fence instructions, and executions of CLFLUSHOPT to the same cache line. They are not ordered with respect to executions of CLFLUSHOPT to different cache lines.

Intel Manual Volume 2:

Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read-modify-write instructions; they are also ordered with respect to the following accesses to the cache line being invalidated: writes, executions of CLFLUSH, and executions of CLFLUSHOPT. They are not ordered with respect to writes, executions of CLFLUSH, or executions of CLFLUSHOPT that access other cache lines; to enforce ordering with such an operation, software can insert an SFENCE instruction between CFLUSHOPT and that operation.

(Note the typo “CFLUSHOPT” at the end of the second quote. LOL. I demand an explanation.)

AMD is slightly different on the ordering between LFENCE and CLFLUSH and CLFLUSHOPT. In particular, on AMD processors, LFENCE is not ordered with respect to CLFLUSH.

AMD Manual Volume 3:

The LFENCE, SFENCE, and serializing instructions are not ordered with respect to CLFLUSH.

AMD Manual Volume 3:

The CLFLUSHOPT instruction is ordered with respect to fence instructions and locked operations.

The MONITOR instruction is treated as a demand WB load operation.

Intel Manual Volume 2:

The MONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction is subject to the permission checking and faults associated with a byte load. Like a load, MONITOR sets the A-bit but not the D-bit in page tables.

Note that Intel uses the terms “memory transactions” and “memory operations” interchangeably. I didn’t find any such statement in the AMD manual, but I expect that it’s treated the same way there too.

By the way, all of these ordering rules regarding LFENCE are important. Don’t get the false impression that they are arbitrary or irrelevant. In the future, I might write more about this. Also I might write similar articles for the other fence instructions and other related instructions.

LFENCE in the Linux kernel

Linux defines a list of all x86 CPU features in /arch/x86/include/asm/cpufeatures.h. The X86_FEATURE_LFENCE_RDTSC flag represents support for a dispatch serializing LFENCE. The X86_FEATURE_XMM2 flag represents support for SSE2. On Intel processors, X86_FEATURE_XMM2 implies X86_FEATURE_LFENCE_RDTSC. On AMD processors, X86_FEATURE_LFENCE_RDTSC requires X86_FEATURE_XMM2 and an extra check needs to be performed. On AMD processors that support SSE2, there are three cases to be considered:

MSR C001_1029[1] must be set to enable the dispatch serializing behavior of LFENCE. This can only be done if and only if the MSR is supported.
The MSR is not supported (AMD 0Fh/11h). LFENCE is by default at least dispatch serializing. Nothing needs to be done.
The MSR is supported, but we are running under a hypervisor that does not support writing that MSR (because perhaps the hypervisor has not been updated yet). In this case, resort to the slower MFENCE (which is fully serializing on AMD processors and represented by the X86_FEATURE_MFENCE_RDTSC macro) for serializing RDTSC and use a Spectre mitigation that does not require LFENCE (i.e., generic retpoline).

static void init_amd(struct cpuinfo_x86 *c)
{
        ...
        
	if (cpu_has(c, X86_FEATURE_XMM2)) {
		unsigned long long val;
		int ret;

		/*
		 * A serializing LFENCE has less overhead than MFENCE, so
		 * use it for execution serialization.  On families which
		 * don't have that MSR, LFENCE is already serializing.
		 * msr_set_bit() uses the safe accessors, too, even if the MSR
		 * is not present.
		 */
		msr_set_bit(MSR_F10H_DECFG,
			    MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT);

		/*
		 * Verify that the MSR write was successful (could be running
		 * under a hypervisor) and only then assume that LFENCE is
		 * serializing.
		 */
		ret = rdmsrl_safe(MSR_F10H_DECFG, &val);
		if (!ret && (val & MSR_F10H_DECFG_LFENCE_SERIALIZE)) {
			/* A serializing LFENCE stops RDTSC speculation */
			set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);
		} else {
			/* MFENCE stops RDTSC speculation */
			set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC);
		}
	}

        ...
}

The following relevant macros are defined in msr-index.h and they are only used on AMD processors:

#define MSR_F10H_DECFG			0xc0011029
#define MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT	1

X86_FEATURE_LFENCE_RDTSC is used elsewhere in the code to choose a Spectre mitigation.

Linux only uses LFENCE when it has the dispatch serializing properties. If Linux is running on an Intel or AMD processor that doesn’t support SSE2 (typically, 32-bit processors), it resorts to the lock prefix.

The Linux kernel defines the x86 memory fences that it uses in /arch/x86/include/asm/barrier.h as follows:

/*
 * Force strict CPU ordering.
 * And yes, this might be required on UP too when we're talking
 * to devices.
 */

#ifdef CONFIG_X86_32
#define mb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "mfence", \
				      X86_FEATURE_XMM2) ::: "memory", "cc")
#define rmb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "lfence", \
				       X86_FEATURE_XMM2) ::: "memory", "cc")
#define wmb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "sfence", \
				       X86_FEATURE_XMM2) ::: "memory", "cc")
#else
#define mb() 	asm volatile("mfence":::"memory")
#define rmb()	asm volatile("lfence":::"memory")
#define wmb()	asm volatile("sfence" ::: "memory")
#endif

On 32-bit x86 CPUs that don’t support LFENCE, Linux resorts to the lock prefix (which is supported by all Intel and AMD x86 processors since the Intel 8086). On both Intel and AMD processors, X86_FEATURE_XMM2 represents support for SSE2, but not necessarily a dispatch serializing LFENCE. The rmb barrier is used many times in the kernel. Note how volatile is used to define all the barriers so that they constitute compiler barriers as well.

Potential Uses of LFENCE

There are few ways to use LFENCE to improve performance and potentially reduce energy consumption and heat emission of the CPU.

Consider the following C code.

#include <unistd.h> // sleep
#include <pthread.h> // pthread

unsigned var = 0;
void *writer(void *unused) {
 sleep(2); // let the reader loop for a while.
 var = 1;
 asm volatile ("sfence" ::: "memory");
 return NULL;
}
void *reader(void *unused) {
 while(var == 0) { } // spinwait
 return NULL;
}
int main(void) {
 pthread_t thread1, thread2;
 void *status;
 pthread_create(&thread1, NULL, reader, NULL);
 pthread_create(&thread2, NULL, writer, NULL);
 // wait for the threads to terminate.
 pthread_join(thread2, &status);
 pthread_join(thread1, &status);
 return 0;
}

I know that this is not 100% language-compliant code, but serves the purpose of demonstrating a use case of LFENCE.

In this code, There are three threads: the main thread, a reader thread, and a writer thread. The writer thread sleeps for 2 seconds and then writes to a shared variable. I’ve used SFENCE at the end of the writer thread to force the write to become globally visible so that the reader thread can see it. The reader thread simply iterates in an empty loop until the writer thread updates the shared variable. Compile the code using the command `gcc main.c -pthread` and run the generated executable. Sure enough, after about 2 seconds, all threads terminate.

The assembly code of the reader loop looks like this:

.L4:
 movl var(%rip), %eax
 testl %eax, %eax
 je .L4

Let’s measure a couple of important hardware performance counters using a command similar to this one`perf stat -r 5 -e r1D1,r10E,r1C2,r0C0 ./a.out`. You should specify the raw events supported by your CPU. On my Intel Haswell processor, r1D1 represents the event of hitting the L1 data cache (MEM_LOAD_UOPS_RETIRED.L1_HIT), r10E represents the event of issuing (Intel terminology) any uop (UOPS_ISSUED.ANY), r1C2 represents the event of retiring any uop (UOPS_RETIRED.ALL), and r0C0 represents the event of retiring any instruction (INST_RETIRED.ANY_P). If your CPU supports hyperthreading, disable it. On my system, I got the following results:

Performance counter stats for './a.out' (5 runs):

  7,37,90,18,318 r1D1 ( +- 0.25% )
 14,76,48,32,136 r10E ( +- 0.25% )
 14,76,35,29,766 r1C2 ( +- 0.25% )
 22,13,74,78,876 r0C0 ( +- 0.25% )

2.000678814 seconds time elapsed ( +- 0.00% )

Most of the executed instructions would come from the reader loop. There are three instructions in the reader loop. The first one is translated to a single uop and the other two get translated to a single fused uop. Therefore, the number of retired instructions should be around 33% larger than the number of uops. The number of issued uops should be close to the number of retired uops because there is very little branch misprediction and because the number of uops is the same in the fused (UOPS_ISSUED.ANY) and unfused domain (UOPS_RETIRED.ALL). The number of L1 data cache hits is close to the number of iterations of the reader loop. It’s nice to see that the standard deviation is very low.

Now let’s use LFENCE inside the loop.

void *reader(void *unused) {
 while(var == 0) // efficient spinwait
 { 
   asm volatile ("lfence" ::: "memory");
 }
 return NULL;
}

The assembly code of the reader loop looks like this:

.L5:
 lfence
 movl var(%rip), %eax
 testl %eax, %eax
 je .L5

LFENCE prevents the logical processor from issue instances of instructions that belong to later iterations of the loop until the value of the memory load of the current iteration has been determined. This basically has the effect of slowing down the loop, but in an intelligent manner. There is no point in rapidly issuing load requests. Compile the code using the command `gcc main.c -pthread` and use perf on the generated executable.

Performance counter stats for './a.out' (5 runs):

   46,53,77,643 r1D1 ( +- 0.25% )
 3,72,30,83,120 r10E ( +- 0.25% )
 3,72,28,10,960 r1C2 ( +- 0.25% )
 1,86,11,02,392 r0C0 ( +- 0.25% )

2.000734895 seconds time elapsed ( +- 0.00% )

Nice! The number of load requests (which is very close to the number of iterations of the reader loop) has been reduced by more than 10x. The number of retired/issued uops has been reduced by about 5x. The number of retired instructions is now much smaller than the number of retired uops. Although the execution time has been increased by about 0.5 mircosecond.

Just like before, and as expected, the number of retired instructions is about 4 times the number of L1 data cache hits (number of iterations). It seems that LFENCE is made up of 5 uops on Haswell.

This technique is particularly useful when hyperthreading is enabled. LFENCE prevents the reader from unnecessarily consuming execution resources, making them available more often to the other threads. However, the PAUSE instruction might be more suitable for that purpose.

Finally, I’ll discuss how LFENCE can be used to control speculative execution using the following program:

#include <stdlib.h>

int main(void) {
 register unsigned count = 100000000;
 while(count>0)
 {
  if(rand()%2 == 0){ --count; }
  //asm volatile ("lfence" ::: "memory");
 }
 return 0;
}

I’m using the `rand` function to generate random values used to evaluate the condition of a branch. The goal here is to basically defeat the branch predictor no matter how sophisticated it is or how it works. The random number generator has not been seeded to make sure that all runs exhibit the same branching decisions.

Let’s use perf to measure the number of issued uops, retired uops, and retired instructions.

Performance counter stats for './a.out' (5 runs):

 18,17,68,22,733 r10E ( +- 0.06% )
 15,98,76,66,390 r1C2 ( +- 0.01% )
 10,89,74,98,843 r0C0 ( +- 0.01% )

2.221222619 seconds time elapsed ( +- 0.26% )

The fact that the number of uops in the fused domain is larger than the number of uops in the unfused domain indicates that the CPU experienced a lot of branch mispredictions. By using LFENCE in the loop, speculative execution can be eliminated. Although LFENCE does not prevent the CPU from speculatively fetching instructions.

Performance counter stats for './a.out' (5 runs):

 15,43,65,94,751 r10E ( +- 0.00% )
 17,29,00,37,902 r1C2 ( +- 0.00% )
 11,09,63,98,795 r0C0 ( +- 0.00% )

2.735851398 seconds time elapsed ( +- 0.32% )

For more information on the impact of LFENCE on performance and on how it is implemented in Intel processors, refer to the following Stack Overflow post: Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths.

22 thoughts on “The Significance of the x86 LFENCE Instruction”

BeeOnRope on May 18, 2018 at 4:45 AM said:

Good article.

I didn’t understand the remainder of the paragraph starting at the text “This means that other agents in the system might …”. You seem to be talking about global observability of loads from one processor to other agents? IMO it doesn’t make sense to talk about the “global observability” of loads. Only stores have a global effect, and loads have no side effect at all – they only return data from some globally observable store or a local store or something else depending on the memory model. So I didn’t follow what you are describing in that section. Perhaps an example would help?

You mentioned that you believe that non-temporal loads or loads on memory types other than WB might be re-ordered on Intel but you can’t find the exact section in the manual (unlike AMD where you have the explicit note). I believe you are correct here, but I dind’t find a smoking gun either. Pretty close is 8.2.3.1 which says:

“As noted above, the examples in this section are limited to accesses to memory regions defined as write-back
cacheable (WB). They apply only to ordinary loads stores and to locked read-modify-write instructions. They do not
necessarily apply to any of the following: out-of-order stores for string instructions (see Section 8.2.4); accesses
with a non-temporal hint; reads from memory by the processor as part of address translation (e.g., page walks);
and updates to segmentation and paging structures by the processor (e.g., to update “accessed” bits).”

At least here you have a pretty explicit warning that ordering is not guaranteed for other memory types and also instructions with non-temporal hints (which would seem to include in principle non-temporal loads like MOVNTDQA although such loads don’t seem to be non-temporal on WB memory regions in practice today).

The talk (in the Intel manual) of how the CLFLUSH[OPT] instructions are ordered is a bit unclear to me. Unlike loads and stores, these instructions have no visible effect in normal user-mode code as far as I can tell? That is, unlike say store order, you can’t really tell the order in which two cache lines were flushed, can you? That is, other coherent agents will still see stores in an order that respects the global store order of the individual stores, and not really related to the CLFLUSH operations, at least as I understand it. Of course there are some non-functional things you can observe, such as whether a line remains cached and maybe there are some more obscure cases (like non-coherent agents) where it matters…

So given that I understand that when the document talks about stores not passing other stores, it really means that stores aren’t visibly reordered with other stores. When it talks about CLFUSH not passing other CFLUSH operations, it is less clear. I guess ordering is at least partly important from a performance perspectively since you want to finish working on a line before you flush it. It your flush moves up before some memory operations on the line, your flush is going to really suck (the flush will happen, then you immediately access the line again, taking hundreds of cycles to bring it back, and then you are left with the line in the cache which you didn’t want in the first place).

Reply ↓
- Hadi Brais on May 18, 2018 at 6:52 AM said:
  
  Thanks for the feedback.
  
  Global visibility of loads is defined in only two places in the Intel manual, once in Volume 1 Section 11. 4.4.3 as follows:
  
  “A load is considered to become globally visible when the value to be loaded is determined.”
  
  and in Volume 2:
  
  “A load instruction is considered to become globally visible when the value to be loaded into its destination register is determined.”
  
  It’s used in Volume 3 Section 8.2.5, but not defined there anywhere in the volume. Also I don’t think there is any explanation anywhere regarding why it matters.
  
  This concept might seem strange at first, but I think it matters at the architectural level and the microarchitectural level.
  
  (Now I realize I used the term “observable” rather than “visible”. I’m using them interchangeably).
  
  Let’s start with the architectural level. Assume one thread is executing the following two loads in program order:
  
  load [x];
  load [y];
  // Compute result based on the load data. The result will be read by the other thread.
  
  and a second thread executing the following two stores in program order:
  
  store [y], 1;
  store [x], 1;
  
  Assume that the initial values stored at x and y are both zero. What are the possible interactions between these two threads?
  
  When the loads are described to be strongly-ordered, it means that they can only be globally-observable in program order. (All ordering rules are defined in terms of observability). That is, the data to be fetched from x must be determined before the data to be fetched from y is determined. The second thread knows that these locations are initialized to zero and it’s going to write 1’s to them. If the first thread read the value zero from location x, then the load has become globally observed before the write of the second thread takes place.
  
  For correctness, when loading data from WB memory, the loads can only be globally observed (although it’s not guaranteed that they will ever be observable, but in case that happens) either at the same time or in program order.
  
  This implies that if the value loaded from location y is zero, then (assuming the stores are strongly ordered) the value loaded from x must also be zero.
  
  So global observability of loads is an essential concept that gives meaning to the ordering rules of loads.
  
  I think we can also define local observability based on the definition from Intel. A load is locally observable (locally visible) (but not globally observable?) when the value to be loaded has not been determined.
  
  I think there is one important aspect of global visibility that must be understood. Global visibility does not mean that the memory request has crossed the coherence domain. That is, the loaded value might have come from the logical processor itself rather than from the coherence domain. This is important for understanding how concurrent threads that read and write the same memory locations may interact with each other and what would happen when a fence instruction is used.
  
  At the microarchitectural level, global visibility can be exploited to improve performance of the processor even in the presence of fences. Consider the following piece of code:
  
  load [x]
  lfence
  // other code
  
  The LFENCE prevents the processor from dispatching the later instructions until the load becomes globally visible. Here is the important part. The processor can begin dispatching later instructions as soon as it gets an acknowledgement that the load has become globally visible; it does not have to wait for the data to actually arrive. I think this is pretty cool. So even if the load has not actually retired yet, later instructions can still be executed. This internal behavior would not violate the guarantees of LFENCE.
  
  Regarding the CLFLUSH[OPT] instructions, like you said, the ordering matters for performance. I can also think of cases where it matters for correctness. But it might better to write a separate article for that in the future.
  
  Regarding the ordering of WC loads (and WC stores), I remember reading a very clear paragraph on this. I’ll go through the Intel manual again more carefully and try to find it.
  
  Reply ↓
  - BeeOnRope on May 19, 2018 at 9:11 AM said:
    
    Well I find it confusing to talk about global observability of loads since by definition loads have no global side-effect and aren’t “observed” by other agents per se – the result of the load is available locally to the CPU only: if it wants to communicate that to another CPU, it needs to store something, just like your example does!
    
    Stores, on the other hand, have a very definite concept of global visibility and that’s probably where this term originated. They start out as speculative in the store buffer, then after retirement they are “senior” and must eventually become visible but aren’t yet and then at some point become globally visible when they join their places in the “total store order”. Not only micro-architectural curiosities – these things lead directly to the ordering rules you see on Intel. There aren’t really analogs for these things for loads… but sure I can agree to use “globally visible” as longhand for “occurs” or shorthand “takes its value at this moment in the total store order”.
    
    Anyways, it’s mostly just a discussion of semantics. It seems clear that Intel is using “global observability” of loads simply to mean the point at which the load “occurs” or “receives its value” from some earlier store. In most of the ordering specific discussions they don’t use this terminology at all in 8.2.2 which is the most comprehensive/general list of ordering rules: there they just talk about “reordering” and “passing”. As far as I can tell something like “stores are not reordered with other stores” is equivalent to a statement like “stores become globally visible in program order”. I guess there could be some small differences: e.g., if I say “loads aren’t reordered across an LFENCE” it has a subtly different meaning than “all loads before an LFENCE become globally visible before the LFENCE retires and no subsequent loads execute before the LFENCE retires”. Both statements have the same semantics with respect to load memory ordering, so I don’t think there are observable differences for a typical user-mode program, but the latter statement is a bit more specific in how things actually happen, in particular how it serializes the instruction stream.
    
    So you see global visibility as the core concept, but I could see “ordering rules” as the core concept and I think they are equivalent.
    
    All that out of the way, I still really don’t get the paragraph I indicated above. It’s not that I disagree with it, it’s that I’m not sure what you are saying. Your example above didn’t shed any light on it, it’s just a normal memory ordering type scenario (and it doesn’t involve an LFENCE). Some comments inline about things i don’t understand.
    
    > This means that other agents in the system might be able to determine that the logical processor have read a particular memory location in case it was globally updated (the update has reached the coherence domain) at least once by an agent other than the logical processor.
    
    What is “This” referred to at the start of the sentence?
    
    > This is possible because the logical processor may produce different results based on the fetched data. But can we say that the other agents can definitely observe these loads?
    
    I’m still lost here as to the overall scenario. What are these loads? How is LFENCE involved? Is your claim that LFENCE allows something to occur or prevents something from occuring that wouldn’t happen without the LFENCE?
    
    > Yes we can. That’s because Intel uses the terms “retire” and “complete” interchangeably throughout the manual. When an instruction retires, it means that all of its side effects are either globally visible or will become globally visible at some time later. In other words, the logical processor cannot retract a retired instruction, but it might be able to retire other subsequent instructions that override its effects before they become globally visible. The third sentence clarifies this situation. Writes to memory from retired instructions may not yet be globally visible even though they are visible from the logical processor that retired the instructions. Such writes can be held in buffers known as write buffers and are only guaranteed to become globally visible when they leave these write buffers. Note that this behavior is only specified by the ISA, but the implementation can actually be such that retirement means global visibility.
    
    All this part I understand and it seems right, but I can’t really put it in the context of the earlier stuff since I don’t understand it.
    
    BTW, I don’t find Intel’s 4-part note about LFENCE at all confusing. They are just describing one obvious implementation of a load fence: you block later loads from executing by blocking all instructions from executing, and you don’t retire the fence until all previous instructions have finished executing. Since there is no buffering of loads as there is with stores, this is enough for a load fence. It’s a bit more conservative since you could just replace “all instructions” with “all loads” and still have the load-fence effect, I guess Intel went out of their way to document the more conservative behavior since I guess it’s useful e.g., for pairing with rdtsc as an “OoO” barrier to measure more deterministically certain code sections, and it was certainly helpful to have lying around as a speculation barrier when Spectre and Meltdown bit.
    
    It doesn’t affect the store buffer, which leads directly to the comment about loads.
    
    My theory? These fences were added when the future vision for memory model wasn’t fully nailed down and before Intel multi-core CPUs existed. sfence was added as an efficient fence for weakly-ordered stores, followed a short time by lfence which was probably targeted towards weakly ordered loads, e.g., with NT hints, or perhaps even with the idea of operating the chip or certain regions in a weakly-ordered memory regions which allow load-load reordering (kind of a combination of the caching performance of WB plus the weakness of WC). For whatever reason this never panned out (there are no weakly ordered loads on mainstream Intel in the most popular WB mode), and they kept the strong load-load ordering and decided it wasn’t going to be a big barrier to performance through the magic of the MOB.
    
    So lfence was left a bit orphaned and useless, so they decided to highlight (document) the instruction serializing behavior too. You can find some evidence for this in an old (2004) manual for LFENCE around its introduction that says:
    
    > Performs a serializing operation on all load-from-memory instructions that were issued prior
    > the LFENCE instruction. This serializing operation guarantees that every
    > load instruction thatprecedes in program order the LFENCE instruction is globally visible
    > before any load instruction that follows the LFENCE instruction is globally visible. The
    > LFENCE instruction is ordered with respect to load instructions, other LFENCE instructions,
    > any MFENCE instruc-tions, and any serializing instructions (such as the CPUID
    > instruction). It is not ordered with respect to store instructions or the SFENCE instruction.
    
    The remainder of the paragraphs are kind of similar today, except that it says that it doesn’t serialize PREFETCH instructions (removed because in the new doc it explicitly serializes instructions so it must serialize these since they are instructions just like any other, although of course hardware prefetch can still kick in).
    
    (Note that they use the “globally visible” language back then but today have removed to talk simply above “receiving data before”)
    
    They don’t mention anything about serializing the instruction stream, just purely about serializing loads. Later on they changed it to the current text mentioning specifically that it serializes the instruction stream, maybe because it gives a useful “second life” for lfence in scenarios where you want to serialize the instruction stream cheaply? Just speculation…
    
    Finally, are you going to support your first sentence? You say that *all* of LFENCE, SFENCE and MFENCE are not memory barrier instructions. You make a case that LFENCE isn’t really here (IMO it’s debatable) but what about the other two? I think they are clearly memory fences, although SFENCE is applicable only when weakly-ordered instructions or memory types are involved.
    
    Reply ↓
- Peter Cordes on May 25, 2018 at 11:08 AM said:
  
  > The talk (in the Intel manual) of how the CLFLUSH[OPT] instructions are ordered is a bit unclear to me. Unlike loads and stores, these instructions have no visible effect in normal user-mode code as far as I can tell?
  
  One use-case for CLFLUSH[OPT], and one of the reasons an OPT version was introduced at all, is cacheable non-volatile storage hooked up to the memory bus, for use by databases running in user-space. Battery-backed DIMMs exist now, and Intel Optane DIMMs will exist in the future.
  
  Ordering is visible if you pull the power cord and then check for filesystem / database corruption after rebooting.
  
  SFENCE becomes the equivalent of a SATA write barrier for a filesystem using journaling or some other ordered-update sequence to allow crash recovery.
  
  > which would seem to include in principle non-temporal loads like MOVNTDQA although such loads don’t seem to be non-temporal on WB memory regions in practice today).
  
  The manual says MOVNTDQA is required to obey the ordering rules of the memory region. Options for implementing the NT hint include minimizing pollution, but they don’t include being weakly ordered.
  
  (Current CPUs ignore the NT hint entirely except on WC memory, because they don’t have NT HW prefetchers. Demand-miss every line sucks, and triggering the non-NT-aware HW prefetchers would defeat the purpose.)
  
  > IMO it doesn’t make sense to talk about the “global observability” of loads.
  
  When reasoning about what order different threads saw, it makes sense to consider their loads as part of a global order of loads / stores by all threads. It’s not observable *directly* by other threads, but sometimes you can infer it from what a thread later stores. (Or use a debugger to look at internal thread state.)
  
  At least on a relatively strongly ordered ISA like x86, loads becoming “globally visible” is equivalent to asking at which point in the global store order they read data. I haven’t done much with weakly ordered ISAs where there isn’t necessarily a single total store order that all cores agree on, and IDK if it one flavour of terminology doesn’t work as well there.
  
  Reply ↓
  - Hadi Brais on May 25, 2018 at 10:16 PM said:
    
    > One use-case for CLFLUSH[OPT], and one of the reasons an OPT version was introduced at all, is cacheable non-volatile storage hooked up to the memory bus…
    
    Yes. This is exactly what I was thinking of when I said in my earlier comment that CLFLUSH[OPT] may matter for correctness, not just performance.
    
    > When reasoning about what order different threads saw, it makes sense to consider their loads as part of a global order of loads / stores by all threads.
    
    This is a good explanation for the global visibility of loads. See my comment below as well for more discussion.
    
    Reply ↓
Hadi Brais on May 19, 2018 at 11:26 PM said:

@BeeOnRope (I’m posting the reply to your comment here because there seems to be a limit on the number of nested comments).

> if it wants to communicate that to another CPU, it needs to store something, just like your example does!

Other than loading from MMIO, the load ordering rules only make sense if there is at least one other agent that is storing to a shared memory location. That’s because automatic updates to segmentation and paging structures in x86 are not subject to these rules as they may occur concurrently. But even these updates are technically stores.

> So you see global visibility as the core concept, but I could see “ordering rules” as the core concept and I think they are equivalent.

I don’t think they are equivalent. Intel says in Section 8.2.2 that loads are not reordered with other loads. If someone asked you what does it mean, how would you explain it? There is no other way other than talking about the global visibility of loads. Now if someone asked you what does the global visibility of loads mean, how would you explain it? You can define it the way Intel did; there is no need to even mention ordering. You see, global visibility is a concept that applies to a single memory operation (not more), while ordering is a concept that applies to at least two operations. Therefore, global visibility constitutes the basis or foundation for all ordering rules. They are not equivalent or interchangeable terms. That said, I think global visibility by itself is useless outside the context of load (memory in general) ordering.

> What is “This” referred to at the start of the sentence?
> I’m still lost here as to the overall scenario.

It refers to the second sentence from the quote, which is really global visibility of loads in disguise. The scenario I discussed in my first comment is basically what that part of the paragraph is about. Note that the second sentence says that loads complete when the data is received. While this is not strictly incorrect, it’s not accurate either. It’d be more accurate to say that the data has been determined, but not necessarily received yet. But this is just an implementation detail and does not make a difference in functionality at the ISA level.

> My theory?…

The Intel patent on LFENCE (https://patents.google.com/patent/US6651151B2/en) also discusses its serializing properties. So I think it was designed like that from the beginning. But they decided to only offer the guarantee of ordering loads at that time. I agree to a certain extent with your theory. But the ability to reorder loads can result in significant performance improvements. There is research on this. I think that the problem with a weakly-ordered WB memory type was potentially that software companies were not ready or were not willing or difficult for them or not motivated enough to take advantage of the memory type. If the world turned out to be ready for a true x86 load fence, then Intel could just change the implementation of LFENCE to make it a true load fence without breaking compatibility. Otherwise, they could leave as it is and expose the serializing properties when needed without breaking compatibility as well. And if it turns out to be completely useless, then they could just reuse it for other functionality and add a control bit to switch between the old and new functionality without breaking compatibility. So giving LFENCE the serializing properties from the beginning makes sense and it was a very good decision at that time. The person/people who invented LFENCE really knew what they were doing.

> Finally, are you going to support your first sentence?…

This requires a discussion of what these terms mean or used to mean in the literature and what they mean in x86. Another tale for another day. Until then, I’ll remove the “but they are not” part. You can call LFENCE a load fence or SFENCE a store fence (Intel does that). That’s fine as long as the person you’re talking to or the target audience understands that you’re talking about x86 fences, rather than fences in theory or in some other architecture. Context matters. When I first started learning about LFENCE, I was very confused because I was coming from an academic/theoretic background.

Reply ↓
BeeOnRope on May 21, 2018 at 12:09 PM said:

> I don’t think they are equivalent. Intel says in Section 8.2.2 that loads are not reordered with other loads. If someone asked you what does it mean, how would you explain it?

I would say that the loads take their values from the total store order in program order.

Or just read formalisms of the x86 memory model, the best one which seems to be:

Click to access cacm.pdf

which certainly don’t need to use that term.

> There is no other way other than talking about the global visibility of loads. Now if someone asked you what does the global visibility of loads mean, how would you explain it? You can define it the way Intel did; there is no need to even mention ordering. You see, global visibility is a concept that applies to a single memory operation (not more), while ordering is a concept that applies to at least two operations. Therefore, global visibility constitutes the basis or foundation for all ordering rules. They are not equivalent or interchangeable terms. That said, I think global visibility by itself is useless outside the context of load (memory in general) ordering.

You have mostly lost me on this global visibility stuff. Many memory models and formalisms get by without even using this terminology at all. Certainly many use “global visibility” only to refer to stores and not loads. It’s not at all clear what being “globally visible” for a load even means to me. What is the opposite of global visibility for loads? For stores it is clear, and the visibility effect is obvious: other threads can see the stores. There is no analog for stores.

IMO global visibility is some term that was introduced back in the single core days where the concerns were simpler and mostly about other agents in the systems, not a coherent memory model. So global visibility was shorthand for “the access appeared on the bus” or “the value has been flushed/read from memory” or “the access can be observed by external agents”. Mostly only a few low level hardware and OS people had to care. Then later we had multiple-CPU and multiple-core and defining everything more precisely became important, and everything became about ordering – because that’s what really matters: that everything is observed in some relative order consistent with the ordering model (and also that it appears “soon enough” – i.e., not indefinitely postponed inside the core: not a problem in practice but a bit of a loophole in both hardware and software memory models).

In various places you still find the global visibility terminology, but IMO it often adds nothing, and perhaps always nothing for loads.

I still think it’s partly just a semantics problem. Sure Intel defines it (in one place) very simply as “when the value to load is to be determined”, but why do they use the term “globally visible”? They could use any other term. Maybe they don’t even need a term, since that definition, IMO, adds nothing: the only thing a load *ever* does is have a value determined (so why do they introduce the GV concept for loads at all?). Just because they chose to use that term for loads in their manual doesn’t make it a good choice and it certainly doesn’t automatically help a user understand its meaning in some fundamental way.

> The Intel patent on LFENCE (https://patents.google.com/patent/US6651151B2/en) also discusses its serializing properties.

Sure, I think it was serializing right from the beginning. That they described the mechanism, including serialization isn’t weird since that’s how all of their hardware patents work: they are generally describing the detailed hardware mechanism, and often (as it appears to me) the exact mechanism used in some actual Intel CPU (often down to the size of various buffers, etc).

> The scenario I discussed in my first comment is basically what that part of the paragraph is about.

The scenario in your comment is simple and doesn’t involve LFENCE. Does the paragraph in the original post involve LFENCE? I’m just trying to see if there is something interesting there since it’s the one paragraph I didn’t understand and you seem to be saying something deep about what LFENCE can do. IMO LFENCE can do nothing here, so I want to get it.

Maybe if you had a one sentence summary of what you are trying to describe at a high level there, because I’m really not even getting it. Something like “Here I’m explaining that the Intel LFENCE text implies that it can be used to ensure that one agent XYZ”. I wish I could fill in XYZ, but I can’t.

> You can call LFENCE a load fence or SFENCE a store fence (Intel does that).

Well by definition they are. They are like SPARC’s #LoadLoad or #StoreStore fences, in that they explicitly don’t allow load-load reordering and store-store reordering, respectively. They are fairly usual in that sense, except for the glaring point that such re-orderings are already disallowed in “normal” (WB) memory for x86, so their real uses in practice are more niche. MFENCE doesn’t have that problem though.

Actually in academia on the concurrent algorithms side you often just find descriptions as if everything was seq cst with a caveat to “insert appropriate barriers” to make it seq cst on your architecture. The second most popular approach is just a global memory fence which blocks all re-orderings. Some other work uses the 4 SPARC style barriers [1]. So it’s not as if there is a large body of work with consistent barrier names and semantics there anyways. Maybe you are talking about the hardware side, it could be different there.

[1] The 4 SPARC-style barriers aren’t really enough though, since it leaves some things undefined, such as if there is a global store order. For example, you can’t really describe the basic x86 memory model in terms of which of the 4 barriers are no-ops, since store-forwarding in particular doesn’t fit that pattern.

Reply ↓
- Hadi Brais on May 22, 2018 at 8:27 AM said:
  
  I made some changes to the article. First, I found a statement in the Intel manual that confirms that WC loads may be reordered (finally). Second, I added a little more discussion on using LFENCE with RDTSC (although this discussion can span a whole article by itself). Third, I discussed how MONITOR relates to LFENCE. Fourth, I added some discussion on the use of LFENCE in the Linux kernel.
  
  I agree that the phrase “global visibility of loads” might sound counterintuitive. It seems to me that AMD only uses the notion of global visibility on stores.
  
  Note that I changed the limit on nested comments to 10, which seems to be the maximum.
  
  Reply ↓
Peter Cordes on May 25, 2018 at 11:18 AM said:

You can get more human-readable perf output by using the ocperf.py wrapper (https://github.com/andikleen/pmu-tools). Instead of people having to remember what `r1D1` is on Haswell, you’ll get output that uses Intel’s symbolic names for the events. (Sample output: https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all).

It even prints out the perf command it actually runs, e.g. on Skylake:

perf stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r2 ./mov-elimination

Use a command like `ocperf.py stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread` to get some of the fixed counters (cycles, instructions) along with programmable counters. Then you don’t have to reason about the clock-speed and nanoseconds, you just have cycle counts directly. (I usually leave `branches` in as part of the count so I don’t have to look elsewhere to remember how many loop iterations I was benchmarking, i.e. what to divide the other counts by to get per-iteration costs. But take it out if you’re running out of counters and you don’t want perf to multiplex.)

Reply ↓
- Hadi Brais on May 25, 2018 at 10:24 PM said:
  
  Thanks. The ocperf.py wrapper is pretty cool. I’ll try to use it in the future.
  
  Yea I measured no more than 4 events in any single run to avoid multiplexing.
  
  Reply ↓
Peter Cordes on May 26, 2018 at 4:59 AM said:

perf will use the fixed counters for events that have them, like cycles and instructions, so you can use those + 4 programmable events without triggering multiplexing.

(If perf does need to multiplex, the user-space code doesn’t take into account which HW events have fixed counters. But it does correctly figure out if it needs to multiplex at all by asking the kernel if it can program all of these events at once. That does use fixed + programmable counters correctly.)

Using alternatives to the fixed counters, e.g. precise event for cycles, is good for perf record; the fixed counters use the legacy mechanism of interrupting on every overflow. But for perf stat I think it’s fine (hopefully perf sets the roll-over interval higher for less-frequent interrupts when doing a stat run where only the totals matter, not statistical sampling).

Reply ↓
joz on June 4, 2018 at 12:18 AM said:

What’s the impact of LFENCE at that location? Well, LFENCE will allow the processor to fetch and decode all later instructions including RDTSC, but not execute any of them until “RDTSC” and all earlier instructions retire.

I think you intended to write ‘but not execute any of them until LFENCE and all earlier instructions retire’.

Reply ↓
- Hadi Brais on June 4, 2018 at 12:42 AM said:
  
  Fixed. Thanks!
  
  Reply ↓
joz on June 4, 2018 at 2:22 AM said:

”
load1
inst2
store1
MFENCE
LFENCE
RDTSC

MFENCE here enables RDTSC to capture the time it takes to make all stores globally visible. ”

To drain the store buffer, don’t you think SFENCE can also help instead of MFENCE ?

Reply ↓
- Hadi Brais on June 4, 2018 at 4:34 AM said:
  
  No. SFENCE is not ordered with LFENCE. This is discussed later in the article.
  
  Reply ↓
  - joz on June 4, 2018 at 5:04 AM said:
    
    Intel SDM v3 sec 11.10 says that
    
    contents of the store buffer are always drained to memory in the following situations
    
    (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores
    
    And clubbing it with the following
    
    Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes
    
    It kind of makes me believe that SFENCE is sufficient
    
    Reply ↓
    - Hadi Brais on June 4, 2018 at 5:57 AM said:
      
      SFENCE is one of the exceptions to that ordering rule. SFENCE does drain the store buffer, but that can happen after LFENCE. So SFENCE cannot be used instead of MFENCE.
      
      Reply ↓
      - joz on June 4, 2018 at 12:54 PM said:
        
        Can you please point out the section that deals with that exception in the Intel manual ?
        
        If SFENCE only guarantees on making stores eventually draining the buffer to the cache, the instruction is of no use because stores will be visible eventually anyway without an SFENCE.
        
        My understanding on reading is that SFENCE retires only after prior stores are globally visible.
        
        Furthermore, an LFENCE that follows an SFENCE can start execution only after the prior SFENCE retires.
        
        Reply ↓
        
        Hadi Brais on June 4, 2018 at 9:16 PM said:
        
        I’ve briefly discussed all the exceptions in the part of the article just before the “LFENCE in the Linux kernel” section.
        
        Your understanding of SFENCE is OK, but it is certainly useful in few situations. I’m planning to write an article about it sometime in the future.
        
        > Furthermore, an LFENCE that follows an SFENCE can start execution only after the prior SFENCE retires.
        
        Again, LFENCE and SFENCE are not ordered with respect to each other. LFENCE does not prevent SFENCE from being reordered.
        
        Reply ↓
    - Peter Cordes on June 8, 2018 at 7:00 PM said:
      
      SFENCE retirement only means that a barrier has been written to the memory-order buffer, preventing reordering of NT stores across that point. Just like a regular store, it does NOT mean that the store buffer is actually finished flushing.
      
      I recently wrote https://stackoverflow.com/a/50322404/224132 which explains why you can’t build MFENCE out of SFENCE + LFENCE.
      
      SFENCE is pretty much only useful for ordering NT stores, or other weakly-ordered store-like things (like CLFLUSHOPT). https://stackoverflow.com/a/32705560/224132. It is not helpful for correctness in StoreLoad reordering or store-forwarding situations. If you need a sequential-release store, do it with `xchg`. (`mov` + MFENCE is slower, since it turns out that MFENCE also serializes the instruction stream on Skylake the same way LFENCE does. It doesn’t guarantee that on paper, though.)
      
      Also related: https://stackoverflow.com/questions/19047327/why-gcc-does-not-use-loadwithout-fence-and-storesfence-for-sequential-consist,
      
      https://stackoverflow.com/questions/32681826/does-sfence-prevent-the-store-buffer-hiding-changes-from-mesi
      
      https://stackoverflow.com/questions/50480511/what-does-serializing-operation-mean-in-the-sfence-documentation
      
      (There was a recent flurry of SFENCE questions on SO for some reason)
      
      Reply ↓
Pingback: The Significance of the x86 SFENCE Instruction | Micromysteries
Pingback: “Plus rapides” et “plus furtives” : 4 nouvelles déclinaisons de Spectre mises en lumière… Intel et AMD concernés ! (micro-op, maxi-dégâts…) | SOSOrdi.net - L'actualité informatique gratuite