## AMDal

## AMD64 Technology

## AMD64 Architecture Programmer's Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions

| Publication No. | Revision | Date |
| :--- | :--- | :--- |
| 43479 | 3.03 | May 2009 |

© 2009 Advanced Micro Devices, Inc. All rights reserved.
The contents of this document are provided in connection with Advanced Micro Devices, Inc. ("AMD") products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD's Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.

AMD's products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.

## Trademarks

AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc MMX is a trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

## Contents

Preface. ..... 9
1 New 128-Bit and 256-Bit Instructions ..... 23
1.1 New Instruction Format ..... 23
1.2 Opcode Byte ..... 26
1.3 Destination XMM registers ..... 27
1.4 Four-Operand Instructions ..... 27
1.5 Three-Operand Instructions ..... 29
1.6 Two Operand Instructions ..... 29
$1.7 \quad$ 16-Bit Floating-Point Data Type ..... 30
1.8 XOP Integer Multiply (Add) and Accumulate Instructions ..... 31
1.9 Packed Integer Horizontal Add and Subtract ..... 34
1.10 Vector Conditional Moves. ..... 35
1.11 Packed Integer Rotates and Shifts ..... 35
1.12 Packed Integer Comparison and Predicate Generation ..... 36
1.13 Fraction Extract ..... 37
1.14 Convert ..... 38
2 AMD XOP, FMA4 and CVT16 Instructions ..... 39
2.1 Notation ..... 39
2.2 Operand Specification ..... 40
2.3 Instruction Reference ..... 41
VCVTPH2PS ..... 42
VCVTPS2PH ..... 45
VFMADDPD ..... 48
VFMADDPS ..... 52
VFMADDSD ..... 56
VFMADDSS ..... 59
VFMADDSUBPD ..... 62
VFMADDSUBPS ..... 66
VFMSUBADDPD ..... 70
VFMSUBADDPS ..... 74
VFMSUBPD ..... 78
VFMSUBPS ..... 81
VFMSUBSD ..... 84
VFMSUBSS ..... 87
VFNMADDPD ..... 90
VFNMADDPS ..... 93
VFNMADDSD ..... 96
VFNMADDSS ..... 99
VFNMSUBPD ..... 102
VFNMSUBPS ..... 105
VFNMSUBSD ..... 108
VFNMSUBSS ..... 111
VFRCZPD ..... 114
VFRCZPS ..... 117
VFRCZSD ..... 120
VFRCZSS ..... 124
VPCMOV ..... 128
VPCOMB ..... 131
VPCOMD ..... 134
VPCOMQ ..... 137
VPCOMUB ..... 140
VPCOMUD ..... 143
VPCOMUQ ..... 146
VPCOMUW ..... 149
VPCOMW ..... 152
VPHADDBD ..... 155
VPHADDBQ ..... 157
VPHADDBW ..... 159
VPHADDDQ ..... 161
VPHADDUBD ..... 163
VPHADDUBQ ..... 165
VPHADDUBW ..... 167
VPHADDUDQ ..... 169
VPHADDUWD ..... 171
VPHADDUWQ ..... 173
VPHADDWD ..... 175
VPHADDWQ ..... 177
VPHSUBBW ..... 179
VPHSUBDQ ..... 181
VPHSUBWD ..... 183
VPMACSDD ..... 185
VPMACSDQH ..... 188
VPMACSDQL ..... 191
VPMACSSDD ..... 194
VPMACSSDQH ..... 197
VPMACSSDQL ..... 200
VPMACSSWD ..... 203
VPMACSSWW ..... 206
VPMACSWD ..... 209
VPMACSWW ..... 212
VPMADCSSWD ..... 215
VPMADCSWD ..... 218
VPPERM ..... 221
VPROTB ..... 225
VPROTD ..... 228
VPROTQ ..... 231
VPROTW ..... 234
VPSHAB ..... 237
VPSHAD ..... 240
VPSHAQ ..... 243

## AMDA

VPSHAW ..... 246
VPSHLB ..... 249
VPSHLD ..... 252
VPSHLQ ..... 255
VPSHLW ..... 258

AMD64 Technology Documentation Updates
43479—Rev. 3.03—May 2009

## Tables

Table 1-1. Operand Element Size-OES ..... 26
Table 1-2. Operand Configurations for PCMOV and PPERM Instructions .....  27
Table 1-3. Four Operand Instruction Opcode Map .....  28
Table 1-4. Operand Configurations for Three Operand Instructions .....  29
Table 1-5. Three Operand Instruction Opcode Map .....  29
Table 1-6. Two Operand Instruction Opcode Map .....  29
Table 1-7. Supported 16-Bit Floating-Point Encodings ..... 31
Table 1-8. Immediate Operand Values for Unsigned Vector Comparison Operations36
Table 2-1. Denormal and Rounding Control with Immediate Byte Operand ..... 42
Table 2-2. Denormal and Rounding Control with Immediate Byte Operand ..... 45
Table 1. VPCOMB Comparison Operations. ..... 130
Table 2. VPCOMD Comparison Operations ..... 133
Table 3. VPCOMQ Comparison Operations ..... 136
Table 4. VPCOMUB Comparison Operations ..... 139
Table 5. VPCOMUD Comparison Operations ..... 142
Table 6. VPCOMUQ Comparison Operations ..... 145
Table 7. VPCOMUW Comparison Operations. ..... 148
Table 8. VPCOMW Comparison Operations ..... 151
Table 2-3. VPPERM Control Byte ..... 221

AMD64 Technology Documentation Updates
43479—Rev. 3.03—May 2009

# AMDE 

## Preface

## About This Book

The instructions described in this book are part of a multivolume work entitled the AMD64 Architecture Programmer's Manual. The following table lists each volume and its order number.

| Title | Order No. |
| :--- | :--- |
| Volume 1: Application Programming | 24592 |
| Volume 2: System Programming | 24593 |
| Volume 3: General-Purpose and System Instructions | 24594 |
| Volume 4: 128-Bit Media Instructions | 26568 |
| Volume 5: 64-Bit Media and x87 Floating-Point Instructions | 26569 |
| Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 <br> Instructions | 43479 |

## Audience

This document is intended for all programmers writing application or system software for a processor that implements the AMD64 architecture.

## Organization

Volumes 3 through 6 describe the AMD64 architecture's instruction set in detail. Together, they cover each instruction's mnemonic syntax, opcodes, functions, affected flags, and possible exceptions.

The AMD64 instruction set is divided into seven subsets:

- General-purpose instructions
- System instructions
- 128-bit media instructions
- 64-bit media instructions
- x87 floating-point instructions
- 128-bit and 256-bit XOP media instructions

Several instructions belong to-and are described identically in-multiple instruction subsets.

This volume describes the 128 -bit and 256-bit XOP, FMA4 and CVT16 instruction extensions. The index at the end cross-references topics within this volume. For other topics relating to the AMD64 architecture, and for information on instructions in other subsets, see the tables of contents and indexes of the other volumes.

## Definitions

Many of the following definitions assume an in-depth knowledge of the legacy x86 architecture. See "Related Documents" on page 20 for descriptions of the legacy x86 architecture.

## Terms and Notation

In addition to the notation described below, "Opcode-Syntax Notation" in Volume 3 describes notation relating specifically to opcodes.
$1011 b$
A binary value-in this example, a 4-bit value.
FOEAh
A hexadecimal value-in this example a 2-byte value.
$[1,2)$
A range that includes the left-most value (in this case, 1) but excludes the right-most value (in this case, 2).

7-4
A bit range, from bit 7 to 4 , inclusive. The high-order bit is shown first.

## 128-bit media instructions

Instructions that use the 128-bit XMM registers. These are a combination of the SSE and SSE2 instruction sets.

## 64-bit media instructions

Instructions that use the 64-bit MMX registers. These are primarily a combination of $\mathrm{MMX}^{\mathrm{TM}}$ and 3DNow! ${ }^{\mathrm{TM}}$ instruction sets, with some additional instructions from the SSE and SSE2 instruction sets.

## 16-bit mode

Legacy mode or compatibility mode in which a 16-bit address size is active. See legacy mode and compatibility mode.

## 32-bit mode

Legacy mode or compatibility mode in which a 32 -bit address size is active. See legacy mode and compatibility mode.

## 64-bit mode

A submode of long mode. In 64-bit mode, the default address size is 64 bits and new features, such as register extensions, are supported for system and application software.

## \#GP(0)

Notation indicating a general-protection exception (\#GP) with error code of 0.

## absolute

Said of a displacement that references the base of a code segment rather than an instruction pointer.
Contrast with relative.
ASID
Address space identifier.

## biased exponent

The sum of a floating-point value's exponent and a constant bias for a particular floating-point data type. The bias makes the range of the biased exponent always positive, which allows reciprocation without overflow.
byte
Eight bits.
clear
To write a bit value of 0 . Compare set.
compatibility mode
A submode of long mode. In compatibility mode, the default address size is 32 bits, and legacy 16bit and 32-bit applications run without modification.

## commit

To irreversibly write, in program order, an instruction's result to software-visible storage, such as a register (including flags), the data cache, an internal write buffer, or memory.

CPL
Current privilege level.

## CR0-CR4

A register range, from register CR0 through CR4, inclusive, with the low-order register first.
CRO.PE $=1$
Notation indicating that the PE bit of the CR0 register has a value of 1.

## direct

Referencing a memory location whose address is included in the instruction's syntax as an immediate operand. The address may be an absolute or relative address. Compare indirect.

## dirty data

Data held in the processor's caches or internal buffers that is more recent than the copy held in main memory.

## displacement

A signed value that is added to the base of a segment (absolute addressing) or an instruction pointer (relative addressing). Same as offset.

## doubleword

Two words, or four bytes, or 32 bits.
double quadword
Eight words, or 16 bytes, or 128 bits. Also called octword.
DS:rSI
The contents of a memory location whose segment address is in the DS register and whose offset relative to that segment is in the rSI register.
$E F E R . L M E=0$
Notation indicating that the LME bit of the EFER register has a value of 0 .

## effective address size

The address size for the current instruction after accounting for the default address size and any address-size override prefix.

## effective operand size

The operand size for the current instruction after accounting for the default operand size and any operand-size override prefix.

## element

See vector.
exception
An abnormal condition that occurs as the result of executing an instruction. The processor's response to an exception depends on the type of the exception. For all exceptions except 128-bit media SIMD floating-point exceptions and x87 floating-point exceptions, control is transferred to the handler (or service routine) for that exception, as defined by the exception's vector. For floating-point exceptions defined by the IEEE 754 standard, there are both masked and unmasked responses. When unmasked, the exception handler is called, and when masked, a default response is provided instead of calling the handler.

FF $/ 0$
Notation indicating that FF is the first byte of an opcode, and a subfield in the second byte has a value of 0 .
flush
An often ambiguous term meaning (1) writeback, if modified, and invalidate, as in "flush the cache line," or (2) invalidate, as in "flush the pipeline," or (3) change a value, as in "flush to zero."

GDT
Global descriptor table.

## GIF

Global interrupt flag.
IDT
Interrupt descriptor table.
IGN
Ignore. Field is ignored.
indirect
Referencing a memory location whose address is in a register or other memory location. The address may be an absolute or relative address. Compare direct.

IRB
The virtual-8086 mode interrupt-redirection bitmap.
IST
The long-mode interrupt-stack table.
IVT
The real-address mode interrupt-vector table.
LDT
Local descriptor table.
legacy x86
The legacy x86 architecture. See "Related Documents" on page 20 for descriptions of the legacy x86 architecture.
legacy mode
An operating mode of the AMD64 architecture in which existing 16-bit and 32-bit applications and operating systems run without modification. A processor implementation of the AMD64 architecture can run in either long mode or legacy mode. Legacy mode has three submodes, real mode, protected mode, and virtual-8086 mode.
long mode
An operating mode unique to the AMD64 architecture. A processor implementation of the AMD64 architecture can run in either long mode or legacy mode. Long mode has two submodes, 64 -bit mode and compatibility mode.

## $l s b$

Least-significant bit.
LSB
Least-significant byte.
main memory
Physical memory, such as RAM and ROM (but not cache memory) that is installed in a particular computer system.
mask
(1) A control bit that prevents the occurrence of a floating-point exception from invoking an exception-handling routine. (2) A field of bits used for a control purpose.

MBZ
Must be zero. If software attempts to set an MBZ bit to 1, a general-protection exception (\#GP) occurs.
memory
Unless otherwise specified, main memory.
ModRM
A byte following an instruction opcode that specifies address calculation based on mode (Mod), register ( R ), and memory ( M ) variables.
moffset
A 16,32 , or 64 -bit offset that specifies a memory operand directly, without using a ModRM or SIB byte.
msb
Most-significant bit.
MSB
Most-significant byte.
multimedia instructions
A combination of 128-bit media instructions and 64-bit media instructions.
octword
Same as double quadword.
offset
Same as displacement.
overflow
The condition in which a floating-point number is larger in magnitude than the largest, finite, positive or negative number that can be represented in the data-type format being used.
packed
See vector.
PAE
Physical-address extensions.
physical memory
Actual memory, consisting of main memory and cache.
probe
A check for an address in a processor's caches or internal buffers. External probes originate outside the processor, and internal probes originate within the processor.

## protected mode

A submode of legacy mode.
quadword
Four words, or eight bytes, or 64 bits.
reserved
Fields marked as reserved may be used at some future time.
To preserve compatibility with future processors, reserved fields require special handling when read or written by software.
Reserved fields may be further qualified as MBZ, RAZ, SBZ or IGN (see definitions).
Software must not depend on the state of a reserved field, nor upon the ability of such fields to return to a previously written state.
If a reserved field is not marked with one of the above qualifiers, software must not change the state of that field; it must reload that field with the same values returned from a prior read.

RAZ
Read as zero (0), regardless of what is written.
real-address mode
A submode of legacy mode with 16-bit addressing and operand size and a simple form of segmentation, lacking the segment and privilege protection mechanisms of protected mode. See real mode.

## real mode

A short name for real-address mode, a submode of legacy mode.
relative
Referencing with a displacement (also called offset) from an instruction pointer rather than the base of a code segment. Contrast with absolute.

REX
An instruction prefix that specifies a 64-bit operand size and provides access to additional registers.

RIP-relative addressing
Addressing relative to the 64-bit RIP instruction pointer.
set
To write a bit value of 1 . Compare clear.
SIB
A byte following an instruction opcode that specifies address calculation based on scale (S), index (I), and base (B).

## SIMD

Single instruction, multiple data. See vector.

## SSEn and SSSEn

Various extensions to the SSE instruction set. See 128-bit media instructions and 64-bit media instructions.
sticky bit
A bit that is set or cleared by hardware and that remains in that state until explicitly changed by software.

TOP
The x87 top-of-stack pointer.
TSS
Task-state segment.
underflow
The condition in which a floating-point number is smaller in magnitude than the smallest nonzero, positive or negative number that can be represented in the data-type format being used.
vector
(1) A set of integer or floating-point values, called elements, that are packed into a single operand. Most of the 128-bit and 64-bit media instructions use vectors as operands. Vectors are also called packed or SIMD (single-instruction multiple-data) operands.
(2) An index into an interrupt descriptor table (IDT), used to access exception handlers. Compare exception.
virtual-8086 mode
A submode of legacy mode.
VMCB
Virtual machine control block.
VMM
Virtual machine monitor.
word
Two bytes, or 16 bits.
$x 86$
See legacy x86.

## Registers

In the following list of registers, the names are used to refer either to a given register or to the contents of that register:

AH-DH
The high 8-bit $\mathrm{AH}, \mathrm{BH}, \mathrm{CH}$, and DH registers. Compare $A L-D L$.

## $A L-D L$

The low 8-bit AL, BL, CL, and DL registers. Compare $A H-D H$.
AL-r15B
The low 8-bit AL, BL, CL, DL, SIL, DIL, BPL, SPL, and R8B-R15B registers, available in 64-bit mode.

BP
Base pointer register.
CRn
Control register number $n$.
CS
Code segment register.
$e A X-e S P$
The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers or the 32-bit EAX, EBX, ECX, EDX,
EDI, ESI, EBP, and ESP registers. Compare $r A X-r S P$.
EBP
Extended base pointer register.

AMD64 Technology Documentation Updates

EFER
Extended features enable register.
eFLAGS
16-bit or 32-bit flags register. Compare rFLAGS.

## EFLAGS

32-bit (extended) flags register.
eIP
16-bit or 32-bit instruction-pointer register. Compare rIP.
EIP
32-bit (extended) instruction-pointer register.
FLAGS
16-bit flags register.

## GDTR

Global descriptor table register.

## GPRs

General-purpose registers. For the 16-bit data size, these are AX, BX, CX, DX, DI, SI, BP, and SP. For the 32-bit data size, these are EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP. For the 64-bit data size, these include RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP, and R8-R15.

IDTR
Interrupt descriptor table register.

IP
16-bit instruction-pointer register.
LDTR
Local descriptor table register.
MSR
Model-specific register.
r8-r15
The 8-bit R8B-R15B registers, or the 16-bit R8W-R15W registers, or the 32-bit R8D-R15D registers, or the 64-bit R8-R15 registers.
$r A X-r S P$
The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers, or the 32-bit EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP registers, or the 64-bit RAX, RBX, RCX, RDX, RDI, RSI, RBP, and RSP
registers. Replace the placeholder $r$ with nothing for 16-bit size, "E" for 32-bit size, or "R" for 64bit size.

RAX
64-bit version of the EAX register.
RBP
64-bit version of the EBP register.
RBX
64-bit version of the EBX register.
RCX
64-bit version of the ECX register.
RDI
64-bit version of the EDI register.
RDX
64-bit version of the EDX register.
$r F L A G S$
16-bit, 32-bit, or 64-bit flags register. Compare RFLAGS.

## RFLAGS

64-bit flags register. Compare rFLAGS.
$r I P$
16-bit, 32-bit, or 64-bit instruction-pointer register. Compare RIP.
RIP
64-bit instruction-pointer register.
RSI
64-bit version of the ESI register.
RSP
64-bit version of the ESP register.
SP
Stack pointer register.
SS
Stack segment register.

## TPR

Task priority register (CR8), a new register introduced in the AMD64 architecture to speed interrupt management.

TR
Task register.
XMM0-XMM15
The 128-bit XMM registers; each is the lower half of a corresponding 256-bit YMM register.

## YMM0-YMM15

The 256-bit YMM registers; the lower half of each of these is the corresponding 128-bit XMM register.

## Endian Order

The x86 and AMD64 architectures address memory using little-endian byte-ordering. Multibyte values are stored with their least-significant byte at the lowest byte address, and they are illustrated with their least significant byte at the right side. Strings are illustrated in reverse order, because the addresses of their bytes increase from right to left.

## Related Documents

- Peter Abel, IBM PC Assembly Language and Programming, Prentice-Hall, Englewood Cliffs, NJ, 1995.
- Rakesh Agarwal, 80x86 Architecture \& Programming: Volume II, Prentice-Hall, Englewood Cliffs, NJ, 1991.
- AMD, AMD-K6 ${ }^{\mathrm{TM}}$ MMX $^{\mathrm{TM}}$ Enhanced Processor Multimedia Technology, Sunnyvale, CA, 2000.
- AMD, 3DNow! ${ }^{\text {TM }}$ Technology Manual, Sunnyvale, CA, 2000.
- AMD, AMD Extensions to the 3DNow! ${ }^{\mathrm{TM}}$ and MMX ${ }^{\mathrm{TM}}$ Instruction Sets, Sunnyvale, CA, 2000.
- Don Anderson and Tom Shanley, Pentium Processor System Architecture, Addison-Wesley, New York, 1995.
- Nabajyoti Barkakati and Randall Hyde, Microsoft Macro Assembler Bible, Sams, Carmel, Indiana, 1992.
- Barry B. Brey, 8086/8088, 80286, 80386, and 80486 Assembly Language Programming, Macmillan Publishing Co., New York, 1994.
- Barry B. Brey, Programming the 80286, 80386, 80486, and Pentium Based Personal Computer, Prentice-Hall, Englewood Cliffs, NJ, 1995.
- Ralf Brown and Jim Kyle, PC Interrupts, Addison-Wesley, New York, 1994.
- Penn Brumm and Don Brumm, 80386/80486 Assembly Language Programming, Windcrest McGraw-Hill, 1993.
- Geoff Chappell, DOS Internals, Addison-Wesley, New York, 1994.
- Chips and Technologies, Inc. Super386 DX Programmer's Reference Manual, Chips and Technologies, Inc., San Jose, 1992.
- John Crawford and Patrick Gelsinger, Programming the 80386, Sybex, San Francisco, 1987.
- Cyrix Corporation, 5x86 Processor BIOS Writer's Guide, Cyrix Corporation, Richardson, TX, 1995.
- Cyrix Corporation, M1 Processor Data Book, Cyrix Corporation, Richardson, TX, 1996.
- Cyrix Corporation, MX Processor MMX Extension Opcode Table, Cyrix Corporation, Richardson, TX, 1996.
- Cyrix Corporation, MX Processor Data Book, Cyrix Corporation, Richardson, TX, 1997.
- Ray Duncan, Extending DOS: A Programmer's Guide to Protected-Mode DOS, Addison Wesley, NY, 1991.
- William B. Giles, Assembly Language Programming for the Intel 80xxx Family, Macmillan, New York, 1991.
- Frank van Gilluwe, The Undocumented PC, Addison-Wesley, New York, 1994.
- John L. Hennessy and David A. Patterson, Computer Architecture, Morgan Kaufmann Publishers, San Mateo, CA, 1996.
- Thom Hogan, The Programmer's PC Sourcebook, Microsoft Press, Redmond, WA, 1991.
- Hal Katircioglu, Inside the 486, Pentium, and Pentium Pro, Peer-to-Peer Communications, Menlo Park, CA, 1997.
- IBM Corporation, 486SLC Microprocessor Data Sheet, IBM Corporation, Essex Junction, VT, 1993.
- IBM Corporation, 486SLC2 Microprocessor Data Sheet, IBM Corporation, Essex Junction, VT, 1993.
- IBM Corporation, 80486DX2 Processor Floating Point Instructions, IBM Corporation, Essex Junction, VT, 1995.
- IBM Corporation, 80486DX2 Processor BIOS Writer's Guide, IBM Corporation, Essex Junction, VT, 1995.
- IBM Corporation, Blue Lightning 486DX2 Data Book, IBM Corporation, Essex Junction, VT, 1994.
- Institute of Electrical and Electronics Engineers, IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985.
- Institute of Electrical and Electronics Engineers, IEEE Standard for Radix-Independent FloatingPoint Arithmetic, ANSI/IEEE Std 854-1987.
- Muhammad Ali Mazidi and Janice Gillispie Mazidi, 80X86 IBM PC and Compatible Computers, Prentice-Hall, Englewood Cliffs, NJ, 1997.
- Hans-Peter Messmer, The Indispensable Pentium Book, Addison-Wesley, New York, 1995.
- Karen Miller, An Assembly Language Introduction to Computer Architecture: Using the Intel Pentium, Oxford University Press, New York, 1999.
- Stephen Morse, Eric Isaacson, and Douglas Albert, The 80386/387 Architecture, John Wiley \& Sons, New York, 1987.
- NexGen Inc., Nx586 Processor Data Book, NexGen Inc., Milpitas, CA, 1993.
- NexGen Inc., Nx686 Processor Data Book, NexGen Inc., Milpitas, CA, 1994.
- Bipin Patwardhan, Introduction to the Streaming SIMD Extensions in the Pentium III, www.x86.org/articles/sse_pt1/ simd1.htm, June, 2000.
- Peter Norton, Peter Aitken, and Richard Wilton, PC Programmer's Bible, Microsoft Press, Redmond, WA, 1993.
- PharLap 386|ASM Reference Manual, Pharlap, Cambridge MA, 1993.
- PharLap TNT DOS-Extender Reference Manual, Pharlap, Cambridge MA, 1995.
- Sen-Cuo Ro and Sheau-Chuen Her, i386/i486 Advanced Programming, Van Nostrand Reinhold, New York, 1993.
- Jeffrey P. Royer, Introduction to Protected Mode Programming, course materials for an onsite class, 1992.
- Tom Shanley, Protected Mode System Architecture, Addison Wesley, NY, 1996.
- SGS-Thomson Corporation, 80486DX Processor SMM Programming Manual, SGS-Thomson Corporation, 1995.
- Walter A. Triebel, The 80386DX Microprocessor, Prentice-Hall, Englewood Cliffs, NJ, 1992.
- John Wharton, The Complete x86, MicroDesign Resources, Sebastopol, California, 1994.
- Web sites and newsgroups:
- www.amd.com
- news.comp.arch
- news.comp.lang.asm.x86
- news.intel.microprocessors
- news.microsoft


## 1 New 128-Bit and 256-Bit Instructions

This release of the AMD64 architecture introduces the XOP, CVT16, and FMA4 instruction set extensions. These 128-bit and 256-bit instructions complement the AMD64 128-bit media instructions deescribed in detail in the AMD64 Architecture Programmer's Manual Volume 4: 128-Bit Media Instructions, order\# 26568. This document describes new instructions that are designed to:

- Improve performance by increasing the work per instruction and
- reduce the need to copy and move around register operands.

These instruction set extensions include:

- Floating-point multiply accumulate instructions
- Floating-point fraction extract and half-precision conversion instructions
- Integer horizontal add instructions
- Integer multiply accumulate instructions
- Byte permutation and bit granularity conditional move instructions
- Packed integer compare and individual-partition shift/rotate instructions

These instructions all use the new XOP instruction format, which takes advantage of the three- and four-operand non-destructive capability, 256-bit operand size, and instruction length efficiency provided by this encoding. These instructions operate on either the lower 128- or full 256 -bits of the new YMM registers. Context handling of the YMM register set is supported by the new XSAVE/XRSTOR instructions in conjunction with the XSETBV and XGETBV instructions. Support for YMM context handling must be provided by the operating system and must be indicated by setting CR4.OSXSAVE to 1.

Support for the new instructions is indicated by use of the CPUID instruction:

- XOP—ECX bit 11 as returned by CPUID function 8000_0001h.
- FMA4—ECX bit 16 as returned by CPUID function 8000_0001h.
- CVT16—ECX bit 18 as returned by CPUID function 8000_0001h

Attempting to execute these instructions causes a \#UD exception either if they are not present in the hardware or if operating system support for YMM context switching is not indicated by setting CR4.OSXSAVE to 1.

### 1.1 New Instruction Format

The XOP and CVT16 instructions utilize a new three-byte XOP prefix preceding the opcode byte. This prefix replaces the use of the $0 \mathrm{~F}, 66, \mathrm{~F} 2$ and F3 prefix bytes and the REX prefix and encodes additional information as well. The FMA4 instructions utilize the new AVX VEX prefix which provides similar encoding capabilities.

Figure 1-1 shows the byte order of the instruction format.


Figure 1-1. Instruction Byte-Order

### 1.1.1 Legacy Prefix

The optional legacy prefixes include operand-size override, address-size override, segment override, Lock and REP prefixes. For additional information, see section 1.2, "Instruction Prefixes" in the AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions, order\#24594.

### 1.1.2 Three-byte Prefix Format

The format of the three-byte form of the XOP, FMA4 and CVT16 instruction prefixes is shown in Figure 1-2. A description of the fields is provided in Table 1-2 below.

Byte 0


| Bit |  | Mnemonic <br> 8Fh | Description <br> XOP Prefix Byte for 3-byte XOP Prefix |
| :---: | :---: | :---: | :---: |
| ¢ | 7-0 |  |  |
| $\underset{\sim}{\underset{\sim}{\aleph}}$ | 7 | R | Inverted one bit extension to ModRM.reg field |
|  | 6 | X | Inverted one bit extension of the SIB index field |
|  | 5 | B | Inverted one bit extension of the ModRM r/m field or the SIB base field |
|  | 4-0 | mmmmm | XOP opcode map select: 08h-instructions with immediate byte; 09h-instructions with no immediate; |
| $\begin{aligned} & \text { N } \\ & \underset{\sim}{ \pm} \end{aligned}$ | 7 | W | Default operand size override for a general purpose register to 64-bit size in 64-bit mode; operand configuration specifier for certain XMM/YMM-based operations. |
|  | 6-3 | vvvv | Source or destination register specifier |
|  | 2 | L | Vector length for XMM/YMM-based operations. |
|  | 1-0 | pp | Specifies whether there's an implied 66, F2, or F3 opcode extension |

Figure 1-2. Three-byte XOP Format

## Prefix Byte 0

Byte 0 of the XOP prefix is set to 8 Fh . This signifies an XOP prefix only in conjunction with the mmmmm field of the following byte being greater than or equal to 8 ; if the mmmmm field is less than 8 then these two bytes are a form of the POP instruction rather than an XOP prefix.

## Prefix Byte 1

Byte 1 of the XOP prefix has four fields.
R Bit (Prefix Byte 1, Bit 7). This bit provides a one bit extension of the ModRM.reg field in 64-bit mode, permitting access to all $16 \mathrm{YMM} / \mathrm{XMM}$ and GPR registers. In 32-bit protected and compatibility modes, this bit must be set to 1 . This bit is the bit-inverted equivalent of the REX.R bit.

X Bit (Prefix Byte 1, Bit 6). This bit provides a one bit extension of the SIB.index field in 64-bit mode, permitting access to $16 \mathrm{YMM} / \mathrm{XMM}$ and GPR registers. In 32-bit protected and compatibility modes, this bit must be set to 1 . This bit is the bit-inverted equivalent of the REX.X bit.

B Bit (Prefix Byte 1, Bit 5). This bit provides a one-bit extension of either the ModRM.r/m field to specify a GPR or XMM register or to the SIB base field to specify a GPR. This permits access to 16
registers. In 32-bit protected and compatibility modes, this bit is ignored. This bit is the bit-inverted equivalent of the REX.B bit and is available only in the 3-byte prefix format.

## Prefix Byte 2

Byte 2 of the three-byte prefix has four fields.
W Bit (Prefix Byte 2, Bit 7). The meaning of the W bit is opcode specific. This bit toggles source operand order or is ignored, depending upon the opcode.
vvvv (Prefix Byte 2, Bits 6-3). Encodes a source XMM or YMM register in inverted 1s complement form.

L (Prefix Byte 2, Bit 2). If $L$ is 0 , encodes a vector length of 128 -bits or indicates scalar operands; if $L$ is 1 , the vector length is 256 -bits. The register operands for a given instruction are either all 128-bit XMM registers or all 256-bit YMM registers.
pp (Prefix Byte 2, Bits 1-0). The pp field in the XOP prefix is reserved for future use.

### 1.2 Opcode Byte

The format of the opcode byte is shown in Figure 1-3. For most instructions, the operand element size (OES) is specified by the two least-significant opcode bits, as shown in Table 1-1.


Figure 1-3. Opcode Byte Format

Table 1-1. Operand Element Size-OES

| Opcode.OES | Integer Operation | Floating-Point <br> Operation |
| :---: | :---: | :---: |
| 00 | Byte | PS |
| 01 | Word | PD |
| 10 | Doubleword | SS |
| 11 | Quadword | SD |

### 1.3 Destination XMM registers

The destination of XOP, FMA4 and CVT16 instructions may be a 128 -bit XMM register or a 256 -bit YMM register. When a 128 -bit result is written to a destination XMM register, the upper 128-bit of the corresponding YMM register are cleared.

### 1.4 Four-Operand Instructions

Some new instructions require three input operands and one destination register. This is accomplished by using the Prefix.vvvv field and $\operatorname{Imm} 8[7: 4]$ along with the MODRM.reg and MODRM.r/m fields.

VPCMOV is an example of a four operand instruction:
VPCMOV dest, src1, src2, src3; dest $=(\operatorname{src} 1 \& \operatorname{src} 3) \mid(\operatorname{src} 2 \& \sim \operatorname{src} 3)$
The first operand is the destination operand and is an XMM or YMM register addressed by the ModRM.reg field.

The second, third and fourth operands are sources. The first source operand is an XMM register specified by the vvvv field. The second and third source operands are specified by the MODRM.r/m and Imm8[7:4] fields, respectively, when VEX.W is set to 0 . The FMA4, VPCMOV and VPPERM instructions provide the option of swapping the second and third source operands by setting W to 1 , as shown in Table 1-2. This allows either the second data operand or the control operand to be memory based.

Table 1-2. Operand Configurations for FMA4, PCMOV and PPERM Instructions

| XOP.W | dest | src1 | src2 | src3 |
| :---: | :---: | :---: | :---: | :---: |
| 0 | ModRM.reg | VEX/XOP.vvvv | modrm.r/m | imm8[7:4] |
| 1 | ModRM.reg | VEX/XOP.vvvv | imm8[7:4] | ModRM.r/m |

The XOP four operand instructions have opcodes in the XOP 08h opcode page and FMA4 instructions have opcodes in the VEX C4h opcode page, as shown Table 1-3 and Table 1-4, respectively.

Table 1-3. Four Operand XOP Instruction Opcode Map

| Operation | Opcode | xOP.mmmmm | Opcode[1:0] <br> OES | Operand Size |
| :---: | :---: | :---: | :---: | :---: |
| VPCMOV | A2h | 01000 b | 10 b | $128 / 256$ |
| VPPERM | A3h | 01000 b | 11 b | 128 |
| VPMACSSWW | 85 h | 01000 b | 01 b | 128 |
| VPMACSWW | 95 h | 01000 b | 01 b | 128 |
| VPMACSSWD | 86 h | 01000 b | 10 b | 128 |
| VPMACSWD | 96 h | 01000 b | 10 b | 128 |
| VPMACSSDD | 8 Eh | 01000 b | 10 b | 128 |
| VPMACSDD | 9 bh | 01000 b | 10 b | 128 |
| VPMACSSDQL | 87 h | 01000 b | 11 b | 128 |
| VPMACSDQL | 97 h | 01000 b | 11 b | 128 |
| VPMACSSDQH | 8 Fh | 01000 b | 11 b | 128 |

Table 1-3. Four Operand XOP Instruction Opcode Map (continued)

| Operation | Opcode | xOP.mmmmm | Opcode[1:0] <br> OES | Operand Size |
| :---: | :---: | :---: | :---: | :---: |
| VPMACSDQH | $9 F h$ | 01000 b | 11 b | 128 |
| VPMADCSSWD | A6h | 01000 b | 10 b | 128 |
| VPMADCSWD | B6h | 01000 b | 10 b | 128 |

Table 1-4. Four Operand FMA4 Instruction Opcode Map

| Operation | Opcode | VEX.mmmmm | Opcode[1:0] <br> OES | Operand Size |
| :---: | :---: | :---: | :---: | :---: |
| VFMADDPD | 69 h | 00011 b | 01 b | $128 / 256$ |
| VFMADDPS | 68 h | 00011 b | 00 b | $128 / 256$ |
| VFMADDSD | 6 Bh | 00011 b | 11 b | 128 |
| VFMADDSS | 6 Ah | 00011 b | 10 b | 128 |
| VFMADDSUBPD | 5Dh | 00011 b | 01 b | $128 / 256$ |
| VFMADDSUBPS | 5 Ch | 00011 b | 00 b | $128 / 256$ |
| VFMSUBADDPD | 5 Fh | 00011 b | 01 b | $128 / 256$ |
| VFMSUBADDPS | 5 Eh | 00011 b | 00 b | $128 / 256$ |
| VFMSUBPD | 6 Dh | 00011 b | 01 b | $128 / 256$ |
| VFMSUBPS | 6 Ch | 00011 b | 00 b | $128 / 256$ |
| VFMSUBSD | $6 F h$ | 00011 b | 11 b | 128 |
| VFMSUBSS | 6 bh | 00011 b | 10 b | 128 |
| VFNMADDPD | 79 h | 00011 b | 01 b | $128 / 256$ |
| FNMADDPS | 78h | 00011 b | 00 b | $128 / 256$ |
| VFNMADDSD | 7Bh | 00011 b | 11 b | 128 |
| VFNMADDSS | 7Ah | 00011 b | 10 b | 128 |
| VFNMSUBPD | 7 Dh | 00011 b | 01 b | $128 / 256$ |
| VFNMSUBPS | 7Ch | 00011 b | 00 b | $128 / 256$ |
| VFNMSUBSD | 7Fh | 00011 b | 11 b | 128 |
| VFNMSUBSS | 7Eh | 00011 b | 10 b | 128 |

### 1.5 Three-Operand Instructions

Some instructions have two source operands and a destination operand.
VPROTB is an example of a three operand instruction:
VPROTB dest, src, count dest $=\operatorname{src} \ll / \gg$ count

The first operand is the destination operand, and is an XMM or YMM register addressed by the ModRM.reg field. The second and third operands are source operands. One source operand is an XMM or YMM register addressed by the XOP.vvvv field, the other source operand is an XMM or YMM register or memory operand addressed by the ModRM.r/m field.

For certain instructions, in the three-operand format the XOP.W bit determines which source operand is specified by which operand field, as shown in Table 1-5.

Table 1-5. Operand Configurations for Three Operand Instructions

| VEX.W | dest | src | count |
| :---: | :---: | :---: | :---: |
| 0 | ModRM.reg | ModRM.r/m | VEX.vvvv |
| 1 | ModRM.reg | VEX.vvvv | ModRM.r/m |

The three operand instructions have opcodes in the mmmmm 08 h or 09 h page. See Table 1-6.
Table 1-6. Three Operand Instruction Opcode Map

| Operation | Opcode | XOP.mmmmm | Opcode[1:0] <br> OES | Operand Size |
| :--- | :---: | :---: | :---: | :---: |
| VPCOM $^{\text {a }}$ | CC-CFh | 00001 b | OES | 128 |
| VPCOMU $^{\mathrm{a}}$ | EC-EFh | 00001 b | OES | 128 |
| VPROT $^{\mathrm{a}}$ | $90-93 \mathrm{~h}$ | 01001 b | OES | 128 |
| VPSHL $^{\text {a }}$ | $94-97 \mathrm{~h}$ | 01001 b | OES | 128 |
| VPSHA $^{\mathrm{a}}$ | $98-9 \mathrm{Bh}$ | 01001 b | OES | 128 |

a. Indicates four instruction variants $\left(\mathrm{B}, \__{-} \mathrm{D}, \mathrm{C}_{\mathrm{W}}\right.$ and _Q) specified by the operand element size field.

### 1.6 Two Operand Instructions

Two-operand instructions use the normal ModRM-based operand assignment. For most instructions, the first operand is the destination, addressed by the ModRM.reg field and the second operand is either anXMM or YMM register or a memory operand, as determined by the ModRM.mod field. For the VCVTPS2PH instruction, the destination operand (which may be memory-based) is specified by the MODRM.r/m field and the source register is specified by the MODRM.reg field. For all of these instructions, the XOP.vvvv field is not applicable and must be set to 1111 b .

VCVTPH2PS is an example of a two operand instruction.
VCVTPH2PS xmm1, xmm2/mem64
All new two-operand instructions are assigned to the XOP.mmmmm 09h page except for VPROTx, VCVTPS2PH and VCVTPH2PS, which are assigned to the XOP.mmmmm 08h page. See Table 1-7, below.

Table 1-7. Two Operand Instruction Opcode Map

| Operation | Opcode | XOP.mmmmm | Opcode[1:0] <br> OES | Operand Size |
| :--- | :--- | :---: | :---: | :---: |
| VFRCZ | ( | $80-83 \mathrm{~h}$ | 01001 b | OES |
| VCVTPH2PS | A0h | 01000 b | 00 b | $128 / 256$ |
| VCVTPS2PH | A1h | 01000 b | 01 b | $128 / 256$ |
| VPHADDBW | C1h | 01001 b | 01 b | 128 |
| VPHADDBD | C2h | 01001 b | 10 b | 128 |
| VPHADDBQ | C3h | 01001 b | 11 b | 128 |
| VPHADDWD | C6h | 01001 b | 10 b | 128 |
| VPHADDWQ | C7h | 01001 b | 11 b | 128 |
| VPHADDDQ | CBh | 01001 b | 11 b | 128 b |
| VPHADDUBW | D1h | 01001 b | 01 b | 128 |
| VPHADDUBD | D2h | 01001 b | 10 b | 128 |
| VPHADDUBQ | D3h | 01001 b | 11 b | 128 |
| VPHADDUWD | D6h | 01001 b | 10 b | 128 |
| VPHADDUWQ | D7h | 01001 b | 11 b | 128 |
| VPHADDUDQ | DBh | 01001 b | 11 b | 128 |
| VPHSUBBW | E1h | 01001 b | 01 b | 128 |
| VPHSUBWD | E2h | 01001 b | 10 b | 128 |
| VPHSUBDQ | E3h | 01001 b | 11 b | 128 |
| VPROT | C3h | C0-C3h | 01000 b | OES |

a. Indicates four instruction variants ( $\mathrm{B}, \__{-} \mathrm{W}, \quad \mathrm{D}$ and _Q) specified by the OPS field.
b. Indicates four instruction variants (_PS, _PD, _SS and _SD) specified by the OPS field.

## $1.7 \quad$ 16-Bit Floating-Point Data Type

CVT16 instruction extensions introduce a new 16-bit floating-point data type and two instructions (VCVTPS2PH and VCVTPH2PS) to convert 16-bit floating-point values to and from single-precision format.

The 16-bit floating-point data type, shown in Figure 1-4 on page 31, includes a 1-bit sign, a 5-bit exponent with a bias of 15 and a 10-bit significand. The integer bit is implied, making a total of 11 bits in the significand. The value of the integer bit can be inferred from the number encoding. Table 1-8 on page 31 shows the floating-point encodings of supported numbers and non-numbers.

| 15 | 14 |  |
| :---: | :--- | :--- |
| S | Biased Exponent | Significand |

Figure 1-4. 16-Bit Floating-Point Data Type

## Table 1-8. Supported 16-Bit Floating-Point Encodings

| Sign | Bias Exponent | Significand ${ }^{\text {a }}$ | Class | ation |
| :---: | :---: | :---: | :---: | :---: |
| 0 | 11111 | 1.0000000000 | Positive Floating-Point Numbers | Positive Infinity |
| 0 | $\begin{gathered} 11110 \\ \text { to } \\ 00001 \end{gathered}$ | $\begin{gathered} 1.1111111111 \\ \text { to } \\ 1.0000000000 \end{gathered}$ |  | Positive Normal |
| 0 | 00000 | $\begin{gathered} 0.1111111111 \\ \text { to } \\ 0.0000000001 \end{gathered}$ |  | Positive Denormal |
| 0 | 00000 | 0.0000000000 |  | Positive Zero |
| 1 | 00000 | 0.0000000000 | Positive Floating-Point Numbers | Negative Zero |
| 1 | 00000 | $\begin{gathered} 0.0000000001 \\ \text { to } \\ 0.1111111111 \end{gathered}$ |  | Negative Denormal |
| 1 | $\begin{gathered} 00001 \\ \text { to } \\ 11110 \end{gathered}$ | $\begin{gathered} 1.0000000000 \\ \text { to } \\ 1.1111111111 \end{gathered}$ |  | Negative Normal |
| 1 | 11111 | 1.0000000000 |  | Negative Infinity |
| X | 11111 | $\begin{gathered} 1.0000000001 \\ \text { to } \\ 1.0111111111 \end{gathered}$ | Non-Number | SNaN |
| X | 11111 | $\begin{gathered} 1.1000000001 \\ \text { to } \\ 1.1111111111 \end{gathered}$ |  | QNaN |

a. The " 1. ." and " 0 ." prefixes represent the implicit integer bit.

### 1.8 XOP Integer Multiply (Add) and Accumulate Instructions

The multiply and accumulate and multiply, add and accumulate instructions operate on and produce packed signed integer values. These instructions allow the accumulation of results from (possibly)
many iterations of similar operations without a separate intermediate addition operation to update the accumulator register.

### 1.8.1 Saturation

Some instructions limit the result of an operation to the maximum or minimum value representable by the data type of the destination-an operation known as saturation. Many of the integer multiply and accumulate instructions saturate the cumulative results of the multiplication and addition (accumulation) operations before writing the final results to the destination (accumulator) register.

Note, however, that not all multiply and accumulate instructions saturate results. (For further discussion of saturation, see the AMD64 Architecture Programmer's Manual Volume 1: Application Programming, order\# 24592.)

### 1.8.2 Multiply and Accumulate Instructions

The operation of a typical XOP integer multiply and accumulate instruction is shown in Figure 1-5 on page 33.

The multiply and accumulate instructions operate on and produce packed signed integer values. These instructions first multiply the value in the first source operand by the corresponding value in the second source operand. Each signed integer product is then added to the corresponding value in the third source operand, which is the accumulator and is identical to the destination operand. The results may or may not be saturated prior to being written to the destination register, depending on the instruction.


Figure 1-5. Operation of Multiply and Accumulate Instructions
The XOP instruction extensions provide the following integer multiply and accumulate instructions.

- VPMACSSWW—Packed Multiply Accumulate Signed Word to Signed Word with Saturation
- VPMACSWW—Packed Multiply Accumulate Signed Word to Signed Word
- VPMACSSWD—Packed Multiply Accumulate Signed Word to Signed Doubleword with Saturation
- VPMACSWD—Packed Multiply Accumulate Signed Word to Signed Doubleword
- VPMACSSDD—Packed Multiply Accumulate Signed Doubleword to Signed Doubleword with Saturation
- VPMACSDD—Packed Multiply Accumulate Signed Doubleword to Signed Doubleword
- VPMACSSDQL—Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword with Saturation
- VPMACSSDQH—Packed Multiply Accumulate Signed High Doubleword to Signed Quadword with Saturation
- VPMACSDQL—Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword
- VPMACSDQH—Packed Multiply Accumulate Signed High Doubleword to Signed Quadword


### 1.8.3 Integer Multiply, Add and Accumulate Instructions

The operation of the multiply, add and accumulate instructions is illustrated in Figure 1-6.
The multiply, add and accumulate instructions first multiply each packed signed integer value in the first source operand by the corresponding packed signed integer value in the second source operand. The odd and even adjacent resulting products are then added. Each resulting sum is then added to the corresponding packed signed integer value in the third source operand.


Figure 1-6. Operation of Multiply, Add and Accumulate Instructions

The XOP instruction set provides the following integer multiply, add and accumulate instructions.

- VPMADCSSWD—Packed Multiply Add and Accumulate Signed Word to Signed Doubleword with Saturation
- VPMADCSWD—Packed Multiply Add and Accumulate Signed Word to Signed Doubleword


### 1.9 Packed Integer Horizontal Add and Subtract

The packed horizontal add and subtract signed byte instructions successively add adjacent pairs of signed integer values from the source XMM register or 128-bit memory operand and pack the (signextended) integer result of each addition in the destination.

- VPHADDBW—Packed Horizontal Add Signed Byte to Signed Word
- VPHADDBD—Packed Horizontal Add Signed Byte to Signed Doubleword
- VPHADDBQ—Packed Horizontal Add Signed Byte to Signed Quadword
- VPHADDDQ—Packed Horizontal Add Signed Doubleword to Signed Quadword
- VPHADDUBW—Packed Horizontal Add Unsigned Byte to Word
- VPHADDUBD—Packed Horizontal Add Unsigned Byte to Doubleword
- VPHADDUBQ—Packed Horizontal Add Unsigned Byte to Quadword
- VPHADDUWD—Packed Horizontal Add Unsigned Word to Doubleword
- VPHADDUWQ—Packed Horizontal Add Unsigned Word to Quadword
- VPHADDUDQ—Packed Horizontal Add Unsigned Doubleword to Quadword
- VPHADDWD—Packed Horizontal Add Signed Word to Signed Doubleword
- VPHADDWQ—Packed Horizontal Add Signed Word to Signed Quadword
- VPHSUBBW—Packed Horizontal Subtract Signed Byte to Signed Word
- VPHSUBWD—Packed Horizontal Subtract Signed Word to Signed Doubleword
- VPHSUBDQ—Packed Horizontal Subtract Signed Doubleword to Signed Quadword


### 1.10 Vector Conditional Moves

XOP instructions include vector conditional move instructions:

- VPCMOV—Vector Conditional Moves
- VPPERM—Packed Permute Bytes

The VPCMOV instruction implements the C/C++ language ternary '?' operator a bit level. This instruction operates on individual bits and requires a bitwise predicate in one XMM or YMM register and the two source operands in two more XMM or YMM registers.

The VPPERM instruction performs vector permutation on a packed array of 32 bytes composed of two 16 -byte input operands. The VPPERM instruction replaces each destination byte with $00 \mathrm{~h}, \mathrm{FFh}$, or one of the 32 bytes of the packed array. A byte selected from the array may have an additional operation such as NOT or bit reversal applied to it, before it is written to the destination. The action for each destination byte is determined by a corresponding control byte. The VPPERM instruction allows either the second 16-byte input array or the control array to be memory based, per the XOP.W bit.

### 1.11 Packed Integer Rotates and Shifts

These instructions rotate/shift the elements of the vector in the first source YMM or 128-bit memory operand by the amount specified by a control byte. The rotates and shifts differ in the way they handle the control byte.

### 1.11.1 Packed Integer Shifts

The packed integer shift instructions shift each element of the vector in the first source XMM or 128bit memory operand by the amount specified by a control byte contained in the least significant byte of the corresponding element of the second source operand. The result of each shift operation is returned in the destination XMM register. This allows load-and-shift from memory operations, with either the source operand or the shift-count operand being memory-based, as indicated by the XOP.W bit. The XOP instruction set provides the following packed integer shift instructions:

- VPSHLB—Packed Shift Logical Bytes
- VPSHLW—Packed Shift Logical Words
- VPSHLD—Packed Shift Logical Doublewords
- VPSHLQ—Packed Shift Logical Quadwords
- VPSHAB—Packed Shift Arithmetic Bytes
- VPSHAW——Packed Shift Arithmetic Words
- VPSHAD—Packed Shift Arithmetic Doublewords
- VPSHAQ—Packed Shift Arithmetic Quadwords


### 1.11.2 Packed Integer Rotate

There are two variants of the packed integer rotate instructions. The first is identical to that described above (see "Packed Integer Shifts"). In the second variant, the control byte is supplied as an 8-bit immediate operand that specifies a single rotate amount for every element in the first source operand. The XOP instruction set provides the following packed integer rotate instructions:

- VPROTB—Packed Rotate Bytes
- VPROTW—Packed Rotate Words
- VPROTD—Packed Rotate Doublewords
- VPROTQ—Packed Rotate Quadwords


### 1.12 Packed Integer Comparison and Predicate Generation

The XOP comparison instructions compare packed integer values in the first source XMM register with corresponding packed integer values in the second source XMM register or 128-bit memory. The type of comparison is specified by the immediate-byte operand. The resulting predicate is placed in the destination XMM register. If the condition is true, all bits in the corresponding field in the destination register are set to 1 s ; otherwise all bits in the field are set to 0 s .

## Table 1-9. Immediate Operand Values for Unsigned Vector Comparison Operations

| Immediate Operand Byte |  | Comparison Operation |
| :---: | :---: | :---: |
| Bits 7:3 | Bits 2:0 |  |
| 00000b | 000b | Less Than |
|  | 001b | Less Than or Equal |
|  | 010b | Greater Than |
|  | 011b | Greater Than or Equal |
|  | 100b | Equal |
|  | 101b | Not Equal |
|  | 110b | False |
|  | 111b | True |

The integer comparison and predicate generation instructions compare corresponding packed signed or unsigned bytes in the first and second source operands and write the result of each comparison in the corresponding element of the destination. The result of each comparison is a value of all 1 s (TRUE) or all 0s (FALSE). The type of comparison is specified by the three low-order bits of the immediate-byte operand. The XOP instruction set provides the following integer comparison instructions.

- VPCOMUB—Compare Vector Unsigned Bytes
- VPCOMUW—Compare Vector Unsigned Words
- VPCOMUD—Compare Vector Unsigned Doublewords
- VPCOMUQ—Compare Vector Unsigned Quadwords
- VPCOMB—Compare Vector Signed Bytes
- VPCOMW-Compare Vector Signed Words
- VPCOMD—Compare Vector Signed Doublewords
- VPCOMQ—Compare Vector Signed Quadwords


### 1.13 Fraction Extract

The fraction extract instructions isolate the fractional portions of vector or scalar floating point operands. The result of _PD and _PS instructions is a vector of integer numbers. The result of _SD and _SS instructions is always a scalar integer number. XOP provides the following fraction extract instructions:

- VFRCZPD—Extract Fraction Packed Double-Precision Floating-Point
- VFRCZPS—Extract Fraction Packed Single-Precision Floating-Point
- VFRCZSD— Extract Fraction Scalar Double-Precision Floating-Point
- VFRCZSS— Extract Fraction Scalar Single-Precision Floating Point

The VFRCZPD and VFRCZPS instructions extract the fractional portions of a vector of double-/single-precision floating-point values in an XMM or YMM register or a 128- or 256-bit memory location and write the results in the corresponding field in the destination register.

The VFRCZSS and VFRCZSD instructions extract the fractional portion of the single-/doubleprecision scalar floating-point value in an XMM register or 32- or 64-bit memory location and writes the result in the lower element of the destination register. The upper elements of the destination XMM register are unaffected by the operation, while the upper 128 bits of the corresponding YMM register are cleared to zeros.

### 1.14 Convert

The two CVT16 instructions are provided to move data from/to memory and convert a single-precision floating point operand to a half-precision floating-point operand or vice versa in one instruction. (See Section 1.7, "16-Bit Floating-Point Data Type," on page 30.) These instructions allow floating point data to be maintained in memory in half-precision format, conserving memory space.

- VCVTPH2PS-Convert Half-Precision Floating-Point to Single-Precision Floating Point
- VCVTPS2PH—Convert Single-Precision Floating-Point to Half-Precision Floating Point


## 2 AMD XOP, FMA4 and CVT16 Instructions

The following section describes the complete set of XOP 128-media instructions. Instructions are listed alphabetically by mnemonic.

### 2.1 Notation

The notation used to denote the size and type of source and destination operands in both mnemonics and opcodes is discussed in detail in Section 2.5, "Notation," on page 37 in the AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions. Mnemonic conventions that are idiosyncratic to the XOP instruction set have been included in Chapter 1, "New 128-Bit Instructions", in this document.

### 2.1.1 Opcode Syntax

Opcode specification for the XOP, FMA4, and CVT16 instruction sets, with their two, three and four operand syntax, requires a slightly different approach from that used to specify the opcodes for previous generation 64- and 128-bit instructions (documented in the AMD64 Architecture Programmer's Manual Volume 4: 128-Bit Media Instructions, order\# 26568, and AMD64 Architecture Programmer's Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions, order\# 26569). In the following pages, opcodes are specified using the order of fields and bits as they occur in a complete opcode specification as outlined in Section 1.1, "New Instruction Format," on page 23. The following opcode specification is typical:


Most of the terms and symbols used in the following pages are defined in Section 1.1, "New Instruction Format," on page 23. The following notations and convention are used in this volume, in addition to the opcode notational conventions specified in Section 2.5.2, "Opcode Syntax," on page 39
in the AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions:
cntr
Control bits (for comparison instructions); immediate byte bits 3-0.
is4
Destination register specifier; immediate byte bits 7:4.
$\overline{\mathrm{RXB}}$
Bit field specifying the R, X and B bit values. Specified in one's complement form.
VEX.W
The meaning of the W bit is opcode specific. This bit toggles source operand order or is ignored, depending upon the opcode.

VEX.L
Vector length specifier
VEX.vvvv
Additional operand register specifier.
XOP
Indicates the XOP prefix byte $(8 \mathrm{Fh})$.

### 2.2 Operand Specification

The packed values in a operand are numbered starting with 0 , which is considered to be evennumbered.

### 2.3 Instruction Reference

## VCVTPH2PS Convert Packed 16-Bit Floating-Point to SinglePrecision Floating-Point

Converts packed 16-bit floating point values to single-precision floating point values. Rounding performed as specified by settings in an immediate 8 -bit operand.

The 128-bit version converts four 16-bit floating-point values in the low-order 64 bits of an XMM register or 64-bit memory location to four packed single-precision floating-point values and writes the converted values to the destination XMM register. When the result operand is written to the destination register, the upper 128 bits of the corresponding YMM register are zeroed.

The 256-bit version converts eight packed 16-bit floating-point values in the low-order 128 bits of a YMM register or 128-bit memory location to eight packed single-precision floating-point values and writes the converted values to a destination YMM register.

The handling of denormals and rounding is controlled by fields in the immediate byte, as shown in Table 2-1.

## Table 2-1. Denormal and Rounding Control with Immediate Byte Operand

| Mnemonic | DAZ |  |  | FTZ |  |  | \#PE detected |  | RC |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Bit | 7 | 6 | Method | 5 | 4 | Method | 3 | Method | 2 | 1 | 0 | Method |
| $\frac{0}{\sqrt{5}}$ | 0 | 0 | Denormal | 0 | 0 | Denormal | 0 | \#PE if inexact | 0 | 0 | 0 | Nearest |
|  | 0 | 1 | DAZ | 0 | 1 | FTZ | 1 | No \#PE | 1 | 0 | 1 | Down |
|  | 1 | $x$ | MXCSR.DAZ | 1 | $x$ | MXCSR.FTZ |  |  |  | 1 | 0 | Up |
|  |  |  |  |  |  |  |  |  |  | 1 | 1 | Truncate |
|  |  |  |  |  |  |  |  |  | 1 | x | x | MXCSR.RC |

The format of a 16-bit floating-point value is described in Section 1.5, "16-Bit Floating-Point Data Type," on page 8.

The VCVTPH2PS instruction is a CVT16 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  | Encoding |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VCVTPH2PS xmm1, xmm2/mem64, imm8 | 8F | RXB. 08 | 0.1111.0.00 | A0/r/imm8 |
| VCVTPH2PS ymm1, xmm2/mem128, imm8 | 8F | RXB. 08 | 0.1111 .1 .00 | A0/r /imm8 |



## Related Instructions

VCVTPS2PH
rFLAGS Affected
None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | CVT16 instructions are only recognized in protected <br> mode. |
|  |  |  | X | The CVT16 instructions are not supported, as <br> indicated by ECX bit 18 of CPUID function <br> 8000 _0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1. |

## VCVTPS2PH Convert Packed Single-Precision Floating-Point to 16-Bit Floating-Point

Converts packed single-precision floating-point values to packed 16-bit floating-point values and writes the converted values to the destination register or to memory. Rounding performed as specified by settings in an immediate 8 -bit operand.

The 128-bit version converts four packed single-precision floating-point values in an XMM register to four packed 16-bit floating-point values and writes the converted values to the low-order 64 bits of the destination XMM register or to a 64-bit memory location. When the result is written to the destination XMM register, the high-order 64 bits in the destination XMM register and the upper 128 bits of the corresponding YMM register are cleared to 0s.

The 256-bit version converts eight packed single-precision floating-point values in a YMM register to eight packed 16-bit floating-point values and writes the converted values to the low-order 128 bits of another YMM register or to a 128 -bit memory location. When the result is written to the destination YMM register, the high-order 128 bits in the register are cleared to 0s.

Table 1-10 on page 8 shows the floating-point encodings of supported numbers and non-numbers.
The format of a 16-bit floating-point value is described in Section 1.5, "16-Bit Floating-Point Data Type," on page 7.

The handling of denormals and rounding is controlled by fields in the immediate byte, as shown in Table 2-2.

## Table 2-2. Denormal and Rounding Control with Immediate Byte Operand

| Mnemonic | DAZ |  |  | FTZ |  |  | \#PE detected |  | RC |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Bit | 7 | 6 | Method | 5 | 4 | Method | 3 | Method | 2 | 1 | 0 | Method |
| $\frac{0}{\sqrt{n}}$ | 0 | 0 | Denormal | 0 | 0 | Denormal | 0 | \#PE if inexact | 0 | 0 | 0 | Nearest |
|  | 0 | 1 | DAZ | 0 | 1 | FTZ | 1 | No \#PE | 1 | 0 | 1 | Down |
|  | 1 | $x$ | MXCSR.DAZ | 1 | $x$ | MXCSR.FTZ |  |  |  | 1 | 0 | Up |
|  |  |  |  |  |  |  |  |  |  | 1 | 1 | Truncate |
|  |  |  |  |  |  |  |  |  | 1 | x | x | MXCSR.RC |

The format of a 16-bit floating-point value is described in Section 1.5, "16-Bit Floating-Point Data Type," on page 8.

The VCVTPS2PH instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VCVTPS2PH $x m m 1 /$ mem64, $x m m 2$, imm8 | 8 F | $\overline{R X B} .08$ | 0.1111 .0 .00 | A1 /r $/ \mathrm{imm} 8$ |
| VCVTPS2PH $x m m 1 / m e m 128, ~ y m m 2, ~ i m m 8 ~$ | $8 F$ | $\overline{R X B} .08$ | 0.1111 .1 .00 | $\mathrm{~A} 1 / \mathrm{r} / \mathrm{imm} 8$ |



## VCVTPS2PH

256-Bit


## Related Instructions

## VCVTPH2PS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  | M |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | CVT16 instructions are only recognized in protected <br> mode. |
|  |  |  | X | The CVT16 instructions are not supported, as <br> indicated by ECX bit 18 of CPUID function <br> 8000 _0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1. |

## VFMADDPD

## Multiply and Add Packed Double-Precision Floating-Point

Multiplies each packed double-precision floating-point value in the first source by the corresponding packed double-precision floating-point value in the second source, then adds each product to the corresponding packed double-precision floating-point value in the third source and writes the rounded results to the destination register.

The VFMADDPD instruction requires four operands:

$$
V F M A D D P D \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=(\operatorname{src} 1 * \operatorname{src} 2)+\operatorname{src} 3
$$

The 128-bit version multiplies each of the two double-precision values in the first source XMM register by its corresponding double-precision value in the second source. It then adds each intermediate product to the corresponding double-precision value in the third source and places the result in the destination XMM register.

The 256-bit version multiplies each of the four double-precision values in the first source YMM register by its corresponding double-precision value in the second source. It then adds each product to the corresponding double-precision value in the third source and places the results in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. The results of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFMADDPD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | vex | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMADDPD $x$ xm1, $x m m 2, x m m 3 / m e m 128, x m m 4$ | C4 | RXB. 03 | $0 . \overline{\mathrm{xsrc} 1} .0 .01$ | $69 / r /$ /is4 |
| VFMADDPD $y m m 1, y m m 2, y m m 3 / m e m 256, y m m 4$ | C4 | RXB. 03 | 0.ysrc1.1.01 | $69 / r /$ is 4 |
| VFMADDPD $x$ mm1, xmm2, xmm3, xmm4/mem128 | C4 | RXB. 03 | 1. $\overline{\mathrm{xsrc} 1.0 .01}$ | $69 / r /$ is 4 |
| VFMADDPD $y m m 1, y m m 2, y m m 3, y m m 4 / m e m 256$ | C4 | $\overline{\mathrm{RXB}} .03$ | 1.ysrc1.1.01 | $69 / r / i s 4$ |

## VFMADDPD



## Related Instructions

VFMADDPS, VFMADDSD, VFMADDSS
rFLAGS Affected
None

## MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | Virtual <br> $\mathbf{8 0 8 6}$ | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |


| Exception | Real | Virtual <br> $\mathbf{8 0 8 6}$ | Protected | Cause of Exception |
| :--- | :--- | :---: | :---: | :--- |
| Denormalized-operand <br> exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of <br> the destination operand. |
| Underflow exception <br> (UE) |  |  | X | A rounded result was too small to fit into the format of <br> the destination operand. |
| Precision exception <br> (PE) |  |  | X | A result could not be represented exactly in the <br> destination format. |

## VFMADDPS

## Multiply and Add Packed Single-Precision Floating-Point

Multiplies each packed single-precision floating-point value in the first source by the corresponding single-precision floating-point value in the second source, then adds each product to the corresponding packed single-precision floating-point value in the third source and writes the rounded results to the destination register.

The VFMADDPS instruction requires four operands:

$$
V F M A D D P S \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2+\operatorname{src} 3
$$

The 128-bit version multiplies each of the four single-precision values in the first source XMM register by its corresponding single-precision value in the second source. It then adds each product to the corresponding single-precision value in the third source and places the results in the destination XMM register.

The 256-bit version multiplies each of the eight single-precision values in the first source YMM register by its corresponding double-precision value in the second source. It then adds each product to the corresponding double-precision value in the third source and places the results in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. The results of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFMADDPS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | vex | RXB.mmmmm | w.vvvv.L.pp | Opcode |
| VFMADDPS $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | C4 | RXB. 03 | $0 . \overline{\mathrm{xsrc} 1.0 .01}$ | $68 / \mathrm{r} / \mathrm{is} 4$ |
| VFMADDPS ymm1, ymm2, ymm3/mem256, ymm4 | C4 | RXB. 03 | $0 . \overline{y s r c 1} .1 .01$ | $68 / \mathrm{r} / \mathrm{is} 4$ |
| VFMADDPS $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 128$ | C4 | RXB. 03 | 1. $\overline{\mathrm{xscc}} .0 .01$ | $68 / \mathrm{r} / \mathrm{is} 4$ |
| VFMADDPS ymm1, ymm2, ymm3, ymm4/mem256 | C4 | RXB. 03 | 1.ysrc1.1.01 | $68 / \mathrm{r} / \mathrm{is} 4$ |

## VFMADDPS



## Related Instructions

VFMADDPD, VFMADDSD, VFMADDSS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | Virtual <br> $\mathbf{8 0 8 6}$ | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |


| Exception | Real | Virtual <br> $\mathbf{8 0 8 6}$ | Protected | Cause of Exception |
| :--- | :--- | :---: | :---: | :--- |
| Denormalized-operand <br> exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of <br> the destination operand. |
| Underflow exception <br> (UE) |  |  | X | A rounded result was too small to fit into the format of <br> the destination operand. |
| Precision exception <br> (PE) |  |  | X | A result could not be represented exactly in the <br> destination format. |

## VFMADDSD

## Multiply and Add Scalar Double-Precision Floating-Point

Multiplies the double-precision floating-point value in the low-order quadword of the first source by the double-precision floating-point value in the low-order quadword of the second source, then adds the product to the double-precision floating-point value in the low-order quadword of the third source. The low-order quadword result is written to the destination.

The VFMADDSD instruction requires four operands:

$$
V F M A D D S D \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=\operatorname{src} 1 * \operatorname{src} 2+\operatorname{src} 3
$$

If VEX.W is 0 , the second source is either a register or 64-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a a register and the third source is a register or 64-bit memory location.

The destination is an XMM register. When the result is written to the destination XMM register, the upper quadword of the destination register (bits 64-127) and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the addition. The result of the addition is rounded, as specified by the rounding mode in MXCSR.

The VFMADDSD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMADDSD $x m m 1, x m m 2, x m m 3 / m e m 64, x m m 4$ | C4 | RXB. 03 | 0.xsrc1.0.01 | 6B/r /is4 |
| VFMADDSD $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 64$ | C4 | $\overline{\mathrm{RXB}} .03$ | 1. $\mathrm{xsrc1} .0 .01$ | 6B/r /is4 |

## VFMADDSD



## Related Instructions

VFMADDPD, VFMADDPS, VFMADDSS
rFLAGS Affected
None
MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | M | M | M |  | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception <br> (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFMADDSS

## Multiply and Add Scalar Single-Precision Floating-Point

Multiplies the single-precision floating-point value in the low-order doubleword of the first source by the low-order single-precision floating-point value in the second source, then adds the product to the low-order single-precision floating-point value in the third source. The low-order doubleword result is written to the destination.

The VFMADDSS instruction requires four operands:

$$
V F M A D D S S \text { dest, src1, src2, src } 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2+\operatorname{src} 3
$$

If VEX.W is 0 , the second source is either a register or 32-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a a register and the third source is a register or 32-bit memory location.

The destination is an XMM register. When the result is written to the destination XMM register, the upper three doublewords of the destination register (bits 32-127) and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the addition. The result of the addition is rounded, as specified by the rounding mode in MXCSR.

The VFMADDSS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  | Encoding |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMADDSS $x m m 1, x m m 2, x m m 3 / m e m 32, x m m 4$ | C4 | RXB. 03 | 0.xsrc1.0.01 | 6A /r /is4 |
| VFMADDSS $x$ mm1, xmm2, xmm3, xmm4/mem32 | C4 | RXB. 03 | 1. $\overline{\mathrm{xsrc} 1} .0 .01$ | 6A/r /is4 |

## VFMADDSS



## Related Instructions

VFMADDPD, VFMADDPS, VFMADDSD
rFLAGS Affected
None
MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  | M | M | M |  | $M$ | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{aligned} & \text { Precision exception } \\ & \text { (PE) } \end{aligned}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFMADDSUBPD

## Multiply with Alternating Add/Subtract of Packed Double-Precision Floating-Point

Multiplies each packed double-precision floating-point value in the first source by the corresponding packed double-precision floating-point value in the second source. Adds each odd-numbered doubleprecision floating-point value in the third source to the corresponding infinite-precision intermediate product; subtracts each even-numbered double-precision floating-point value in the third source from its corresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the two double-precision floating-point values in the first source by its corresponding value in the second source. The low-order double-precision floating-point value in the third source is subtracted from its corresponding infinite-precision product and the high-order double-precision floating-point value in the third source is added to its corresponding product. The results of these operations are placed in their corresponding positions in the destination.

The 256-bit version multiplies each of the four double-precision floating-point values in first source by its corresponding double-precision value in the second source. The even-numbered double-precision values in the third source are subtracted from their corresponding infinite-precision intermediate products and the odd-numbered double-precision values in the third source are added to their corresponding infinite precision intermediate products. The results of these operations are placed in their corresponding positions in the destination.

The first source is an XMM register or a YMM register, depending on the vector size, as determined by VEX.L.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the final addition and subtraction operation(s). The results of the addition and subtraction operations are rounded, as specified by the rounding mode in MXCSR.

The VFMADDSUBPD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | VEx | RXB.mmmmm | w.vvv.L.pp | Opcode |
| VFMADDSUBPD $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | C4 | $\overline{\mathrm{RXB}} .03$ | $0 . \overline{x \operatorname{src} 1.0 .01}$ | $5 \mathrm{D} / \mathrm{r} / \mathrm{is} 4$ |
| VFMADDSUBPD $y m m 1, y m m 2, y m m 3 / m e m 256, y m m 4$ | C 4 | $\overline{\mathrm{RXB}} .03$ | $0 . \overline{y s r c 1.1 .01}$ | $5 \mathrm{D} / \mathrm{r} / \mathrm{is} 4$ |
| VFMADDSUBPD $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 128$ | C 4 | $\overline{\mathrm{RXB}} .03$ | $1 . \overline{\mathrm{xsrc} 1.0 .01}$ | $5 \mathrm{D} / \mathrm{r} / \mathrm{is} 4$ |
| VFMADDSUBPD $y m m 1, y m m 2, y m m 3, y m m 4 / m e m 256$ | C 4 | $\overline{\mathrm{RXB}} .03$ | $1 . \overline{y s r c 1.1 .01}$ | $5 \mathrm{D} / \mathrm{r} / \mathrm{is} 4$ |

## VFMADDSUBPD



## Related Instructions

VFMADDSUBPD, VFMSUBADDPD, VFMSUBADDPS

rFLAGS Affected
None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  |
| $M$ | $M$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details |


| Exception | Real | Virtual <br> $\mathbf{8 0 8 6}$ | Protected | Cause of Exception |
| :--- | :--- | :---: | :---: | :--- | :--- |
| Invalid-operation <br> exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/-infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | +infinity was subtracted from +infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand <br> exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of <br> the destination operand. |
| Underflow exception <br> (UE) |  |  | X | A rounded result was too small to fit into the format of <br> the destination operand. |
| Precision exception <br> (PE) |  |  | X | A result could not be represented exactly in the <br> destination format. |

## VFMADDSUBPS

## Multiply with Alternating Add/Subtract of Packed Single-Precision Floating-Point

Multiplies each packed single-precision floating-point value in the first source by the corresponding packed single-precision floating-point value in the second source. Adds each odd-numbered singleprecision floating-point value in the third source to the corresponding infinite-precision intermediate product; subtracts each even-numbered single-precision floating-point value in the third source from its corresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the four single-precision floating-point values in first source by its corresponding single-precision value in the second source. The even-numbered single-precision values in the third source are subtracted from their corresponding infinite-precision intermediate products and the odd-numbered single-precision values in the third source are added to their corresponding infinite precision intermediate products. The results of these operations are placed in their corresponding positions in the destination.

The 256-bit version multiplies each of the eight single-precision floating-point values in first source by its corresponding single-precision value in the second source. The even-numbered single-precision values in the third source are subtracted from their corresponding infinite-precision intermediate products and the odd-numbered single-precision values in the third source are added to their corresponding infinite precision intermediate products. The results of these operations are placed in their corresponding positions in the destination.

The first source is either an XMM register or a YMM register, depending on the vector size, as determined by VEX.L.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise intermediate products are used in the addition and subtraction operations. The results of the addition and subtraction operations are rounded, as specified by the rounding mode in MXCSR.

The VFMADDSUBPS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMADDSUBPS xmm1, xmm2, xmm3/mem128, xmm4 | C4 | $\overline{\text { RXB }} .03$ | 0.xsrc1.0.01 | 5C/r /is4 |
| VFMADDSUBPS ymm1, ymm2, ymm3/mem256, ymm4 | C4 | RXB. 03 | $0 . \overline{y s r c 1} 1.01$ | 5C/r /is4 |
| VFMADDSUBPS xmm1, xmm2, xmm3, xmm4/mem128 | C4 | $\overline{\text { RXB }} .03$ | 1.xsrc1.0.01 | 5C/r /is4 |
| VFMADDSUBPS ymm1, ymm2, ymm3, ymm4/mem256 | C4 | RXB. 03 | 1.ysrc1.1.01 | 5C/r /is4 |

## VFMADDSUBPS



## Related Instructions

## VFMADDSUBPD, VFMSUBADDPD, VFMSUBADDPS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  |
| $M$ | $M$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details |


| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/-infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | +infinity was subtracted from +infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{aligned} & \text { Precision exception } \\ & \text { (PE) } \end{aligned}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFMSUBADDPD

## Multiply with Alternating Subtract/Add of Packed Double-Precision Floating-Point

Multiplies each packed double-precision floating-point value in the first source by the corresponding packed double-precision floating-point value in the second source. Adds each even-numbered doubleprecision floating-point value in the third source to the corresponding infinite-precision intermediate product; subtracts each odd-numbered double-precision floating-point value in the third source from its corresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the two double-precision floating-point values in the first source by its corresponding value in the second source. The high-order double-precision floating-point value in the third source is subtracted from its corresponding infinite-precision product and the low-order double-precision floating-point value in the third source is added to its corresponding product. The results of these operations are placed in their corresponding positions in the destination.

The 256-bit version multiplies each of the four double-precision floating-point values in first source by its corresponding double-precision value in the second source. The odd-numbered double-precision values in the third source are subtracted from their corresponding infinite-precision intermediate products and the even-numbered double-precision values in the third source are added to their corresponding infinite precision intermediate products. The results of these operations are placed in their corresponding positions in the destination.

The first source is either an XMM register or a YMM register, depending on the vector size, as determined by VEX.L.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source operand is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the two infinitely precise intermediate products are used in the addition. The results of the addition and subtraction operations are rounded, as specified by the rounding mode in MXCSR.

The VFMSUBADDPD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMSUBADDPD $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | C4 | $\overline{\text { RXB }} .03$ | 0.xsrc1.0.01 | 5F/r /is4 |
| VFMSUBADDPD ymm1, ymm2, ymm3/mem256, ymm4 | C4 | RXB. 03 | $0 . \overline{y s r c 1.1 .01 ~}$ | $5 \mathrm{~F} / \mathrm{r} / \mathrm{is} 4$ |
| VFMSUBADDPD $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 128$ | C4 | RXB. 03 | 1. $\overline{\mathrm{xsrc} 1} .0 .01$ | 5F/r /is4 |
| VFMSUBADDPD ymm1, ymm2, ymm3, ymm4/mem256 | C4 | RXB. 03 | 1.ysrc1.1.01 | 5F/r /is4 |

## VFMSUBADDPD



## Related Instructions

VFMADDSUBPD, VFMADDSUBPS, VFMSUBADDPS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  |
| $M$ | $M$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details |


| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | +infinity was subtracted from +infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{aligned} & \text { Precision exception } \\ & \text { (PE) } \end{aligned}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFMSUBADDPS

## Multiply with Alternating Subtract/Add of Packed Single-Precision Floating-Point

Multiplies each packed single-precision floating-point value in the first source by the corresponding packed single-precision floating-point value in the second source. Adds each even-numbered singleprecision floating-point value in the third source to the corresponding infinite-precision intermediate product; subtracts each odd-numbered single-precision floating-point value in the third source from its corresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the four single-precision floating-point values in first source by its corresponding single-precision value in the second source. The odd-numbered single-precision values in the third source are subtracted from their corresponding infinite-precision intermediate products and the even-numbered single-precision values in the third source are added to their corresponding infinite precision intermediate products. The results of these operations are placed in their corresponding positions in the destination.

The 256-bit version multiplies each of the eight single-precision floating-point values in first source by its corresponding single-precision value in the second source. The odd-numbered single-precision values in the third source are subtracted from their corresponding infinite-precision intermediate products and the even-numbered single-precision values in the third source are added to their corresponding infinite precision intermediate products. The results of these operations are placed in their corresponding positions in the destination.

The first source is either an XMM register or a YMM register, depending on the vector size, as determined by VEX.L.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. The results of the additions and subtracts are rounded, as specified by the rounding mode in MXCSR.

The VFMSUBADDPS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMSUBADDPS $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | C4 | RXB. 03 | 0.xsrc1.0.01 | 5E/r /is4 |
| VFMSUBADDPS ymm1, ymm2, ymm3/mem256, ymm4 | C4 | RXB. 03 | $0 . \overline{y s r c 1} 1.01$ | 5E/r/is4 |
| VFMSUBADDPS $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 128$ | C4 | RXB. 03 | 1.xsrc1.0.01 | 5E/r/is4 |
| VFMSUBADDPS ymm1, ymm2, ymm3, ymm4/mem256 | C4 | $\overline{\text { RXB }} .03$ | 1.ysrc1.1.01 | 5E/r/is4 |

## VFMSUBADDPS



## Related Instructions

VFMADDSUBPD, VFMADDSUBPS, VFMSUBADDPD, VFMSUBADDPS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  |
| $M$ | $M$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details |


| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | +infinity was subtracted from +infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{aligned} & \text { Precision exception } \\ & \text { (PE) } \end{aligned}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFMSUBPD

## Multiply and Subtract Packed Double-Precision Floating-Point

Multiplies each of the packed double-precision floating-point values in the first source by its corresponding packed double-precision floating-point value in the second source, then subtracts the corresponding packed double-precision floating-point values in the third source from the intermediate products of the multiplication. The results are written to the destination register.

The VFMSUBPD instruction requires four operands:

$$
V F M S U B P D \text { dest }, \operatorname{src} 1, s r c 2, s r c 3 \quad d e s t=s r c 1 * \operatorname{src} 2-s r c 3
$$

The 128-bit version multiplies two packed double-precision floating-point values in the first source, by their corresponding packed double-precision floating point values in the second source, producing two intermediate products. The two double precision floating-point values in the third source are subtracted from the intermediate products of the multiplication and the remainders are placed in the destination XMM register.

The 256-bit version multiplies four packed double-precision floating-point values in the first source by their corresponding packed double-precision floating point values in the second source, producing four intermediate products. The four double-precision floating-point values in the third source are subtracted from the intermediate products of the multiplication and the remainders are placed in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When writing to a 128 -bit XMM destination register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the two infinitely precise products are used in the subtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFMSUBPD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)
Mnemonic
VFMSUBPD $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$
VFMSUBPD $y m m 1, y m m 2, y m m 3 / m e m 256, y m m 4$
VFMSUBPD $x m m 1, x m m 2, x m m 3, x m m 4 /$ mem128
VFMSUBPD $y m m 1, y m m 2, y m m 3, y m m 4 /$ mem256

## VFMSUBPD



## Related Instructions

VFMSUBPS, VFMSUBSD, VFMSUBSS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ |
| $M$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception (PE) (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFMSUBPS

## Multiply and Subtract Packed Single-Precision

 Floating-PointMultiplies each of the packed single-precision floating-point values in the first source by its corresponding packed single-precision floating-point value in the second source, then subtracts the corresponding packed single-precision floating-point values in the third source from the products. The four results are written to the destination register.

The VFMSUBPS instruction requires four operands:

$$
V F M S U B P S \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2-s r c 3
$$

The 128-bit version multiplies four packed single-precision floating-point values in the first source by their corresponding packed single-precision floating point values in the second source, producing four intermediate products. The four single-precision floating-point values in the third source are subtracted from the intermediate products of the multiplication and the remainders are placed in the destination XMM register.

The 256-bit version multiplies eight packed single-precision floating-point values in the first source by their corresponding packed single-precision floating point values in the second source, producing eight intermediate products. The eight single-precision floating-point values in the third source are subtracted from the intermediate products of the multiplication and the remainders are placed in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the two infinitely precise products are used in the subtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFMSUBPS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMSUBPS xmm1, xmm2, xmm3/mem128, xmm4 | C4 | $\overline{\text { RXB }} .03$ | 0.xsrc1.0.01 | 6C /r/is4 |
| VFMSUBPS ymm1, ymm2, ymm3/mem256, ymm4 | C4 | RXB. 03 | 0.ysrc1.1.01 | 6C/r/is4 |
| VFMSUBPS xmm1, xmm2, xmm3, xmm4/mem128 | C4 | $\overline{\mathrm{RXB}} .03$ | 1. $\mathrm{xsrc1} .0 .01$ | 6C/r/is4 |
| VFMSUBPS ymm1, ymm2, ymm3, ymm4/mem256 | C4 | RXB. 03 | 1. $\overline{\mathrm{ysrc} 1.1 .01}$ | 6C /r /is4 |

## VFMSUBPS



## Related Instructions

VFMSUBPD, VFMSUBSD, VFMSUBSS
rFLAGS Affected
None
MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | M | M | M |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception (PE) (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFMSUBSD <br> Multiply and Subtract Scalar Double-Precision <br> Floating-Point

Multiplies the double-precision floating-point value in the low-order quadword of the first source by the double-precision floating-point value in the low-order quadword of the second source, then subtracts the double-precision floating-point value in the low-order quadword of the third source from the intermediate product. The low-order quadword result is written to the destination.

The VFMSUBSD instruction requires four operands:
$V F M S U B S D$ dest, src $1, s r c 2, s r c 3 \quad$ dest $=s r c 1 * s r c 2-s r c 3$
If VEX.W is 0 , the second source is either a register or 64-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or 64-bit memory location.

The destination is an XMM register. When the result is written to the destination XMM register, the upper quadword of the destination register (bits 64-127) and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the subtraction. The result of the subtraction is rounded, as specified by the rounding mode in MXCSR.

The VFMSUBSD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMSUBSD $x m m 1, x m m 2, x m m 3 / m e m 64, x m m 4$ | C4 | $\overline{\mathrm{RXB}} .03$ | 0.xsrc1.0.01 | 6F /r /is4 |
| VFMSUBSD $x$ xm1, xmm2, xmm3, xmm4/mem64 | C4 | $\overline{\mathrm{RXB}} .03$ | 1. $\overline{\mathrm{xsrc} 1} .0 .01$ | 6F/r /is4 |

## VFMSUBSD



## Related Instructions

## VFMSUBPD, VFMSUBPS, VFMSUBSS

rFLAGS Affected
None
MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | M | M | M |  | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{aligned} & \text { Precision exception } \\ & \text { (PE) } \end{aligned}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFMSUBSS

## Multiply and Subtract Scalar Single-Precision Floating-Point

Multiplies the single-precision floating-point value in the low-order doubleword of the first source by the single-precision floating-point value in the low-order doubleword of the second source, then subtracts the single-precision floating-point value in the low-order doubleword of the third source from the product. The low-order doubleword result is written to the destination.

The VFMSUBSS instruction requires four operands:

$$
V F M S U B S S \text { dest, src1, src2, src } 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2-s r c 3
$$

If VEX.W is 0 , the second source is either a register or 32-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or 32-bit memory location.

The destination is an XMM register. When the result is written to the destination XMM register, the upper three doublewords of the destination register (bits 32-127) and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the subtraction. The result of the subtraction is rounded, as specified by the rounding mode in MXCSR.

The VFMSUBSS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  | Encoding |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFMSUBSS $x m m 1, x m m 2, x m m 3 / m e m 32, x m m 4$ | C4 | RXB. 03 | 0.xsrc1.0.01 | 6E /r /is4 |
| VFMSUBSS xmm1, xmm2, xmm3, xmm4/mem32 | C4 | $\overline{\text { RXB }} .03$ | 1.xsrc1.0.01 | 6E /r /is4 |

## VFMSUBSS



## Related Instructions

VFMSUBPD, VFMSUBPS, VFMSUBSD
rFLAGS Affected
None
MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | M | M | M |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception (PE) (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMADDPD

## Negative Multiply and Add Packed Double-Precision Floating-Point

Multiplies each of the packed double-precision floating-point values in the first source by the corresponding packed double-precision floating-point values in the second source, then negates the products and adds them to the corresponding packed double-precision floating-point values in the third source. The results are written to the destination register.

The VFNMADDPD instruction requires four operands:

$$
V F N M A D D P D \text { dest, src1, src2, src } 3 \quad \text { dest }=-(s r c 1 * \operatorname{src} 2)+\operatorname{src} 3
$$

The 128-bit version multiplies the two double-precision values in the first source XMM register by the corresponding double-precision values in the second source, which can be either an XMM register or a 128-bit memory location. It then negates each product and adds it to the corresponding doubleprecision value in the third source. The results are then placed in the destination XMM register.

The 256-bit version multiplies the four double-precision values in the first source YMM register by the four double-precision values in the second source, which can be either a YMM register or a 256-bit memory location. It then negates each product and adds it to the corresponding double-precision value in the third source. The results are then placed in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. The results of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFNMADDPD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | vex | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMADDPD $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | C4 | $\overline{\text { RXB }} .03$ | $0 . \overline{\mathrm{xsrc} 1.0 .01}$ | $79 / \mathrm{r} / \mathrm{is} 4$ |
| VFNMADDPD $y m m 1, y m m 2, y m m 3 / m e m 256, y m m 4$ | C4 | RXB. 03 | 01. $\overline{\mathrm{ssrc} 1.1 .01}$ | 79 /r /is4 |
| VFNMADDPD $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 128$ | C4 | $\overline{\text { RXB. }} 03$ | 1.xsrc1.0.01 | 79 /r/is4 |
| VFNMADDPD $y m m 1, y m m 2, y m m 3, y m m 4 / m e m 256$ | C4 | $\overline{\mathrm{RXB}} .03$ | 1.ysrc1. 1.01 | 79 /r/is4 |

## VFNMADDPD



## Related Instructions

VFNMADDPS, VFNMADDSD, VFNMADDSS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception <br> (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMADDPS

# Negative Multiply and Add Packed Single-Precision Floating-Point 

Multiplies each of the packed single-precision floating-point values in first source by the corresponding packed single-precision floating-point value in the second source, then negates the products and adds them to the corresponding packed single-precision floating-point values in the third source. The results are written to the destination register.

The VFNMADDPS instruction requires four operands:

$$
V F N M A D D P S \text { dest, src1, src2, src3 dest }=-\left(s r c 1^{*} \operatorname{src} 2\right)+\operatorname{src} 3
$$

The 128-bit version multiplies the four single-precision values in the first source XMM register by the corresponding single-precision values in the second source, which can be either an XMM register or a 128-bit memory location. It then negates each product and adds it to the corresponding singleprecision value in the third source. The results are then placed in the destination XMM register.

The 256-bit version multiplies the eight single-precision values in the first source YMM register by the eight single-precision values in the second source, which can be either a YMM register or a 256-bit memory location. It then negates each product and adds it to the corresponding single-precision value in the third source. The result is then placed in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. The results of the addition are rounded, as specified by the rounding mode in MXCSR.

The FNMADDPS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | vex | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMADDPS $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | C4 | $\overline{\mathrm{RXB}} .03$ | 0.-xsrc1. 0.01 | $78 / \mathrm{r} / \mathrm{is} 4$ |
| VFNMADDPS $y m m 1, y m m 2, y m m 3 / m e m 256, y m m 4$ | C4 | $\overline{\text { RXB. }} 03$ | $0 . \overline{y s r c 1.1 .01 ~}$ | 78/r/is4 |
| VFNMADDPS $x m m 1, x m m 2, x m m 3, x m m 4 /$ mem 128 | C4 | $\overline{\mathrm{RXB}} .03$ | 1. $\overline{\mathrm{xscc} 1} .0 .01$ | $78 / \mathrm{r} / \mathrm{is} 4$ |
| VFNMADDPS $y m m 1, y m m 2, y m m 3, y m m 4 / m e m 256$ | C4 | $\overline{\mathrm{RXB}} .03$ | 1. $\overline{\mathrm{ysrc}} .1 .01$ | 78/r /is4 |



## Related Instructions

VFNMADDPD, VFNMADDSD, VFNMADDSS
rFLAGS Affected
None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | M | M | M |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{aligned} & \text { Precision exception } \\ & \text { (PE) } \end{aligned}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMADDSD

## Negative Multiply and Add Scalar Double-Precision Floating-Point

Multiplies the double-precision floating-point value in the low-order quadword of the first source by the double-precision floating-point value in the low-order quadword of the second source, then negates the product and adds it to the double-precision floating-point value in the low-order quadword of the third source. The low-order quadword result is written to the destination register.

The VFNMADDSD instruction requires four operands:

$$
V F N M A D D S D \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=-(s r c 1 * \operatorname{src} 2)+\operatorname{src})
$$

The first source is an XMM register indicated by VEX.vvvv.
If VEX.W is 0 , the second source is either a register or 64-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or 64 -bit memory location.

The destination is always an XMM register. When the result is written to the destination XMM register, the high quadword of the destination register (bits 64-127) and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. The results of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFNMADDSD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMADDSD $x m m 1$, xmm2, $x m m 3 / m e m 64, ~ x m m 4 ~$ | C4 | RXB. 03 | 0.xsrc1.0.01 | 7B/r /is4 |
| VFNMADDSD $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 64$ | C4 | $\overline{\text { RXB }} .03$ | 1. $\mathrm{xsrc1} .0 .01$ | 7B/r /is4 |

VFNMADDSD


## Related Instructions

VFNMADDPD, VFNMADDPS, VFNMADDSS

## rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ |
| $M$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception <br> (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMADDSS

## Negative Multiply and Add Scalar Single-Precision Floating-Point

Multiplies the single-precision floating-point value in the low-order doubleword of the first source by the single-precision floating-point value in the low-order doubleword of the second source, then negates the product and adds it to the single-precision floating-point value in the low-order doubleword of the third source. The low-order doubleword result is written to the destination.

The VFNMADDSS instruction requires four operands:

$$
V F N M A D D S S \text { dest, src1, src2, src } 3 \quad \text { dest }=-(\operatorname{src} 1 * \operatorname{src} 2)+\operatorname{src} 3
$$

If VEX.W is 0 , the second source is either a register or 32-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or 32-bit memory location.

The destination is always an XMM register. When the result is written to the destination XMM register, the upper three doublewords of the destination register (bits 32-127) and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. The results of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFNMADDSS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMADDSS $x m m 1, x m m 2, x m m 3 / m e m 32, x m m 4$ | C4 | RXB. 03 | 0.xsrc1.0.01 | 7A/r /is4 |
| VFNMADDSS $x$ mm1, xmm2, xmm3, xmm4/mem32 | C4 | $\overline{\text { RXB. }} 03$ | 1.xsrc1.0.01 | 7A/r /is4 |

## VFNMADDSS



Related Instructions
VFNMADDPD, VFNMADDPS, VFNMADDSS
rFLAGS Affected
None
MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{aligned} & \text { Precision exception } \\ & \text { (PE) } \end{aligned}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMSUBPD

## Negative Multiply and Subtract Packed Double-Precision Floating-Point

Multiplies each of the packed double-precision floating-point values in the first source by the corresponding packed double-precision floating-point value in the second source, then subtracts the corresponding packed double-precision floating-point value in the third source from the negated interim products. The results are written to the destination register.

The VFNMSUBPD instruction requires four operands:

$$
V F N M S U B P D \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=-(s r c 1 * \operatorname{src} 2)-s r c 3
$$

The 128-bit version multiplies each of the two double-precision values in the first source XMM register by its corresponding double-precision value in the second source, which can be either an XMM register or a 128 -bit memory location. It then subtracts the corresponding double-precision value in the third source from the negated interim product. The results are then placed in the destination XMM register.

The 256-bit version multiplies each of the four double-precision values in the first source YMM register by its corresponding double-precision value in the second source, which can be either a YMM register or a 256 -bit memory location. It then subtracts the corresponding double-precision value in the third source from the negated interim product. The results are then placed in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBPD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMSUBPD xmm1, xmm2, xmm3/mem128, xmm4 | C4 | $\overline{\text { RXB }} .03$ | 0.xsrc1.0.01 | 7D /r /is4 |
| VFNMSUBPD ymm1, ymm2, ymm3/mem256, ymm4 | C4 | RXB. 03 | 0.ysrc1.1.01 | 7D /r /is4 |
| VFNMSUBPD xmm1, xmm2, xmm3, xmm4/mem128 | C4 | RXB. 03 | 1.xsrc1.0.01 | 7D /r /is4 |
| VFNMSUBPD ymm1, ymm2, ymm3, ymm4/mem256 | C4 | RXB. 03 | 1.ysrc1.1.01 | 7D /r /is4 |

## VFNMSUBPD



## Related Instructions

VFNMSUBPS, VFNMSUBSD, VFNMSUBSS
rFLAGS Affected
None

## MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/-infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception (PE) (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMSUBPS

## Negative Multiply and Subtract Packed Single-Precision Floating-Point

Multiplies each of the packed single-precision floating-point values in the first source by the corresponding packed single-precision floating-point value in the second source, then subtracts the corresponding packed single-precision floating-point values in the third source from the negated products. The results are written to the destination register.

The VFNMSUBPS instruction requires four operands:

$$
\text { VFNMSUBPS dest, src1, src2, src3 dest }=-(s r c 1 * \operatorname{src} 2)-s r c 3
$$

The 128-bit version multiplies each of the four single-precision values in the first source XMM register by its corresponding single-precision value in the second source, which can be either an XMM register or a 128 -bit memory location. It then subtracts the corresponding single-precision value in the third source from the negated interim product. The results are then placed in the destination XMM register.

The 256-bit version multiplies each of the eight single-precision values in the first source YMM register by its corresponding single-precision value in the second source, which can be either a YMM register or a 256 -bit memory location. It then subtracts the corresponding single-precision value in the third source from the negated interim product. The results are then placed in the destination YMM register.

If VEX.W is 0 , the second source is either a register or memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBPS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMSUBPS xmm1, xmm2, xmm3/mem128, xmm4 | C4 | $\overline{\text { RXB }} .03$ | 0.xsrc1.0.01 | 7C /r /is4 |
| VFNMSUBPS ymm1, ymm2, ymm3/mem256, ymm4 | C4 | RXB. 03 | 0.ysrc1.1.01 | 7C /r /is4 |
| VFNMSUBPS xmm1, xmm2, xmm3, xmm4/mem128 | C4 | RXB. 03 | 1.xsrc1.0.01 | 7C /r /is4 |
| VFNMSUBPS ymm1, ymm2, ymm3, ymm4/mem256 | C4 | RXB. 03 | 1.ysrc1.1.01 | 7C /r /is4 |

## VFNMSUBPS



## Related Instructions

VFNMSUBPD, VFNMSUBSD, VFNMSUBSS

## rFLAGS Affected

None
MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | M | M | M |  | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/-infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception (PE) (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMSUBSD

## Negative Multiply and Subtract Scalar Double-Precision Floating-Point

Multiplies the double-precision floating-point value in the low-order quadword of the first source by the double-precision floating-point value in the low-order quadword of the second source, then subtracts the double-precision floating-point value in the low-order quadword of the third source from the negated interim product.The low-order quadword result is written to the destination.

The VFNMSUBSD instruction requires four operands:

$$
V F N M S U B S D \text { dest, src1, src2, src3 } \quad \text { dest }=-(s r c 1 * s r c 2)-s r c 3
$$

The first source is an XMM register.
If VEX.W is 0 , the second source is either a register or 64-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or 64-bit memory location.

The destination is always an XMM register indicated by VEX.vvvv. All unaffected bits of the destination XMM register (bits 64-127) and its corresponding YMM register (bits 128-255) are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBSD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMSUBSD $x m m 1$, $x m m 2, x m m 3 / m e m 64, x m m 4$ | C4 | $\overline{\text { RXB }} .03$ | 0.xsrc1.0.01 | 7F /r /is4 |
| VFNMSUBSD $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 64$ | C4 | $\overline{\text { RXB }} .03$ | 1. $\overline{\mathrm{xsrc} 1} .0 .01$ | 7F /r /is4 |



## Related Instructions

VFNMSUBPD, VFNMSUBPS, VFNMSUBSS

## rFLAGS Affected

None
MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | M | M | M |  | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/- infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| $\begin{array}{\|l} \hline \text { Precision exception } \\ \text { (PE) } \end{array}$ |  |  | X | A result could not be represented exactly in the destination format. |

## VFNMSUBSS

## Negative Multiply and Subtract Scalar Single-Precision Floating-Point

Multiplies the single-precision floating-point value in the low-order doubleword of the first source by the single-precision floating-point value in the low-order doubleword of the second source, then subtracts the single-precision floating-point value in the low-order doubleword of the third source from the negated product. The low-order doubleword result is written to the destination.

The VFNMSUBSS instruction requires four operands:

$$
V F N M S U B S S \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad d e s t=-(s r c 1 * \operatorname{src} 2)-\operatorname{src} 3
$$

If VEX.W is 0 , the second source is either a register or 32-bit memory location and the third source is a register. If VEX.W is 1 , the second source is a register and the third source is a register or 32-bit memory location.

The destination is always a XMM register indicated by VEX.vvvv. All unaffected bits of the destination XMM register (bits 32-127) and its corresponding YMM register (bits 128-255) are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBSS instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | VEX | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFNMSUBSS $x m m 1, x m m 2, x m m 3 / m e m 32, x m m 4$ | C4 | $\overline{\text { RXB }} .03$ | 0.xsrc1.0.01 | 7E /r /is4 |
| VFNMSUBSS $x m m 1, x m m 2, x m m 3, x m m 4 /$ mem32 | C4 | RXB. 03 | 1.xsrc1.0.01 | 7E /r /is4 |

## VFNMSUBSS



## Related Instructions

VFNMSUBPD, VFNMSUBPS, VFNMSUBSD
rFLAGS Affected
None
MXCSR Flags Affected

| MM | FZ | RC | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  | $M$ | $M$ | $M$ |  | $M$ | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{array}{\|c\|} \hline \text { Virtual } \\ 8086 \\ \hline \end{array}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | FMA4 instructions are only recognized in protected mode. |
|  |  |  | X | The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h. |
|  |  |  | X | The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set. |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value. |
|  |  |  | X | +/-zero was multiplied by +/-infinity |
|  |  |  | X | +infinity was added to -infinity |
|  |  |  | X | -infinity was subtracted from -infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Overflow exception (OE) |  |  | X | A rounded result was too large to fit into the format of the destination operand. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |
| Precision exception (PE) (PE) |  |  | X | A result could not be represented exactly in the destination format. |

## VFRCZPD

## Extract Fraction Packed Double-Precision Floating-Point

Extracts the fractional portion of each double-precision floating-point value in a source register or memory location and writes the resulting values in the corresponding elements of the destination register. The fractional results are precise.

If XOP.L is 0 , the source is an XMM register or 128-bit memory location; If XOP.L is 1 , the source is a YMM register or 256-bit memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of XOP.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or a denormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero is a function of MXCSR.RC:

| MXCSR.RC | Result |
| :--- | :---: |
| Round down | -0 |
| Round to nearest | +0 |
| Round up | +0 |
| Round toward zero | +0 |

If the source value is QNaN , it is written to the destination with no exception generated. If the source value is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) is masked. If the source value is an integer, the instruction returns zero. The sign of the instruction result is the same as the input.

The VFRCZPD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFRCZPD $x m m 1, x m m 2 / m e m 128$ | 8F | RXB. 09 | 0.1111.0.00 | $81 / r$ |
| VFRCZPD ymm1, ymm2/mem256 | 8F | RXB. 09 | 0.1111.1.00 | $81 / r$ |

## VFRCZPD



## Related Instructions

ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPS, VFRCZSS, VFRCZSD

## rFLAGS Affected

None
MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  | M |  |  | M | M |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
|  |  |  | X | VEX.W was set to 1. |
|  |  |  | X | VEX.vvvv was not 1111b. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value or infinity |
| Denormalized-operand exception (DE) |  |  | X | A source operand was a denormal value. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |


#### Abstract

AMDE

\section*{VFRCZPS}

\section*{Extract Fraction Packed Single-Precision Floating-Point}


Extracts the fractional portion of each of the single-precision floating-point values in a source register or memory location and writes the resulting values to the corresponding elements of the destination register. The fractional results are exact.

IfXOP.L is 0 , the source is an XMM register or 128-bit memory location; If XOP.L is 1, the source is a YMM register or 256-bit memory location.

The destination is always an XMM register or a YMM register, depending on the vector size, as determined by the value of XOP.L. When the destination is a 128-bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or a denormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero is a function of MXCSR.RC:

| MXCSR.RC | Result |
| :--- | :---: |
| Round down | -0 |
| Round to nearest | +0 |
| Round up | +0 |
| Round toward zero | +0 |

If the source value is QNaN , it is written to the destination with no exception generated. If the source value is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) is masked. If the source value is an integer, the instruction returns zero. The sign of the instruction result is the same as the input.

The VFRCZPS instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VFRCZPS $x m m 1, ~ x m m 2 / m e m 128 ~$ | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VFRCZPS $y m m 1, y m m 2 / m e m 256$ | $8 F$ | $\overline{R X B} .09$ | 0.1111 .0 .00 | $80 / \mathrm{r}$ |
| VR | 8F | $\overline{R X B} .09$ | 0.1111 .1 .00 | $80 / \mathrm{r}$ |

## VFRCZPS



## Related Instructions

ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPD, VFRCZSS, VFRCZSD rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE | IE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  | M |  |  | $M$ |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
|  |  |  | X | VEX.W was set to 1. |
|  |  |  | X | VEX.vvvv was not 1111b. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value or infinity |
| Denormalized-operand exception (DE) exception (DE) |  |  | X | A source operand was a denormal value. |
| Underflow exception (UE) |  |  | X | A rounded result was too small to fit into the format of the destination operand. |

## VFRCZSD

## Extract Fraction Scalar Double-Precision Floating-Point

Extracts the fractional portion of the double-precision floating-point value in the low-order quadword of a XMM register or 64-bit memory location and writes the result in the low-order quadword of the destination XMM register. The fractional results are precise.

When the result is written to the destination XMM register, the upper quadword of the destination register and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or a denormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero is a function of MXCSR.RC:

| MXCSR.RC | Result |
| :--- | :---: |
| Round down | -0 |
| Round to nearest | +0 |
| Round up | +0 |
| Round toward zero | +0 |

If the source value is QNaN , it is written to the destination with no exception generated. If the source value is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) is masked. If the source value is an integer, the instruction returns zero. The sign of the instruction result is the same as the input.

The VFRCZSD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VFRCZSD xmm1, xmm2/mem64 | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{\text { RXB. } 09}$ | 0.1111 .0 .00 | $83 / \mathrm{r}$ |

## VFRCZSD



## Related Instructions

ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPS, VFRCZPD, VFRCZSS rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  | IE |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |$|$| X |
| :--- |


| Exception | Real | Virtual <br> $\mathbf{8 0 8 6}$ | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
| Denormalized-operand <br> exception (DE) |  |  | $X$ | A source operand was a denormal value. |
| Underflow exception <br> (UE) |  |  | $X$ | A rounded result was too small to fit into the format of <br> the destination operand. |

## VFRCZSS

## Extract Fraction Scalar Single-Precision Floating Point

Extracts the fractional portion of the single-precision floating-point value in the low-order doubleword of an XMM register or 32-bit memory location and writes the result in the low-order doubleword in the destination XMM register. The fractional results are precise.

When the result is written to the destination XMM register, the upper three doublewords of the destination register and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The upper 224 bits of the YMM destination register are cleared to zeros.
The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or a denormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero is a function of MXCSR.RC:

| MXCSR.RC | Result |
| :--- | :---: |
| Round down | -0 |
| Round to nearest | +0 |
| Round up | +0 |
| Round toward zero | +0 |

If the source value is QNaN , it is written to the destination with no exception generated. If the source value is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) is masked. If the source value is an integer, the instruction returns zero. The sign of the instruction result is the same as the input.

The VFRCZSS instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VFRCZSS xmm1, xmm2/mem32 | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{R X B} .09$ | 0.1111 .0 .00 | $82 / r$ |



Related Instructions
ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPS, VFRCZPD, VFRCZSD rFLAGS Affected

None

## MXCSR Flags Affected

| MM | FZ | RC |  | PM | UM | OM | ZM | DM | IM | DAZ | PE | UE | OE | ZE | DE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  | IE |  |  |  |  |  |  |  |  |  |  |
| 17 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |

Note: A flag that may be set to one or cleared to zero is $M$ (modified). Unaffected flags are blank.

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \\ \hline \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT $=0$. See SIMD Floating-Point Exceptions, below, for details. |
|  |  |  | X | VEX.W was set to 1. |
|  |  |  | X | VEX.vvVv was not 1111b. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |
| SIMD Floating-Point Exception, \#XF |  |  | X | There was an unmasked SIMD floating-point exception while CR4.OSXMMEXCPT=1. See SIMD Floating-Point Exceptions, below, for details. |
| SIMD Floating-Point Exceptions |  |  |  |  |
| Invalid-operation exception (IE) |  |  | X | A source operand was an SNaN value or infinity |


| Exception | Real | Virtual <br> $\mathbf{8 0 8 6}$ | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
| Denormalized-operand <br> exception (DE) |  |  | $X$ | A source operand was a denormal value. |
| Underflow exception <br> $(U E)$ |  |  | $X$ | A rounded result was too small to fit into the format of <br> the destination operand. |

## VPCMOV

## Vector Conditional Moves

Moves bits of either the first source or the second source into their corresponding positions in the destination, depending on the value of the corresponding selector bit in the selector. If the selector bit is set to 1 , the corresponding bit in the first source is moved to the destination; otherwise, the corresponding bit from the second source is moved to the destination.

This instruction directly implements the C-language ternary "?" operation on each of the source bits.
Arbitrary bit-granular predicates can be constructed by any number of methods, or loaded as constants from memory. The VPCMOV instruction may use the results of any SSE instructions as the predicate in the selector. VPCMPEQB (VPCMPGTB), VPCMPEQW (VPCMPGTW), VPCMPEQD (VPCMPGTD) and VPCMPEQQ (VPCMPGTQ) compare bytes, words, doublewords, quadwords and integers, respectively, and set the predicate in the destination register to masks of 1 s and 0 s accordingly. VCMPPS (VCMPSS) and VCMPPD (VCMPSD) compare word and doubleword floating-point source values, respectively, and provide the predicate for the floating-point instructions.

The VPCMOV instruction requires four operands:
VPCMOV dest, src 1 , src 2 , selector
The vector size is determined by the value of VEX.L. All moves are 128 bits in length if XOP.L is cleared to 0 and 256 bits in length if XOP.L is set to 1 . The sources are the same size as the destination.

The first source (srcl) is always an XMM or YMM register specified by XOP.vvvv.
This instruction supports operand configuration using XOP.W. When XOP.W is 0 , the second source (src2) is an XMM or YMM register or 128- or 256-bit memory location specified by MODRM.rm and the selector is an XMM or YMM register specified by imm8[7:4]. When XOP.W is 1, the second source (src2) is an XMM or YMM register specified by imm8[7:4] and selector is an XMM or YMM register or 128- or 256-bit memory location specified by MODRM.rm.

The destination (dest) is always either an XMM register or a YMM register, depending on the vector size, as determined by the value of VEX.L. When the destination is a 128 -bit XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPCMOV instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | xOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPCMOV $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | 8 F | $\overline{\mathrm{RXB}} .08$ | $0 . \overline{\operatorname{src} 1.0 .00}$ | $\mathrm{~A} 2 / \mathrm{rimm}[7: 4]$ |
| VPCMOV $y m m 1, y m m 2, y m m 3 / m e m 256, y m m 4$ | 8 F | $\overline{\mathrm{RXB}} .08$ | $0 . \overline{\operatorname{src} 1.1 .00}$ | $\mathrm{~A} 2 / \mathrm{rimm}[7: 4]$ |
| VPCMOV $x m m 1, x m m 2, x m m 3, x m m 4 / m e m 128$ | 8 F | $\overline{\mathrm{RXB}} .08$ | $1 . \overline{\operatorname{src} 1.0 .00}$ | $\mathrm{~A} 2 / \mathrm{rimm}[7: 4]$ |
| VPCMOV $y m m 1, y m m 2, y m m 3, y m m 4 / m e m 256$ | 8 F | $\overline{\mathrm{RXB}} .08$ | $1 . \overline{\operatorname{src} 1.1 .00}$ | $\mathrm{~A} 2 / \mathrm{rimm}[7: 4]$ |

## VPCMOV

src1 = xmm/ymm


## Related Instructions

VPCOMUB, VPCOMUD, VPCOMUQ, VPCOMUW, VCMPPD, VCMPPS

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CRO was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |

## VPCOMB

## Compare Vector Signed Bytes

Compares corresponding packed signed bytes in the first and second sources and writes the result of each comparison in the corresponding byte of the destination. The result of each comparison is an 8-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMB instruction requires four operands:
VPCOMB dest, src1, src2, comp
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the comparison results are written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (srcl) is an XMM register specified by the XOP.vvvv field and the second source (src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 1. The VPCOMPB instruction with an appropriate value of imm8 is aliased to the following mnemonics to facilitate coding.

Table 1. VPCOMB Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTB | 0 | Less Than |
| VPCOMLEB | 1 | Less Than or <br> Equal |
| VPCOMGTB | 2 | Greater Than |
| VPCOMGEB | 4 | Greater Than or <br> Equal |
| VPCOMNEQB | 5 | Equal |
| VPCOMNEQB | 6 | Not Equal |
| VPCOMFALSEB | 7 | False |
| VPCOMTRUEB | True |  |

The VPCOMB instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  | Encoding |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPCOMB xmm1, xmm2, xmm3/mem128, imm8 | 8F | $\overline{R X B} .8$ | $0 . \overline{\text { src1 }} 0.00$ | CCh $/ \mathrm{r} / \mathrm{imm} 8$ |



## Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMW, VPCOMD, VPCOMQ rFLAGS Affected

None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |

## VPCOMD

## Compare Vector Signed Doublewords

Compares corresponding packed signed doublewords in the first and second sources and writes the result of each comparison in the corresponding doubleword of the destination. The result of each comparison is a 32 -bit value of all 1 s (TRUE) or all 0s (FALSE).

The VPCOMD instruction requires four operands:
VPCOMD dest, srcl, src2, comp

The destination is an XMM register addressed by the MODRM.reg field. When the results of the comparisons are written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (srcl) is an XMM register specified by the XOP.vvvv field and the second source (src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 2. The VPCOMD instruction with an appropriate value of imm8 is aliased to the following mnemonics to facilitate coding.

Table 2. VPCOMD Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTD | 0 | Less Than |
| VPCOMLED | 1 | Less Than or <br> Equal |
| VPCOMGTD | 2 | Greater Than |
| VPCOMGED | 4 | Greater Than or <br> Equal |
| VPCOMNEQD | 5 | Equal |
| VPCOMNEQD | 6 | Not Equal |
| VPCOMFALSED | 7 | False |
| VPCOMTRUED | True |  |

The VPCOMD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)


## Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMQ rFLAGS Affected

None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
| Invalid opcode, \#UD |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit <br> (OSXSAVE) of CR4 was cleared to 0, as indicated by <br> ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits <br> XFEATURE_ENABED_MASK[2:1] were were not <br> both set to 1. |
|  |  |  | X | XOP.W was set to 1. |
| Device not available, <br> \#NM |  |  | X | XOP.L was set to 1. <br> The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit <br> or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or <br> was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the <br> instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while <br> alignment checking was enabled while <br> MXCSR.MM=1. |


#### Abstract

AMDE


## VPCOMQ

## Compare Vector Signed Quadwords

Compares corresponding packed signed quadwords in the first and second sources and writes the result of each comparison in the corresponding quadword of the destination. The result of each comparison is a 64-bit value of all 1 s (TRUE) or all 0s (FALSE).

The VPCOMQ instruction requires four operands:
VPCOMQ dest, src1, src2, comp

The destination is an XMM register addressed by the MODRM.reg field. When the result is written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv field and the second source is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 3. The VPCOMQ instruction with an appropriate value of imm 8 is aliased to the following mnemonics to facilitate coding.

Table 3. VPCOMQ Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTQ | 0 | Less Than |
| VPCOMLEQ | 1 | Less Than or <br> Equal |
| VPCOMGTQ | 2 | Greater Than |
| VPCOMGEQ | 3 | Greater Than or <br> Equal |
| VPCOMNEQQ | 4 | Equal |
| VPCOMNEQQ | 5 | Not Equal |
| VPCOMFALSEQ | 6 | False |
| VPCOMTRUEQ | 7 | True |

The VPCOMQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  | Encoding |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |
| VPCOMQ xmm1, xmm2/mem128, imm8 | XF | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  |  | $\overline{R X B} .8$ | $0 . \overline{\operatorname{src} 1.0 .00}$ | CF $/ \mathrm{r}$ imm8 |



## Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPCOMUB

## Compare Vector Unsigned Bytes

Compares corresponding packed unsigned bytes in the first and second sources and writes the result of each comparison in the corresponding byte of the destination. The result of each comparison is an 8-bit value of all 1 s (TRUE) or all 0 s (FALSE).

The VPCOMUB instruction requires four operands:

> VPCOMUB dest, src1, src2, comp

The destination is an XMM register addressed by the MODRM.reg field. When the result is written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv field and the second source is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 4. The VPCOMUB instruction with an appropriate value of imm8 is aliased to the following mnemonics to facilitate coding.

Table 4. VPCOMUB Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTUB | 0 | Less Than |
| VPCOMLEUB | 1 | Less Than or <br> Equal |
| VPCOMGTUB | 2 | Greater Than |
| VPCOMGEUB | 3 | Greater Than or <br> Equal |
| VPCOMNEQUB | 5 | Equal |
| VPCOMNEQUB | 6 | Not Equal |
| VPCOMFALSEUB | 7 | False |
| VPCOMTRUEUB |  | True |

The VPCOMUB instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  | Encoding |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPCOMUB xmm1, xmm2/mem128, | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPm8 <br> imm | 8F | $\overline{R X B} .8$ | $0 . \overline{\text { src1.0.00 }}$ | EC $/ \mathrm{r}$ imm8 |



## Related Instructions

VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD, VPCOMQ rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPCOMUD

## Compare Vector Unsigned Doublewords

Compares corresponding packed unsigned doublewords in the first and second sources and writes the result of each comparison in the corresponding doubleword of the destination. The result of each comparison is a 32-bit value of all 1 s (TRUE) or all 0s (FALSE).

The VPCOMUD instruction requires four operands:
VPCOMUD dest, src1, src2, comp
The destination is an XMM register addressed by the MODRM.reg field. When the results are written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv field and the second source is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown Table 5. The VPCOMUD instruction with an appropriate value of imm8 is aliased to the following mnemonics to facilitate coding.

Table 5. VPCOMUD Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTUD | 0 | Less Than |
| VPCOMLEUD | 1 | Less Than or <br> Equal |
| VPCOMGTUD | 2 | Greater Than |
| VPCOMGEUD | 3 | Greater Than or <br> Equal |
| VPCOMNEQUD | 4 | Equal |
| VPCOMNEQUD | 6 | Not Equal |
| VPCOMFALSEUD | 7 | False |
| VPCOMTRUEUD |  | True |

The VPCOMUD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  | Encoding |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPCOMUD xmm1, xmm2/mem128, imm8 | 8F | $\overline{R X B} .8$ | $0 . \overline{\operatorname{src} 1.0 .00}$ | EEh $/ \mathrm{r}$ imm8 |



## Related Instructions

# VPCOMUB, VPCOMUW, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD, VPCOMQ rFLAGS Affected 

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPCOMUQ

## Compare Vector Unsigned Quadwords

Compares corresponding packed unsigned quadwords in the first and second sources and writes the result of each comparison in the corresponding quadword of the destination. The result of each comparison is a 64-bit value of all 1 s (TRUE) or all 0s (FALSE).

The VPCOMUQ instruction requires four operands:

> VPCOMUQ dest, src1, src2, comp

The destination is an XMM register addressed by the MODRM.reg field. When the results are written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv field and the second source is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 6. The VPCOMUQ instruction with an appropriate value of imm8 is aliased to the following mnemonics to facilitate coding.

Table 6. VPCOMUQ Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTUQ | 0 | Less Than |
| VPCOMLEUQ | 1 | Less Than or <br> Equal |
| VPCOMGTUQ | 2 | Greater Than |
| VPCOMGEUQ | 4 | Greater Than or <br> Equal |
| VPCOMNEQUQ | 5 | Equal |
| VPCOMNEQUQ | 6 | Not Equal |
| VPCOMFALSEUQ | 7 | False |
| VPCOMTRUEUQ | True |  |

The VPCOMUQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPCOMUQ $x m m 1, ~ x m m 2 / m e m 128, ~ i m m 8 ~$ | $8 F$ | $\overline{R X B} .8$ | $0 . \overline{s r c 1} .0 .00$ | EF $/ \mathrm{r}$ imm8 |



## Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMB, VPCOMW, VPCOMD, VPCOMQ rFLAGS Affected

None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPCOMUW

## Compare Vector Unsigned Words

Compares corresponding packed unsigned words in the first and second sources and writes the result of each comparison in the corresponding word of the destination. The result of each comparison is a 16-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMUW instruction requires four operands:
VPCOMUW dest, src 1, src 2, comp
The destination is an XMM register addressed by the MODRM.reg field. When the results are written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv field and the second source is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as defined in Table 7. The VPCOMUW instruction with an appropriate value of imm8 is aliased to the following mnemonics to facilitate coding.

Table 7. VPCOMUW Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTUW | 0 | Less Than |
| VPCOMLEUW | 1 | Less Than or <br> Equal |
| VPCOMGTUW | 2 | Greater Than |
| VPCOMGEUW | 3 | Greater Than or <br> Equal |
| VPCOMNEQUW | 5 | Equal |
| VPCOMNEQUW | 6 | Not Equal |
| VPCOMFALSEUW | 7 | False |
| VPCOMTRUEUW |  | True |

The VPCOMUW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

## Mnemonic

|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| :--- | :---: | :---: | :---: | :---: | :---: |
| VPCOMB $x m m 1$, xmm2/mem128, imm8 | 8 F | $\overline{\mathrm{RXB}} .8$ | $0 . \operatorname{src} 1.0 .00$ | $\mathrm{ED} / \mathrm{r}$ imm8 |



## Related Instructions

VPCOMUB, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD, VPCOMQ
rFLAGS Affected
None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPCOMW

## Compare Vector Signed Words

Compares corresponding packed signed words in the first and second sources and writes the result of each comparison in the corresponding word of the destination. The result of each comparison is a 16bit value of all 1 s (TRUE) or all 0 s (FALSE).

The VPCOMW instruction requires four operands:

$$
V P C O M W \text { dest, src 1, src2, comp }
$$

The destination is an XMM register addressed by the MODRM.reg field. When the results are written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by XOP.vvvv and second source is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as defined in Table 8. The VPCOMW instruction with an appropriate value of imm8 is aliased to the following mnemonics to facilitate coding.

Table 8. VPCOMW Comparison Operations

| Mnemonic | Implied Value of imms | Comparison <br> Operation |
| :--- | :---: | :---: |
| VPCOMLTW | 0 | Less Than |
| VPCOMLEW | 1 | Less Than or <br> Equal |
| VPCOMGTW | 2 | Greater Than |
| VPCOMGEW | 4 | Greater Than or <br> Equal |
| VPCOMNEQW | 5 | Equal |
| VPCOMNEQW | 6 | Not Equal |
| VPCOMFALSEW | 7 | False |
| VPCOMTRUEW | True |  |

The VPCOMW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPCOMW xmm1, xmm2/mem128, imm8 | 8F | $\overline{R X B} .8$ | $0 . \overline{s r c 1} .0 .00$ | CD $/ \mathrm{r}$ imm8 |



## Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMD, VPCOMQ rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPHADDBD

## Packed Horizontal Add Signed Byte to Signed

 DoublewordAdds four successive 8-bit signed integer values from the source and packs the sign-extended results of the additions in the corresponding doubleword in the destination.

This instruction takes two operands:
VPHADDBD dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDBD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPHADDBD $x m m 1, ~ x m m 2 / m e m 128 ~$ | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{R X B} .09$ | 0.1111 .0 .00 | $\mathrm{C} 2 / \mathrm{r}$ |



## Related Instructions

VPHADDBW, VPHADDBQ, VPHADDWD, VPHADDWQ, VPHADDDQ
rFLAGS Affected
None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHADDBQ

## Packed Horizontal Add Signed Byte to Signed Quadword

Adds eight successive 8 -bit signed integer values from the source and packs the sign-extended results of the additions in the corresponding quadword in the destination.

This instruction takes two operands:
VPHADDBQ dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDBQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPHADDBQ $x m m 1, ~ x m m 2 / m e m 128 ~$ | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{R X B} .09$ | 0.1111 .0 .00 | C3 $/ \mathrm{r}$ |

## VPHADDBQ



## Related Instructions

VPHADDBW, VPHADDBD, VPHADDWD, VPHADDWQ, VPHADDDQ

## rFLAGS Affected

None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
|  |  |  | X | XOP.vvvv was not 1111b. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPHADDBW

## Packed Horizontal Add Signed Byte to Signed

 WordAdds each adjacent pair of 8-bit signed integer values from the source and packs the sign-extended 16bit integer result of each addition in the corrresponding word element of the destination.

This instruction takes two operands:

## VPHADDBW dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination XMM register is written, the upper 128 bits are cleared to zeros.

The PHADDBW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| VPHADDBW xmm1, xmm2/mem128 | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{\text { RXB. }} 09$ | 0.1111 .0 .00 | $\mathrm{C} 1 / \mathrm{r}$ |



## Related Instructions

VPHADDBD, VPHADDBQ, VPHADDWD, VPHADDWQ, VPHADDDQ
rFLAGS Affected
None
MXCSR FLAGS Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

# AMDE <br> <br> VPHADDDQ <br> <br> VPHADDDQ <br> <br> Packed Horizontal Add Signed Doubleword to <br> <br> Packed Horizontal Add Signed Doubleword to Signed Quadword 

 Signed Quadword}

Adds each adjacent pair of signed doubleword integer values in the source and packs the sign-extended sums of each additions in the corresponding quadword in the destination register.

This instruction takes two operands:
VPHADDDQ dest, src
The source is an XMM register or 128-bit memory location and the destination is an XMM register. . When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDDQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| VPHADDDQ $x m m 1, ~ x m m 2 / m e m 128 ~$ | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{R X B} .09$ | 0.1111 .0 .00 | $\mathrm{CB} / \mathrm{r}$ |



## Related Instructions

VPHADDBW, VPHADDBD, VPHADDBQ, VPHADDWD, VPHADDWQ
rFLAGS Affected
None
MXCSR FLAGS Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHADDUBD

## Packed Horizontal Add Unsigned Byte to Doubleword

Adds four successive 8-bit unsigned integer values from the source and packs the results of the additions in the corresponding doubleword in the destination.

This instruction takes two operands:
VPHADDUBD dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDUBD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPHADDUBD ymm1, ymm2/mem128 | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{\text { RXB. }} 09$ | 0.1111 .0 .00 | $D 2 / r$ |

## VPHADDUBD



## Related Instructions

VPHADDUBW, VPHADDUBQ, VPHADDUWD, VPHADDUWQ, VPHADDUDQ
rFLAGS Affected
None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHADDUBQ

## Packed Horizontal Add Unsigned Byte to Quadword

Adds eight successive 8-bit unsigned integer values from the second source and packs the results of the additions in the corresponding quadword in the destination.

This instruction takes two operands:
VPHADDUBQ dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The PHADDUBQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPHADDUBQ $x m m 1, ~ x m m 2 / m e m 128 ~$ | 8F | $\overline{\text { RXB. }} 09$ | 0.1111 .0 .00 | D3 $/ \mathrm{r}$ |

## PHADDUBQ



## Related Instructions

VPHADDUBW, VPHADDUBD, VPHADDUWD, VPHADDUWQ, VPHADDUDQ

## rFLAGS Affected

None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
|  |  |  | X | XOP.vvvv was not 1111b. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |


#### Abstract

AMD긱


## VPHADDUBW Packed Horizontal Add Unsigned Byte to Word

Adds each adjacent pair of 8-bit unsigned integer values from the source and packs the 16 -bit integer results of each addition in the corresponding word in the destination.

This instruction takes two operands:

## VPHADDUBW dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDUBW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPHADDUBWD $x m m 1, ~ x m m 2 / m e m 128 ~$ | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{R X B} .09$ | 0.1111 .0 .00 | D1 $/ \mathrm{r}$ |



## Related Instructions

VPHADDUBD, VPHADDUBQ, VPHADDUWD, VPHADDUWQ, VPHADDUDQ
rFLAGS Affected
None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |


#### Abstract

AMDE


## VPHADDUDQ Packed Horizontal Add Unsigned Doubleword to Quadword

Adds each adjacent pair of 32-bit unsigned integer values from the source and packs the results of each addition in the corresponding quadword in the destination.

This instruction takes two operands:
VPHADDUDQ dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDUDQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPHADDUDQ $x m m 1, x m m 2 / m e m 128$ | 8F | RXB. 09 | 0.1111.0.00 | D8/r |



## Related Instructions

VPHADDUBW, VPHADDUBD, VPHADDUBQ, VPHADDUWD, VPHADDUWQ

## rFLAGS Affected

None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHADDUWD

## Packed Horizontal Add Unsigned Word to Doubleword

Adds each adjacent pair of 16-bit unsigned integer values from the source and packs the results of each addition in the corresponding doubleword in the destination.

This instruction takes two operands:
VPHADDUWD dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDUWD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPHADDUWD xmm1, xmm2/mem128 | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
|  | 8F | $\overline{R X B} .09$ | 0.1111 .0 .00 | D6 $/ \mathrm{r}$ |



## Related Instructions

VPHADDUBW, VPHADDUBD, VPHADDUBQ, VPHADDUWQ, VPHADDUDQ
rFLAGS Affected
None
MXCSR FLAGS Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHADDUWQ

## Packed Horizontal Add Unsigned Word to Quadword

Adds four successive 16-bit unsigned integer values from the source and packs the results of the additions in the corresponding quadword element in the destination.

This instruction takes two operands:
VPHADDUWQ dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHADDUWQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPHADDUWQ xmm1, xmm2/mem128 | 8F | $\overline{\mathrm{RXB}} .09$ | 0.1111.0.00 | D7 /r |



Related Instructions
VPHADDUBW, VPHADDUBD, VPHADDUBQ, VPHADDUWD, VPHADDUDQ
rFLAGS Affected
None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHADDWD

## Packed Horizontal Add Signed Word to Signed Doubleword

Adds each adjacent pair of 16-bit signed integer values from the source and packs the sign-extended results of the addition in the corresponding doubleword in the destination).

This instruction takes two operands:
VPHADDWD dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination XMM register is written, the upper 128 bits or the corresponding YMM register are cleared to zeros.

The VPHADDWD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPHADDWD ymm1, ymm2/mem128 | 8F | $\overline{\mathrm{RXB}} .09$ | 0.1111.0.00 | C6 /r |



Related Instructions
VPHADDBW, VPHADDBD, VPHADDBQ, VPHADDWQ, VPHADDDQ
rFLAGS Affected
None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |


#### Abstract

AMDE


## VPHADDWQ

## Packed Horizontal Add Signed Word to Signed

 QuadwordAdds four successive 16-bit signed integer values from the second source and packs the sign-extended results of each addition in the corresponding quadword in the destination.

The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeroes.

The VPHADDWQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPHADDWQ xmm1, xmm2/mem128 | 8F | RXB. 09 | 0.1111.0.00 | D7 /r |



## Related Instructions

VPHADDBW, VPHADDBD, VPHADDBQ, VPHADDWD, VPHADDDQ
rFLAGS Affected
None
MXCSR FLAGS Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHSUBBW

## Packed Horizontal Subtract Signed Byte to Signed Word

Subtracts the most significant signed integer byte from the least significant signed integer byte of each word element in the source and packs the sign-extended 16-bit integer results of each subtraction in the destination.

This instruction takes two operands:
VPHSUBBW dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHSUBBW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

VPHSUBBW xmm1, xmm2/mem128

XOP
8F

RXB.mmmmm
$\overline{\mathrm{RXB}} .09$

Encoding

| W.vvvv.L.pp | Opcode |
| ---: | :---: |
| 0.1111 .0 .00 | $\mathrm{E} 1 / \mathrm{r}$ |

E1/r


## Related Instructions

VPHSUBWD, VPHSUBDQ

## rFLAGS Affected

None

## MXCSR FLAGS Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

# AMDE <br> <br> VPHSUBDQ Packed Horizontal Subtract Signed Doubleword to <br> <br> VPHSUBDQ Packed Horizontal Subtract Signed Doubleword to Signed Quadword 

 Signed Quadword}

Subtracts the most significant signed integer doubleword from the least significant signed integer doubleword of each quadword in the source and packs the sign-extended 64-bit integer result of each subtraction in the corresonding quadword element of the destination.

This instruction takes two operands:
VPHSUBDQ dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHSUBDQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.


## Related Instructions

VPHSUBBW, VPHSUBWD
rFLAGS Affected
None
MXCSR FLAGS Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPHSUBWD

## Packed Horizontal Subtract Signed Word to Signed Doubleword

Subtracts the most significant signed integer word from the least significant signed integer word of each doubleword from the source and packs the sign-extended 32-bit integer result of each subtraction in the destination.

This instruction takes two operands:
VPHSUBWD dest, src
The destination is an XMM register and the source is an XMM register or 128-bit memory location. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPHSUBWD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPHSUBWD xmm1, xmm2/mem128 | 8F | $\overline{\text { RXB. }} 09$ | 0.1111.0.00 | E2 /r |

VPHSUBWD
src $=x m m 2 / m e m 128$


## Related Instructions

VPHSUBBW, VPHSUBDQ
rFLAGS Affected
None
MXCSR FLAGS Affected
None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

VPMACSDD

## Packed Multiply Accumulate Signed Doubleword to Signed Doubleword

Multiplies each packed 32-bit signed integer value in the first source by the corresponding packed 32bit signed integer value in the second source, then adds the 64-bit signed integer product to the corresponding packed 32-bit signed integer value in the third source. The four resulting 32-bit sums are stored in the destination.

The VPMACSDD instruction requires four operands:
$V P M A C S D D$ dest, src1, src2, src $3 \quad$ dest $=s r c 1 * \operatorname{src} 2+s r c 3$
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $s r c 1$ ) is an XMM register specified by the XOP.vvvv fields; the second source ( $\operatorname{src} 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When the third source designates the same XMM register as the destination, the XMM register behaves as an accumulator.

No saturation is performed on the sum. If the result of the multiplication causes non-zero values to be set in the upper 32 bits of the 64 bit product, they are ignored. If the result of the add overflows, the carry is ignored (neither the overflow nor carry bit in rFLAGS is set). In both cases, only the signed low-order 32 bits of the result are written to the destination.

The VPMACSDD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | w.vvvv.L.pp | Opcode |
| VPMACSDD $x$ mm1, xmm2, xmm3/mem128, xmm4 | 8 F | $\overline{\mathrm{RXB}} .08$ | $0 . \overline{\text { src1. }} 0.00$ | 9E/r /is4 |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD
rFLAGS Affected
None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPMACSDQH

## Packed Multiply Accumulate Signed High Doubleword to Signed Quadword

Multiplies the second 32-bit signed integer value of the first source by the second 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to the low-order 64-bit signed integer value in the third source. Simultaneously, multiplies the fourth 32-bit signed integer value of the first source by the fourth 32 -bit signed integer value in the second source, then adds the 64-bit signed integer product to the second 64 -bit signed integer value in the third source. The results are written to the destination register.

The VPMACSDQH instruction requires four operands:

$$
V P M A C S D Q H \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=s r c 1 * s r c 2+s r c 3
$$

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $\operatorname{src} 1$ ) is an XMM register specified by the XOP.vvvv field; the second source $(\operatorname{src} 2)$ is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When the third source designates the same XMM register as th destination register, the XMM register behaves as an accumulator.

No saturation is performed on the sum. If the result of the add overflows, the carry is ignored (neither the overflow nor carry bit in rFLAGS is set).

The VPMACSDQH instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMACSDQH $x m m 1, ~ x m m 2, ~$ | $x m m 3 /$ mem128, $x m m 4$ | $8 F$ | $\overline{R X B} .08$ | $0 . \overline{\text { src1.0.00 }}$ |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMADCSSWD, VPMADCSWD
rFLAGS Affected
None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSDQL

## Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword

Multiplies the low-order 32-bit signed integer value of the first source by the low-order 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to the low-order 64-bit signed integer value in the third source. Simultaneously, multiplies the third 32-bit signed integer value of the first source by the corresponding 32-bit signed integer value in the second source, then adds the 64 -bit signed integer product to the second 64-bit signed integer value in the third source. The results are written to the destination (register.

TheVPMACSDQL instruction requires four operands:

$$
V P M A C S D Q L \text { dest }, \operatorname{src} 1, \operatorname{src} 2, \operatorname{src} 3 \quad d e s t=s r c 1 * \operatorname{src} 2+\operatorname{src} 3
$$

The destination register is a YMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $\operatorname{src} 1$ ) is an XMM register specified by the XOP.vvvv fields; the second source ( $\operatorname{src} 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

No saturation is performed on the sum. If the result of the add overflows, the carry is ignored (neither the overflow nor carry bit in rFLAGS is set). Only the low-order 64 bits of each result are written in the destination.

The VPMACSDQL instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMACSDQL $x m m 1, ~ x m m 2, ~$ | xmm3/mem128, $x m m 4$ | $8 F$ | $\overline{R X B} .8$ | $0 . \overline{\operatorname{src} 1.0 .00}$ |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQH, VPMADCSSWD, VPMADCSWD rFLAGS Affected

None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSSDD Packed Multiply Accumulate Signed Doubleword to Signed Doubleword with Saturation

Multiplies each packed 32-bit signed integer value in the first source by the corresponding packed 32bit signed integer value in the second source, then adds each 64 -bit signed integer product to the corresponding packed 32-bit signed integer value in the third source. The saturated results are written to the destination register.

The VPMACSSDD instruction requires four operands:

$$
V P M A C S S D D \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=\operatorname{src} 1 * \operatorname{src} 2+\operatorname{src} 3
$$

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $\operatorname{src} 1$ ) is an XMM register specified by the XOP.vvvv fields; the second source ( $\operatorname{src} 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

Out of range results of the addition are saturated to fit into a signed 32-bit integer. For each packed value in the destination, if the value is larger than the largest signed 32-bit integer, it is saturated to 7FFF_FFFFh, and if the value is smaller than the smallest signed 32-bit integer, it is saturated to 8000_0000h.

The VPMACSSDD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMACSSDD $x m m 1, ~ x m m 2, ~$ | mm3/mem128, $x m m 4$ | 8 F | $\overline{R X B} .08$ | $0 . \overline{\operatorname{src} 1.0 .00}$ |



Related Instructions
VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD
rFLAGS Affected
None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSSDQH Packed Multiply Accumulate Signed High Doubleword to Signed Quadword with Saturation

Multiplies the second 32-bit signed integer value of the first source by the second 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to the low-order 64-bit signed integer value in the third source. Simultaneously, multiplies the fourth 32-bit signed integer value of the first source by the fourth 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to the high-order 64-bit signed integer value in the third source. The saturated results are written to the destination register.

The PMACSSDQH instruction requires four operands:

$$
V P M A C S S D Q H \text { dest, src1, src2, src3 } \quad \text { dest }=s r c 1 * \operatorname{src} 2+\operatorname{src} 3
$$

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $s r c 1$ ) is an XMM register specified by the XOP.vvvv fields; the second source ( $\operatorname{src} 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

Out of range results of the addition are saturated to fit into a signed 64-bit integer. For each packed value in the destination, if the value is larger than the largest signed 64-bit integer, it is saturated to 7FFF_FFFF_FFFF_FFFFh, and if the value is smaller than the smallest signed 64-bit integer, it is saturated to 8000_0000_0000_0000h.

The VPMACSSDQH instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMACSSDQH $x m m 1, ~ x m m 2, ~ x m m 3 / m e m 128, ~ x m m 4 ~$ | $8 F$ | $\overline{R X B} .08$ | $0 . \overline{\text { src1.0.00 }}$ | $8 \mathrm{~F} / \mathrm{r}$ is4 |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD
rFLAGS Affected
None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSSDQL <br> Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword with Saturation

Multiplies the low-order 32-bit signed integer value of the first source by the low-order 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to the low-order 64-bit signed integer value in the third source. Simultaneously, multiplies the third 32-bit signed integer value of the first source by the third 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to the high-order 64-bit signed integer value in the third source. The saturated results are written to the destination register.

The VPMACSSDQL instruction requires four operands:

$$
V P M A C S S D Q L \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2+s r c 3
$$

The destination (dest) register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $s r c 1$ ) is an XMM register specified by the XOP.vvvv fields; the second source ( $\operatorname{src} 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

Out of range results of the addition are saturated to fit into a signed 64-bit integer. For each packed value in the destination, if the value is larger than the largest signed 64-bit integer, it is saturated to 7FFF_FFFF_FFFF_FFFFh, and if the value is smaller than the smallest signed 64-bit integer, it is saturated to 8000_0000_0000_0000h.

The VPMACSSDQL instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

## Mnemonic

Encoding

|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| :--- | :---: | :---: | :---: | :---: |
| PMACSSDQL $x m m 1, ~ x m m 2, ~ x m m 3 / m e m 128, ~$ |  |  |  |  |
| Pmm4 | $8 F$ | $\overline{R X B} .08$ | $0 . \overline{\operatorname{src} 1.0 .00}$ | $87 / \mathrm{r} / \mathrm{is} 4$ |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD
rFLAGS Affected
None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSSWD

## Packed Multiply Accumulate Signed Word to Signed Doubleword with Saturation

Multiplies the odd-numbered packed 16-bit signed integer values in the first source by the corresponding packed 16 -bit signed integer values in the second source, then adds the 32 -bit signed integer products to the corresponding packed 32 -bit signed integer values in the third source. The saturated results are written to the destination register.

The VPMACSSWD instruction requires four operands:

$$
V P M A C S S W D \text { dest, src } 1, \operatorname{src} 2, s r c 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2+s r c 3
$$

The destinationa (dest) is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $s r c 1$ ) is an XMM register specified by the XOP.vvvv field; the second source ( $s r c 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source $(s r c 3)$ is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

Out of range results of the addition are saturated to fit into a signed 32-bit integer. For each packed value in the destination, if the value is larger than the largest signed 32-bit integer, it is saturated to 7 FFF _FFFFh, and if the value is smaller than the smallest signed 32-bit integer, it is saturated to 8000_0000h.

The VPMACSSWD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMACSSWD xmm1, xmm2, xmm3/mem128, xmm4 | 8F | $\overline{\mathrm{RXB}} .08$ | 0.src1.0.00 | $86 / r / i s 4$ |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD
rFLAGS Affected
None
MXCSR Flags Affected
None

Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CRO was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSSWW

## Packed Multiply Accumulate Signed Word to Signed Word with Saturation

Multiplies each packed 16-bit signed integer value in the first source by its corresponding packed 16bit signed integer value in the second source, then adds the 32-bit signed integer products to the corresponding packed 16-bit signed integer value in the third source. The saturated results are written to the destination register.

The VPMACSSWW instruction requires four operands:

$$
V P M A C S S W W \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2+\operatorname{src} 3
$$

The destination register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $\operatorname{src} 1$ ) is an XMM register specified by the XOP.vvvv fields; the second source ( $\operatorname{src} 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ and dest designate the same XMM register, this register behaves as an accumulator.
Out of range results of the addition are saturated to fit into a signed 16-bit integer. For each packed value in the destination, if the value is larger than the largest signed 16-bit integer, it is saturated to 7 FFFh, and if the value is smaller than the smallest signed 16 -bit integer, it is saturated to 8000 h .

The VPMACSSWW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| PMACSSWW xmm1, xmm2, xmm3/mem128, xmm4 | 8F | $\overline{\mathrm{RXB}} .08$ | 0.src1.0.00 | $85 / r / i s 4$ |



Related Instructions
VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL,VPMACSDQH, VPMADCSSWD, VPMADCSWD
rFLAGS Affected
None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSWD

## Packed Multiply Accumulate Signed Word to Signed Doubleword

Multiplies each odd-numbered packed 16-bit signed integer value in the first source by the corresponding packed 16 -bit signed integer value in the second source, then adds the 32 -bit signed integer products to the corresponding packed 32-bit signed integer value in the third source. The four results are written to the destination register.

The VPMACSWD instruction requires four operands:

$$
\text { VPMACSWD dest, src1, src2, src3 } \quad \text { dest }=\operatorname{src} 1 * \operatorname{src} 2+\operatorname{src} 3
$$

The destination (dest) register is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (srcl) is an XMM register specified by the XOP.vvvv fields; the second source (src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

If the result of the add overflows, the carry is ignored (neither the overflow nor carry bit in rFLAGS is set). Only the low-order 32 bits of the result are written in the destination.

The VPMACSWD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMACSWD $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | 8F | RXB. 08 | $0 . \overline{\text { src1 }} .0 .00$ | 96 /r /is4 |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSSDD, VPMACSDO, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMACSWW

## Packed Multiply Accumulate Signed Word to Signed Word

Multiplies each packed 16-bit signed integer value in the first source by the corresponding packed 16bit signed integer value in the second source, then adds each 32 -bit signed integer product to the corresponding packed 16 -bit signed integer value in the third source. The eight results are written to the destination register.

The VPMACSWW instruction requires four operands:

$$
V P M A C S W W \text { dest, src1, src2, src } 3 \quad d e s t=s r c 1 * \operatorname{src} 2+s r c 3
$$

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source ( $s r c 1$ ) is an XMM register specified by the XOP.vvvv fields; the second source ( $\operatorname{src} 2$ ) is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source (src3) is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

No saturation is performed on the sum. If the result of the multipliplication causes non-zero values to be set in the upper 16 bits of the 32 bit result, they are ignored. If the result of the add overflows, the carry is ignored (neither the overflow nor carry bit in rFLAGS is set). In both cases, only the signed low-order 16 bits of the result are written in the destination.

The VPMACSWW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMACSWW $x m m 1, ~ x m m 2, ~$ |  |  |  |  |
| Rmm3/mem128, $x m m 4$ | 8F | RXB.08 | $0 . \overline{\text { src1.0.00 }}$ | $95 / \mathrm{r} / \mathrm{is} 4$ |



## Related Instructions

VPMACSSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

## rFLAGS Affected

None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

# AMDE <br> <br> VPMADCSSWD Packed Multiply, Add and Accumulate Signed <br> <br> VPMADCSSWD Packed Multiply, Add and Accumulate Signed Word to Signed Doubleword with Saturation 

 Word to Signed Doubleword with Saturation}

Multiplies each packed 16-bit signed integer value in the first source by the corresponding packed 16bit signed integer value in the second source, then adds the 32-bit signed integer products of the evenodd adjacent words. Each resulting sum is then added to the corresponding packed 32-bit signed integer value in the third source. The four results are written to the destination (accumulator) register.

The VPMADCSSWD instruction requires four operands:

$$
V P M A D C S S W D \text { dest, src } 1, \operatorname{src} 2, \operatorname{src} 3 \quad \text { dest }=\operatorname{src} 1 * \operatorname{src} 2+\operatorname{src} 3
$$

The destination register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv fields; the second source is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

Out of range results of the addition are saturated to fit into a signed 32-bit integer. For each packed value in the destination, if the value is larger than the largest signed 32-bit integer, it is saturated to 7FFF_FFFFh, and if the value is smaller than the smallest signed 32-bit integer, it is saturated to 8000_0000h.

The VPMADCSSWD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | xOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPMADCSSWD $x m m 1, ~ x m m 2, ~$ |  |  |  |  |
| Rmm3/mem128, $x m m 4$ | $8 F$ | $\overline{R X B} .08$ | $0 . \operatorname{src1} .0 .00$ | A6 $/ \mathrm{r} / \mathrm{is} 4$ |

## VPMADCSSWD



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSWD
rFLAGS Affected
None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPMADCSWD

## Packed Multiply Add and Accumulate Signed Word to Signed Doubleword

Multiplies each packed 16-bit signed integer value in the first source by the corresponding packed 16bit signed integer value in the second source, then adds the 32-bit signed integer products of the evenodd adjacent words together and adds their sum to the corresponding packed 32-bit signed integer values in the third source. The four results are written to the destination register.

The VPMADCSWD instruction requires four operands:

$$
V P M A D C S W D \text { dest, src1, src2, src } 3 \quad \text { dest }=s r c 1 * \operatorname{src} 2+s r c 3
$$

The destination register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv fields, the second source is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source is an XMM register specified by imm8[7:4].

When $\operatorname{src} 3$ designates the same XMM register as the dest register, the XMM register behaves as an accumulator.

No saturation is performed on the sum. If the result of the addition overflows, the carry is ignored (neither the overflow nor carry bit in rFLAGS is set). Only the signed 32-bits of the result are written to the destination.

The VPMADCSWD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| PMADCSWD $x m m 1, ~ x m m 2, ~ x m m 3 / m e m 128, ~ x m m 4 ~$ | $8 F$ | $\overline{R X B} .08$ | $0 . \overline{s r c 1} .0 .00$ | $\mathrm{~B} 6 / \mathrm{r} / \mathrm{is} 4$ |



## Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD

## rFLAGS Affected

None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CRO was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.W was set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPPERM

Packed Permute Bytes
Selects 16 of the 32-packed bytes in the two sources and optionally applies a logical transformation to each selected byte before it is stored to its specified position in the destination XMM register.

The VPPERM instruction requires four operands:
VPPERM dest, src 1, src 2 , selector
The 32-byte source consists of the concatenation of the second source ( $\operatorname{src} 2$ ) and the first source ( $\operatorname{src} 1$ ). The third source operand (src3) contains control bytes specifying the source byte and the logical operation to perform on each destination byte.

The srcl operand is always an XMM register specified by XOP.vvvv
This instruction supports operand source configuration using XOP.W. When XOP.W is $0, \operatorname{src} 2$ is an XMM register or 128-bit memory location specified by MODRM.rm and selector is an XMM register specified by imm8[7:4]. When XOP.W is $1, \operatorname{src} 2$ is an XMM register specified by imm8[7:4] and selector is an XMM register or 128-bit memory location specified by MODRM.rm.

The destination (dest) is always an XMM register specified by MODRM.reg. When the result operand is written to the dest XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

For each byte of the 16-byte result, the corresponding selector byte is used as follows:

- Bits 4:0 of the selector selects the source byte to move from the 32 bytes from src2:src1.
- Bits 7:5 of the selector selects the logical operation to perform on the selected operand.

Table 2-3. VPPERM Control Byte

| Bits | Description |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| 7:5 | Op - Defines the logical operation performed on the selected operand. |  |  |  |
|  | OP | Operation |  |  |
|  | 000 | Source byte (no logical operation) |  |  |
|  | 001 | Invert source byte |  |  |
|  | 010 | Bit reverse of source byte |  |  |
|  | 011 | Bit reverse of inverted source byte |  |  |
|  | 100 | 00h |  |  |
|  | 101 | FFh |  |  |
|  | 110 | Most significant bit of source byte replicated in all bit positions. |  |  |
|  | 111 | Invert most significant bit of source byte and replicate in all bit positions. |  |  |
| 4:0 | Source Selector |  |  |  |
|  | Selector | Source Selected | Selector | Source Selected |
|  | 00000 | src 1[7:0] | 10000 | src2[7:0] |
|  | 00001 | src 1[15:8] | 10001 | src2[15:8] |
|  | 00010 | src 1[23:16] | 10010 | src2[23:16] |
|  | 00011 | src 1[31:24] | 10011 | src2[31:24] |
|  | 00100 | src 1[39:32] | 10100 | src2[39:32] |
|  | 00101 | src 1[47:40] | 10101 | src2[47:40] |
|  | 00110 | src 1[55:48] | 10110 | src2[55:48] |
|  | 00111 | src 1[63:56] | 10111 | src2[63:56] |
|  | 01000 | src 1[71:64] | 11000 | src2[71:64] |
|  | 01001 | src 1[79:72] | 11001 | src2[79:72] |
|  | 01010 | src 1[87:80] | 11010 | src2[87:80] |
|  | 01011 | src 1[95:88] | 11011 | src2[95:88] |
|  | 01100 | src 1[103:96] | 11100 | src2[103:96] |
|  | 01101 | src1[111:104] | 11101 | src2[111:104] |
|  | 01110 | src1[119:112] | 11110 | src2[119:112] |
|  | 01111 | src1[127:120] | 11111 | src2[127:120] |

TheVPPERM instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPPERM $x m m 1, x m m 2, ~ x m m 3, ~ x m m 4 / m e m 128 ~$ | $8 F$ | $\overline{R X B} .8$ | $1 . \overline{s r c 1} .0 .00$ | $\mathrm{~A} 3 / \mathrm{r}$ is4 |
| VPPERM $x m m 1, x m m 2, x m m 3 / m e m 128, x m m 4$ | 8 F | $\overline{\mathrm{RXB}} .8$ | $0 . \overline{\operatorname{src} 1.0 .00}$ | $\mathrm{~A} 3 / \mathrm{r}$ is4 |

## Action

```
for (i=0; i<16; i=++)
    dest[i]:= control[i].op (src1|src2) control[i].src_sel;
```



## Related Instructions

VPSHUFHW, VPSHUFD, VPSHUFLW, VPSHUFW, VPERMPS, VPERMPD

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled while MXCSR.MM=1. |

## VPROTB

## Packed Rotate Bytes

Rotates each byte of the source by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding byte of the destination.

There are two versions of the instruction, depending on the source of the count byte used for each 8-bit shift:

- VPROTB dest, src, fixed-count
- VPROTB dest, src, variable-count-src

The destination (dest) operand of both versions of this instruction is an XMM register addressed by the MODRM.reg field. When the result of the rotation is written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The fixed-count version of this instruction rotates each byte element of the source (src) by the number of bits specified by the immediate fixed-count byte. All byte elements of the source are rotated by the same number of bits. The source is a 128 -bit XMM register or memory location addressed by the MODRM.rm field.

The variable-count-src version of this instruction rotates each byte of the source by the amount specified in the corresponding byte element in the variable-count-src, which is an XMM register or 128-bit memory location.

The src and variable-count-src are configurable through XOP.W. If XOP.W is 0 , the variable-countsrc is an XMM register specified by XOP.vvvv and the src operand is an XMM register or 128-bit memory location specified by MODRM.rm. If XOP.W is 1 , the variable-count-src is an XMM register or 128-bit memory location specified by MODRM.rm and the src operand is a XMM register specified by XOP.vvvv.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions). The bits rotated out left of the most significant bit are rotated back in at the right end (least-significant bit) of the byte.

If the count value is negative, bits are rotated to the right (toward the least significant bit positions). The bits rotated to the right out of the least significant bit are rotated back in at the left end (mostsignificant bit) of the byte.

The VPROTB instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)



## Related Instructions

VPROTW, VPROTD, VPROTQ,VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW, VPSHAD, VPSHAQ

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | $\begin{gathered} \hline \text { Virtual } \\ 8086 \end{gathered}$ | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.L was set to 1. |
|  |  |  | X | XOP.vvvv was not 1111b for immediate count form of instruction (opcode COh). |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CRO was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |

## VPROTD

## Packed Rotate Doublewords

Rotates each of the four doublewords of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding doubleword of the destination.

There are two variants of this instruction, depending on the source of the count byte used for each doubleword shift:

- VPROTD dest, src, fixed-count
- VPROTD dest, src, variable-count

The dest operand of both versions of this instruction is an XMM register addressed by the MODRM.reg field. When the 128-bit result operand is written to the dest register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The fixed count version of this instruction rotates each doubleword of the source operand by the number of bits specified by the immediate fixed-count byte operand. All doubleword elements of the source operand are rotated by the same number of bits. The src is anXMM register or memory location addressed by the MODRM.rm field.

The variable count version of this instruction rotates each doubleword of the source by the amount specified in the low order byte of the corresponding doubleword of the variable-count operand vector.

The src and variable-count operand vector are configurable through XOP.W. If XOP.W is 0 , the src is an XMM register or 128-bit memory location specified by the MODRM.rm field and the variablecount operand vector is an XMM register specified by XOP.vvvv. If XOP.W is 1 , the src operand is an XMM register specified by XOP.vvvv and the variable-count operand is an XMM register or 128-bit memory location specified by the MODRM.rm field.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions). The bits rotated out to the left of the most significant bit of each source doubleword operand are rotated back in at the right end (least-significant bit) of the doubleword.

If the count value is negative, bits are rotated to the right (toward the least significant bit positions). The bits rotated to the right out of the least significant bit of each source doubleword operand are rotated back in at the left end (most-significant bit) of the doubleword.

The VPROTD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

## Mnemonic

|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| :---: | :---: | :---: | :---: | :---: |
| VPROTD xmm1, xmm2/mem128, xmm3 | 8F | RXB. 09 | 0. $\overline{\mathrm{cnt}} .0 .00$ | $92 / \mathrm{r}$ |
| VPROTD $x m m 1, x m m 2, x m m 3 / m e m 128$ | 8F | RXB. 09 | 1.src. 0.00 | $92 / \mathrm{r}$ |
| VPROTD $x m m 1, x m m 2 / m e m 128$, imm8 | 8F | RXB. 08 | 0.1111.0.00 | C2 /ib |

## VPROTD

src1
$x m m$ if VEX.W = 1
xmm/mem128 if VEX.W = 0



## Related Instructions

VPROTB, VPROTW, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW, VPSHAD, VPSHAQ

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
| Invalid opcode, \#UD |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPROTQ

## Packed Rotate Quadwords

Rotates each of the quadwords of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding quadword of the destination.

There are two variants of this instruction, depending on the source of the count byte used for each quadword shift:

- VPROTQ dest, src, fixed-count
- VPROTQ dest, src, variable-count

The dest operand of both versions of this instruction is an XMM register addressed by the MODRM.reg field. When the 128 -bit result is written to the dest XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The fixed count version of this instruction rotates each quadword in the source by the number of bits specified by the immediate fixed-count byte operand. All quadword elements of the source are rotated by the same number of bits. The src is a 128-bit XMM register or memory location addressed by the MODRM.rm field.

The variable count version of this instruction rotates each quadword of the source by the amount specified in the low order byte of the corresponding quadword of the variable-count operand.

The src and variable-count are configurable through XOP.W. If XOP.W is 0 , the src is an XMM register or 128-bit memory location specified by MODRM.rm and the count is an XMM register specified by XOP.vvvv. If XOP.W is $1, s r c$ is an XMM register specified by XOP.vvvv and the variable-count is an XMM register or 128-bit memory location specified by MODRM.rm.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions) of the operand element. The bits rotated out to the left of the most significant bit of the word element are rotated back in at the right end (least-significant bit).

If the count value is negative, operand element bits are rotated to the right (toward the least significant bit positions). The bits rotated to the right out of the least significant bit are rotated back in at the left end (most-significant bit) of the word element.

The VPROTQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPROTQ xmm1, xmm2/mem128, xmm3 | 8F | RXB. 09 | $0 . \overline{c n t} .0 .00$ | 93 /r |
| VPROTQ xmm1, xmm2, xmm3/mem128 | 8F | RXB. 09 | 1.src. 0.00 | $93 / \mathrm{r}$ |
| VPROTQ xmm1, xmm2/mem128, imm8 | 8F | RXB. 08 | 0.1111.0.00 | C3 /ib |

## VPROTQ



## Related Instructions

PROTB, PROTW, PROTD, PSHLB, PSHLW, PSHLD, PSHLQ, PSHAB, PSHAW, PSHAD, PSHAQ

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
| Invalid opcode, \#UD |  |  | X | The emulate bit (EM) of CR0 was set to 1. |

## VPROTW

## Packed Rotate Words

Rotates each of the eight words of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding word of the destination.

There are two variants of this instruction, depending on the source of the count byte used for each word shift:

- VPROTW dest, src, fixed-count
- VPROTW dest, src, variable-count

The dest operand of both versions of this instruction is a YMM register addressed by the MODRM.reg field. When the 128 -bit result operand is written to the dest XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The fixed count version of this instruction rotates each word of the source operand by the number of bits specified by the immediate fixed-count byte operand. All word elements of the source operand are rotated by the same number of bits. The src operand is a 128 -bit YMM register or memory location addressed by the MODRM.rm field.

The variable count version of this instruction rotates each word of the source operand by the amount specified in the low order byte of the corresponding word of the variable-count operand.

The $s r c$ and count operands are configurable through XOP.W. If XOP.W is 0 , the $s r c$ operand is an XMM register or 128-bit memory location specified by MODRM.rm and the count operand is an XMM register specified by XOP.vvvv. If XOP.W is 1 , the src operand is an XMM register specified by XOP.vvvv and the variable-count operand is an XMM register or 128-bit memory location specified by MODRM.rm.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions) . The bits rotated out to the left of the most significant bit of an element are rotated back in at the right end (least-significant bit) of the word element.

If the count value is negative, bits are rotated to the right (toward the least significant bit positions) of the element. The bits rotated to the right out of the least significant bit of an element are rotated back in at the left end (most-significant bit) of the word element.

The PROTW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPROTW xmm1, xmm2/mem128, xmm3 | 8F | $\overline{\text { RXB. }} 09$ | $0 . \overline{\mathrm{cnt}} .0 .00$ | $91 / \mathrm{r}$ |
| VPROTW xmm1, xmm2, xmm3/mem128 | 8F | $\overline{\text { RXB }} .09$ | 1.src. 0.00 | $91 / \mathrm{r}$ |
| VPROTW xmm1, xmm2/mem128, imm8 | 8F | $\overline{\text { RXB }} .08$ | 0.1111.0.00 | C1/r/ib |



## Related Instructions

PROTB, PROTD, PROTQ, PSHLB, PSHLW, PSHLD, PSHLQ, PSHAB, PSHAW, PSHAD, PSHAQ rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.L was set to 1. |
|  |  |  | X | XOP.vvvv was not 1111b for immediate count form of instruction (opcode C1h). |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |


#### Abstract

AMD긱


## VPSHAB

## Packed Shift Arithmetic Bytes

Shifts each signed byte of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding byte of the destination.

The count byte for each 8-bit shift is an 8-bit signed two's-complement value in the corresponding byte element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the byte.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). The most significant bit (sign bit) is replicated and shifted in at the left end (most-significant bit) of the byte.

The VPSHAB instruction requires three operands:
VPSHAB dest, src, count
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the results are written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

If XOP.W is 0 , the count is an XMM register specified by XOP.vvvv and the $s r c$ is an XMM register or 128-bit memory location specified by MODRM.rm. If XOP.W is 1, the count is an XMM register or 128-bit memory location specified by MODRM.rm and the src operand is an XMM register specified by XOP.vvvv.

The VPSHAB instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encod |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPSHAB xmm1, xmm2/mem128, xmm3 | 8F | $\overline{\mathrm{RXB}} .09$ | $0 . \overline{\mathrm{cnt}} .0 .00$ | 98 /r |
| VPSHAB xmm1, xmm2, xmm3/mem128 | 8F | RXB. 09 | 1.src. 0.00 | $98 / r$ |



## Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAW, VPSHAD, VPSHAQ

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
| Invalid opcode, \#UD |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit <br> (OSXSAVE) of CR4 was cleared to 0, as indicated by <br> ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits <br> XFEATURE_ENABED_MASK[2:1] were were not <br> both set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, <br> \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit <br> or was non-canonical. |


| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or <br> was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the <br> instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while <br> alignment checking was enabled. |

## VPSHAD

## Packed Shift Arithmetic Doublewords

Shifts each of the four signed doublewords of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding doubleword of the destination.

The count byte for each doubleword shift is an 8-bit signed two's-complement value located in the loworder byte of the corresponding doubleword element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the doubleword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). The most significant bit (sign bit) is replicated and shifted in at the left end (most-significant bit) of the doubleword.

The VPSHAD instruction requires three operands:
VPSHAD dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bit result is written to the dest XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0 , the count is an XMM register specified by XOP.vvvv and the src is an XMM register or memory location specified by MODRM.rm. If XOP.W is 1, the count is an XMM register or memory location specified by MODRM.rm and the src is an XMM register specified by XOP.vvvv.

The VPSHAD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic |  |  |  | Encoding |
| :--- | :--- | :--- | :--- | ---: |
|  |  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode



## Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW, VPSHAQ

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :--- |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  | X | The emulate bit (EM) of CR0 was set to 1. |  |
|  |  | X | The operating-system XSAVE/XRSTOR support bit <br> (OSXSAVE) of CR4 was cleared to 0, as indicated by <br> ECX bit 27 of CPUID function 0000_0001h. |  |
|  |  | X | The operating-system YMM support bits <br> XFEATURE_ENABED_MASK[2:1] were were not <br> both set to 1. |  |
|  |  | X | XOP.L was set to 1. |  |


| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :--- | :---: | :---: | :--- |
| Device not available, <br> \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit <br> or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or <br> was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the <br> instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while <br> alignment checking was enabled. |

## VPSHAQ

## Packed Shift Arithmetic Quadwords

Shifts the two quadwords of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding quadword of the destination.

The count byte for each quadword shift is an 8-bit signed two's-complement value located in the loworder byte of the corresponding quadword element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the quadword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). The most significant bit is replicated and shifted in at the left end (most-significant bit) of the quadword.

The shift amount is stored in two's-complement form. The count is modulo 64.
The VPSHAQ instruction requires three operands:
VPSHAQ dest, src, count
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128 -bit result operand is written to the dest XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0 , the count is a 128 -bit XMM register specified by XOP.vvvv and the src is a 128-bit XMM register or memory location specified by MODRM.rm. If XOP.W is 1 , the count is a 128-bit XMM register or memory location specified by MODRM.rm and the src is a 128 -bit XMM register specified by XOP.vvvv.

The VPSHAQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPSHAQ $x m m 1, ~ x m m 2 / m e m 128, ~ x m m 3 ~$ | $8 F$ | RXB.09 | $0 . c n t .0 .00$ | $9 B / r$ |
| VPSHAQ $x m m 1, x m m 2, x m m 3 /$ mem128 | $8 F$ | RXB.09 | $1 . s r c .0 .00$ | $9 B / r$ |



## Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW, VPSHAD

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :--- |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  | X | The emulate bit (EM) of CRO was set to 1. |  |
|  |  | X | The operating-system XSAVE/XRSTOR support bit <br> (OSXSAVE) of CR4 was cleared to 0, as indicated by <br> ECX bit 27 of CPUID function 0000_0001h. |  |
|  |  | X | The operating-system YMM support bits <br> XFEATURE_ENABED_MASK[2:1] were were not <br> both set to 1. |  |
|  |  | X | XOP.L was set to 1. |  |


| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
| Device not available, <br> \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit <br> or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or <br> was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the <br> instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while <br> alignment checking was enable. |

## VPSHAW

## Packed Shift Arithmetic Words

Shifts each of the eight words of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding signed word of the destination.

The count byte for each word shift is an 8-bit signed two's-complement value located in the low-order byte of the corresponding word element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the word.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). The most significant bit (signed bit) is replicated and shifted in at the left end (most-significant bit) of the word.

The shift amount is stored in two's-complement form. The count is modulo 16.
The VPSHAW instruction requires three operands:
VPSHAW dest, src, count
The destination (dest) is a YMM register addressed by the MODRM.reg field. When the 128 -bit result operand is written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0 , the count is a 128 -bit XMM register specified by XOP.vVvv and the src operand is a 128 -bit XMM register or memory location specified by MODRM.rm. If XOP.W is 1 , the count operand is a 128 -bit XMM register or memory location specified by MODRM.rm and the src operand is a 128-bit XMM register specified by XOP.vvvv.

The VPSHAW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPSHAW $x m m 1, ~ x m m 2 / m e m 128, ~ x m m 3 ~$ | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPSHAW $x m m 1, ~ x m m 2, ~ x m m 3 / m e m 128 ~$ | $8 F$ | RXB.09 | $0 . c n t .0 .00$ | $99 / r$ |
|  | 8F | RXB.09 | $1 . s r c .0 .00$ | $99 / r$ |



## Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAD, VPSHAQ

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :--- |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1. |
|  |  | X | The operating-system XSAVE/XRSTOR support bit <br> (OSXSAVE) of CR4 was cleared to 0, as indicated by <br> ECX bit 27 of CPUID function 0000_0001h. |  |
|  |  |  | X | The operating-system YMM support bits <br> XFEATURE_ENABED_MASK[2:1] were were not <br> both set to 1. |
|  |  | X | XOP.L was set to 1. |  |


| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :--- | :---: | :---: | :--- |
| Device not available, <br> \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit <br> or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or <br> was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the <br> instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while <br> alignment checking was enabled. |

## VPSHLB

## Packed Shift Logical Bytes

Shifts each byte of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding byte of the destination.

The count byte for each byte shift is an 8-bit signed two's-complement value located in the the corresponding byte element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the byte.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). Zeros are shifted in at the left end (most-significant bit) of the byte.

The VPSHLB instruction requires three operands:
VPSHLB dest, src, count
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bit result is written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0 , the count is a 128 -bit XMM register specified by XOP.vvvv and the src is a 128 -bit XMM register or memory location specified by MODRM.rm. If XOP.W is 1 , the count is a 128-bit XMM register or memory location specified by MODRM.rm and the src is a 128-bit XMM register specified by XOP.vvvv.

The VPSHLB instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPSHLB xmm1, xmm2/mem128, xmm3 | 8F | RXB. 9 | $0 . \overline{c n t} .0 .00$ | $94 / \mathrm{r}$ |
| VPSHLB xmm1, xmm2, xmm3/mem128 | 8F | RXB. 9 | 1.डsc. 0.00 | $94 / \mathrm{r}$ |



Related Instructions
VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW, VPSHAD, VPSHAQ
rFLAGS Affected
None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |

## VPSHLD

## Packed Shift Logical Doublewords

Shifts each doubleword of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding doubleword of the destination.

The count byte for each doubleword shift is an 8-bit signed two's-complement value located in the loworder byte of the corresponding doubleword element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the doubleword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). Zeros are shifted in at the left end (most-significant bit) of the doubleword.

The shift amount is stored in two's-complement form. The count is modulo 32.
The VPSHLD instruction requires three operands:
VPSHLD dest, src, count
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128 -bit result is written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0 , the count is a 128 -bit XMM register specified by XOP.vvvv and the src is a 128 -bit XMM register or memory location specified by MODRM.rm. If XOP.W is 1 , the count is a 128-bit XMM register or memory location specified by MODRM.rm and the $s r c$ operand is a 128-bit XMM register specified by XOP.vvvv.

The VPSHLD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| VPSHLD $y m m 1, x m m 3 / m e m 128, ~ x m m 2 ~$ | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPSHLD $y m m 1, x m m 2, x m m 3 / m e m 128 ~$ | $8 F$ | RXB.09 | $0 . \overline{c n t} 0.00$ | $96 / \mathrm{r}$ |
|  | 8F | RXB.09 | $1 . \overline{\text { src.0.00 }}$ | $96 / \mathrm{r}$ |



## Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLQ, VPSHAB, VPSHAW, VPSHAD, VPSHAQ

## rFLAGS Affected

None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
| Invalid opcode, \#UD |  |  | X | The emulate bit (EM) of CRO was set to 1. |
|  |  | X | The operating-system XSAVE/XRSTOR support bit <br> (OSXSAVE) of CR4 was cleared to 0, as indicated by <br> ECX bit 27 of CPUID function 0000_0001h. |  |
|  |  |  | X | The operating-system YMM support bits <br> XFEATURE_ENABED_MASK[2:1] were were not <br> both set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, <br> \#NM |  |  | X | The task-switch bit (TS) of CR0 was set to 1. |


| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit <br> or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or <br> was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the <br> instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while <br> alignment checking was enabled. |

## VPSHLQ

## Packed Shift Logical Quadwords

Shifts the two quadwords of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding quadword of the destination.

The count byte for each quadword shift is an 8-bit signed two's-complement value located in the loworder byte of the corresponding quadword element of the count operand.

Bit 6 of the count byte is ignored.
If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the quadword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). Zeros are shifted in at the left end (most-significant bit) of the quadword.

The VPSHLQ instruction requires three operands:
VPSHLQ dest, src, count
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bit result is written to the dest YMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0 , the count operand is a 128 -bit XMM register specified by XOP.vvvv and the src is a 128-bit XMM register or memory location specified by MODRM.rm. If XOP.W is 1 , the count is a 128 -bit XMM register or memory location specified by MODRM.rm and the $s r c$ is a 128-bit XMM register specified by XOP.vvvv.

The VPSHLQ instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPSHLQ $x m m 1, x m m 3 / m e m 128, ~ x m m 2 ~$ | $8 F$ | RXB.09 | $0 . \overline{\mathrm{cnt}} .0 .00$ | $97 / \mathrm{r}$ |
| VPSHLQ $x m m 1, x m m 2, x m m 3 / m e m 128$ | $8 F$ | RXB.09 | $1 . \overline{\operatorname{src} .0 .00}$ | $97 / r$ |



## Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHAB, VPSHAW, VPSHAD, VPSHAQ
rFLAGS Affected
None

## MXCSR Flags Affected

None

## Exceptions

| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
|  | X | X |  | XOP instructions are only recognized in protected <br> mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated <br> by ECX bit 11 of CPUID function 8000_0001h. |
| Invalid opcode, \#UD |  |  | X | The emulate bit (EM) of CRO was set to 1. |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit <br> (OSXSAVE) of CR4 was cleared to 0, as indicated by <br> ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits <br> XFEATURE_ENABED_MASK[2:1] were were not <br> both set to 1. |
|  |  |  | X | XOP.L was set to 1. |
| Device <br> \#NM |  | X | The task-switch bit (TS) of CRO was set to 1. |  |


| Exception | Real | Virtual <br> 8086 | Protected | Cause of Exception |
| :--- | :---: | :---: | :---: | :--- |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit <br> or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or <br> was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the <br> instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while <br> alignment checking was enabled. |

## VPSHLW

## Packed Shift Logical Words

Shifts each of the eight words of the source operand by the amount specified in the signed value of the corresponding count byte and writes the result in the corresponding word of the destination.

The count byte for each word shift is an 8-bit signed two's-complement value located in the low-order byte of the corresponding word element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions). Zeros are shifted in at the right end (least-significant bit) of the word.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions). Zeros are shifted in at the left end (most-significant bit) of the word.

The VPSHLW instruction requires three operands:
VPSHLW dest, src, count
The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bit result is written to the destination XMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0 , the count operand is a 128 -bit XMM register specified by XOP.vvvv and the src is a 128-bit XMM register or memory location specified by MODRM.rm. If XOP.W is 1 , the count is a 128 -bit XMM register or memory location specified by MODRM.rm and the $s r c$ is a 128 -bit XMM register specified by XOP.vvvv.

The VPSHLW instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order\# 25481.)

| Mnemonic | Encoding |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | XOP | RXB.mmmmm | W.vvvv.L.pp | Opcode |
| VPSHLW xmm1, xmm3/mem128, xmm2 | 8F | RXB. 09 | $0 . \overline{\text { cnt }} 0.00$ | $95 / r$ |
| VPSHLW xmm1, xmm2, xmm3/mem128 | 8F | RXB. 09 | 1.src. 0.00 | $95 / r$ |



Related Instructions
VPROTB, VPROLW, VPROTD, VPROTQ, VPSHLB, VPSHLD, VPSHLQ, VPSHAB, VPSHAW, VPSHAD, VPSHAQ
rFLAGS Affected
None
MXCSR Flags Affected
None

## Exceptions

| Exception | Real | Virtual 8086 | Protected | Cause of Exception |
| :---: | :---: | :---: | :---: | :---: |
| Invalid opcode, \#UD | X | X |  | XOP instructions are only recognized in protected mode. |
|  |  |  | X | The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h. |
|  |  |  | X | The emulate bit (EM) of CR0 was set to 1 . |
|  |  |  | X | The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0 , as indicated by ECX bit 27 of CPUID function 0000_0001h. |
|  |  |  | X | The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1 . |
|  |  |  | X | XOP.L was set to 1. |
| Device not available, \#NM |  |  | X | The task-switch bit (TS) of CRO was set to 1 . |
| Stack, \#SS |  |  | X | A memory address exceeded the stack segment limit or was non-canonical. |
| General protection, \#GP |  |  | X | A memory address exceeded a data segment limit or was non-canonical. |
|  |  |  | X | A null data segment was used to reference memory. |
| Page fault, \#PF |  |  | X | A page fault resulted from the execution of the instruction. |
| Alignment Check, \#AC |  |  | X | An unaligned memory reference was performed while alignment checking was enabled. |

## Index

Numerics
16-bit mode ..... 10
32-bit mode ..... 10
64-bit mode ..... 11
A
addressingRIP-relative16
B
biased exponent ..... 11
C
commit ..... 11
compatibility mode ..... 11
D
direct referencing ..... 11
displacements ..... 12
double quadword ..... 12
doubleword ..... 12
E
eAX-eSP register ..... 17
effective address size ..... 12
effective operand size ..... 12
eFLAGS register ..... 18
eIP register ..... 18
element ..... 12
endian order ..... 20
exceptions ..... 12
exponent ..... 11
F
flush ..... 13
I
IGN ..... 13
indirect ..... 13
L
legacy mode ..... 13
legacy x86 ..... 13
long mode ..... 13
LSB ..... 14
lsb ..... 14

## M

mask ..... 14
MBZ ..... 14
modes
16-bit ..... 10
32-bit ..... 10
64-bit ..... 11
compatibility ..... 11
legacy ..... 13
long ..... 13
protected ..... 15
real ..... 15
virtual-8086 ..... 17
moffset ..... 14
MSB ..... 14
msb ..... 14
MSR ..... 18
0
octword ..... 14
offset. ..... 14
overflow ..... 15
P
packed ..... 15
protected mode ..... 15
Q
quadword ..... 15
R
r8-r15 ..... 18
rAX-rSP ..... 19
RAZ. ..... 15
real address mode. See real mode real mode ..... 15
registers
eAX-eSP ..... 17
eFLAGS ..... 18
eIP ..... 18
r8-r15 ..... 18
rAX-rSP ..... 19
rFLAGS ..... 19
rIP ..... 19
relative ..... 15
reserved ..... 15
rFLAGS register ..... 19
rIP register. ..... 19
RIP-relative addressing ..... 16
S VPHADDBW ..... 158
set ..... 16
SSE ..... 16
SSE-2 ..... 16
SSE3 ..... 16
sticky bits ..... 16
TTSS16
U
underflow ..... 16
V
VCVTPH2PS ..... 42
VCVTPS2PH ..... 45
vector. ..... 16
VFMADDPD ..... 48
VFMADDPS ..... 52
VFMADDSD ..... 56
VFMADDSS ..... 59, 99
VFMADDSUBPD ..... 62
VFMADDSUBPS ..... 66
VFMSUBADDPD ..... 70
VFMSUBADDPS ..... 74
VFMSUBPD ..... 78
VFMSUBPS ..... 81
VFMSUBSD ..... 84
VFMSUBSS ..... 87
VFNMADDPD ..... 90
VFNMADDPS ..... 93
VFNMADDSD ..... 96
VFNMSUBPD ..... 102
VFNMSUBPS ..... 105
VFNMSUBSD ..... 108
VFNMSUBSS ..... 111
VFRCZPD ..... 114
VFRCZPS ..... 117
VPHADDDQ ..... 160
PHADDUBD ..... 162
VPHADDUBQ ..... 164
VPHADDUBW ..... 166
VPHADDUDQ ..... 168
VPHADDUWD ..... 170
VPHADDUWQ ..... 172
VPHADDWD ..... 174
VPHADDWQ ..... 176
VPHSUBBW ..... 178
VPHSUBDQ ..... 180
VPHSUBWD ..... 182
VPMACSDD ..... 184
VPMACSDQH ..... 187
VPMACSDQL ..... 190
VPMACSSDD ..... 193
VPMACSSDQL ..... 199
VPMACSSQH ..... 196
VPMACSSWD ..... 202
VPMACSSWW ..... 205
VPMACSWD ..... 208
VPMACSWW ..... 211
VPMADCSSWD ..... 214
VPMADCSWD ..... 217
VPPERM ..... 220
VPROTB ..... 224
VPROTD ..... 227
VPROTQ ..... 230
VPROTW ..... 233
VPSHAB ..... 236
VPSHAD ..... 239
VPSHAQ ..... 242
VPSHAW ..... 245
VPSHLB ..... 248
VPSHLD ..... 251
VPSHLQ ..... 254
VPSHLW ..... 257
VFRCZSD ..... 120
VFRCZSS ..... 124
virtual-8086 mode ..... 17
VPCMOV ..... 127
VPCOMB ..... 130, 133
VPCOMQ ..... 136
VPCOMUB ..... 139
VPCOMUD ..... 139, 142
VPCOMUQ ..... 145
VPCOMUW ..... 145, 148
VPCOMW ..... 151
VPHADDBD ..... 154
VPHADDBQ ..... 156

