Improving the Performance of the Secure Hash Algorithm (SHA-1)

Maxim Loktyukhin

Improving the performance of the Secure Hash Algorithm (SHA-1) with Intel® Supplemental SSE3 instructions.

Introduction

Nowadays we are seeing more and more services in the Internet and personal computing requiring secure data communications and storage. HTTPS with the SSL and TLS protocols are replacing non-secured HTTP traffic in online commerce, banking, secure web mail, and generally what is now being called The Internet Cloud. File system encryption is becoming ubiquitous on client operating systems as well as in enterprise solutions. In addition to security applications, we are seeing the usage of cryptographic hash algorithms in data de-duplication applications for storage and networking. These trends are very well reflected by the feedback we receive from our customers and partners who are seeing a continuous shift in their workload distribution towards more cryptographic computations. We are doing our best to address these compute intensive workloads. In 2010 Intel introduced AES-NI - new Intel Architecture (IA) instruction set extension for the Advanced Encryption Standards (AES) - first implemented in processors produced with 32-nm technology like Intel® Xeon® 5600 and 2010 Intel® Core^TM processor families, utilizing u-architecture codenamed ‘Westemere', with dramatic performance improvements [1]. We also found that it is possible to leverage previously introduced IA instruction set extensions with some algorithmic innovations to increase the performance of the widely used Secure Hash Algorithm (SHA-1) [2]. This is the subject of this article.

Brief history and prior implementations of SHA-1

SHA-1 has been implemented many times in various software and hardware products, including implementations for the Intel Architecture instruction set. One must note that virtually all widely used software implementations of SHA-1 are using only scalar ALU operations on the general purpose register set, while Intel has introduced several generations of vector (or SIMD) instruction set extensions to IA32 and Intel64. Some of them, namely Intel SSE2, Intel Supplemental SSE3 (SSSE3) and Intel SSE4 are particularly useful for integer algorithms.

SHA-1 received its share of researchers' attention. Besides research devoted to the security aspects such as strength and potential weaknesses, there have also been explorations paying attention to implementation efficiency with regard to modern superscalar and parallel computer architectures, like [3] or [4]. The summary is that although generally SHA-1 is a vastly parallel algorithm in terms of instruction level parallelism (it would have been able to efficiently utilize up to 7 ALU pipelines), it does not suit itself easily for an efficient implementation with SIMD instruction set architectures. Our work, however, demonstrates that SHA-1 can efficiently use SIMD to better utilize the capabilities of IA for enhanced performance.

There have been attempts to implement SHA-1 with SSE2. One good example is the implementation made by Dean Gaudet [5]. We built on some of Dean's ideas with several new important improvements.

It would be appropriate for the reader who is not familiar with SHA-1 to give a brief introduction. Certainly more details can be found in [2] or [6].

SHA-1 produces a 160 bit (20 byte) hash value (digest), taking as an input a sequence of 1 or more 512 bit (64 byte) data blocks. The original source data also requires some padding according to the standard. The data is treated as an array of big-endian 32-bit values. Processing each of the 64-byte input blocks consists of 80 iterations also known as rounds.

Schematically SHA-1 looks like this using C-like syntax:





	int F1(int B, int C, int D) { return (D ^ ( B & (C ^ D)); }

		int F2(int B, int C, int D) { return (D ^ B ^ C); }

		int F3(int B, int C, int D) { return (B & C) | (D & (B ^ C)); }

		

		#define K1 0x5A827999

		#define K2 0x6ED9EBA1

		#define K3 0x8F1BBCDC

		#define K4 0xCA62C1D6

		

		int K( int i ) { return

		i < 20 ? K1 : 

		i < 40 ? K2 :

		i < 60 ? K3 : K4;

		}

		

		int F( int i, int B, int C, int D ) { return

		i < 20 ? F1(B, C, D) : 

		i < 40 ? F2(B, C, D) : 

		i < 60 ? F3(B, C, D) : 

		F2(B, C, D) ;

		}

		

		// Update HASH[] by processing a one 64-byte block in MESSAGE[]

		void SHA1( int HASH[], int MESSAGE[] )

		{

		// these arrays are not necessary but used to better highlight dependencies

		int A[81], B[81], C[81], D[81], E[81];

		int W[80];

		

		int i, FN;

		

		A[0] = HASH[0]; 

		B[0] = HASH[1];

		C[0] = HASH[2];

		D[0] = HASH[3];

		E[0] = HASH[4];

		

		for ( i=0; i<80; ++i )

		{

		// W[i] calculation, also known as 'Message Scheduling'

		if ( i < 16 )

		// reverse the order of bytes on little-endian

		W[i] = BIG_ENDIAN_LOAD( MESSAGE[i] );

		else

		W[i] = ROTATE_LEFT( W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16], 1 );

		

		FN = F( i, B[i], C[i], D[i] );

		

		A[i+1] = FN + E[i] + ROTATE_LEFT( A[i], 5 ) + W[i] + K(i);

		B[i+1] = A[i];

		C[i+1] = ROTATE_LEFT( B[i], 30 );

		D[i+1] = C[i];

		E[i+1] = D[i];

		}

		

		HASH[0] += A[80];

		HASH[1] += B[80];

		HASH[2] += C[80];

		HASH[3] += D[80];

		HASH[4] += E[80];

		}

Employing SIMD to vectorize SHA-1 computation

Intel SSE2 allows parallel operations on four 32-bit integers, stored in 128-bit XMM registers. The known SSE2 vectorization attempts of SHA-1 are rightfully focused on the message scheduling part for rounds 17-80, as a relatively isolated compute chain of W values which can be allocated in SIMD registers.

In the following statements:




W[i] = (W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16]) rol 1;        (**)

	...

	W[i] + K(i);

It is obvious, that the dependency W[i] -> W[i-3] prevents straightforward vectorization of this part for the architectures with vector lengths more than 3. There are two approaches to have it vectorized on the machine with registers of 4 elements:

1) The vectorization of only part of the expression: (W[i-8] ^ W[i-14] ^ W[i-16])




W'[i  ] = W[i-8] ^ W[i-14] ^ W[i-16]

	W'[i+1] = W[i-7] ^ W[i-13] ^ W[i-15]

	W'[i+2] = W[i-6] ^ W[i-12] ^ W[i-14]

	W'[i+3] = W[i-5] ^ W[i-11] ^ W[i-13]

It takes 4 XMM registers to keep the last 16 W values. Vectors W[i-8:i-5] and W[i-16:i-13] are naturally aligned with XMM registers, while vector W[i-14:i-11] requires data from two different XMM registers. This can be most efficiently computed with the Intel® Supplemental SSE3 (SSSE3) instruction PALIGNR.

And then the scalar code does the rest of the computation steps:




W[i] = (W'[i] ^ W'[i-3]) rol 1; ... W[i] + K;

Note that the completed computation of W[i] needs to be reflected back to the vector side of the processor in time to be used in the next round of calculations.

2) Vectorize the whole expression plus more, overcoming dependency with additional iteration (approach by Dean Gaudet):




// done on 4 consequtive W[i] values in a single XMM register

	W[i  ] = (W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16]) rol 1

	W[i+1] = (W[i-2] ^ W[i-7] ^ W[i-13] ^ W[i-15]) rol 1

	W[i+2] = (W[i-1] ^ W[i-6] ^ W[i-12] ^ W[i-14]) rol 1

	W[i+3] = (   0   ^ W[i-5] ^ W[i-11] ^ W[i-13]) rol 1

	

	// this additional calculation unfortunately requires many additional operations

	W[i+3] ^= W[i] rol 1

	

	// once we have 4 W[i] values in XMM we can also add four K values with one instruction

	W[i:i+3] += {K,K,K,K}

Both require either extensive amount of data shuffling between general purpose and XMM registers (1) or in XMM registers (2), and if applied for all rounds 17-80 end up being not much faster than the best available scalar code.

Our implementation uses 2^nd approach for the rounds 17-32 (4 vectorized iterations) only. But for the rounds 33 to 80 we came up with the improved vectorization presented below.

Equal transformation of SHA-1 algorithm to remove vectorization overhead

The 2^nd approach relies on the following distributive identity:




(A ^ B) rol C = (A rol C) ^ (B rol C)                      (1)

Let's also employ another one. If W defined as:




W[i] = W[i-A] ^ W[i-B]

Then, as can be shown with basic substitution (see below), it is equivalent to:




W[i] = W[i-2*A] ^ W[i-2*B]                                 (2)

if values (i-2*A) and (i-2*B) are still in the defined range for i

For those curious to see the proof of it:

If (i-2*A) and (i-2*B) are in the defined range for i, then:

By substitution:


 W[i-A] = W[i-A-A] ^ W[i-A-B];

	W[i-B] = W[i-B-A] ^ W[i-B-B];

	

	W[i] = (W[i-A-A] ^ W[i-A-B]) ^ (W[i-B-A] ^ W[i-B-B]);

Considering that: W[i-A-B] = W[i-B-A];




	W[i] = W[i-A-A] ^ W[i-B-B];

	

	W[i] = W[i-2*A] ^ W[i-2*B]       QED.

We need one more basic identity:




(X rol 1) rol 1 = X rol 2                                  (3)

Now, if we apply (2) in an extended form for four elements being XOR'ed instead of two, with rotates and XOR distribution identities (1) and (3) to the statement (**) calculating W[i] in SHA-1, it allows the following transformation:




W[i] = (W[i-3] ^ W[i- 8] ^ W[i-14] ^ W[i-16]) rol 1

for i>32, is equivalent to:




W[i] = (W[i-6] ^ W[i-16] ^ W[i-28] ^ W[i-32]) rol 2 (4)

The outcome is that the dependency W[i]->W[i-3] was transformed into W[i]->W[i-6], without increasing the amount of calculations, thanks to mutual cancelling of equal values with XOR. It can now be vectorized with 4 element SIMD in a very straightforward way for rounds 33-80. Also quite fortunately vectors W[i-16], W[i-28] and W[i-32] align themselves perfectly with XMM registers, each of which keeps 4 elements of array W[], and only the one starting at W[i-6] requires forming the vector from the two "neighbour" XMM registers. This still can be handled very efficiently with a single SSSE3 PALIGNR instruction. The snippet of the vectorized code to compute 4 W values for the rounds 33-80 is below (all W's below are XMM registers, each keeping 4 consecutive values as defined by algorithm):





	movdqa  W_TMP, W_minus_04

		pxor    W, W_minus_28            ; W equals W[i-32:i-29] before XOR

		; W = W[i-32:i-29] ^ W[i-28:i-25]

		palignr W_TMP, W_minus_08, 8     ; W_TMP = W[i-6:i-3], combined from

		; W[i-4:i-1] and W[i-8:i-5] vectors

		pxor    W, W_minus_16            ; W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13]

		pxor    W, W_TMP                 ; W = (W[i-32:i-29] ^ W[i-28:i-25]  ^ W[i-16:i-13]) ^ W[i-6:i-3])

		movdqa  W_TMP, W                 ; 4 dwords in W are rotated left by 2

		psrld   W, 30                    ; rotate left by 2     W = (W >> 30) | (W << 2)

		pslld   W_TMP, 2

		por     W_TMP, W

		movdqa  W, W_TMP                 ; four new W values W[i:i+3] are now calculated

		paddd   W_TMP, [K_XMM]           ; adding 4 current round's values of K

		movdqa  [WK(i)], W_TMP           ; storing for downstream GPR instructions to read

In the end four values of W[i]+K are calculated in W_TMP XMM register, then we use memory instructions (every 128-bit store is later read by four 32-bit loads) as a way to move precalculated data from XMM to general purpose registers domain. It allows offloading of these operations to memory unit where stores and loads can be executed independently from ALU operations. Latency of the L1 cache store plus load is either completely hidden since W pre-computation can be scheduled early or served by store to load forwarding mechanism in the microarchitecture.

There are a few other improvements worth mentioning. The use of SSSE3 instruction PSHUFB allows efficient conversion between big- and little-endian data formats for rounds 1 to 16, where values of W[i] are read from the message data, 4 values are converted by a single PSHUFB instruction. The other optimization is software pipelining: if an application uses the interface to SHA-1 update function such that it allows to process several continuous 64-byte blocks at one call, then applying software pipelining (also used by Dean in his implementation) may give about 5% to 10% of additional performance improvement.

Performance benefits and integration

The overall performance improvement of this implementation over the best known scalar implementations ranges from ~1.2X to ~1.5X, achieving as low as 5.8 cycles per byte on 1024-byte buffer being hashed on the latest generations of Intel processors. Also, here at Intel, we have a unique opportunity to use performance simulators to measure performance and tune the code for the several future generations of processors in a development pipeline. This code was very well tuned and will maintain and improve its performance advantage over the scalar implementations going forward.

We are providing 64-bit assembly implementation of SHA-1 using platform independent NASM[7] syntax (it was also tested with YASM[8]). We wanted the final performance to be the best and independent of the platform and compiler being used. Assembler, in the case of SHA-1, has a solid performance advantage over C/C++ code with all the compilers we tried. The speedups ranged from large to moderate. Coding in assembly language allowed perfect control over the instruction scheduling as well as register allocation, which compilers cannot yet achieve. The assembly code employs macros efficiently to remain high-level. The code should be easy to integrate into existing applications. Please refer to the instructions in the beginning of the source file. We took care of the CPUID SSSE3 feature flag check and dispatch. A 32-bit version may show up later too, although it must be noted that the smaller number of registers available to 32-bit apps will impact performance compared to 64-bit code.

Finale

This implementation notably advances the performance of SHA-1 algorithm compared to existing implementations. We are encouraging all projects utilizing SHA-1 to integrate this new fast implementation and are ready to help if issues or concerns arise (you are welcome to leave a comment or write an email to the authors). It is provided 'as is' and free for either commercial or non-commercial use. We are already working with several interested customers and open source projects to integrate it. Hopefully you will see this SHA-1 implementation as a part of OpenSSL library, and, of course, Intel® Integrated Performance Primitives will have it soon too.

I would like to thank and give credit to Dean Gaudet for his SHA-1 vectorization research, with several key ideas used in this implementation, as well as to a few exceptional engineers from various corners of Intel who contributed to this arguably world-best SHA-1 implementation for IA at the moment. They are: Ronen Zohar, Jim Guilford, Vinodh Gopal, Wajdi Feghali, Mark Buxton, Shay Gueron, Zino Benaissa and Anu Suryanarayanan. Thank you very much.

Appendix A.





	;---------------------

		;

		;   This code implements two interfaces of SHA-1 update function: 1) working on a single

		;   64-byte block and 2) working on a buffer of multiple 64-bit blocks. Multiple blocks

		;   version of code is software pipelined and faster overall, it is a default. Assemble

		;   with -DINTEL_SHA1_SINGLEBLOCK to select single 64-byte block function interface.

		;

		;   C++ prototypes of implemented functions are below:

		;

		;   #ifndef INTEL_SHA1_SINGLEBLOCK

		;      // Updates 20-byte SHA-1 record in 'hash' for 'num_blocks' consequtive 64-byte blocks

		;      extern "C" void sha1_update_intel(int *hash, const char* input, size_t num_blocks );

		;   #else

		;      // Updates 20-byte SHA-1 record in 'hash' for one 64-byte block pointed by 'input'

		;      extern "C" void sha1_update_intel(int *hash, const char* input);

		;   #endif

		;

		;   Function name 'sha1_update_intel' can be changed in the source or via macro:

		;     -DINTEL_SHA1_UPDATE_FUNCNAME=my_sha1_update_func_name

		;

		;   It implements both UNIX(default) and Windows ABIs, use -DWIN_ABI on Windows

		;

		;   Code checks CPU for SSSE3 support via CPUID feature flag (CPUID.1.ECX.SSSE3[bit 9]==1),

		;   and performs dispatch. Since in most cases the functionality on non-SSSE3 supporting CPUs

		;   is also required, the default (e.g. one being replaced) function can be provided for

		;   dispatch on such CPUs, the name of old function can be changed in the source or via macro:

		;      -DINTEL_SHA1_UPDATE_DEFAULT_DISPATCH=default_sha1_update_function_name

		;

		;   Authors: Maxim Locktyukhin and Ronen Zohar at Intel.com

		;

		

		%ifndef INTEL_SHA1_UPDATE_DEFAULT_DISPATCH

		;; can be replaced with a default SHA-1 update function name

		%define INTEL_SHA1_UPDATE_DEFAULT_DISPATCH  sha1_intel_non_ssse3_cpu_stub_  

		%else

		extern  INTEL_SHA1_UPDATE_DEFAULT_DISPATCH

		%endif

		

		;; provide alternative SHA-1 update function's name here

		%ifndef INTEL_SHA1_UPDATE_FUNCNAME

		%define INTEL_SHA1_UPDATE_FUNCNAME     sha1_update_intel

		%endif

		

		global INTEL_SHA1_UPDATE_FUNCNAME

		

		

		%ifndef INTEL_SHA1_SINGLEBLOCK

		%assign multiblock 1

		%else

		%assign multiblock 0

		%endif

		

		

		bits 64

		default rel

		

		%ifdef WIN_ABI

		%xdefine arg1 rcx

		%xdefine arg2 rdx

		%xdefine arg3 r8

		%else

		%xdefine arg1 rdi

		%xdefine arg2 rsi

		%xdefine arg3 rdx

		%endif

		

		%xdefine ctx arg1

		%xdefine buf arg2

		%xdefine cnt arg3

		

		%macro REGALLOC 0

		%xdefine A ecx

		%xdefine B esi

		%xdefine C edi

		%xdefine D ebp

		%xdefine E edx

		

		%xdefine T1 eax

		%xdefine T2 ebx

		%endmacro

		

		%xdefine K_BASE     r8

		%xdefine HASH_PTR   r9

		%xdefine BUFFER_PTR r10

		%xdefine BUFFER_END r11

		

		%xdefine W_TMP  xmm0

		%xdefine W_TMP2 xmm9

		

		%xdefine W0  xmm1

		%xdefine W4  xmm2

		%xdefine W8  xmm3

		%xdefine W12 xmm4

		%xdefine W16 xmm5

		%xdefine W20 xmm6

		%xdefine W24 xmm7

		%xdefine W28 xmm8

		

		%xdefine XMM_SHUFB_BSWAP xmm10

		

		;; we keep window of 64 w[i]+K pre-calculated values in a circular buffer

		%xdefine WK(t) (rsp + (t & 15)*4)

		

		;------------------------------------------------------------------------------

		;

		; macro implements SHA-1 function's body for single or several 64-byte blocks

		; first param: function's name

		; second param: =0 - function implements single 64-byte block hash

		;               =1 - function implements multiple64-byte blocks hash

		;                    3rd function's argument is a number, greater 0, of 64-byte blocks to calc hash for

		;

		%macro  SHA1_VECTOR_ASM  2

		align 4096

		%1:

		push rbx

		push rbp

		

		%ifdef WIN_ABI

		push rdi

		push rsi

		

		%xdefine stack_size (16*4 + 16*5 + 8)

		%else

		%xdefine stack_size (16*4 + 8)

		%endif    

		

		sub     rsp, stack_size

		

		%ifdef WIN_ABI

		%xdefine xmm_save_base (rsp + 16*4)

		

		xmm_mov [xmm_save_base + 0*16], xmm6

		xmm_mov [xmm_save_base + 1*16], xmm7

		xmm_mov [xmm_save_base + 2*16], xmm8

		xmm_mov [xmm_save_base + 3*16], xmm9

		xmm_mov [xmm_save_base + 4*16], xmm10

		%endif

		

		mov     HASH_PTR, ctx

		mov     BUFFER_PTR, buf

		

		%if (%2 == 1)

		shl     cnt, 6           ;; mul by 64

		add     cnt, buf

		mov     BUFFER_END, cnt

		%endif

		

		lea     K_BASE, [K_XMM_AR]

		xmm_mov XMM_SHUFB_BSWAP, [bswap_shufb_ctl]

		

		SHA1_PIPELINED_MAIN_BODY %2

		

		%ifdef WIN_ABI

		xmm_mov xmm6, [xmm_save_base + 0*16]

		xmm_mov xmm7, [xmm_save_base + 1*16]

		xmm_mov xmm8, [xmm_save_base + 2*16]

		xmm_mov xmm9, [xmm_save_base + 3*16]

		xmm_mov xmm10,[xmm_save_base + 4*16]

		%endif

		

		add rsp, stack_size

		

		%ifdef WIN_ABI

		pop rsi

		pop rdi

		%endif

		

		pop rbp

		pop rbx

		

		ret

		%endmacro

		

		;--------------------------------------------

		; macro implements 80 rounds of SHA-1, for one 64-byte block or multiple blocks with s/w pipelining

		; macro param: =0 - process single 64-byte block

		;              =1 - multiple blocks

		;

		%macro SHA1_PIPELINED_MAIN_BODY 1

		

		REGALLOC

		

		mov A, [HASH_PTR   ]

		mov B, [HASH_PTR+ 4]

		mov C, [HASH_PTR+ 8]

		mov D, [HASH_PTR+12]

		

		mov E, [HASH_PTR+16]

		

		%assign i 0

		%rep    W_PRECALC_AHEAD

		W_PRECALC i

		%assign i i+1

		%endrep

		

		%xdefine F F1

		

		%if (%1 == 1)                         ;; code loops through more than one block

		%%_loop:

		cmp BUFFER_PTR, K_BASE          ;; we use K_BASE value as a signal of a last block,

		jne %%_begin                    ;; it is set below by: cmovae BUFFER_PTR, K_BASE

		jmp %%_end

		

		align 32

		%%_begin:

		%endif

		RR A,B,C,D,E,0

		RR D,E,A,B,C,2

		RR B,C,D,E,A,4

		RR E,A,B,C,D,6

		RR C,D,E,A,B,8

		

		RR A,B,C,D,E,10

		RR D,E,A,B,C,12

		RR B,C,D,E,A,14

		RR E,A,B,C,D,16

		RR C,D,E,A,B,18

		

		%xdefine F F2

		

		RR A,B,C,D,E,20

		RR D,E,A,B,C,22

		RR B,C,D,E,A,24

		RR E,A,B,C,D,26

		RR C,D,E,A,B,28

		

		RR A,B,C,D,E,30

		RR D,E,A,B,C,32

		RR B,C,D,E,A,34

		RR E,A,B,C,D,36

		RR C,D,E,A,B,38

		

		%xdefine F F3

		

		RR A,B,C,D,E,40

		RR D,E,A,B,C,42

		RR B,C,D,E,A,44

		RR E,A,B,C,D,46

		RR C,D,E,A,B,48

		

		RR A,B,C,D,E,50

		RR D,E,A,B,C,52

		RR B,C,D,E,A,54

		RR E,A,B,C,D,56

		RR C,D,E,A,B,58

		

		%xdefine F F4

		

		%if (%1 == 1)                         ;; if code loops through more than one block

		add   BUFFER_PTR, 64            ;; move to next 64-byte block

		cmp   BUFFER_PTR, BUFFER_END    ;; check if current block is the last one

		cmovae BUFFER_PTR, K_BASE       ;; smart way to signal the last iteration

		%else

		%xdefine W_NO_TAIL_PRECALC 1    ;; no software pipelining for single block interface

		%endif

		

		RR A,B,C,D,E,60

		RR D,E,A,B,C,62

		RR B,C,D,E,A,64

		RR E,A,B,C,D,66

		RR C,D,E,A,B,68

		

		RR A,B,C,D,E,70

		RR D,E,A,B,C,72

		RR B,C,D,E,A,74

		RR E,A,B,C,D,76

		RR C,D,E,A,B,78

		

		UPDATE_HASH [HASH_PTR   ],A

		UPDATE_HASH [HASH_PTR+ 4],B

		UPDATE_HASH [HASH_PTR+ 8],C

		UPDATE_HASH [HASH_PTR+12],D

		UPDATE_HASH [HASH_PTR+16],E

		

		%if (%1 == 1)

		jmp %%_loop

		

		align 32  

		%%_end:

		%endif

		

		

		%xdefine W_NO_TAIL_PRECALC 0

		%xdefine F %error

		

		%endmacro

		

		

		%macro F1 3

		mov T1,%2

		xor T1,%3

		and T1,%1

		xor T1,%3

		%endmacro

		

		%macro F2 3

		mov T1,%3

		xor T1,%2

		xor T1,%1

		%endmacro

		

		%macro F3 3

		mov T1,%2

		mov T2,%1

		or  T1,%1

		and T2,%2

		and T1,%3

		or  T1,T2

		%endmacro

		

		%define F4 F2

		

		%macro UPDATE_HASH 2

		add %2, %1

		mov %1, %2

		%endmacro

		

		

		%macro W_PRECALC 1

		%xdefine i (%1)

		

		%if (i < 20)

		%xdefine K_XMM  0

		%elif (i < 40)

		%xdefine K_XMM  16

		%elif (i < 60)

		%xdefine K_XMM  32

		%else

		%xdefine K_XMM  48

		%endif

		

		%if (i<16 || (i>=80 && i<(80 + W_PRECALC_AHEAD)))

		

		%if (W_NO_TAIL_PRECALC == 0)

		

		%xdefine i ((%1) % 80)        ;; pre-compute for the next iteration

		

		%if (i == 0)

		W_PRECALC_RESET

		%endif

		

		

		W_PRECALC_00_15

		%endif

		

		%elif (i < 32)

		W_PRECALC_16_31

		%elif (i < 80)   ;; rounds 32-79

		W_PRECALC_32_79

		%endif

		%endmacro

		

		%macro W_PRECALC_RESET 0

		%xdefine    W             W0

		%xdefine    W_minus_04    W4

		%xdefine    W_minus_08    W8

		%xdefine    W_minus_12    W12

		%xdefine    W_minus_16    W16

		%xdefine    W_minus_20    W20

		%xdefine    W_minus_24    W24

		%xdefine    W_minus_28    W28

		%xdefine    W_minus_32    W

		%endmacro

		

		%macro W_PRECALC_ROTATE 0

		%xdefine    W_minus_32    W_minus_28

		%xdefine    W_minus_28    W_minus_24

		%xdefine    W_minus_24    W_minus_20

		%xdefine    W_minus_20    W_minus_16

		%xdefine    W_minus_16    W_minus_12

		%xdefine    W_minus_12    W_minus_08

		%xdefine    W_minus_08    W_minus_04

		%xdefine    W_minus_04    W

		%xdefine    W             W_minus_32

		%endmacro

		

		%xdefine W_PRECALC_AHEAD   16

		%xdefine W_NO_TAIL_PRECALC 0

		

		

		%xdefine xmm_mov            movdqa

		

		%macro W_PRECALC_00_15 0

		;; message scheduling pre-compute for rounds 0-15

		%if ((i & 3) == 0)       ;; blended SSE and ALU instruction scheduling, 1 vector iteration per 4 rounds

		movdqu W_TMP, [BUFFER_PTR + (i * 4)]

		%elif ((i & 3) == 1)

		pshufb W_TMP, XMM_SHUFB_BSWAP

		movdqa W, W_TMP

		%elif ((i & 3) == 2)

		paddd  W_TMP, [K_BASE]

		%elif ((i & 3) == 3)

		movdqa  [WK(i&~3)], W_TMP

		

		W_PRECALC_ROTATE

		%endif

		%endmacro

		

		%macro W_PRECALC_16_31 0

		;; message scheduling pre-compute for rounds 16-31

		;; calculating last 32 w[i] values in 8 XMM registers

		;; pre-calculate K+w[i] values and store to mem, for later load by ALU add instruction

		;;

		;; "brute force" vectorization for rounds 16-31 only due to w[i]->w[i-3] dependency

		;;

		%if ((i & 3) == 0)    ;; blended SSE and ALU instruction scheduling, 1 vector iteration per 4 rounds

		movdqa  W, W_minus_12

		palignr W, W_minus_16, 8       ;; w[i-14]

		movdqa  W_TMP, W_minus_04

		psrldq  W_TMP, 4               ;; w[i-3]

		pxor    W, W_minus_08

		%elif ((i & 3) == 1)

		pxor    W_TMP, W_minus_16

		pxor    W, W_TMP

		movdqa  W_TMP2, W

		movdqa  W_TMP, W

		pslldq  W_TMP2, 12

		%elif ((i & 3) == 2)

		psrld   W, 31

		pslld   W_TMP, 1

		por     W_TMP, W

		movdqa  W, W_TMP2

		psrld   W_TMP2, 30

		pslld   W, 2

		%elif ((i & 3) == 3)

		pxor    W_TMP, W

		pxor    W_TMP, W_TMP2

		movdqa  W, W_TMP

		paddd   W_TMP, [K_BASE + K_XMM]

		movdqa  [WK(i&~3)],W_TMP

		

		W_PRECALC_ROTATE

		%endif

		%endmacro

		

		%macro W_PRECALC_32_79 0

		;; in SHA-1 specification: w[i] = (w[i-3] ^ w[i-8]  ^ w[i-14] ^ w[i-16]) rol 1

		;; instead we do equal:    w[i] = (w[i-6] ^ w[i-16] ^ w[i-28] ^ w[i-32]) rol 2

		;; allows more efficient vectorization since w[i]=>w[i-3] dependency is broken

		;;

		%if ((i & 3) == 0)    ;; blended SSE and ALU instruction scheduling, 1 vector iteration per 4 rounds

		movdqa  W_TMP, W_minus_04

		pxor    W, W_minus_28         ;; W is W_minus_32 before xor

		palignr W_TMP, W_minus_08, 8

		%elif ((i & 3) == 1)

		pxor    W, W_minus_16

		pxor    W, W_TMP

		movdqa  W_TMP, W

		%elif ((i & 3) == 2)

		psrld   W, 30

		pslld   W_TMP, 2

		por     W_TMP, W

		%elif ((i & 3) == 3)

		movdqa  W, W_TMP

		paddd   W_TMP, [K_BASE + K_XMM]

		movdqa  [WK(i&~3)],W_TMP

		

		W_PRECALC_ROTATE

		%endif

		%endmacro

		

		%macro RR 6             ;; RR does two rounds of SHA-1 back to back with W pre-calculation

		

		;;     TEMP = A

		;;     A = F( i, B, C, D ) + E + ROTATE_LEFT( A, 5 ) + W[i] + K(i)

		;;     C = ROTATE_LEFT( B, 30 )

		;;     D = C

		;;     E = D

		;;     B = TEMP

		

		W_PRECALC (%6 + W_PRECALC_AHEAD)

		F    %2, %3, %4     ;; F returns result in T1

		add  %5, [WK(%6)]

		rol  %2, 30

		mov  T2, %1

		add  %4, [WK(%6 + 1)]

		rol  T2, 5

		add  %5, T1

		

		W_PRECALC (%6 + W_PRECALC_AHEAD + 1)

		add  T2, %5

		mov  %5, T2

		rol  T2, 5

		add  %4, T2

		F    %1, %2, %3    ;; F returns result in T1

		add  %4, T1

		rol  %1, 30

		

		;; write:  %1, %2

		;; rotate: %1<=%4, %2<=%5, %3<=%1, %4<=%2, %5<=%3

		%endmacro

		

		

		

		;;----------------------

		section .data align=128

		

		%xdefine K1 0x5a827999

		%xdefine K2 0x6ed9eba1

		%xdefine K3 0x8f1bbcdc

		%xdefine K4 0xca62c1d6

		

		align 128

		K_XMM_AR:

		DD K1, K1, K1, K1

		DD K2, K2, K2, K2

		DD K3, K3, K3, K3

		DD K4, K4, K4, K4

		

		align 16

		bswap_shufb_ctl:

		DD 00010203h

		DD 04050607h

		DD 08090a0bh

		DD 0c0d0e0fh

		

		;; dispatch pointer, points to the init routine for the first invocation

		sha1_update_intel_dispatched:

		DQ  sha1_update_intel_init_

		

		;;----------------------

		section .text align=4096

		

		SHA1_VECTOR_ASM     sha1_update_intel_ssse3_, multiblock

		

		align 32

		sha1_update_intel_init_:       ;; we get here with the first time invocation

		call    sha1_update_intel_dispacth_init_

		INTEL_SHA1_UPDATE_FUNCNAME:    ;; we get here after init

		jmp     qword [sha1_update_intel_dispatched]

		

		;; CPUID feature flag based dispatch

		sha1_update_intel_dispacth_init_:

		push    rax

		push    rbx

		push    rcx

		push    rdx

		push    rsi

		

		lea     rsi, [INTEL_SHA1_UPDATE_DEFAULT_DISPATCH]

		

		mov     eax, 1

		cpuid

		

		test    ecx, 0200h          ;; SSSE3 support, CPUID.1.ECX[bit 9]

		jz      _done

		

		lea     rsi, [sha1_update_intel_ssse3_]

		

		_done:

		mov     [sha1_update_intel_dispatched], rsi

		

		pop     rsi

		pop     rdx

		pop     rcx

		pop     rbx

		pop     rax

		ret

		

		;;----------------------

		;; in the case a default SHA-1 update function implementation was not provided

		;; and code was invoked on a non-SSSE3 supporting CPU, dispatch handles this

		;; failure in a safest way - jumps to the stub function with UD2 instruction below

		sha1_intel_non_ssse3_cpu_stub_:

		ud2     ;; in the case no default SHA-1 was provided non-SSSE3 CPUs safely fail here

		ret

		

		; END

		;----------------------