I am using vtune on a numerically intensive Fortran code with input parameters JD and KD which control the problem size. When I run with input parameters JD=41 and KD=41, vtune highlighted "4K Aliasing". This was new to me so I educated myself a bit about write-after-read hazards. So far, so good. Inside vtune, there are two subroutines which show 4K aliasing numbers of 1.000. One of the subroutines is essentially this:
SUBROUTINE DECJ ( JPER,B,D,H,XSC,JD,KD )
LOGICAL, INTENT (IN) :: JPER
INTEGER, INTENT (IN) :: JD,KD
REAL*8, DIMENSION(JD,KD), INTENT (INOUT) :: B,D
REAL*8, DIMENSION(JD,KD), INTENT (IN) :: H,XSC
INTEGER :: J,JP,JM,K
DO K = 1,KD
DO J = 2,JD-1
JP = J+1
JM = J-1
B(JP,K) = B(JP,K) - H(JP,K)*(0.5*XSC(J,K))
D(JM,K) = D(JM,K) + H(JM,K)*(0.5*XSC(J,K))
ENDDO
ENDDO
This is called twice:
CALL DECJ ( JPER,B,D,H,XSCP,JD,KD )
CALL DECJ ( JPER,BT,DT,H,XSCM,JD,KD )
The arguments here are automatic arrays in the calling routine The calling routine has several automatic arrays, declared like this:
REAL*8, DIMENSION(JD,KD) :: A,B,C,D,E
REAL*8, DIMENSION(JD,KD) :: AT,BT,CT,DT,ET
REAL*8, DIMENSION(JD,KD,5) :: G
REAL*8, DIMENSION(JD,KD) :: H,UU,XSCP,XSCM
My basic question is, what specifically triggers 4K aliasing in the case JD=41, KD=41 and not in the case JD=41, KD=40 (experimentally, with JD=41 and KD=40, vtune shows minimal 4K aliasing in subroutine decj, aliasing number is 0.109).
Compilation was with ifort 2015.3.187 using the options
-O3 -axCORE-AVX2,AVX -xSSE4.2 -g -ip -pad -align -auto -fpe0 -ftz -traceback
The loop in decj is unrolled 4 times by the compiler, so presumably after unrolling it looks something like this:
B(J+1,K) = B(J+1,K) - H(J+1,K)*(0.5*XSC(J, K))
D(J-1,K) = D(J-1,K) + H(J-1,K)*(0.5*XSC(J, K))
B(J+2,K) = B(J+2,K) - H(J+2,K)*(0.5*XSC(J+1,K))
D(J, K) = D(J, K) + H(J, K)*(0.5*XSC(J+1,K))
B(J+3,K) = B(J+3,K) - H(J+3,K)*(0.5*XSC(J+2,K))
D(J+1,K) = D(J+1,K) + H(J+1,K)*(0.5*XSC(J+2,K))
B(J+4,K) = B(J+4,K) - H(J+4,K)*(0.5*XSC(J+3,K))
D(J+2,K) = D(J+2,K) + H(J+2,K)*(0.5*XSC(J+3,K))
I did some testing and couldn't find any addresses that differed by a multiple of 4096. The worst I could find was
some addresses that differed by a multiple of 256.