I am using vtune on a numerically intensive Fortran code with input parameters JD and KD which control the problem size. When I run with input parameters JD=41 and KD=41, vtune highlighted "4K Aliasing". This was new to me so I educated myself a bit about write-after-read hazards. So far, so good. Inside vtune, there are two subroutines which show 4K aliasing numbers of 1.000. One of the subroutines is essentially this:
SUBROUTINE DECJ ( JPER,B,D,H,XSC,JD,KD ) LOGICAL, INTENT (IN) :: JPER INTEGER, INTENT (IN) :: JD,KD REAL*8, DIMENSION(JD,KD), INTENT (INOUT) :: B,D REAL*8, DIMENSION(JD,KD), INTENT (IN) :: H,XSC INTEGER :: J,JP,JM,K DO K = 1,KD DO J = 2,JD-1 JP = J+1 JM = J-1 B(JP,K) = B(JP,K) - H(JP,K)*(0.5*XSC(J,K)) D(JM,K) = D(JM,K) + H(JM,K)*(0.5*XSC(J,K)) ENDDO ENDDO This is called twice: CALL DECJ ( JPER,B,D,H,XSCP,JD,KD )
CALL DECJ ( JPER,BT,DT,H,XSCM,JD,KD )
The arguments here are automatic arrays in the calling routine The calling routine has several automatic arrays, declared like this:
REAL*8, DIMENSION(JD,KD) :: A,B,C,D,E REAL*8, DIMENSION(JD,KD) :: AT,BT,CT,DT,ET REAL*8, DIMENSION(JD,KD,5) :: G REAL*8, DIMENSION(JD,KD) :: H,UU,XSCP,XSCM
My basic question is, what specifically triggers 4K aliasing in the case JD=41, KD=41 and not in the case JD=41, KD=40 (experimentally, with JD=41 and KD=40, vtune shows minimal 4K aliasing in subroutine decj, aliasing number is 0.109).
Compilation was with ifort 2015.3.187 using the options
-O3 -axCORE-AVX2,AVX -xSSE4.2 -g -ip -pad -align -auto -fpe0 -ftz -traceback
The loop in decj is unrolled 4 times by the compiler, so presumably after unrolling it looks something like this:
B(J+1,K) = B(J+1,K) - H(J+1,K)*(0.5*XSC(J, K)) D(J-1,K) = D(J-1,K) + H(J-1,K)*(0.5*XSC(J, K)) B(J+2,K) = B(J+2,K) - H(J+2,K)*(0.5*XSC(J+1,K)) D(J, K) = D(J, K) + H(J, K)*(0.5*XSC(J+1,K)) B(J+3,K) = B(J+3,K) - H(J+3,K)*(0.5*XSC(J+2,K)) D(J+1,K) = D(J+1,K) + H(J+1,K)*(0.5*XSC(J+2,K)) B(J+4,K) = B(J+4,K) - H(J+4,K)*(0.5*XSC(J+3,K)) D(J+2,K) = D(J+2,K) + H(J+2,K)*(0.5*XSC(J+3,K)) I did some testing and couldn't find any addresses that differed by a multiple of 4096. The worst I could find was some addresses that differed by a multiple of 256.