Discussion:
[gem5-users] CLFLUSH cause cache misses in stats but data reading time not effected [same data read time irrespective data comes from memory or cache]
Usman Ali
2018-10-24 11:50:36 UTC
Permalink
Dear,

I am in scenario where I need simulator to generate different data read
time for memory and cache but I am getting constant reading time
irrespective data comes from memory and caches.

I used clfush command to empty cache line, so that system reads from memory.

clflush cause cache misses in stats, but data reading time shows within
program that it comes from cache where data is coming from memory. [rdtsc
is used for time measurement ]

I simulate program in Timings and Derive03 CPU but issue is still there.
Is this issue with 'clflush' implementation within x86 arch in GEM5? any
suggestion will be appreciated.

PS: On real system, its working fine.

regards,

Usman Ali
MSEE Student, Information Technology Univeristy, Lahore
***@itu.edu.pk
Swapnil Haria
2018-10-24 13:06:16 UTC
Permalink
Hey Usman,

Can you please provide a small program that I can use to reproduce this
issue?

The CLFLUSH instruction only flushes the cache line till the request queue
of the memory controller and not to memory. Intel had proposed and later
decommissioned a PCOMMIT instruction which cleared the memory controller
queues. So I think the load after a CLFLUSH would simply read the value
from the memory controller queue itself. So the load latency might be
similar to cache latency and not memory latency.

I will look into this further.

Cheers,
Swapnil Haria,
PhD Candidate,
Dept of Computer Sciences,
University of Wisconsin-Madison

http://pages.cs.wisc.edu/~swapnilh/
Post by Usman Ali
Dear,
I am in scenario where I need simulator to generate different data read
time for memory and cache but I am getting constant reading time
irrespective data comes from memory and caches.
I used clfush command to empty cache line, so that system reads from memory.
clflush cause cache misses in stats, but data reading time shows within
program that it comes from cache where data is coming from memory. [rdtsc
is used for time measurement ]
I simulate program in Timings and Derive03 CPU but issue is still there.
Is this issue with 'clflush' implementation within x86 arch in GEM5? any
suggestion will be appreciated.
PS: On real system, its working fine.
regards,
Usman Ali
MSEE Student, Information Technology Univeristy, Lahore
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Usman Ali
2018-10-26 10:13:43 UTC
Permalink
Hi Swapnil Haria,

Thanks for your kind reply, bellow is program which can be used to
reproduce the effect,

regards,
Usman Ali
MSEE Student, ITU, Lahore

#include <stdio.h>
#include <unistd.h>


int probe(void *addr) {
volatile unsigned long time;
asm __volatile__ (
" mfence \n"
" lfence \n"
" rdtsc \n"
" lfence \n"
" movl %%eax, %%esi \n"
" movl (%1), %%eax \n"
" lfence \n"
" rdtsc \n"
" subl %%esi, %%eax \n"
" clflush 0(%1) \n"
: "=a" (time)
: "c" (addr)
: "%esi", "%edx");
return time;

}


int readOnly(void *addr) {
volatile unsigned long time;
asm __volatile__ (
" mfence \n"
" lfence \n"
" rdtsc \n"
" lfence \n"
" movl %%eax, %%esi \n"
" movl (%1), %%eax \n"
" lfence \n"
" rdtsc \n"
" subl %%esi, %%eax \n"
: "=a" (time)
: "c" (addr)
: "%esi", "%edx");
return time;

}

int main(void){

int t=0, total = 0, n=100;

int pO = 123456;
void * p = &pO;

for(int i=0; i < n; i++){

if(i%2){

t = probe(p);
t = probe(p);
t = probe(p);

}else{
t = readOnly(p);
t = readOnly(p);
t = readOnly(p);
}

printf("\n----Cycle: %d \n", t);

}


printf("Done --ud-- \n");
return 0;

sleep(1);
}







// END of PROGRAM
Swapnil Haria
2018-10-27 15:50:56 UTC
Permalink
Hey Usman,

This is an example of false sharing in your code because the target address
is on the stack.
Other stack variables like t, total, n which are on the same cacheline.
Those are referenced in the for loop, which brings the
entire cacheline (including p) in the cache.

Based on your program, I created a working minimal example with the only
major change being allocating the addr in the heap.
With that I see:

Latency of cached load: 20

Latency of uncached load: 182

int main() {
volatile unsigned long time;
int *addr = new int(); // WORKS
*addr = 0x12221;

// Does not work.
// int pO = 123456;
// void * p = &pO;

// 1. Load to ensure addr gets cached.
// 2. Time access to cached addr.
// 3. Flush addr.
asm __volatile__ (
" mfence \n"
" movl (%1), %%eax \n"
" lfence \n"
" rdtsc \n"
" lfence \n"
" movl %%eax, %%esi \n"
" movl (%1), %%eax \n"
" lfence \n"
" rdtsc \n"
" subl %%esi, %%eax \n"
" clflush 0(%1) \n"
: "=a" (time)
: "c" (addr)
: "%esi", "%edx");


printf("\n Latency of cached load: %lu \n", time);
// Load value from memory.

asm __volatile__ (
" mfence \n"
" lfence \n"
" rdtsc \n"
" lfence \n"
" movl %%eax, %%esi \n"
" movl (%1), %%eax \n"
" lfence \n"
" rdtsc \n"
" subl %%esi, %%eax \n"
: "=a" (time)
: "c" (addr)
: "%esi", "%edx");

printf("\n Latency of uncached load: %lu \n", time);
}

Cheers,
Swapnil Haria,
PhD Candidate,
Dept of Computer Sciences,
University of Wisconsin-Madison

http://pages.cs.wisc.edu/~swapnilh/
Post by Usman Ali
Hi Swapnil Haria,
Thanks for your kind reply, bellow is program which can be used to
reproduce the effect,
regards,
Usman Ali
MSEE Student, ITU, Lahore
#include <stdio.h>
#include <unistd.h>
int probe(void *addr) {
volatile unsigned long time;
asm __volatile__ (
" mfence \n"
" lfence \n"
" rdtsc \n"
" lfence \n"
" movl %%eax, %%esi \n"
" movl (%1), %%eax \n"
" lfence \n"
" rdtsc \n"
" subl %%esi, %%eax \n"
" clflush 0(%1) \n"
: "=a" (time)
: "c" (addr)
: "%esi", "%edx");
return time;
}
int readOnly(void *addr) {
volatile unsigned long time;
asm __volatile__ (
" mfence \n"
" lfence \n"
" rdtsc \n"
" lfence \n"
" movl %%eax, %%esi \n"
" movl (%1), %%eax \n"
" lfence \n"
" rdtsc \n"
" subl %%esi, %%eax \n"
: "=a" (time)
: "c" (addr)
: "%esi", "%edx");
return time;
}
int main(void){
int t=0, total = 0, n=100;
int pO = 123456;
void * p = &pO;
for(int i=0; i < n; i++){
if(i%2){
t = probe(p);
t = probe(p);
t = probe(p);
}else{
t = readOnly(p);
t = readOnly(p);
t = readOnly(p);
}
printf("\n----Cycle: %d \n", t);
}
printf("Done --ud-- \n");
return 0;
sleep(1);
}
// END of PROGRAM
Usman Ali
2018-10-30 10:22:21 UTC
Permalink
Hi, Thanks for your response.

This resolve the issue.

best regards,
Usman Ali

Loading...