Discussion:
Understanding of cache trace of ALPHA timing CPU
(too old to reply)
mengyu liang
2016-11-06 21:03:54 UTC
Permalink
Hello everyone,

Recently I am studying the memory access time, (i.e. the duration of memory load and store) in term of CPU cycles in a multicore system. I come up with alpha timing CPU and have run several Full system simulation with Parsec workloads. In order to look into details of memory access procedure, I turned on the debug trace of Cache.

However I am very disappointed to see that the entire memory access is treated "atomically". To illustrate my doubt, I paste the following Cache trace segment:


3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from ReadReq for addr 0x6bcac0 size 32
3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0 (ns)
3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03
3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache
3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03
3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache
3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 (ns) in state 0
3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with 0x6bcac0 (ns): writeback
3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795


As you can see above, cpu3 initiates a read request at the very beginning but encountered cache miss. So it has triggered a series of cache actions due to cache coherency. However they ALL take place at the same time tick, as if every memory access, no matter if it is cache miss or hit, takes ZERO time!


As per the documentation of gem5, The TimingSimpleCPU is the version of SimpleCPU that uses timing memory accesses. It stalls on cache accesses and waits for the memory system to respond prior to proceeding. Based on that, I didn't expect an atomic-like behavior of timing CPU. It should have exhibited non-zero duration for each memory access.


Does anybody have the same experience and can explain the reason for that?


Or is there any CPU model which behaves non-atomically and can be implemented in multicore system? As far as I know, only O3 CPU does this job, however it's out of order. I need an in-order CPU.


Thanks and best regards,

Mengyu Liang
Rodrigo Cataldo
2016-11-07 11:52:26 UTC
Permalink
Hello Mengyu Liang,
i would recommend that you check out the thesis of Uri Wiener (Modeling and
Analysis of a Cache Coherent Interconnect)

he describes the decisions made on the implementation of the CCI model on
gem5.

quoting page 25: "Snoop requests from the slave are handled and forwarded
in zero time. This major inaccuracy is intended for avoiding race
conditions in the memory system, and mostly the need to implement
transition-states in the
cache-controller."
Post by mengyu liang
Hello everyone,
Recently I am studying the memory access time, (i.e. the duration of
memory load and store) in term of CPU cycles in a multicore system. I come
up with alpha timing CPU and have run several Full system simulation with
Parsec workloads. In order to look into details of memory access procedure,
I turned on the debug trace of Cache.
However I am very disappointed to see that the entire memory access is
treated "atomically". To illustrate my doubt, I paste the following Cache
*3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq
system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size
32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0
tag: 10c03 3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper
system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size
32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0
tag: 10c03 3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper
system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 (ns) in
state 0 3587305218000: system.cpu3.dcache: replacement: replacing
system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from
state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795*
As you can see above, cpu3 initiates a read request at the very beginning
but encountered cache miss. So it has triggered a series of cache actions
due to cache coherency. However they ALL take place at the same time tick,
as if every memory access, no matter if it is cache miss or hit, takes ZERO
time!
As per the documentation of gem5, *The TimingSimpleCPU is the version of
SimpleCPU that uses timing memory accesses. It stalls on cache accesses and
waits for the memory system to respond prior to proceeding*. Based on
that, I didn't expect an atomic-like behavior of timing CPU. It should have
exhibited non-zero duration for each memory access.
Does anybody have the same experience and can explain the reason for that?
Or is there any CPU model which behaves non-atomically and can be
implemented in multicore system? As far as I know, only O3 CPU does this
job, however it's out of order. I need an in-order CPU.
Thanks and best regards,
Mengyu Liang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Jason Lowe-Power
2016-11-07 14:30:25 UTC
Permalink
In addition to what Rodrigo says, if you want to model a cache coherent
memory system in detail, you should be using the Ruby memory system, no the
classic caches. Ruby performs all coherence actions in detailed timing mode.

Also, for an in order CPU, you may want to try out the MinorCPU. It works
well with ARM, and somewhat with x86. I'm not sure if it will work with
Alpha.

Cheers,
Jason
Post by Rodrigo Cataldo
Hello Mengyu Liang,
i would recommend that you check out the thesis of Uri Wiener (Modeling
and Analysis of a Cache Coherent Interconnect)
he describes the decisions made on the implementation of the CCI model on
gem5.
quoting page 25: "Snoop requests from the slave are handled and forwarded
in zero time. This major inaccuracy is intended for avoiding race
conditions in the memory system, and mostly the need to implement
transition-states in the
cache-controller."
Hello everyone,
Recently I am studying the memory access time, (i.e. the duration of
memory load and store) in term of CPU cycles in a multicore system. I come
up with alpha timing CPU and have run several Full system simulation with
Parsec workloads. In order to look into details of memory access procedure,
I turned on the debug trace of Cache.
However I am very disappointed to see that the entire memory access is
treated "atomically". To illustrate my doubt, I paste the following Cache
*3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq
system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size
32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0
tag: 10c03 3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper
system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size
32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0
tag: 10c03 3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper
system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 (ns) in
state 0 3587305218000: system.cpu3.dcache: replacement: replacing
system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from
state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795*
As you can see above, cpu3 initiates a read request at the very beginning
but encountered cache miss. So it has triggered a series of cache actions
due to cache coherency. However they ALL take place at the same time tick,
as if every memory access, no matter if it is cache miss or hit, takes ZERO
time!
As per the documentation of gem5, *The TimingSimpleCPU is the version of
SimpleCPU that uses timing memory accesses. It stalls on cache accesses and
waits for the memory system to respond prior to proceeding*. Based on
that, I didn't expect an atomic-like behavior of timing CPU. It should have
exhibited non-zero duration for each memory access.
Does anybody have the same experience and can explain the reason for that?
Or is there any CPU model which behaves non-atomically and can be
implemented in multicore system? As far as I know, only O3 CPU does this
job, however it's out of order. I need an in-order CPU.
Thanks and best regards,
Mengyu Liang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Andreas Hansson
2016-11-14 22:33:53 UTC
Permalink
Hi all,

The classic memory system avoids a lot of complexity in the cache state machines by performing the state transitions in zero time. Note that it does not complete the packet transfer in zero time though, and it pays for the instant request propagation either in the downstream component, or on the response path. There are fields in the packet that accumulate the “unpaid” snoop latency. You can run a multi-core lmbench-like benchmark if you want to convince yourself it is doing the right thing.

As a result of the aforementioned functionality, I would argue the classic memory system is actually a good representation of a hierarchical crossbar-based system with a MOESI protocol. It is also a lot faster than Ruby, and far more flexible. In the end it depends on what you want to accomplish. For most system-level performance exploration I would suggest classic. For detailed interconnect topologies or coherency protocols, go with Ruby.

I hope that helps.

Andreas

From: gem5-users <gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>> on behalf of Jason Lowe-Power <***@lowepower.com<mailto:***@lowepower.com>>
Reply-To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Date: Monday, 7 November 2016 at 14:30
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] Understanding of cache trace of ALPHA timing CPU

In addition to what Rodrigo says, if you want to model a cache coherent memory system in detail, you should be using the Ruby memory system, no the classic caches. Ruby performs all coherence actions in detailed timing mode.

Also, for an in order CPU, you may want to try out the MinorCPU. It works well with ARM, and somewhat with x86. I'm not sure if it will work with Alpha.

Cheers,
Jason

On Mon, Nov 7, 2016 at 5:52 AM Rodrigo Cataldo <***@gmail.com<mailto:***@gmail.com>> wrote:
Hello Mengyu Liang,
i would recommend that you check out the thesis of Uri Wiener (Modeling and Analysis of a Cache Coherent Interconnect)

he describes the decisions made on the implementation of the CCI model on gem5.

quoting page 25: "Snoop requests from the slave are handled and forwarded in zero time. This major inaccuracy is intended for avoiding race conditions in the memory system, and mostly the need to implement transition-states in the
cache-controller."

On Sun, Nov 6, 2016 at 7:03 PM, mengyu liang <***@hotmail.com<mailto:***@hotmail.com>> wrote:

Hello everyone,

Recently I am studying the memory access time, (i.e. the duration of memory load and store) in term of CPU cycles in a multicore system. I come up with alpha timing CPU and have run several Full system simulation with Parsec workloads. In order to look into details of memory access procedure, I turned on the debug trace of Cache.

However I am very disappointed to see that the entire memory access is treated "atomically". To illustrate my doubt, I paste the following Cache trace segment:


3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from ReadReq for addr 0x6bcac0 size 32
3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0 (ns)
3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03
3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache
3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03
3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache
3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 (ns) in state 0
3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with 0x6bcac0 (ns): writeback
3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795


As you can see above, cpu3 initiates a read request at the very beginning but encountered cache miss. So it has triggered a series of cache actions due to cache coherency. However they ALL take place at the same time tick, as if every memory access, no matter if it is cache miss or hit, takes ZERO time!


As per the documentation of gem5, The TimingSimpleCPU is the version of SimpleCPU that uses timing memory accesses. It stalls on cache accesses and waits for the memory system to respond prior to proceeding. Based on that, I didn't expect an atomic-like behavior of timing CPU. It should have exhibited non-zero duration for each memory access.


Does anybody have the same experience and can explain the reason for that?


Or is there any CPU model which behaves non-atomically and can be implemented in multicore system? As far as I know, only O3 CPU does this job, however it's out of order. I need an in-order CPU.


Thanks and best regards,

Mengyu Liang


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
mengyu liang
2016-11-17 22:11:02 UTC
Permalink
Dear all,

Thanks a lot for all your explanation below. I'm now sticking to the classical Xbar memory system, not the ruby one. I accept the fact that the state transition or cache coherency takes zero time in this case.

However today I studied the exec debug trace again for ALPHA FS simulation and found the following interesting entries:


3334580479000: system.switch_cpus02 T0 : 0x12000867c : ldq r2,29968(r1) : MemRead : A=0x1200adda8

......

3334580495000: system.switch_cpus02 T0 : 0x12000867c : ldq r2,29968(r1) : MemRead : D=0x00000001200adda8 A=0x1200adda8


You see at the first entry cpu02 tries to read address from A=0x1200adda8 but without data. Some time later at entry 2 I found the same core at the same instruction address is accessing the same data address with the same registers. But this time a valid data is returned D=0x00000001200adda8.

Can I explain this as memory access request at the 1st entry, and data acknowledgement at the 2nd entry? Does it have something to do with Cache miss?

If you compare this with the cache debug trace, you will find out that the 1st entry is not noted in cache trace. We have only notation in cache trace for 2nd entry.

Then what happened at first entry?

I would like to say, this kind of accesses take only a very small percentage of all memory accesses. Most memory accesses acquired the data already at their first entries.


Also there are other kind of memory accesses in exec trace which have neither data address A=0x.... or returned data D=0x... example is below:


3334580433000: system.switch_cpus00 T0 : @iowrite8+36 : mb : MemRead :


How to explain this?


PS: I still don't know how to reply and hanging my post onto an existing topic in gem5 mailing list? instead of opening a new topic?

Thanks in advance.


Best regards,

Mengyu



Hi all,

The classic memory system avoids a lot of complexity in the cache state
machines by performing the state transitions in zero time. Note that it does
not complete the packet transfer in zero time though, and it pays for the
instant request propagation either in the downstream component, or on the
response path. There are fields in the packet that accumulate the “unpaid”
snoop latency. You can run a multi-core lmbench-like benchmark if you want to
convince yourself it is doing the right thing.

As a result of the aforementioned functionality, I would argue the classic
memory system is actually a good representation of a hierarchical
crossbar-based system with a MOESI protocol. It is also a lot faster than Ruby,
and far more flexible. In the end it depends on what you want to accomplish.
For most system-level performance exploration I would suggest classic. For
detailed interconnect topologies or coherency protocols, go with Ruby.

I hope that helps.

Andreas

From: gem5-users
<gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>> on behalf of
Jason Lowe-Power <***@lowepower.com<mailto:***@lowepower.com>>
Reply-To: gem5 users mailing list
<gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Date: Monday, 7 November 2016 at 14:30
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] Understanding of cache trace of ALPHA timing CPU

In addition to what Rodrigo says, if you want to model a cache coherent memory
system in detail, you should be using the Ruby memory system, no the classic
caches. Ruby performs all coherence actions in detailed timing mode.

Also, for an in order CPU, you may want to try out the MinorCPU. It works well
with ARM, and somewhat with x86. I'm not sure if it will work with Alpha.

Cheers,
Jason

On Mon, Nov 7, 2016 at 5:52 AM Rodrigo Cataldo
<***@gmail.com<mailto:***@gmail.com>> wrote:
Hello Mengyu Liang,
i would recommend that you check out the thesis of Uri Wiener (Modeling and
Analysis of a Cache Coherent Interconnect)

he describes the decisions made on the implementation of the CCI model on gem5.

quoting page 25: "Snoop requests from the slave are handled and forwarded in
zero time. This major inaccuracy is intended for avoiding race conditions in
the memory system, and mostly the need to implement transition-states in the
cache-controller."

On Sun, Nov 6, 2016 at 7:03 PM, mengyu liang
<***@hotmail.com<mailto:***@hotmail.com>> wrote:

Hello everyone,

Recently I am studying the memory access time, (i.e. the duration of memory
load and store) in term of CPU cycles in a multicore system. I come up with
alpha timing CPU and have run several Full system simulation with Parsec
workloads. In order to look into details of memory access procedure, I turned
on the debug trace of Cache.

However I am very disappointed to see that the entire memory access is treated
"atomically". To illustrate my doubt, I paste the following Cache trace segment:


3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from
ReadReq for addr 0x6bcac0 size 32
3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0
(ns)
3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1
dirty: 0 tag: 10c03
3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for
snoop CleanEvict from lower cache
3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1
dirty: 0 tag: 10c03
3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for
snoop CleanEvict from lower cache
3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0
(ns) in state 0
3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with
0x6bcac0 (ns): writeback
3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1,
dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0
to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795


As you can see above, cpu3 initiates a read request at the very beginning but
encountered cache miss. So it has triggered a series of cache actions due to
cache coherency. However they ALL take place at the same time tick, as if every
memory access, no matter if it is cache miss or hit, takes ZERO time!


As per the documentation of gem5, The TimingSimpleCPU is the version of
SimpleCPU that uses timing memory accesses. It stalls on cache accesses and
waits for the memory system to respond prior to proceeding. Based on that, I
didn't expect an atomic-like behavior of timing CPU. It should have exhibited
non-zero duration for each memory access.


Does anybody have the same experience and can explain the reason for that?


Or is there any CPU model which behaves non-atomically and can be implemented
in multicore system? As far as I know, only O3 CPU does this job, however it's
out of order. I need an in-order CPU.


Thanks and best regards,

Mengyu Liang


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
mengyu liang
2016-11-17 22:12:06 UTC
Permalink
Dear all,

Thanks a lot for all your explanation below. I'm now sticking to the classical Xbar memory system, not the ruby one. I accept the fact that the state transition or cache coherency takes zero time in this case.

However today I studied the exec debug trace again for ALPHA FS simulation and found the following interesting entries:


3334580479000: system.switch_cpus02 T0 : 0x12000867c : ldq r2,29968(r1) : MemRead : A=0x1200adda8

......

3334580495000: system.switch_cpus02 T0 : 0x12000867c : ldq r2,29968(r1) : MemRead : D=0x00000001200adda8 A=0x1200adda8


You see at the first entry cpu02 tries to read address from A=0x1200adda8 but without data. Some time later at entry 2 I found the same core at the same instruction address is accessing the same data address with the same registers. But this time a valid data is returned D=0x00000001200adda8.

Can I explain this as memory access request at the 1st entry, and data acknowledgement at the 2nd entry? Does it have something to do with Cache miss?

If you compare this with the cache debug trace, you will find out that the 1st entry is not noted in cache trace. We have only notation in cache trace for 2nd entry.

Then what happened at first entry?

I would like to say, this kind of accesses take only a very small percentage of all memory accesses. Most memory accesses acquired the data already at their first entries.


Also there are other kind of memory accesses in exec trace which have neither data address A=0x.... or returned data D=0x... example is below:


3334580433000: system.switch_cpus00 T0 : @iowrite8+36 : mb : MemRead :


How to explain this?


PS: I still don't know how to reply and hanging my post onto an existing topic in gem5 mailing list? instead of opening a new topic?

Thanks in advance.


Best regards,

Mengyu



________________________________
Von: mengyu liang <***@hotmail.com>
Gesendet: Sonntag, 6. November 2016 22:03
An: gem5 forum
Betreff: Understanding of cache trace of ALPHA timing CPU


Hello everyone,

Recently I am studying the memory access time, (i.e. the duration of memory load and store) in term of CPU cycles in a multicore system. I come up with alpha timing CPU and have run several Full system simulation with Parsec workloads. In order to look into details of memory access procedure, I turned on the debug trace of Cache.

However I am very disappointed to see that the entire memory access is treated "atomically". To illustrate my doubt, I paste the following Cache trace segment:


3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from ReadReq for addr 0x6bcac0 size 32
3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0 (ns)
3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03
3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache
3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03
3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache
3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 (ns) in state 0
3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with 0x6bcac0 (ns): writeback
3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795


As you can see above, cpu3 initiates a read request at the very beginning but encountered cache miss. So it has triggered a series of cache actions due to cache coherency. However they ALL take place at the same time tick, as if every memory access, no matter if it is cache miss or hit, takes ZERO time!


As per the documentation of gem5, The TimingSimpleCPU is the version of SimpleCPU that uses timing memory accesses. It stalls on cache accesses and waits for the memory system to respond prior to proceeding. Based on that, I didn't expect an atomic-like behavior of timing CPU. It should have exhibited non-zero duration for each memory access.


Does anybody have the same experience and can explain the reason for that?


Or is there any CPU model which behaves non-atomically and can be implemented in multicore system? As far as I know, only O3 CPU does this job, however it's out of order. I need an in-order CPU.


Thanks and best regards,

Mengyu Liang
Jason Lowe-Power
2016-11-20 20:24:39 UTC
Permalink
Hello,

To reply to a post, you should just click "reply" in your email client.

For your question... I would look at the code that is executed with the
Exec debug flag. By reading the code you should be able to step through and
figure out what's going on.

Jason
Post by mengyu liang
Dear all,
Thanks a lot for all your explanation below. I'm now sticking to the
classical Xbar memory system, not the ruby one. I accept the fact that the
state transition or cache coherency takes zero time in this case.
However today I studied the exec debug trace again for ALPHA FS simulation
3334580479000: system.switch_cpus02 T0 : 0x12000867c : ldq
r2,29968(r1) : MemRead : A=0x1200adda8
......
3334580495000: system.switch_cpus02 T0 : 0x12000867c : ldq
r2,29968(r1) : MemRead : D=0x00000001200adda8 A=0x1200adda8
You see at the first entry cpu02 tries to read address from A=0x1200adda8
but without data. Some time later at entry 2 I found the same core at the
same instruction address is accessing the same data address with the same
registers. But this time a valid data is returned D=0x00000001200adda8.
Can I explain this as memory access request at the 1st entry, and data
acknowledgement at the 2nd entry? Does it have something to do with Cache
miss?
If you compare this with the cache debug trace, you will find out that
the 1st entry is not noted in cache trace. We have only notation in cache
trace for 2nd entry.
Then what happened at first entry?
I would like to say, this kind of accesses take only a very small
percentage of all memory accesses. Most memory accesses acquired the data
already at their first entries.
Also there are other kind of memory accesses in exec trace which have
How to explain this?
PS: I still don't know how to reply and hanging my post onto an existing
topic in gem5 mailing list? instead of opening a new topic?
Thanks in advance.
Best regards,
Mengyu
------------------------------
*Gesendet:* Sonntag, 6. November 2016 22:03
*An:* gem5 forum
*Betreff:* Understanding of cache trace of ALPHA timing CPU
Hello everyone,
Recently I am studying the memory access time, (i.e. the duration of
memory load and store) in term of CPU cycles in a multicore system. I come
up with alpha timing CPU and have run several Full system simulation with
Parsec workloads. In order to look into details of memory access procedure,
I turned on the debug trace of Cache.
However I am very disappointed to see that the entire memory access is
treated "atomically". To illustrate my doubt, I paste the following Cache
*3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq
system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size
32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0
tag: 10c03 3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper
system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size
32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0
tag: 10c03 3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper
system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 (ns) in
state 0 3587305218000: system.cpu3.dcache: replacement: replacing
system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from
state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795*
As you can see above, cpu3 initiates a read request at the very beginning
but encountered cache miss. So it has triggered a series of cache actions
due to cache coherency. However they ALL take place at the same time tick,
as if every memory access, no matter if it is cache miss or hit, takes ZERO
time!
As per the documentation of gem5, *The TimingSimpleCPU is the version of
SimpleCPU that uses timing memory accesses. It stalls on cache accesses and
waits for the memory system to respond prior to proceeding*. Based on
that, I didn't expect an atomic-like behavior of timing CPU. It should have
exhibited non-zero duration for each memory access.
Does anybody have the same experience and can explain the reason for that?
Or is there any CPU model which behaves non-atomically and can be
implemented in multicore system? As far as I know, only O3 CPU does this
job, however it's out of order. I need an in-order CPU.
Thanks and best regards,
Mengyu Liang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Continue reading on narkive:
Loading...