Discussion:
Access icache every cycle
(too old to reply)
Runjie Zhang
2012-10-22 20:58:27 UTC
Permalink
Greetings,

I tried to write stressmarks in X86 assembly so that the simulated IPC or
O3CPU can hit N for a N-way out-of-order core. However, no matter how I
modify the assembly, the IPC could never reach 4 for a 4-way OoO core.

According to the execution trace, icache stall was the trouble maker. In
my case, even if the whole program fits in icache, the fetch unit still
stalls for a few cycles between fetching 32 instructions over 8 cycles(I
assume 32 X86 ADD instructions fill one cache line?). With Gem5 memory
system (no Ruby), this latency is 2 cycles. With Ruby memory, this latency
is 3 cycles.

So my questions are:

1. Since Gem5 does not accept a zero hit latency, is there a way to
access icache every cycle without any stall? Let's assume there are no
icache misses.

2. The icache hit latencies for both Ruby memory and Gem5 memory cores
were 2 cycles, why the Ruby case experienced an extra cycle stall?

I was running Full System Gem5(changeset: 9305:ac608464be80) with X86
ISA and single detailed CPU. For Ruby, I used MOESI_hammer protocol.


Thanks!

Runjie Zhang
University of Virginia
Mahmood Naderan
2012-10-23 08:09:01 UTC
Permalink
How about *not to* push cache latencies in to the queue? Though I am not
quite sure about if this is correct.

Regards,
Mahmood
Post by Runjie Zhang
Greetings,
I tried to write stressmarks in X86 assembly so that the simulated IPC
or O3CPU can hit N for a N-way out-of-order core. However, no matter how I
modify the assembly, the IPC could never reach 4 for a 4-way OoO core.
According to the execution trace, icache stall was the trouble maker. In
my case, even if the whole program fits in icache, the fetch unit still
stalls for a few cycles between fetching 32 instructions over 8 cycles(I
assume 32 X86 ADD instructions fill one cache line?). With Gem5 memory
system (no Ruby), this latency is 2 cycles. With Ruby memory, this latency
is 3 cycles.
1. Since Gem5 does not accept a zero hit latency, is there a way to
access icache every cycle without any stall? Let's assume there are no
icache misses.
2. The icache hit latencies for both Ruby memory and Gem5 memory cores
were 2 cycles, why the Ruby case experienced an extra cycle stall?
I was running Full System Gem5(changeset: 9305:ac608464be80) with X86
ISA and single detailed CPU. For Ruby, I used MOESI_hammer protocol.
Thanks!
Runjie Zhang
University of Virginia
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Nilay Vaish
2012-10-23 15:13:50 UTC
Permalink
Post by Runjie Zhang
Greetings,
I tried to write stressmarks in X86 assembly so that the simulated IPC or
O3CPU can hit N for a N-way out-of-order core. However, no matter how I
modify the assembly, the IPC could never reach 4 for a 4-way OoO core.
According to the execution trace, icache stall was the trouble maker. In
my case, even if the whole program fits in icache, the fetch unit still
stalls for a few cycles between fetching 32 instructions over 8 cycles(I
assume 32 X86 ADD instructions fill one cache line?). With Gem5 memory
system (no Ruby), this latency is 2 cycles. With Ruby memory, this latency
is 3 cycles.
1. Since Gem5 does not accept a zero hit latency, is there a way to
access icache every cycle without any stall? Let's assume there are no
icache misses.
Why do you think only a cache which is accessed with zero hit latency can
be accessed with out stalls? I would expect a design that is pipelined
enough would hit its peak throughput once the pipeline is full. Over here
pipeline means not only the processor pipeline but the path that connects
the processor to the caches. So, if the cache provides a throughput of
four instructions every cycle, its latency (whether it is one cycle, or
100 cycles) would not matter at all, once the pipeline is full.

I think you should check why is the fetch unit stalling. You have not
stated that.
Post by Runjie Zhang
2. The icache hit latencies for both Ruby memory and Gem5 memory cores
were 2 cycles, why the Ruby case experienced an extra cycle stall?
You should be able to figure out from the trace where those cycles where
spent.

--
Nilay
Runjie Zhang
2012-10-25 01:58:39 UTC
Permalink
Hi, Nilay

I agree with you that to fetch from icache every cycle, hit latency
don't have to be zero.

Here is a snap shot from the exec trace: (deleted some detail to
make it more clear) Icache hit latency is 1 cycle and fetch width is 4

(Ticks)
...60..: ....fetch: Running stage.
...60..: ....fetch: Attempting to fetch from [tid:0]
...60..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...60..: ....fetch: [tid:0]: Instruction PC 0x400ab7 (0) created [sn:5050].
...60..: ....fetch: [tid:0]: Instruction PC 0x400ab9 (0) created [sn:5051].
...60..: ....fetch: [tid:0]: Instruction PC 0x400abb (0) created [sn:5052].
...60..: ....fetch: [tid:0]: Instruction PC 0x400abd (0) created [sn:5053].
...60..: ....fetch: [tid:0]: Done fetching, reached fetch bandwidth
for this cycle.

...65..: ....fetch: Running stage.
...65..: ....fetch: Attempting to fetch from [tid:0]
...65..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...65..: ....fetch: [tid:0]: Issuing a pipelined I-cache access,
starting at PC (0x400abf=>0x400ac7).(0=>1).
...65..: ....fetch: [tid:0] Fetching cache line 0x400ac0 for addr 0x400ac0
...65..: ....fetch: Fetch: Doing instruction read.
...65..: ....fetch: [tid:0]: Doing Icache access.

...70..: ....fetch: [tid:0] Waking up from cache miss.
...70..: ....fetch: Running stage.
...70..: ....fetch: Attempting to fetch from [tid:0]
...70..: ....fetch: [tid:0]: Icache miss is complete.
...70..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...70..: ....fetch: [tid:0]: Instruction PC 0x400abf (0) created [sn:5054].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac1 (0) created [sn:5055].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac3 (0) created [sn:5056].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac5 (0) created [sn:5057].
...70..: ....fetch: [tid:0]: Done fetching, reached fetch bandwidth
for this cycle.

When entering cycle 65, the previous cache line has been consumed
so the fetch unit launched a pipelined icache access. However, this
access has latency of 1 so the fetch unit need to wait till cycle 70
to start to fetch again. This created a one cycle stall. If I
understand correctly, this latency could be covered it the pipelined
icache access were launched one cycle earlier (in cycle 60). Can I
configure that in Gem5?

I am not sure whether the Fetch flag is enough to study this
phenomenon. If not, please tell me what other flags should I use!

BTW, the O3CPUALL debug flag seems not working. I got error "invalid
debug flag 'O3CPUALL' ".

Thanks!
Runjie


On Mon, 22 Oct 2012, Runjie Zhang wrote:


Greetings,

I tried to write stressmarks in X86 assembly so that the simulated IPC or
O3CPU can hit N for a N-way out-of-order core. However, no matter how I
modify the assembly, the IPC could never reach 4 for a 4-way OoO core.

According to the execution trace, icache stall was the trouble maker. In
my case, even if the whole program fits in icache, the fetch unit still
stalls for a few cycles between fetching 32 instructions over 8 cycles(I
assume 32 X86 ADD instructions fill one cache line?). With Gem5 memory
system (no Ruby), this latency is 2 cycles. With Ruby memory, this latency
is 3 cycles.

So my questions are:

1. Since Gem5 does not accept a zero hit latency, is there a way to
access icache every cycle without any stall? Let's assume there are no
icache misses.

Why do you think only a cache which is accessed with zero hit latency can be
accessed with out stalls? I would expect a design that is pipelined enough
would hit its peak throughput once the pipeline is full. Over here pipeline
means not only the processor pipeline but the path that connects the
processor to the caches. So, if the cache provides a throughput of four
instructions every cycle, its latency (whether it is one cycle, or 100
cycles) would not matter at all, once the pipeline is full.

I think you should check why is the fetch unit stalling. You have not stated
that.

2. The icache hit latencies for both Ruby memory and Gem5 memory cores
were 2 cycles, why the Ruby case experienced an extra cycle stall?

You should be able to figure out from the trace where those cycles where
spent.

--
Nilay
Nilay Vaish
2012-10-25 14:42:53 UTC
Permalink
Post by Runjie Zhang
Hi, Nilay
I agree with you that to fetch from icache every cycle, hit latency
don't have to be zero.
Here is a snap shot from the exec trace: (deleted some detail to
make it more clear) Icache hit latency is 1 cycle and fetch width is 4
(Ticks)
...60..: ....fetch: Running stage.
...60..: ....fetch: Attempting to fetch from [tid:0]
...60..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...60..: ....fetch: [tid:0]: Instruction PC 0x400ab7 (0) created [sn:5050].
...60..: ....fetch: [tid:0]: Instruction PC 0x400ab9 (0) created [sn:5051].
...60..: ....fetch: [tid:0]: Instruction PC 0x400abb (0) created [sn:5052].
...60..: ....fetch: [tid:0]: Instruction PC 0x400abd (0) created [sn:5053].
...60..: ....fetch: [tid:0]: Done fetching, reached fetch bandwidth
for this cycle.
What happened on cycles 61-64? Should not the fetch unit try to create
four instructions every cycles?
Post by Runjie Zhang
...65..: ....fetch: Running stage.
...65..: ....fetch: Attempting to fetch from [tid:0]
...65..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...65..: ....fetch: [tid:0]: Issuing a pipelined I-cache access,
starting at PC (0x400abf=>0x400ac7).(0=>1).
...65..: ....fetch: [tid:0] Fetching cache line 0x400ac0 for addr 0x400ac0
...65..: ....fetch: Fetch: Doing instruction read.
...65..: ....fetch: [tid:0]: Doing Icache access.
What happened in the in-between cycles?
Post by Runjie Zhang
...70..: ....fetch: [tid:0] Waking up from cache miss.
...70..: ....fetch: Running stage.
...70..: ....fetch: Attempting to fetch from [tid:0]
...70..: ....fetch: [tid:0]: Icache miss is complete.
...70..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...70..: ....fetch: [tid:0]: Instruction PC 0x400abf (0) created [sn:5054].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac1 (0) created [sn:5055].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac3 (0) created [sn:5056].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac5 (0) created [sn:5057].
...70..: ....fetch: [tid:0]: Done fetching, reached fetch bandwidth
for this cycle.
When entering cycle 65, the previous cache line has been consumed
so the fetch unit launched a pipelined icache access. However, this
access has latency of 1 so the fetch unit need to wait till cycle 70
to start to fetch again. This created a one cycle stall. If I
understand correctly, this latency could be covered it the pipelined
icache access were launched one cycle earlier (in cycle 60). Can I
configure that in Gem5?
One cycle earlier would mean cycle 64 and not cycle 60. You have
completely removed the trace for the in between cycles which is required
for understanding what was going in the fetch unit during those cycles.
Post by Runjie Zhang
I am not sure whether the Fetch flag is enough to study this
phenomenon. If not, please tell me what other flags should I use!
BTW, the O3CPUALL debug flag seems not working. I got error "invalid
debug flag 'O3CPUALL' ".
It is not working because you are using the wrong flag. The correct flag
is O3CPUAll.

--
Nilay
Runjie Zhang
2012-10-25 15:25:06 UTC
Permalink
Sorry for the confusion.

The numbers 60, 65 and 70 were part of the tick number each cycle started. I
removed some digits in the tick count to make each line shorter...

The complete trace looks like this:

33922322296000: system.switch_cpus.fetch: Running stage.
33922322296000: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322296000: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ab7
(0) created [sn:5050].
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction is:
ADD_R_R : add ecx, ecx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ab9
(0) created [sn:5051].
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction is:
ADD_R_R : add edx, edx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abb
(0) created [sn:5052].
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction is:
SUB_R_R : sub eax, eax, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abd
(0) created [sn:5053].
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction is:
SUB_R_R : sub ebx, ebx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.

33922322296500: system.switch_cpus.BPredUnit: BranchPred: [tid:0]:
Committing branches until [sn:5025].
33922322296500: system.switch_cpus.fetch: Running stage.
33922322296500: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322296500: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322296500: system.switch_cpus.fetch: [tid:0]: Issuing a pipelined
I-cache access, starting at PC (0x400abf=>0x400ac7).(0=>1).
33922322296500: system.switch_cpus.fetch: [tid:0] Fetching cache line
0x400ac0 for addr 0x400ac0
33922322296500: system.switch_cpus.fetch: Fetch: Doing instruction read.
33922322296500: system.switch_cpus.fetch: [tid:0]: Doing Icache access.
33922322297000: system.switch_cpus.fetch: [tid:0] Waking up from cache miss.
33922322297000: system.switch_cpus.BPredUnit: BranchPred: [tid:0]:
Committing branches until [sn:5029].
33922322297000: system.switch_cpus.fetch: Running stage.
33922322297000: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322297000: system.switch_cpus.fetch: [tid:0]: Icache miss is complete.
33922322297000: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abf
(0) created [sn:5054].
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction is:
SUB_R_R : sub ecx, ecx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac1
(0) created [sn:5055].
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction is:
SUB_R_R : sub edx, edx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac3
(0) created [sn:5056].
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction is:
ADD_R_R : add eax, eax, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac5
(0) created [sn:5057].
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction is:
ADD_R_R : add ebx, ebx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.

33922322297500: system.switch_cpus.BPredUnit: BranchPred: [tid:0]:
Committing branches until [sn:5033].
33922322297500: system.switch_cpus.fetch: Running stage.
33922322297500: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322297500: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac7
(0) created [sn:5058].
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction is:
ADD_R_R : add ecx, ecx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac9
(0) created [sn:5059].
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction is:
ADD_R_R : add edx, edx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400acb
(0) created [sn:5060].
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction is:
SUB_R_R : sub eax, eax, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400acd
(0) created [sn:5061].
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction is:
SUB_R_R : sub ebx, ebx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.

Sorry for the confusion.

Runjie
Post by Runjie Zhang
Hi, Nilay
Post by Runjie Zhang
I agree with you that to fetch from icache every cycle, hit latency
don't have to be zero.
Here is a snap shot from the exec trace: (deleted some detail to
make it more clear) Icache hit latency is 1 cycle and fetch width is 4
(Ticks)
...60..: ....fetch: Running stage.
...60..: ....fetch: Attempting to fetch from [tid:0]
...60..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...60..: ....fetch: [tid:0]: Instruction PC 0x400ab7 (0) created [sn:5050].
...60..: ....fetch: [tid:0]: Instruction PC 0x400ab9 (0) created [sn:5051].
...60..: ....fetch: [tid:0]: Instruction PC 0x400abb (0) created [sn:5052].
...60..: ....fetch: [tid:0]: Instruction PC 0x400abd (0) created [sn:5053].
...60..: ....fetch: [tid:0]: Done fetching, reached fetch bandwidth
for this cycle.
What happened on cycles 61-64? Should not the fetch unit try to create
four instructions every cycles?
Post by Runjie Zhang
...65..: ....fetch: Running stage.
...65..: ....fetch: Attempting to fetch from [tid:0]
...65..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...65..: ....fetch: [tid:0]: Issuing a pipelined I-cache access,
starting at PC (0x400abf=>0x400ac7).(0=>1).
...65..: ....fetch: [tid:0] Fetching cache line 0x400ac0 for addr 0x400ac0
...65..: ....fetch: Fetch: Doing instruction read.
...65..: ....fetch: [tid:0]: Doing Icache access.
What happened in the in-between cycles?
Post by Runjie Zhang
...70..: ....fetch: [tid:0] Waking up from cache miss.
...70..: ....fetch: Running stage.
...70..: ....fetch: Attempting to fetch from [tid:0]
...70..: ....fetch: [tid:0]: Icache miss is complete.
...70..: ....fetch: [tid:0]: Adding instructions to queue to decode.
...70..: ....fetch: [tid:0]: Instruction PC 0x400abf (0) created [sn:5054].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac1 (0) created [sn:5055].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac3 (0) created [sn:5056].
...70..: ....fetch: [tid:0]: Instruction PC 0x400ac5 (0) created [sn:5057].
...70..: ....fetch: [tid:0]: Done fetching, reached fetch bandwidth
for this cycle.
When entering cycle 65, the previous cache line has been consumed
so the fetch unit launched a pipelined icache access. However, this
access has latency of 1 so the fetch unit need to wait till cycle 70
to start to fetch again. This created a one cycle stall. If I
understand correctly, this latency could be covered it the pipelined
icache access were launched one cycle earlier (in cycle 60). Can I
configure that in Gem5?
One cycle earlier would mean cycle 64 and not cycle 60. You have
completely removed the trace for the in between cycles which is required
for understanding what was going in the fetch unit during those cycles.
Post by Runjie Zhang
I am not sure whether the Fetch flag is enough to study this
phenomenon. If not, please tell me what other flags should I use!
BTW, the O3CPUALL debug flag seems not working. I got error "invalid
debug flag 'O3CPUALL' ".
It is not working because you are using the wrong flag. The correct flag
is O3CPUAll.
--
Nilay
Nilay Vaish
2012-10-26 19:42:45 UTC
Permalink
I understand now the problem that you are trying to elucidate. I just
checked the fetch_impl.hh. If you look at line 889, it is doing exactly
what you have suggested. It might be that there is some thing wrong with
this code and it is not behaving as expected. You might want to take a
deeper dive in to the fetch stage's code and figure out the reason why the
icache access was not issued a cycle earlier.

--
Nilay
Post by Runjie Zhang
Sorry for the confusion.
The numbers 60, 65 and 70 were part of the tick number each cycle started. I
removed some digits in the tick count to make each line shorter...
33922322296000: system.switch_cpus.fetch: Running stage.
33922322296000: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322296000: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ab7
(0) created [sn:5050].
ADD_R_R : add ecx, ecx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ab9
(0) created [sn:5051].
ADD_R_R : add edx, edx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abb
(0) created [sn:5052].
SUB_R_R : sub eax, eax, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abd
(0) created [sn:5053].
SUB_R_R : sub ebx, ebx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.
Committing branches until [sn:5025].
33922322296500: system.switch_cpus.fetch: Running stage.
33922322296500: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322296500: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322296500: system.switch_cpus.fetch: [tid:0]: Issuing a pipelined
I-cache access, starting at PC (0x400abf=>0x400ac7).(0=>1).
33922322296500: system.switch_cpus.fetch: [tid:0] Fetching cache line
0x400ac0 for addr 0x400ac0
33922322296500: system.switch_cpus.fetch: Fetch: Doing instruction read.
33922322296500: system.switch_cpus.fetch: [tid:0]: Doing Icache access.
33922322297000: system.switch_cpus.fetch: [tid:0] Waking up from cache miss.
Committing branches until [sn:5029].
33922322297000: system.switch_cpus.fetch: Running stage.
33922322297000: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322297000: system.switch_cpus.fetch: [tid:0]: Icache miss is complete.
33922322297000: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abf
(0) created [sn:5054].
SUB_R_R : sub ecx, ecx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac1
(0) created [sn:5055].
SUB_R_R : sub edx, edx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac3
(0) created [sn:5056].
ADD_R_R : add eax, eax, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac5
(0) created [sn:5057].
ADD_R_R : add ebx, ebx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.
Committing branches until [sn:5033].
33922322297500: system.switch_cpus.fetch: Running stage.
33922322297500: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322297500: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac7
(0) created [sn:5058].
ADD_R_R : add ecx, ecx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac9
(0) created [sn:5059].
ADD_R_R : add edx, edx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400acb
(0) created [sn:5060].
SUB_R_R : sub eax, eax, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400acd
(0) created [sn:5061].
SUB_R_R : sub ebx, ebx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.
Sorry for the confusion.
Runjie
Runjie Zhang
2012-10-26 21:09:21 UTC
Permalink
Thanks for the confirmation. I'll look into it and discuss potential
solutions.

BTW, just curious, is there any particular reason for putting the code for
fetch in a .hh, instead of a .cc file?

Thanks!
Runjie
Post by Nilay Vaish
I understand now the problem that you are trying to elucidate. I just
checked the fetch_impl.hh. If you look at line 889, it is doing exactly
what you have suggested. It might be that there is some thing wrong with
this code and it is not behaving as expected. You might want to take a
deeper dive in to the fetch stage's code and figure out the reason why the
icache access was not issued a cycle earlier.
--
Nilay
Sorry for the confusion.
Post by Runjie Zhang
The numbers 60, 65 and 70 were part of the tick number each cycle started. I
removed some digits in the tick count to make each line shorter...
33922322296000: system.switch_cpus.fetch: Running stage.
33922322296000: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322296000: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ab7
(0) created [sn:5050].
ADD_R_R : add ecx, ecx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ab9
(0) created [sn:5051].
ADD_R_R : add edx, edx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abb
(0) created [sn:5052].
SUB_R_R : sub eax, eax, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abd
(0) created [sn:5053].
SUB_R_R : sub ebx, ebx, esi
33922322296000: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.
Committing branches until [sn:5025].
33922322296500: system.switch_cpus.fetch: Running stage.
33922322296500: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322296500: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322296500: system.switch_cpus.fetch: [tid:0]: Issuing a pipelined
I-cache access, starting at PC (0x400abf=>0x400ac7).(0=>1).
33922322296500: system.switch_cpus.fetch: [tid:0] Fetching cache line
0x400ac0 for addr 0x400ac0
33922322296500: system.switch_cpus.fetch: Fetch: Doing instruction read.
33922322296500: system.switch_cpus.fetch: [tid:0]: Doing Icache access.
33922322297000: system.switch_cpus.fetch: [tid:0] Waking up from cache miss.
Committing branches until [sn:5029].
33922322297000: system.switch_cpus.fetch: Running stage.
33922322297000: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322297000: system.switch_cpus.fetch: [tid:0]: Icache miss is complete.
33922322297000: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400abf
(0) created [sn:5054].
SUB_R_R : sub ecx, ecx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac1
(0) created [sn:5055].
SUB_R_R : sub edx, edx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac3
(0) created [sn:5056].
ADD_R_R : add eax, eax, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac5
(0) created [sn:5057].
ADD_R_R : add ebx, ebx, esi
33922322297000: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.
Committing branches until [sn:5033].
33922322297500: system.switch_cpus.fetch: Running stage.
33922322297500: system.switch_cpus.fetch: Attempting to fetch from [tid:0]
33922322297500: system.switch_cpus.fetch: [tid:0]: Adding instructions to
queue to decode.
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac7
(0) created [sn:5058].
ADD_R_R : add ecx, ecx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400ac9
(0) created [sn:5059].
ADD_R_R : add edx, edx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400acb
(0) created [sn:5060].
SUB_R_R : sub eax, eax, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Instruction PC 0x400acd
(0) created [sn:5061].
SUB_R_R : sub ebx, ebx, esi
33922322297500: system.switch_cpus.fetch: [tid:0]: Done fetching, reached
fetch bandwidth for this cycle.
Sorry for the confusion.
Runjie
Continue reading on narkive:
Loading...