Discussion:
Writeback buffer kills O3 performance, what is it meant to model?
(too old to reply)
Paul V. Gratz via gem5-users
2014-05-09 14:45:40 UTC
Permalink
Hi All,
Doing some digging on performance issues in the O3 model we and others have
run into allocation of the writeback buffer having a big performance
impact. Basically, the a writeback buffer is grabbed at issue time and
held through till completion. With default assumptions about the number of
available writeback buffers, (x*issue width, where x is 1 by default), the
buffers often end up bottlenecking the effective issue width (particularly
in the face of long latency loads grabbing up all the WB buffers). What
are these structures trying to model? I can see limiting the number of
instructions allowed to complete and writeback/bypass in a cycle but this
ends up being much more conservative than that if it is the intent. If not
why does it do this? We can easily make number of WB bufs high but want to
understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
Steve Reinhardt via gem5-users
2014-05-12 17:39:04 UTC
Permalink
Hi Paul,

I assume you're talking about the 'wbMax' variable? I don't recall it
specifically myself, but after looking at the code a bit, the best I can
come up with is that there's assumed to be a finite number of buffers
somewhere that hold results from the function units before they write back
to the reg file. Realistically, to me, it seems like those buffers would
be distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?

It doesn't seem to relate to any structure that's directly modeled in the
code, i.e., I think you could rip the whole thing out (incrWb(), decrWb(),
wbOustanding, wbMax) without breaking anything in the model... which would
be a good thing if in fact everyone else is either suffering unaware or
just working around it by setting a large value for wbDepth.

That said, we've done some internal performance correlation work, and I
don't recall this being an issue, for whatever that's worth. I know ARM
has done some correlation work too; have you run into this?

Steve



On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and others
have run into allocation of the writeback buffer having a big performance
impact. Basically, the a writeback buffer is grabbed at issue time and
held through till completion. With default assumptions about the number of
available writeback buffers, (x*issue width, where x is 1 by default), the
buffers often end up bottlenecking the effective issue width (particularly
in the face of long latency loads grabbing up all the WB buffers). What
are these structures trying to model? I can see limiting the number of
instructions allowed to complete and writeback/bypass in a cycle but this
ends up being much more conservative than that if it is the intent. If not
why does it do this? We can easily make number of WB bufs high but want to
understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Arthur Perais via gem5-users
2014-05-12 19:08:18 UTC
Permalink
Hi all,

I have no specific knowledge on what are the buffers modeling or what
they should be modeling, but I too have encountered this issue some time
ago. Setting a high wbDepth is what I do to work around it (actually, 3
is sufficient for me), because performance is indeed suffering quite a
lot (and even more for narrow-issue cores if wbWidth == issueWidth, I
would expect) in some cases.
Post by Steve Reinhardt via gem5-users
Hi Paul,
I assume you're talking about the 'wbMax' variable? I don't recall it
specifically myself, but after looking at the code a bit, the best I
can come up with is that there's assumed to be a finite number of
buffers somewhere that hold results from the function units before
they write back to the reg file. Realistically, to me, it seems like
those buffers would be distributed among the function units anyway,
not a global resource, so having a global limit doesn't make a lot of
sense. Does anyone else out there agree or disagree?
It doesn't seem to relate to any structure that's directly modeled in
the code, i.e., I think you could rip the whole thing out (incrWb(),
decrWb(), wbOustanding, wbMax) without breaking anything in the
model... which would be a good thing if in fact everyone else is
either suffering unaware or just working around it by setting a large
value for wbDepth.
That said, we've done some internal performance correlation work, and
I don't recall this being an issue, for whatever that's worth. I know
ARM has done some correlation work too; have you run into this?
Steve
On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and
others have run into allocation of the writeback buffer having a
big performance impact. Basically, the a writeback buffer is
grabbed at issue time and held through till completion. With
default assumptions about the number of available writeback
buffers, (x*issue width, where x is 1 by default), the buffers
often end up bottlenecking the effective issue width (particularly
in the face of long latency loads grabbing up all the WB buffers).
What are these structures trying to model? I can see limiting
the number of instructions allowed to complete and
writeback/bypass in a cycle but this ends up being much more
conservative than that if it is the intent. If not why does it do
this? We can easily make number of WB bufs high but want to
understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551 <tel:979-488-4551>
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
Mitch Hayenga via gem5-users
2014-05-12 20:51:07 UTC
Permalink
*"Realistically, to me, it seems like those buffers would be distributed
among the function units anyway, not a global resource, so having a global
limit doesn't make a lot of sense. Does anyone else out there agree or
disagree?"*

I believe that's more or less correct. With wbWidth probably meant to be
the # of write ports on the register file and wbDepth being the pipe stages
for a multi-cycle write back.

I don't fully agree that it should be distributed at the function unit
level, as you could imagine designs with higher issue width and functional
units than the number of register file write ports. Essentially allowing
more instructions to be issued on a given cycle, as long as they did not
all complete on the same cycle.

Going back to Paul's issue (loads holding write back slots on misses). The
"proper" way to do it would probably be to reserve a slot assuming an L1
cache hit latency. Give up the slot on a miss. Have an early signal that
a load-miss is coming back from the cache so that you could reserve a write
back slot in parallel with doing all the other necessary work for a load
(CAMing vs the store queue, etc). But this would likely be annoying to
implement.


*In general though, yes this seems like something not worth modeling in
gem5 as the potential negative impacts of its current implementation
outweigh the benefits. And the benefits of fully modeling it are likely
small.*



On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
Post by Arthur Perais via gem5-users
Hi all,
I have no specific knowledge on what are the buffers modeling or what they
should be modeling, but I too have encountered this issue some time ago.
Setting a high wbDepth is what I do to work around it (actually, 3 is
sufficient for me), because performance is indeed suffering quite a lot
(and even more for narrow-issue cores if wbWidth == issueWidth, I would
expect) in some cases.
Hi Paul,
I assume you're talking about the 'wbMax' variable? I don't recall it
specifically myself, but after looking at the code a bit, the best I can
come up with is that there's assumed to be a finite number of buffers
somewhere that hold results from the function units before they write back
to the reg file. Realistically, to me, it seems like those buffers would
be distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?
It doesn't seem to relate to any structure that's directly modeled in
the code, i.e., I think you could rip the whole thing out (incrWb(),
decrWb(), wbOustanding, wbMax) without breaking anything in the model...
which would be a good thing if in fact everyone else is either suffering
unaware or just working around it by setting a large value for wbDepth.
That said, we've done some internal performance correlation work, and I
don't recall this being an issue, for whatever that's worth. I know ARM
has done some correlation work too; have you run into this?
Steve
On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and others
have run into allocation of the writeback buffer having a big performance
impact. Basically, the a writeback buffer is grabbed at issue time and
held through till completion. With default assumptions about the number of
available writeback buffers, (x*issue width, where x is 1 by default), the
buffers often end up bottlenecking the effective issue width (particularly
in the face of long latency loads grabbing up all the WB buffers). What
are these structures trying to model? I can see limiting the number of
instructions allowed to complete and writeback/bypass in a cycle but this
ends up being much more conservative than that if it is the intent. If not
why does it do this? We can easily make number of WB bufs high but want to
understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Paul V. Gratz via gem5-users
2014-05-13 02:07:59 UTC
Permalink
Hi All,
Agreed, thanks for confirming we were not missing something. Just some
followup, my student has some data about this he'll post to here shortly
with the performance impact he sees for this issue, but it is quite large
for 2-wide OOO. I was thinking it might be something along those lines
(or something about the bypass network width) but it seems like grabbing
them at issue time is probably too conservative (as opposed to grabbing
them at completion and stalling the functional unit if you can't get one).

I believe Karu Sankaralingham at Wisc also found this and a few other
issues, they have a related paper at WDDD this year.

We also found a problem where multiple outstanding loads to the same
address causing heavy flushing in O3 w/ ruby that has a similarly large
performance impact, we'll start another thread on that shortly.
Thanks!
Paul



On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users <
Post by Mitch Hayenga via gem5-users
*"Realistically, to me, it seems like those buffers would be distributed
among the function units anyway, not a global resource, so having a global
limit doesn't make a lot of sense. Does anyone else out there agree or
disagree?"*
I believe that's more or less correct. With wbWidth probably meant to be
the # of write ports on the register file and wbDepth being the pipe stages
for a multi-cycle write back.
I don't fully agree that it should be distributed at the function unit
level, as you could imagine designs with higher issue width and functional
units than the number of register file write ports. Essentially allowing
more instructions to be issued on a given cycle, as long as they did not
all complete on the same cycle.
Going back to Paul's issue (loads holding write back slots on misses).
The "proper" way to do it would probably be to reserve a slot assuming an
L1 cache hit latency. Give up the slot on a miss. Have an early signal
that a load-miss is coming back from the cache so that you could reserve a
write back slot in parallel with doing all the other necessary work for a
load (CAMing vs the store queue, etc). But this would likely be annoying to
implement.
*In general though, yes this seems like something not worth modeling in
gem5 as the potential negative impacts of its current implementation
outweigh the benefits. And the benefits of fully modeling it are likely
small.*
On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
Post by Arthur Perais via gem5-users
Hi all,
I have no specific knowledge on what are the buffers modeling or what
they should be modeling, but I too have encountered this issue some time
ago. Setting a high wbDepth is what I do to work around it (actually, 3 is
sufficient for me), because performance is indeed suffering quite a lot
(and even more for narrow-issue cores if wbWidth == issueWidth, I would
expect) in some cases.
Hi Paul,
I assume you're talking about the 'wbMax' variable? I don't recall it
specifically myself, but after looking at the code a bit, the best I can
come up with is that there's assumed to be a finite number of buffers
somewhere that hold results from the function units before they write back
to the reg file. Realistically, to me, it seems like those buffers would
be distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?
It doesn't seem to relate to any structure that's directly modeled in
the code, i.e., I think you could rip the whole thing out (incrWb(),
decrWb(), wbOustanding, wbMax) without breaking anything in the model...
which would be a good thing if in fact everyone else is either suffering
unaware or just working around it by setting a large value for wbDepth.
That said, we've done some internal performance correlation work, and I
don't recall this being an issue, for whatever that's worth. I know ARM
has done some correlation work too; have you run into this?
Steve
On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and others
have run into allocation of the writeback buffer having a big performance
impact. Basically, the a writeback buffer is grabbed at issue time and
held through till completion. With default assumptions about the number of
available writeback buffers, (x*issue width, where x is 1 by default), the
buffers often end up bottlenecking the effective issue width (particularly
in the face of long latency loads grabbing up all the WB buffers). What
are these structures trying to model? I can see limiting the number of
instructions allowed to complete and writeback/bypass in a cycle but this
ends up being much more conservative than that if it is the intent. If not
why does it do this? We can easily make number of WB bufs high but want to
understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
Steve Reinhardt via gem5-users
2014-05-13 02:39:28 UTC
Permalink
Paul,

Are you talking about the issue where multiple accesses to the same block
cause Ruby to tell the core to retry, which in turn causes a pipeline
flush? We've seen that too and have a patch that we've been intending to
post... this discussion (and the earlier one about store prefetching) have
inspired me to try and get that process started again.

Thanks for speaking up. I'd much rather have people point out problems, or
better yet post patches for them, than stockpile them for a WDDD paper ;-).

Steve



On Mon, May 12, 2014 at 7:07 PM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Agreed, thanks for confirming we were not missing something. Just some
followup, my student has some data about this he'll post to here shortly
with the performance impact he sees for this issue, but it is quite large
for 2-wide OOO. I was thinking it might be something along those lines
(or something about the bypass network width) but it seems like grabbing
them at issue time is probably too conservative (as opposed to grabbing
them at completion and stalling the functional unit if you can't get one).
I believe Karu Sankaralingham at Wisc also found this and a few other
issues, they have a related paper at WDDD this year.
We also found a problem where multiple outstanding loads to the same
address causing heavy flushing in O3 w/ ruby that has a similarly large
performance impact, we'll start another thread on that shortly.
Thanks!
Paul
On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users <
Post by Mitch Hayenga via gem5-users
*"Realistically, to me, it seems like those buffers would be distributed
among the function units anyway, not a global resource, so having a global
limit doesn't make a lot of sense. Does anyone else out there agree or
disagree?"*
I believe that's more or less correct. With wbWidth probably meant to be
the # of write ports on the register file and wbDepth being the pipe stages
for a multi-cycle write back.
I don't fully agree that it should be distributed at the function unit
level, as you could imagine designs with higher issue width and functional
units than the number of register file write ports. Essentially allowing
more instructions to be issued on a given cycle, as long as they did not
all complete on the same cycle.
Going back to Paul's issue (loads holding write back slots on misses).
The "proper" way to do it would probably be to reserve a slot assuming an
L1 cache hit latency. Give up the slot on a miss. Have an early signal
that a load-miss is coming back from the cache so that you could reserve a
write back slot in parallel with doing all the other necessary work for a
load (CAMing vs the store queue, etc). But this would likely be annoying to
implement.
*In general though, yes this seems like something not worth modeling in
gem5 as the potential negative impacts of its current implementation
outweigh the benefits. And the benefits of fully modeling it are likely
small.*
On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
Post by Arthur Perais via gem5-users
Hi all,
I have no specific knowledge on what are the buffers modeling or what
they should be modeling, but I too have encountered this issue some time
ago. Setting a high wbDepth is what I do to work around it (actually, 3 is
sufficient for me), because performance is indeed suffering quite a lot
(and even more for narrow-issue cores if wbWidth == issueWidth, I would
expect) in some cases.
Hi Paul,
I assume you're talking about the 'wbMax' variable? I don't recall it
specifically myself, but after looking at the code a bit, the best I can
come up with is that there's assumed to be a finite number of buffers
somewhere that hold results from the function units before they write back
to the reg file. Realistically, to me, it seems like those buffers would
be distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?
It doesn't seem to relate to any structure that's directly modeled in
the code, i.e., I think you could rip the whole thing out (incrWb(),
decrWb(), wbOustanding, wbMax) without breaking anything in the model...
which would be a good thing if in fact everyone else is either suffering
unaware or just working around it by setting a large value for wbDepth.
That said, we've done some internal performance correlation work, and
I don't recall this being an issue, for whatever that's worth. I know ARM
has done some correlation work too; have you run into this?
Steve
On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and others
have run into allocation of the writeback buffer having a big performance
impact. Basically, the a writeback buffer is grabbed at issue time and
held through till completion. With default assumptions about the number of
available writeback buffers, (x*issue width, where x is 1 by default), the
buffers often end up bottlenecking the effective issue width (particularly
in the face of long latency loads grabbing up all the WB buffers). What
are these structures trying to model? I can see limiting the number of
instructions allowed to complete and writeback/bypass in a cycle but this
ends up being much more conservative than that if it is the intent. If not
why does it do this? We can easily make number of WB bufs high but want to
understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vamsi Krishna via gem5-users
2014-05-13 08:09:48 UTC
Permalink
Hello All,

As Paul was mentioning, I tried to come up with small analysis on how the
number of writeback buffers affect performance of PARSEC benchmarks when
increased by 5x the default size. I found out that 2-wide processor
improved by 22% , 4-wide processor by 7% and 8-wide processor by 0.6% in
performance in average. This is mainly because of increased effective issue
width because of increased availability of buffers. Clearly only effective
writeback width should be affected not the effective issue width if modeled
correctly. Long latency instructions like load miss will result in
decreased issue width until load is completed. Processors with less width
seems to suffer significantly because of this.

Regarding the issue where multiple accesses to same block causing pipeline
flushes, I have posted this question earlier (
http://comments.gmane.org/gmane.comp.emulators.m5.users/16657),
unfortunately the thread did not proceed further. It has a huge impact on
performance of upto 40% in PARSEC benchmarks in 8-wide processor, 29% in
4-wide processor and 13% in 2-wide processor in average. It would be great
to have the fix for this in gem5 because it is causing unusually high
flushing activity in pipeline and affects speculation.

Thanks,
Vamsi Krishna


On Mon, May 12, 2014 at 9:39 PM, Steve Reinhardt via gem5-users <
Post by Steve Reinhardt via gem5-users
Paul,
Are you talking about the issue where multiple accesses to the same block
cause Ruby to tell the core to retry, which in turn causes a pipeline
flush? We've seen that too and have a patch that we've been intending to
post... this discussion (and the earlier one about store prefetching) have
inspired me to try and get that process started again.
Thanks for speaking up. I'd much rather have people point out problems,
or better yet post patches for them, than stockpile them for a WDDD paper
;-).
Steve
On Mon, May 12, 2014 at 7:07 PM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Agreed, thanks for confirming we were not missing something. Just some
followup, my student has some data about this he'll post to here shortly
with the performance impact he sees for this issue, but it is quite large
for 2-wide OOO. I was thinking it might be something along those lines
(or something about the bypass network width) but it seems like grabbing
them at issue time is probably too conservative (as opposed to grabbing
them at completion and stalling the functional unit if you can't get one).
I believe Karu Sankaralingham at Wisc also found this and a few other
issues, they have a related paper at WDDD this year.
We also found a problem where multiple outstanding loads to the same
address causing heavy flushing in O3 w/ ruby that has a similarly large
performance impact, we'll start another thread on that shortly.
Thanks!
Paul
On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users <
Post by Mitch Hayenga via gem5-users
*"Realistically, to me, it seems like those buffers would be distributed
among the function units anyway, not a global resource, so having a global
limit doesn't make a lot of sense. Does anyone else out there agree or
disagree?"*
I believe that's more or less correct. With wbWidth probably meant to
be the # of write ports on the register file and wbDepth being the pipe
stages for a multi-cycle write back.
I don't fully agree that it should be distributed at the function unit
level, as you could imagine designs with higher issue width and functional
units than the number of register file write ports. Essentially allowing
more instructions to be issued on a given cycle, as long as they did not
all complete on the same cycle.
Going back to Paul's issue (loads holding write back slots on misses).
The "proper" way to do it would probably be to reserve a slot assuming an
L1 cache hit latency. Give up the slot on a miss. Have an early signal
that a load-miss is coming back from the cache so that you could reserve a
write back slot in parallel with doing all the other necessary work for a
load (CAMing vs the store queue, etc). But this would likely be annoying to
implement.
*In general though, yes this seems like something not worth modeling in
gem5 as the potential negative impacts of its current implementation
outweigh the benefits. And the benefits of fully modeling it are likely
small.*
On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
Post by Arthur Perais via gem5-users
Hi all,
I have no specific knowledge on what are the buffers modeling or what
they should be modeling, but I too have encountered this issue some time
ago. Setting a high wbDepth is what I do to work around it (actually, 3 is
sufficient for me), because performance is indeed suffering quite a lot
(and even more for narrow-issue cores if wbWidth == issueWidth, I would
expect) in some cases.
Hi Paul,
I assume you're talking about the 'wbMax' variable? I don't recall
it specifically myself, but after looking at the code a bit, the best I can
come up with is that there's assumed to be a finite number of buffers
somewhere that hold results from the function units before they write back
to the reg file. Realistically, to me, it seems like those buffers would
be distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?
It doesn't seem to relate to any structure that's directly modeled in
the code, i.e., I think you could rip the whole thing out (incrWb(),
decrWb(), wbOustanding, wbMax) without breaking anything in the model...
which would be a good thing if in fact everyone else is either suffering
unaware or just working around it by setting a large value for wbDepth.
That said, we've done some internal performance correlation work, and
I don't recall this being an issue, for whatever that's worth. I know ARM
has done some correlation work too; have you run into this?
Steve
On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and others
have run into allocation of the writeback buffer having a big performance
impact. Basically, the a writeback buffer is grabbed at issue time and
held through till completion. With default assumptions about the number of
available writeback buffers, (x*issue width, where x is 1 by default), the
buffers often end up bottlenecking the effective issue width (particularly
in the face of long latency loads grabbing up all the WB buffers). What
are these structures trying to model? I can see limiting the number of
instructions allowed to complete and writeback/bypass in a cycle but this
ends up being much more conservative than that if it is the intent. If not
why does it do this? We can easily make number of WB bufs high but want to
understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
Regards,
Vamsi Krishna
Mitch Hayenga via gem5-users
2014-05-13 14:32:08 UTC
Permalink
I actually wrote a patch a while back (apparently Feb 20) that fixed the
load squash issue. I kind of abandoned it, but it was able to run a few
benchmarks (never ran the regression tests on it). I'll revive that and
see if it passes the regression tests.

All it did was force the load to be repetitively replayed until it was
successfully not blocked, rather than squashing the entire pipeline. I
remember incrWB() and decrWb() was the most annoying part of writing it.

As a side note, I've found generally that increasing tgts_per_mshr to
something unlikely to get hit largely eliminates the issue (this is why I
abandoned the patch). You are still limiting the number of outstanding
cache lines to a specific number via the number of MSHRs, but don't squash
just because a bunch of loads all accessed the same line. This is
probably a good temporary solution.



On Tue, May 13, 2014 at 3:09 AM, Vamsi Krishna via gem5-users <
Post by Vamsi Krishna via gem5-users
Hello All,
As Paul was mentioning, I tried to come up with small analysis on how the
number of writeback buffers affect performance of PARSEC benchmarks when
increased by 5x the default size. I found out that 2-wide processor
improved by 22% , 4-wide processor by 7% and 8-wide processor by 0.6% in
performance in average. This is mainly because of increased effective issue
width because of increased availability of buffers. Clearly only effective
writeback width should be affected not the effective issue width if modeled
correctly. Long latency instructions like load miss will result in
decreased issue width until load is completed. Processors with less width
seems to suffer significantly because of this.
Regarding the issue where multiple accesses to same block causing pipeline
flushes, I have posted this question earlier (
http://comments.gmane.org/gmane.comp.emulators.m5.users/16657),
unfortunately the thread did not proceed further. It has a huge impact on
performance of upto 40% in PARSEC benchmarks in 8-wide processor, 29% in
4-wide processor and 13% in 2-wide processor in average. It would be great
to have the fix for this in gem5 because it is causing unusually high
flushing activity in pipeline and affects speculation.
Thanks,
Vamsi Krishna
On Mon, May 12, 2014 at 9:39 PM, Steve Reinhardt via gem5-users <
Post by Steve Reinhardt via gem5-users
Paul,
Are you talking about the issue where multiple accesses to the same block
cause Ruby to tell the core to retry, which in turn causes a pipeline
flush? We've seen that too and have a patch that we've been intending to
post... this discussion (and the earlier one about store prefetching) have
inspired me to try and get that process started again.
Thanks for speaking up. I'd much rather have people point out problems,
or better yet post patches for them, than stockpile them for a WDDD paper
;-).
Steve
On Mon, May 12, 2014 at 7:07 PM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Agreed, thanks for confirming we were not missing something. Just some
followup, my student has some data about this he'll post to here shortly
with the performance impact he sees for this issue, but it is quite large
for 2-wide OOO. I was thinking it might be something along those lines
(or something about the bypass network width) but it seems like grabbing
them at issue time is probably too conservative (as opposed to grabbing
them at completion and stalling the functional unit if you can't get one).
I believe Karu Sankaralingham at Wisc also found this and a few other
issues, they have a related paper at WDDD this year.
We also found a problem where multiple outstanding loads to the same
address causing heavy flushing in O3 w/ ruby that has a similarly large
performance impact, we'll start another thread on that shortly.
Thanks!
Paul
On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users <
Post by Mitch Hayenga via gem5-users
*"Realistically, to me, it seems like those buffers would be
distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?"*
I believe that's more or less correct. With wbWidth probably meant to
be the # of write ports on the register file and wbDepth being the pipe
stages for a multi-cycle write back.
I don't fully agree that it should be distributed at the function unit
level, as you could imagine designs with higher issue width and functional
units than the number of register file write ports. Essentially allowing
more instructions to be issued on a given cycle, as long as they did not
all complete on the same cycle.
Going back to Paul's issue (loads holding write back slots on misses).
The "proper" way to do it would probably be to reserve a slot assuming an
L1 cache hit latency. Give up the slot on a miss. Have an early signal
that a load-miss is coming back from the cache so that you could reserve a
write back slot in parallel with doing all the other necessary work for a
load (CAMing vs the store queue, etc). But this would likely be annoying to
implement.
*In general though, yes this seems like something not worth modeling in
gem5 as the potential negative impacts of its current implementation
outweigh the benefits. And the benefits of fully modeling it are likely
small.*
On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
Post by Arthur Perais via gem5-users
Hi all,
I have no specific knowledge on what are the buffers modeling or what
they should be modeling, but I too have encountered this issue some time
ago. Setting a high wbDepth is what I do to work around it (actually, 3 is
sufficient for me), because performance is indeed suffering quite a lot
(and even more for narrow-issue cores if wbWidth == issueWidth, I would
expect) in some cases.
Hi Paul,
I assume you're talking about the 'wbMax' variable? I don't recall
it specifically myself, but after looking at the code a bit, the best I can
come up with is that there's assumed to be a finite number of buffers
somewhere that hold results from the function units before they write back
to the reg file. Realistically, to me, it seems like those buffers would
be distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?
It doesn't seem to relate to any structure that's directly modeled
in the code, i.e., I think you could rip the whole thing out (incrWb(),
decrWb(), wbOustanding, wbMax) without breaking anything in the model...
which would be a good thing if in fact everyone else is either suffering
unaware or just working around it by setting a large value for wbDepth.
That said, we've done some internal performance correlation work,
and I don't recall this being an issue, for whatever that's worth. I know
ARM has done some correlation work too; have you run into this?
Steve
On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and
others have run into allocation of the writeback buffer having a big
performance impact. Basically, the a writeback buffer is grabbed at issue
time and held through till completion. With default assumptions about the
number of available writeback buffers, (x*issue width, where x is 1 by
default), the buffers often end up bottlenecking the effective issue width
(particularly in the face of long latency loads grabbing up all the WB
buffers). What are these structures trying to model? I can see limiting
the number of instructions allowed to complete and writeback/bypass in a
cycle but this ends up being much more conservative than that if it is the
intent. If not why does it do this? We can easily make number of WB bufs
high but want to understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
Regards,
Vamsi Krishna
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Fernando Endo via gem5-users
2014-05-22 10:34:36 UTC
Permalink
Hello,

To my understanding, wbDepth represents a kind of "average and
effective execution stage depth": wbWidth*wbDepth represents the
maximum allowed in-flight instructions in the EXE, i.e. instructions
that issued but did not writeback yet.

Given that, I agree that such buffers would be better used if they are
distributed among the FUs, because it would represent the FU effective
depth.

I call it "effective depth" because in real hardware the FP functional
unit (FU) may have 4 stages (then its depth is 4), but in gem5 we
would setup a FMAC with a latency greater than 4, say 8 cycles,
because in real hardware there may be more than one pipe inside a FU
(e.g FP ADD, FP MUL, etc). Then, we should set up the FP FU with an
effective EXE depth of 8 to correctly simulate them.

For while, I always setup the wbDepth = max(opLat/issueCycles), which
means that the issueWidth, the issueCycle and the opLat are generally
dictating the maximum number of in-flight instructions in the EXE
(wbDepth should have no major influence).

Thanks for this discussion.

Regards,
Post by Mitch Hayenga via gem5-users
I actually wrote a patch a while back (apparently Feb 20) that fixed the
load squash issue. I kind of abandoned it, but it was able to run a few
benchmarks (never ran the regression tests on it). I'll revive that and
see if it passes the regression tests.
All it did was force the load to be repetitively replayed until it was
successfully not blocked, rather than squashing the entire pipeline. I
remember incrWB() and decrWb() was the most annoying part of writing it.
As a side note, I've found generally that increasing tgts_per_mshr to
something unlikely to get hit largely eliminates the issue (this is why I
abandoned the patch). You are still limiting the number of outstanding
cache lines to a specific number via the number of MSHRs, but don't squash
just because a bunch of loads all accessed the same line. This is
probably a good temporary solution.
On Tue, May 13, 2014 at 3:09 AM, Vamsi Krishna via gem5-users <
Post by Vamsi Krishna via gem5-users
Hello All,
As Paul was mentioning, I tried to come up with small analysis on how the
number of writeback buffers affect performance of PARSEC benchmarks when
increased by 5x the default size. I found out that 2-wide processor
improved by 22% , 4-wide processor by 7% and 8-wide processor by 0.6% in
performance in average. This is mainly because of increased effective issue
width because of increased availability of buffers. Clearly only effective
writeback width should be affected not the effective issue width if modeled
correctly. Long latency instructions like load miss will result in
decreased issue width until load is completed. Processors with less width
seems to suffer significantly because of this.
Regarding the issue where multiple accesses to same block causing pipeline
flushes, I have posted this question earlier (
http://comments.gmane.org/gmane.comp.emulators.m5.users/16657),
unfortunately the thread did not proceed further. It has a huge impact on
performance of upto 40% in PARSEC benchmarks in 8-wide processor, 29% in
4-wide processor and 13% in 2-wide processor in average. It would be great
to have the fix for this in gem5 because it is causing unusually high
flushing activity in pipeline and affects speculation.
Thanks,
Vamsi Krishna
On Mon, May 12, 2014 at 9:39 PM, Steve Reinhardt via gem5-users <
Post by Steve Reinhardt via gem5-users
Paul,
Are you talking about the issue where multiple accesses to the same block
cause Ruby to tell the core to retry, which in turn causes a pipeline
flush? We've seen that too and have a patch that we've been intending to
post... this discussion (and the earlier one about store prefetching) have
inspired me to try and get that process started again.
Thanks for speaking up. I'd much rather have people point out problems,
or better yet post patches for them, than stockpile them for a WDDD paper
;-).
Steve
On Mon, May 12, 2014 at 7:07 PM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Agreed, thanks for confirming we were not missing something. Just some
followup, my student has some data about this he'll post to here shortly
with the performance impact he sees for this issue, but it is quite large
for 2-wide OOO. I was thinking it might be something along those lines
(or something about the bypass network width) but it seems like grabbing
them at issue time is probably too conservative (as opposed to grabbing
them at completion and stalling the functional unit if you can't get one).
I believe Karu Sankaralingham at Wisc also found this and a few other
issues, they have a related paper at WDDD this year.
We also found a problem where multiple outstanding loads to the same
address causing heavy flushing in O3 w/ ruby that has a similarly large
performance impact, we'll start another thread on that shortly.
Thanks!
Paul
On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users <
Post by Mitch Hayenga via gem5-users
*"Realistically, to me, it seems like those buffers would be
distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?"*
I believe that's more or less correct. With wbWidth probably meant to
be the # of write ports on the register file and wbDepth being the pipe
stages for a multi-cycle write back.
I don't fully agree that it should be distributed at the function unit
level, as you could imagine designs with higher issue width and functional
units than the number of register file write ports. Essentially allowing
more instructions to be issued on a given cycle, as long as they did not
all complete on the same cycle.
Going back to Paul's issue (loads holding write back slots on misses).
The "proper" way to do it would probably be to reserve a slot assuming an
L1 cache hit latency. Give up the slot on a miss. Have an early signal
that a load-miss is coming back from the cache so that you could reserve a
write back slot in parallel with doing all the other necessary work for a
load (CAMing vs the store queue, etc). But this would likely be annoying to
implement.
*In general though, yes this seems like something not worth modeling in
gem5 as the potential negative impacts of its current implementation
outweigh the benefits. And the benefits of fully modeling it are likely
small.*
On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
Post by Arthur Perais via gem5-users
Hi all,
I have no specific knowledge on what are the buffers modeling or what
they should be modeling, but I too have encountered this issue some time
ago. Setting a high wbDepth is what I do to work around it (actually, 3 is
sufficient for me), because performance is indeed suffering quite a lot
(and even more for narrow-issue cores if wbWidth == issueWidth, I would
expect) in some cases.
Hi Paul,
I assume you're talking about the 'wbMax' variable? I don't recall
it specifically myself, but after looking at the code a bit, the best I can
come up with is that there's assumed to be a finite number of buffers
somewhere that hold results from the function units before they write back
to the reg file. Realistically, to me, it seems like those buffers would
be distributed among the function units anyway, not a global resource, so
having a global limit doesn't make a lot of sense. Does anyone else out
there agree or disagree?
It doesn't seem to relate to any structure that's directly modeled
in the code, i.e., I think you could rip the whole thing out (incrWb(),
decrWb(), wbOustanding, wbMax) without breaking anything in the model...
which would be a good thing if in fact everyone else is either suffering
unaware or just working around it by setting a large value for wbDepth.
That said, we've done some internal performance correlation work,
and I don't recall this being an issue, for whatever that's worth. I know
ARM has done some correlation work too; have you run into this?
Steve
On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
Post by Paul V. Gratz via gem5-users
Hi All,
Doing some digging on performance issues in the O3 model we and
others have run into allocation of the writeback buffer having a big
performance impact. Basically, the a writeback buffer is grabbed at issue
time and held through till completion. With default assumptions about the
number of available writeback buffers, (x*issue width, where x is 1 by
default), the buffers often end up bottlenecking the effective issue width
(particularly in the face of long latency loads grabbing up all the WB
buffers). What are these structures trying to model? I can see limiting
the number of instructions allowed to complete and writeback/bypass in a
cycle but this ends up being much more conservative than that if it is the
intent. If not why does it do this? We can easily make number of WB bufs
high but want to understand what is going on here first...
Thanks!
Paul
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
-----------------------------------------
Paul V. Gratz
Assistant Professor
ECE Dept, Texas A&M University
Office: 333M WERC
Phone: 979-488-4551
http://cesg.tamu.edu/faculty/paul-gratz/
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
Regards,
Vamsi Krishna
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
--
--
Fernando A. Endo, PhD student and researcher

Université de Grenoble, UJF
France
Loading...