Discussion:
A Patch for DRAMsim2 Integration
Dong, Xiangyu
2011-12-18 09:48:58 UTC
Permalink
Hi all,



I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
I'm willing to share it here.



For those who have such needs, please go to my website
www.cse.psu.edu/~xydong to download the patch and test it. To enable
DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can create
by yourself). The basic idea to enable the DRAMsim2 module is to use the
derived DRAMMemory class instead of PhysicalMemory class.



Please let me know if there are bugs.



Thank you!



Best,

Xiangyu Dong
Andrew Cebulski
2011-12-18 14:37:55 UTC
Permalink
Thanks for the integration patch Xiangyu! I'll let you know if I come
across any bugs.

-Andrew



> Date: Sun, 18 Dec 2011 01:48:58 -0800
> From: "Dong, Xiangyu" <***@gmail.com>
> To: "gem5 users mailing list" <gem5-***@gem5.org>
> Subject: [gem5-users] A Patch for DRAMsim2 Integration
> Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi all,
>
>
>
> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
> I'm willing to share it here.
>
>
>
> For those who have such needs, please go to my website
> www.cse.psu.edu/~xydong to download the patch and test it. To enable
> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
> create
> by yourself). The basic idea to enable the DRAMsim2 module is to use the
> derived DRAMMemory class instead of PhysicalMemory class.
>
>
>
> Please let me know if there are bugs.
>
>
>
> Thank you!
>
>
>
> Best,
>
> Xiangyu Dong
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
> >
>
>
Dong, Xiangyu
2011-12-18 22:04:36 UTC
Permalink
Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if it
also works for other processors.



In addition, I actually made more modification on the DRAMsim2 side. Maybe
the most important one is that I changed the way how DRAMsim2 reports
latency/bandwidth statistics. DRAMsim2 reports all the statistics after
every EPOCH, and then resets all the numbers. For Gem5 users who are only
interested in the statistics over the entire simulation time, you might want
to change the codes in DRAMsim2/MemoryController.cpp and make similar
changes like what I've done (that's NOT in the patch since it's more
DRAMsim2-related).



Best,

Xiangyu



From: gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] On
Behalf Of Andrew Cebulski
Sent: Sunday, December 18, 2011 6:38 AM
To: gem5-***@gem5.org
Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Thanks for the integration patch Xiangyu! I'll let you know if I come
across any bugs.



-Andrew





Date: Sun, 18 Dec 2011 01:48:58 -0800
From: "Dong, Xiangyu" <***@gmail.com>
To: "gem5 users mailing list" <gem5-***@gem5.org>
Subject: [gem5-users] A Patch for DRAMsim2 Integration
Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
Content-Type: text/plain; charset="us-ascii"

Hi all,



I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
I'm willing to share it here.



For those who have such needs, please go to my website
www.cse.psu.edu/~xydong to download the patch and test it. To enable
DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can create
by yourself). The basic idea to enable the DRAMsim2 module is to use the
derived DRAMMemory class instead of PhysicalMemory class.



Please let me know if there are bugs.



Thank you!



Best,

Xiangyu Dong

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3
fdf5da/attachment.html>
Andrew Cebulski
2012-03-09 02:51:44 UTC
Permalink
Xiangyu,

I've been having an issue recently with the number of instructions I've
been seeing committed to the CPU (I have a separate thread on this). It
turns out the issue seems to be coming from this patch you created to
integrate DramSim2 with Gem5. Unfortunately, I've been running with
gem5.fast, not gem5.opt. So up until now, I haven't been seeing
assertions. I thought I'd run it with gem5.opt or debug back in December,
but I must not have. My runs on the Arm O3 cpu fails with this assertion:

build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst<Impl>::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount <= 1500' failed.

Have you seen similar results? Is this count how many instructions are
currently being processed by the cpu? My initial guess is that memory
instructions being sent to DramSim2 are getting counted as committed
regardless of whether they are mispredicted (and rerun). Any suggestions
on where to insert DPRINTFs, or use current ones, to find out if this is
what is happening?

Ali helped me earlier with getting the checker and some debug flags to
track earlier. I currently have traces with debug flags Exec, ExecAsid and
DynInst. I just need to know what to search for in them for useful info.

Thanks,
Andrew

On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com> wrote:

> Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if
> it also works for other processors.****
>
> ** **
>
> In addition, I actually made more modification on the DRAMsim2 side.
> Maybe the most important one is that I changed the way how DRAMsim2 reports
> latency/bandwidth statistics. DRAMsim2 reports all the statistics after
> every EPOCH, and then resets all the numbers. For Gem5 users who are only
> interested in the statistics over the entire simulation time, you might
> want to change the codes in DRAMsim2/MemoryController.cpp and make similar
> changes like what I’ve done (that’s NOT in the patch since it’s more
> DRAMsim2-related).****
>
> ** **
>
> Best,****
>
> Xiangyu****
>
> ** **
>
> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] *On
> Behalf Of *Andrew Cebulski
> *Sent:* Sunday, December 18, 2011 6:38 AM
> *To:* gem5-***@gem5.org
> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>
> ** **
>
> Thanks for the integration patch Xiangyu! I'll let you know if I come
> across any bugs.****
>
> ** **
>
> -Andrew****
>
> ** **
>
> ****
>
> Date: Sun, 18 Dec 2011 01:48:58 -0800
> From: "Dong, Xiangyu" <***@gmail.com>
> To: "gem5 users mailing list" <gem5-***@gem5.org>
> Subject: [gem5-users] A Patch for DRAMsim2 Integration
> Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi all,
>
>
>
> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
> I'm willing to share it here.
>
>
>
> For those who have such needs, please go to my website
> www.cse.psu.edu/~xydong to download the patch and test it. To enable
> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
> create
> by yourself). The basic idea to enable the DRAMsim2 module is to use the
> derived DRAMMemory class instead of PhysicalMemory class.
>
>
>
> Please let me know if there are bugs.
>
>
>
> Thank you!
>
>
>
> Best,
>
> Xiangyu Dong
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
> >****
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Rio Xiangyu Dong
2012-03-13 17:37:45 UTC
Permalink
Hi Andrew,



I didn't see this error in my simulations. May I ask which gem5 version you
are using? I find some of the latest code updates do not comply with my
changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
ARM_SE.



Thank you!



Best,

Xiangyu



From: Andrew Cebulski [mailto:***@drexel.edu]
Sent: Thursday, March 08, 2012 6:52 PM
To: gem5 users mailing list
Cc: ***@gmail.com; ***@umich.edu
Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Xiangyu,



I've been having an issue recently with the number of instructions I've
been seeing committed to the CPU (I have a separate thread on this). It
turns out the issue seems to be coming from this patch you created to
integrate DramSim2 with Gem5. Unfortunately, I've been running with
gem5.fast, not gem5.opt. So up until now, I haven't been seeing assertions.
I thought I'd run it with gem5.opt or debug back in December, but I must not
have. My runs on the Arm O3 cpu fails with this assertion:



build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst<Impl>::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount <= 1500' failed.



Have you seen similar results? Is this count how many instructions are
currently being processed by the cpu? My initial guess is that memory
instructions being sent to DramSim2 are getting counted as committed
regardless of whether they are mispredicted (and rerun). Any suggestions on
where to insert DPRINTFs, or use current ones, to find out if this is what
is happening?



Ali helped me earlier with getting the checker and some debug flags to
track earlier. I currently have traces with debug flags Exec, ExecAsid and
DynInst. I just need to know what to search for in them for useful info.



Thanks,

Andrew



On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com> wrote:

Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if it
also works for other processors.



In addition, I actually made more modification on the DRAMsim2 side. Maybe
the most important one is that I changed the way how DRAMsim2 reports
latency/bandwidth statistics. DRAMsim2 reports all the statistics after
every EPOCH, and then resets all the numbers. For Gem5 users who are only
interested in the statistics over the entire simulation time, you might want
to change the codes in DRAMsim2/MemoryController.cpp and make similar
changes like what I've done (that's NOT in the patch since it's more
DRAMsim2-related).



Best,

Xiangyu



From: gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] On
Behalf Of Andrew Cebulski
Sent: Sunday, December 18, 2011 6:38 AM
To: gem5-***@gem5.org
Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Thanks for the integration patch Xiangyu! I'll let you know if I come
across any bugs.



-Andrew





Date: Sun, 18 Dec 2011 01:48:58 -0800
From: "Dong, Xiangyu" <***@gmail.com>
To: "gem5 users mailing list" <gem5-***@gem5.org>
Subject: [gem5-users] A Patch for DRAMsim2 Integration
Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
Content-Type: text/plain; charset="us-ascii"

Hi all,



I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
I'm willing to share it here.



For those who have such needs, please go to my website
www.cse.psu.edu/~xydong to download the patch and test it. To enable
DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can create
by yourself). The basic idea to enable the DRAMsim2 module is to use the
derived DRAMMemory class instead of PhysicalMemory class.



Please let me know if there are bugs.



Thank you!



Best,

Xiangyu Dong

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3
fdf5da/attachment.html>
Andrew Cebulski
2012-03-13 17:56:42 UTC
Permalink
Hi Xiangyu,

I just started looking into this some more. So at first I thought it
was due to updating to a more recent revision, but then I went back to
revision 8643, added your patch, built and ran....and now get the error
with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
update to SWIG might have resulted in this error, maybe someone on the
mailing list would know if that's possible. The difference is 1.3.40 vs.
2.0.3, both of which are supported according to the dependencies wiki page.

Just for completeness, here's the error from revision 8643:
build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount <= 1500' failed.

I also removed all the changes I've added for my research from the test
with revision 8643 and your patch, aside from adding a new rcS file and the
cross-compiled binary to the disk image for Libquantum (SPEC CPU2006).
Note that I use CodeSourcery for cross-compiling.

I have not tried running with gem5.debug, so I will be doing that today.
Maybe this is an assertion that is occurring due to an optimization. That
would mean it wouldn't be triggered in gem5.debug since it runs without
optimizations. Have you tested all debug, opt and fast with your tests?

Thanks,
Andrew

On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>wrote:

> Hi Andrew,****
>
> ** **
>
> I didn’t see this error in my simulations. May I ask which gem5 version
> you are using? I find some of the latest code updates do not comply with my
> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
> ARM_SE.****
>
> ** **
>
> Thank you!****
>
> ** **
>
> Best,****
>
> Xiangyu****
>
> ** **
>
> *From:* Andrew Cebulski [mailto:***@drexel.edu]
> *Sent:* Thursday, March 08, 2012 6:52 PM
>
> *To:* gem5 users mailing list
> *Cc:* ***@gmail.com; ***@umich.edu
>
> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>
> ** **
>
> Xiangyu,****
>
> ** **
>
> I've been having an issue recently with the number of instructions I've
> been seeing committed to the CPU (I have a separate thread on this). It
> turns out the issue seems to be coming from this patch you created to
> integrate DramSim2 with Gem5. Unfortunately, I've been running with
> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
> assertions. I thought I'd run it with gem5.opt or debug back in December,
> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
> ****
>
> ** **
>
> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
> `cpu->instcount <= 1500' failed.****
>
> ** **
>
> Have you seen similar results? Is this count how many instructions
> are currently being processed by the cpu? My initial guess is that memory
> instructions being sent to DramSim2 are getting counted as committed
> regardless of whether they are mispredicted (and rerun). Any suggestions
> on where to insert DPRINTFs, or use current ones, to find out if this is
> what is happening?****
>
> ** **
>
> Ali helped me earlier with getting the checker and some debug flags to
> track earlier. I currently have traces with debug flags Exec, ExecAsid and
> DynInst. I just need to know what to search for in them for useful info.*
> ***
>
> ** **
>
> Thanks,****
>
> Andrew ****
>
> ** **
>
> On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com>
> wrote:****
>
> Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if
> it also works for other processors.****
>
> ****
>
> In addition, I actually made more modification on the DRAMsim2 side.
> Maybe the most important one is that I changed the way how DRAMsim2 reports
> latency/bandwidth statistics. DRAMsim2 reports all the statistics after
> every EPOCH, and then resets all the numbers. For Gem5 users who are only
> interested in the statistics over the entire simulation time, you might
> want to change the codes in DRAMsim2/MemoryController.cpp and make similar
> changes like what I’ve done (that’s NOT in the patch since it’s more
> DRAMsim2-related).****
>
> ****
>
> Best,****
>
> Xiangyu****
>
> ****
>
> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] *On
> Behalf Of *Andrew Cebulski
> *Sent:* Sunday, December 18, 2011 6:38 AM
> *To:* gem5-***@gem5.org
> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>
> ****
>
> Thanks for the integration patch Xiangyu! I'll let you know if I come
> across any bugs.****
>
> ****
>
> -Andrew****
>
> ****
>
> ****
>
> Date: Sun, 18 Dec 2011 01:48:58 -0800
> From: "Dong, Xiangyu" <***@gmail.com>
> To: "gem5 users mailing list" <gem5-***@gem5.org>
> Subject: [gem5-users] A Patch for DRAMsim2 Integration
> Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi all,
>
>
>
> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
> I'm willing to share it here.
>
>
>
> For those who have such needs, please go to my website
> www.cse.psu.edu/~xydong to download the patch and test it. To enable
> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
> create
> by yourself). The basic idea to enable the DRAMsim2 module is to use the
> derived DRAMMemory class instead of PhysicalMemory class.
>
>
>
> Please let me know if there are bugs.
>
>
>
> Thank you!
>
>
>
> Best,
>
> Xiangyu Dong
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
> >****
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users****
>
> ** **
>
Andrew Cebulski
2012-03-13 19:12:14 UTC
Permalink
Did you ever get this error building gem5.debug on revision 8643?

-----

cc1plus: warnings being treated as errors
In file included from build/dramsim2/SystemConfiguration.h:42:0,
from build/dramsim2/MemorySystem.h:42,
from build/ARM_FS/mem/dram.hh:42,
from build/ARM_FS/mem/dram.cc:39:
build/dramsim2/PrintMacros.h:49:0: error: "DEBUG" redefined
<command-line>:0:0: note: this is the location of the previous definition

-----

The DramSim2 repo has the same file:
https://github.com/dramninjasUMD/DRAMSim2/blob/master/PrintMacros.h

The solutions seems to be adding "#undef DEBUG" right before it gets
defined. I can't find where it's getting declared initially though...
Doing a grep of the gem5 src and ext repo's for a DEBUG define come up
empty. Commenting out all defines of DEBUG in PrintMacros.h still results
in the redefine message...

I have two versions of gcc readily available (4.5.0 and 4.5.3), and I get
it on both.

-Andrew

On Tue, Mar 13, 2012 at 1:56 PM, Andrew Cebulski <***@drexel.edu> wrote:

> Hi Xiangyu,
>
> I just started looking into this some more. So at first I thought it
> was due to updating to a more recent revision, but then I went back to
> revision 8643, added your patch, built and ran....and now get the error
> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
> update to SWIG might have resulted in this error, maybe someone on the
> mailing list would know if that's possible. The difference is 1.3.40 vs.
> 2.0.3, both of which are supported according to the dependencies wiki page.
>
> Just for completeness, here's the error from revision 8643:
> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
> `cpu->instcount <= 1500' failed.
>
> I also removed all the changes I've added for my research from the
> test with revision 8643 and your patch, aside from adding a new rcS file
> and the cross-compiled binary to the disk image for Libquantum (SPEC
> CPU2006). Note that I use CodeSourcery for cross-compiling.
>
> I have not tried running with gem5.debug, so I will be doing that
> today. Maybe this is an assertion that is occurring due to an
> optimization. That would mean it wouldn't be triggered in gem5.debug since
> it runs without optimizations. Have you tested all debug, opt and fast
> with your tests?
>
> Thanks,
> Andrew
>
>
> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>wrote:
>
>> Hi Andrew,****
>>
>> ** **
>>
>> I didn’t see this error in my simulations. May I ask which gem5 version
>> you are using? I find some of the latest code updates do not comply with my
>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>> ARM_SE.****
>>
>> ** **
>>
>> Thank you!****
>>
>> ** **
>>
>> Best,****
>>
>> Xiangyu****
>>
>> ** **
>>
>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>
>> *To:* gem5 users mailing list
>> *Cc:* ***@gmail.com; ***@umich.edu
>>
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>
>> ** **
>>
>> Xiangyu,****
>>
>> ** **
>>
>> I've been having an issue recently with the number of instructions
>> I've been seeing committed to the CPU (I have a separate thread on this).
>> It turns out the issue seems to be coming from this patch you created to
>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>> ****
>>
>> ** **
>>
>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount <= 1500' failed.****
>>
>> ** **
>>
>> Have you seen similar results? Is this count how many instructions
>> are currently being processed by the cpu? My initial guess is that memory
>> instructions being sent to DramSim2 are getting counted as committed
>> regardless of whether they are mispredicted (and rerun). Any suggestions
>> on where to insert DPRINTFs, or use current ones, to find out if this is
>> what is happening?****
>>
>> ** **
>>
>> Ali helped me earlier with getting the checker and some debug flags to
>> track earlier. I currently have traces with debug flags Exec, ExecAsid and
>> DynInst. I just need to know what to search for in them for useful info.
>> ****
>>
>> ** **
>>
>> Thanks,****
>>
>> Andrew ****
>>
>> ** **
>>
>> On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com>
>> wrote:****
>>
>> Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if
>> it also works for other processors.****
>>
>> ****
>>
>> In addition, I actually made more modification on the DRAMsim2 side.
>> Maybe the most important one is that I changed the way how DRAMsim2 reports
>> latency/bandwidth statistics. DRAMsim2 reports all the statistics after
>> every EPOCH, and then resets all the numbers. For Gem5 users who are only
>> interested in the statistics over the entire simulation time, you might
>> want to change the codes in DRAMsim2/MemoryController.cpp and make similar
>> changes like what I’ve done (that’s NOT in the patch since it’s more
>> DRAMsim2-related).****
>>
>> ****
>>
>> Best,****
>>
>> Xiangyu****
>>
>> ****
>>
>> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org]
>> *On Behalf Of *Andrew Cebulski
>> *Sent:* Sunday, December 18, 2011 6:38 AM
>> *To:* gem5-***@gem5.org
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>
>> ****
>>
>> Thanks for the integration patch Xiangyu! I'll let you know if I come
>> across any bugs.****
>>
>> ****
>>
>> -Andrew****
>>
>> ****
>>
>> ****
>>
>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> From: "Dong, Xiangyu" <***@gmail.com>
>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi all,
>>
>>
>>
>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>> I'm willing to share it here.
>>
>>
>>
>> For those who have such needs, please go to my website
>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download
>> the patch and test it. To enable
>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>> create
>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>> derived DRAMMemory class instead of PhysicalMemory class.
>>
>>
>>
>> Please let me know if there are bugs.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu Dong
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >****
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users****
>>
>> ** **
>>
>
>
Rio Xiangyu Dong
2012-03-13 20:30:30 UTC
Permalink
I use gem5.opt, and never see this error.



From: gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] On
Behalf Of Andrew Cebulski
Sent: Tuesday, March 13, 2012 12:12 PM
To: gem5 users mailing list
Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Did you ever get this error building gem5.debug on revision 8643?

-----



cc1plus: warnings being treated as errors
In file included from build/dramsim2/SystemConfiguration.h:42:0,
from build/dramsim2/MemorySystem.h:42,
from build/ARM_FS/mem/dram.hh:42,
from build/ARM_FS/mem/dram.cc:39:
build/dramsim2/PrintMacros.h:49:0: error: "DEBUG" redefined
<command-line>:0:0: note: this is the location of the previous definition

-----

The DramSim2 repo has the same file:
https://github.com/dramninjasUMD/DRAMSim2/blob/master/PrintMacros.h


The solutions seems to be adding "#undef DEBUG" right before it gets
defined. I can't find where it's getting declared initially though...
Doing a grep of the gem5 src and ext repo's for a DEBUG define come up
empty. Commenting out all defines of DEBUG in PrintMacros.h still results
in the redefine message...



I have two versions of gcc readily available (4.5.0 and 4.5.3), and I get it
on both.



-Andrew

On Tue, Mar 13, 2012 at 1:56 PM, Andrew Cebulski <***@drexel.edu> wrote:

Hi Xiangyu,



I just started looking into this some more. So at first I thought it
was due to updating to a more recent revision, but then I went back to
revision 8643, added your patch, built and ran....and now get the error with
it too (when running ARM_FS/gem5.opt). I"m testing now to see if an update
to SWIG might have resulted in this error, maybe someone on the mailing list
would know if that's possible. The difference is 1.3.40 vs. 2.0.3, both of
which are supported according to the dependencies wiki page.



Just for completeness, here's the error from revision 8643:

build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount <= 1500' failed.



I also removed all the changes I've added for my research from the test
with revision 8643 and your patch, aside from adding a new rcS file and the
cross-compiled binary to the disk image for Libquantum (SPEC CPU2006). Note
that I use CodeSourcery for cross-compiling.



I have not tried running with gem5.debug, so I will be doing that today.
Maybe this is an assertion that is occurring due to an optimization. That
would mean it wouldn't be triggered in gem5.debug since it runs without
optimizations. Have you tested all debug, opt and fast with your tests?



Thanks,

Andrew



On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>
wrote:

Hi Andrew,



I didn't see this error in my simulations. May I ask which gem5 version you
are using? I find some of the latest code updates do not comply with my
changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
ARM_SE.



Thank you!



Best,

Xiangyu



From: Andrew Cebulski [mailto:***@drexel.edu]
Sent: Thursday, March 08, 2012 6:52 PM


To: gem5 users mailing list

Cc: ***@gmail.com; ***@umich.edu


Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Xiangyu,



I've been having an issue recently with the number of instructions I've
been seeing committed to the CPU (I have a separate thread on this). It
turns out the issue seems to be coming from this patch you created to
integrate DramSim2 with Gem5. Unfortunately, I've been running with
gem5.fast, not gem5.opt. So up until now, I haven't been seeing assertions.
I thought I'd run it with gem5.opt or debug back in December, but I must not
have. My runs on the Arm O3 cpu fails with this assertion:



build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst<Impl>::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount <= 1500' failed.



Have you seen similar results? Is this count how many instructions are
currently being processed by the cpu? My initial guess is that memory
instructions being sent to DramSim2 are getting counted as committed
regardless of whether they are mispredicted (and rerun). Any suggestions on
where to insert DPRINTFs, or use current ones, to find out if this is what
is happening?



Ali helped me earlier with getting the checker and some debug flags to
track earlier. I currently have traces with debug flags Exec, ExecAsid and
DynInst. I just need to know what to search for in them for useful info.



Thanks,

Andrew



On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com> wrote:

Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if it
also works for other processors.



In addition, I actually made more modification on the DRAMsim2 side. Maybe
the most important one is that I changed the way how DRAMsim2 reports
latency/bandwidth statistics. DRAMsim2 reports all the statistics after
every EPOCH, and then resets all the numbers. For Gem5 users who are only
interested in the statistics over the entire simulation time, you might want
to change the codes in DRAMsim2/MemoryController.cpp and make similar
changes like what I've done (that's NOT in the patch since it's more
DRAMsim2-related).



Best,

Xiangyu



From: gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] On
Behalf Of Andrew Cebulski
Sent: Sunday, December 18, 2011 6:38 AM
To: gem5-***@gem5.org
Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Thanks for the integration patch Xiangyu! I'll let you know if I come
across any bugs.



-Andrew





Date: Sun, 18 Dec 2011 01:48:58 -0800
From: "Dong, Xiangyu" <***@gmail.com>
To: "gem5 users mailing list" <gem5-***@gem5.org>
Subject: [gem5-users] A Patch for DRAMsim2 Integration
Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
Content-Type: text/plain; charset="us-ascii"

Hi all,



I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
I'm willing to share it here.



For those who have such needs, please go to my website
www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download the
patch and test it. To enable
DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can create
by yourself). The basic idea to enable the DRAMsim2 module is to use the
derived DRAMMemory class instead of PhysicalMemory class.



Please let me know if there are bugs.



Thank you!



Best,

Xiangyu Dong

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3
fdf5da/attachment.html>
Andrew Cebulski
2012-03-13 22:08:44 UTC
Permalink
As far as I can tell, the source of the instruction count problem either is
restricted to FS (don't know if restricted to ARM), varies with environment
setup or is caused by the cross-compiled binary file. I'm doubtful of the
latter two since I've run with various benchmarks in gem5.fast to
completion, and they have been within 10% of the committed instructions
that the benchmarks have run on physical hardware (as measured with
valgrind...same binary files).

At this point, it would be helpful to know if anyone else has run a
benchmark with your patch on ARM_FS (or other arch) successfully. Luckily,
you have made it very easy to test since it replaces the old DRAMMemory
object. Otherwise, I'll continue debugging this and emailing the list with
questions as I go.

Is it possible for this assertion to only occur in ARM_FS, not ARM_SE, with
the same benchmarks? I require full-system, so I haven't tested ARM_SE.
If my benchmarks pass with ARM_SE in gem5.opt, that should narrow the
source of the problem down.

Thanks,
Andrew

On Tue, Mar 13, 2012 at 4:30 PM, Rio Xiangyu Dong <***@gmail.com>wrote:

> I use gem5.opt, and never see this error.****
>
> ** **
>
> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] *On
> Behalf Of *Andrew Cebulski
> *Sent:* Tuesday, March 13, 2012 12:12 PM
>
> *To:* gem5 users mailing list
> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>
> ** **
>
> Did you ever get this error building gem5.debug on revision 8643?
>
> -----****
>
> ** **
>
> cc1plus: warnings being treated as errors
> In file included from build/dramsim2/SystemConfiguration.h:42:0,
> from build/dramsim2/MemorySystem.h:42,
> from build/ARM_FS/mem/dram.hh:42,
> from build/ARM_FS/mem/dram.cc:39:
> build/dramsim2/PrintMacros.h:49:0: error: "DEBUG" redefined
> <command-line>:0:0: note: this is the location of the previous definition
>
> -----
>
> The DramSim2 repo has the same file:
> https://github.com/dramninjasUMD/DRAMSim2/blob/master/PrintMacros.h ****
>
>
> The solutions seems to be adding "#undef DEBUG" right before it gets
> defined. I can't find where it's getting declared initially though...
> Doing a grep of the gem5 src and ext repo's for a DEBUG define come up
> empty. Commenting out all defines of DEBUG in PrintMacros.h still results
> in the redefine message...****
>
> ** **
>
> I have two versions of gcc readily available (4.5.0 and 4.5.3), and I get
> it on both.****
>
> ** **
>
> -Andrew****
>
> On Tue, Mar 13, 2012 at 1:56 PM, Andrew Cebulski <***@drexel.edu> wrote:
> ****
>
> Hi Xiangyu,****
>
> ** **
>
> I just started looking into this some more. So at first I thought it
> was due to updating to a more recent revision, but then I went back to
> revision 8643, added your patch, built and ran....and now get the error
> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
> update to SWIG might have resulted in this error, maybe someone on the
> mailing list would know if that's possible. The difference is 1.3.40 vs.
> 2.0.3, both of which are supported according to the dependencies wiki page.
> ****
>
> ** **
>
> Just for completeness, here's the error from revision 8643:****
>
> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
> `cpu->instcount <= 1500' failed. ****
>
> ** **
>
> I also removed all the changes I've added for my research from the
> test with revision 8643 and your patch, aside from adding a new rcS file
> and the cross-compiled binary to the disk image for Libquantum (SPEC
> CPU2006). Note that I use CodeSourcery for cross-compiling.****
>
> ** **
>
> I have not tried running with gem5.debug, so I will be doing that
> today. Maybe this is an assertion that is occurring due to an
> optimization. That would mean it wouldn't be triggered in gem5.debug since
> it runs without optimizations. Have you tested all debug, opt and fast
> with your tests?****
>
> ** **
>
> Thanks,****
>
> Andrew****
>
> ** **
>
> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>
> wrote:****
>
> Hi Andrew,****
>
> ****
>
> I didn’t see this error in my simulations. May I ask which gem5 version
> you are using? I find some of the latest code updates do not comply with my
> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
> ARM_SE.****
>
> ****
>
> Thank you!****
>
> ****
>
> Best,****
>
> Xiangyu****
>
> ****
>
> *From:* Andrew Cebulski [mailto:***@drexel.edu]
> *Sent:* Thursday, March 08, 2012 6:52 PM****
>
>
> *To:* gem5 users mailing list****
>
> *Cc:* ***@gmail.com; ***@umich.edu****
>
>
> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>
> ****
>
> Xiangyu,****
>
> ****
>
> I've been having an issue recently with the number of instructions I've
> been seeing committed to the CPU (I have a separate thread on this). It
> turns out the issue seems to be coming from this patch you created to
> integrate DramSim2 with Gem5. Unfortunately, I've been running with
> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
> assertions. I thought I'd run it with gem5.opt or debug back in December,
> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
> ****
>
> ****
>
> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
> `cpu->instcount <= 1500' failed.****
>
> ****
>
> Have you seen similar results? Is this count how many instructions
> are currently being processed by the cpu? My initial guess is that memory
> instructions being sent to DramSim2 are getting counted as committed
> regardless of whether they are mispredicted (and rerun). Any suggestions
> on where to insert DPRINTFs, or use current ones, to find out if this is
> what is happening?****
>
> ****
>
> Ali helped me earlier with getting the checker and some debug flags to
> track earlier. I currently have traces with debug flags Exec, ExecAsid and
> DynInst. I just need to know what to search for in them for useful info.*
> ***
>
> ****
>
> Thanks,****
>
> Andrew ****
>
> ****
>
> On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com>
> wrote:****
>
> Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if
> it also works for other processors.****
>
> ****
>
> In addition, I actually made more modification on the DRAMsim2 side.
> Maybe the most important one is that I changed the way how DRAMsim2 reports
> latency/bandwidth statistics. DRAMsim2 reports all the statistics after
> every EPOCH, and then resets all the numbers. For Gem5 users who are only
> interested in the statistics over the entire simulation time, you might
> want to change the codes in DRAMsim2/MemoryController.cpp and make similar
> changes like what I’ve done (that’s NOT in the patch since it’s more
> DRAMsim2-related).****
>
> ****
>
> Best,****
>
> Xiangyu****
>
> ****
>
> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] *On
> Behalf Of *Andrew Cebulski
> *Sent:* Sunday, December 18, 2011 6:38 AM
> *To:* gem5-***@gem5.org
> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>
> ****
>
> Thanks for the integration patch Xiangyu! I'll let you know if I come
> across any bugs.****
>
> ****
>
> -Andrew****
>
> ****
>
> ****
>
> Date: Sun, 18 Dec 2011 01:48:58 -0800
> From: "Dong, Xiangyu" <***@gmail.com>
> To: "gem5 users mailing list" <gem5-***@gem5.org>
> Subject: [gem5-users] A Patch for DRAMsim2 Integration
> Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi all,
>
>
>
> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
> I'm willing to share it here.
>
>
>
> For those who have such needs, please go to my website
> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download
> the patch and test it. To enable
> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
> create
> by yourself). The basic idea to enable the DRAMsim2 module is to use the
> derived DRAMMemory class instead of PhysicalMemory class.
>
>
>
> Please let me know if there are bugs.
>
>
>
> Thank you!
>
>
>
> Best,
>
> Xiangyu Dong
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
> >****
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users****
>
> ****
>
> ** **
>
> ** **
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Andrew Cebulski
2012-03-14 03:33:03 UTC
Permalink
I figured out how to run my benchmark in ARM_SE. First, I had to recompile
my benchmark statically linked, since I run it dynamically linked in
ARM_FS. I was able to run the statically linked libquantum in ARM_SE with
gem5.opt without any errors, the same as your tests. I did look at the
committed instruction count within the stats file for the ARM_SE run with
the O3 CPU, and it does look accurate.

Next, I added this new statically linked libquantum to my disk image to run
in ARM_FS with gem5.opt. It failed with the same exact instcount error
in base_dyn_inst_impl.hh as with my dynamically linked libquantum
benchmark.

So it would seem that the problem only occurs in FS, not SE.

Now the question is how to go about fixing the issue preventing your
integration of DramSim2 from working in ARM_FS... Any ideas?

It looks like you have some debugging coded into your dram.cc, but
commented out (printf statements). I'll add them back and compare their
outputs between ARM_SE and ARM_FS. I might convert them to DPRINTFs too.

Do you know how to setup one of your ARM_SE benchmarks to run in ARM_FS?
It basically involves putting it into the Gem5 Ubuntu image (i.e. using
util/gem5img.py), creating a rcS file, and adding it in Benchmarks.py.
This way you can try to reproduce the error in your environment. That
would be a good start.

-Andrew

On Tue, Mar 13, 2012 at 6:08 PM, Andrew Cebulski <***@drexel.edu> wrote:

> As far as I can tell, the source of the instruction count problem either
> is restricted to FS (don't know if restricted to ARM), varies with
> environment setup or is caused by the cross-compiled binary file. I'm
> doubtful of the latter two since I've run with various benchmarks in
> gem5.fast to completion, and they have been within 10% of the committed
> instructions that the benchmarks have run on physical hardware (as measured
> with valgrind...same binary files).
>
> At this point, it would be helpful to know if anyone else has run a
> benchmark with your patch on ARM_FS (or other arch) successfully. Luckily,
> you have made it very easy to test since it replaces the old DRAMMemory
> object. Otherwise, I'll continue debugging this and emailing the list with
> questions as I go.
>
> Is it possible for this assertion to only occur in ARM_FS, not ARM_SE,
> with the same benchmarks? I require full-system, so I haven't tested
> ARM_SE. If my benchmarks pass with ARM_SE in gem5.opt, that should narrow
> the source of the problem down.
>
> Thanks,
> Andrew
>
>
> On Tue, Mar 13, 2012 at 4:30 PM, Rio Xiangyu Dong <***@gmail.com>wrote:
>
>> I use gem5.opt, and never see this error.****
>>
>> ** **
>>
>> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org]
>> *On Behalf Of *Andrew Cebulski
>> *Sent:* Tuesday, March 13, 2012 12:12 PM
>>
>> *To:* gem5 users mailing list
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>
>> ** **
>>
>> Did you ever get this error building gem5.debug on revision 8643?
>>
>> -----****
>>
>> ** **
>>
>> cc1plus: warnings being treated as errors
>> In file included from build/dramsim2/SystemConfiguration.h:42:0,
>> from build/dramsim2/MemorySystem.h:42,
>> from build/ARM_FS/mem/dram.hh:42,
>> from build/ARM_FS/mem/dram.cc:39:
>> build/dramsim2/PrintMacros.h:49:0: error: "DEBUG" redefined
>> <command-line>:0:0: note: this is the location of the previous definition
>>
>> -----
>>
>> The DramSim2 repo has the same file:
>> https://github.com/dramninjasUMD/DRAMSim2/blob/master/PrintMacros.h ****
>>
>>
>> The solutions seems to be adding "#undef DEBUG" right before it gets
>> defined. I can't find where it's getting declared initially though...
>> Doing a grep of the gem5 src and ext repo's for a DEBUG define come up
>> empty. Commenting out all defines of DEBUG in PrintMacros.h still results
>> in the redefine message...****
>>
>> ** **
>>
>> I have two versions of gcc readily available (4.5.0 and 4.5.3), and I get
>> it on both.****
>>
>> ** **
>>
>> -Andrew****
>>
>> On Tue, Mar 13, 2012 at 1:56 PM, Andrew Cebulski <***@drexel.edu>
>> wrote:****
>>
>> Hi Xiangyu,****
>>
>> ** **
>>
>> I just started looking into this some more. So at first I thought it
>> was due to updating to a more recent revision, but then I went back to
>> revision 8643, added your patch, built and ran....and now get the error
>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
>> update to SWIG might have resulted in this error, maybe someone on the
>> mailing list would know if that's possible. The difference is 1.3.40 vs.
>> 2.0.3, both of which are supported according to the dependencies wiki page.
>> ****
>>
>> ** **
>>
>> Just for completeness, here's the error from revision 8643:****
>>
>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount <= 1500' failed. ****
>>
>> ** **
>>
>> I also removed all the changes I've added for my research from the
>> test with revision 8643 and your patch, aside from adding a new rcS file
>> and the cross-compiled binary to the disk image for Libquantum (SPEC
>> CPU2006). Note that I use CodeSourcery for cross-compiling.****
>>
>> ** **
>>
>> I have not tried running with gem5.debug, so I will be doing that
>> today. Maybe this is an assertion that is occurring due to an
>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>> it runs without optimizations. Have you tested all debug, opt and fast
>> with your tests?****
>>
>> ** **
>>
>> Thanks,****
>>
>> Andrew****
>>
>> ** **
>>
>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>
>> wrote:****
>>
>> Hi Andrew,****
>>
>> ****
>>
>> I didn’t see this error in my simulations. May I ask which gem5 version
>> you are using? I find some of the latest code updates do not comply with my
>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>> ARM_SE.****
>>
>> ****
>>
>> Thank you!****
>>
>> ****
>>
>> Best,****
>>
>> Xiangyu****
>>
>> ****
>>
>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>> *Sent:* Thursday, March 08, 2012 6:52 PM****
>>
>>
>> *To:* gem5 users mailing list****
>>
>> *Cc:* ***@gmail.com; ***@umich.edu****
>>
>>
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>
>> ****
>>
>> Xiangyu,****
>>
>> ****
>>
>> I've been having an issue recently with the number of instructions
>> I've been seeing committed to the CPU (I have a separate thread on this).
>> It turns out the issue seems to be coming from this patch you created to
>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>> ****
>>
>> ****
>>
>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount <= 1500' failed.****
>>
>> ****
>>
>> Have you seen similar results? Is this count how many instructions
>> are currently being processed by the cpu? My initial guess is that memory
>> instructions being sent to DramSim2 are getting counted as committed
>> regardless of whether they are mispredicted (and rerun). Any suggestions
>> on where to insert DPRINTFs, or use current ones, to find out if this is
>> what is happening?****
>>
>> ****
>>
>> Ali helped me earlier with getting the checker and some debug flags to
>> track earlier. I currently have traces with debug flags Exec, ExecAsid and
>> DynInst. I just need to know what to search for in them for useful info.
>> ****
>>
>> ****
>>
>> Thanks,****
>>
>> Andrew ****
>>
>> ****
>>
>> On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com>
>> wrote:****
>>
>> Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if
>> it also works for other processors.****
>>
>> ****
>>
>> In addition, I actually made more modification on the DRAMsim2 side.
>> Maybe the most important one is that I changed the way how DRAMsim2 reports
>> latency/bandwidth statistics. DRAMsim2 reports all the statistics after
>> every EPOCH, and then resets all the numbers. For Gem5 users who are only
>> interested in the statistics over the entire simulation time, you might
>> want to change the codes in DRAMsim2/MemoryController.cpp and make similar
>> changes like what I’ve done (that’s NOT in the patch since it’s more
>> DRAMsim2-related).****
>>
>> ****
>>
>> Best,****
>>
>> Xiangyu****
>>
>> ****
>>
>> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org]
>> *On Behalf Of *Andrew Cebulski
>> *Sent:* Sunday, December 18, 2011 6:38 AM
>> *To:* gem5-***@gem5.org
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>
>> ****
>>
>> Thanks for the integration patch Xiangyu! I'll let you know if I come
>> across any bugs.****
>>
>> ****
>>
>> -Andrew****
>>
>> ****
>>
>> ****
>>
>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> From: "Dong, Xiangyu" <***@gmail.com>
>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi all,
>>
>>
>>
>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>> I'm willing to share it here.
>>
>>
>>
>> For those who have such needs, please go to my website
>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download
>> the patch and test it. To enable
>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>> create
>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>> derived DRAMMemory class instead of PhysicalMemory class.
>>
>>
>>
>> Please let me know if there are bugs.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu Dong
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >****
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users****
>>
>> ****
>>
>> ** **
>>
>> ** **
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
Andrew Cebulski
2012-03-14 04:41:42 UTC
Permalink
Here's a summary from running with the printfs showing the reads, writes
and receive timings reported by dram.cc for libquantum:

reads, writes, receive timings
ARM_FS 1486578, 115473, 1605225

ARM_SE 347210, 3002, 350213

FS/SE 4.28, 38.47, 4.58

-Andrew

On Tue, Mar 13, 2012 at 11:33 PM, Andrew Cebulski <***@drexel.edu> wrote:

> I figured out how to run my benchmark in ARM_SE. First, I had to
> recompile my benchmark statically linked, since I run it dynamically linked
> in ARM_FS. I was able to run the statically linked libquantum in ARM_SE
> with gem5.opt without any errors, the same as your tests. I did look at
> the committed instruction count within the stats file for the ARM_SE run
> with the O3 CPU, and it does look accurate.
>
> Next, I added this new statically linked libquantum to my disk image to
> run in ARM_FS with gem5.opt. It failed with the same exact instcount error
> in base_dyn_inst_impl.hh as with my dynamically linked libquantum
> benchmark.
>
> So it would seem that the problem only occurs in FS, not SE.
>
> Now the question is how to go about fixing the issue preventing your
> integration of DramSim2 from working in ARM_FS... Any ideas?
>
> It looks like you have some debugging coded into your dram.cc, but
> commented out (printf statements). I'll add them back and compare their
> outputs between ARM_SE and ARM_FS. I might convert them to DPRINTFs too.
>
> Do you know how to setup one of your ARM_SE benchmarks to run in ARM_FS?
> It basically involves putting it into the Gem5 Ubuntu image (i.e. using
> util/gem5img.py), creating a rcS file, and adding it in Benchmarks.py.
> This way you can try to reproduce the error in your environment. That
> would be a good start.
>
> -Andrew
>
>
> On Tue, Mar 13, 2012 at 6:08 PM, Andrew Cebulski <***@drexel.edu> wrote:
>
>> As far as I can tell, the source of the instruction count problem either
>> is restricted to FS (don't know if restricted to ARM), varies with
>> environment setup or is caused by the cross-compiled binary file. I'm
>> doubtful of the latter two since I've run with various benchmarks in
>> gem5.fast to completion, and they have been within 10% of the committed
>> instructions that the benchmarks have run on physical hardware (as measured
>> with valgrind...same binary files).
>>
>> At this point, it would be helpful to know if anyone else has run a
>> benchmark with your patch on ARM_FS (or other arch) successfully. Luckily,
>> you have made it very easy to test since it replaces the old DRAMMemory
>> object. Otherwise, I'll continue debugging this and emailing the list with
>> questions as I go.
>>
>> Is it possible for this assertion to only occur in ARM_FS, not ARM_SE,
>> with the same benchmarks? I require full-system, so I haven't tested
>> ARM_SE. If my benchmarks pass with ARM_SE in gem5.opt, that should narrow
>> the source of the problem down.
>>
>> Thanks,
>> Andrew
>>
>>
>> On Tue, Mar 13, 2012 at 4:30 PM, Rio Xiangyu Dong <***@gmail.com>wrote:
>>
>>> I use gem5.opt, and never see this error.****
>>>
>>> ** **
>>>
>>> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org]
>>> *On Behalf Of *Andrew Cebulski
>>> *Sent:* Tuesday, March 13, 2012 12:12 PM
>>>
>>> *To:* gem5 users mailing list
>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>>
>>> ** **
>>>
>>> Did you ever get this error building gem5.debug on revision 8643?
>>>
>>> -----****
>>>
>>> ** **
>>>
>>> cc1plus: warnings being treated as errors
>>> In file included from build/dramsim2/SystemConfiguration.h:42:0,
>>> from build/dramsim2/MemorySystem.h:42,
>>> from build/ARM_FS/mem/dram.hh:42,
>>> from build/ARM_FS/mem/dram.cc:39:
>>> build/dramsim2/PrintMacros.h:49:0: error: "DEBUG" redefined
>>> <command-line>:0:0: note: this is the location of the previous definition
>>>
>>> -----
>>>
>>> The DramSim2 repo has the same file:
>>> https://github.com/dramninjasUMD/DRAMSim2/blob/master/PrintMacros.h ****
>>>
>>>
>>> The solutions seems to be adding "#undef DEBUG" right before it gets
>>> defined. I can't find where it's getting declared initially though...
>>> Doing a grep of the gem5 src and ext repo's for a DEBUG define come up
>>> empty. Commenting out all defines of DEBUG in PrintMacros.h still results
>>> in the redefine message...****
>>>
>>> ** **
>>>
>>> I have two versions of gcc readily available (4.5.0 and 4.5.3), and I
>>> get it on both.****
>>>
>>> ** **
>>>
>>> -Andrew****
>>>
>>> On Tue, Mar 13, 2012 at 1:56 PM, Andrew Cebulski <***@drexel.edu>
>>> wrote:****
>>>
>>> Hi Xiangyu,****
>>>
>>> ** **
>>>
>>> I just started looking into this some more. So at first I thought
>>> it was due to updating to a more recent revision, but then I went back to
>>> revision 8643, added your patch, built and ran....and now get the error
>>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
>>> update to SWIG might have resulted in this error, maybe someone on the
>>> mailing list would know if that's possible. The difference is 1.3.40 vs.
>>> 2.0.3, both of which are supported according to the dependencies wiki page.
>>> ****
>>>
>>> ** **
>>>
>>> Just for completeness, here's the error from revision 8643:****
>>>
>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
>>> `cpu->instcount <= 1500' failed. ****
>>>
>>> ** **
>>>
>>> I also removed all the changes I've added for my research from the
>>> test with revision 8643 and your patch, aside from adding a new rcS file
>>> and the cross-compiled binary to the disk image for Libquantum (SPEC
>>> CPU2006). Note that I use CodeSourcery for cross-compiling.****
>>>
>>> ** **
>>>
>>> I have not tried running with gem5.debug, so I will be doing that
>>> today. Maybe this is an assertion that is occurring due to an
>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>> it runs without optimizations. Have you tested all debug, opt and fast
>>> with your tests?****
>>>
>>> ** **
>>>
>>> Thanks,****
>>>
>>> Andrew****
>>>
>>> ** **
>>>
>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>
>>> wrote:****
>>>
>>> Hi Andrew,****
>>>
>>> ****
>>>
>>> I didn’t see this error in my simulations. May I ask which gem5 version
>>> you are using? I find some of the latest code updates do not comply with my
>>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>>> ARM_SE.****
>>>
>>> ****
>>>
>>> Thank you!****
>>>
>>> ****
>>>
>>> Best,****
>>>
>>> Xiangyu****
>>>
>>> ****
>>>
>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>> *Sent:* Thursday, March 08, 2012 6:52 PM****
>>>
>>>
>>> *To:* gem5 users mailing list****
>>>
>>> *Cc:* ***@gmail.com; ***@umich.edu****
>>>
>>>
>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>>
>>> ****
>>>
>>> Xiangyu,****
>>>
>>> ****
>>>
>>> I've been having an issue recently with the number of instructions
>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>> It turns out the issue seems to be coming from this patch you created to
>>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>> ****
>>>
>>> ****
>>>
>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>> BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
>>> `cpu->instcount <= 1500' failed.****
>>>
>>> ****
>>>
>>> Have you seen similar results? Is this count how many instructions
>>> are currently being processed by the cpu? My initial guess is that memory
>>> instructions being sent to DramSim2 are getting counted as committed
>>> regardless of whether they are mispredicted (and rerun). Any suggestions
>>> on where to insert DPRINTFs, or use current ones, to find out if this is
>>> what is happening?****
>>>
>>> ****
>>>
>>> Ali helped me earlier with getting the checker and some debug flags
>>> to track earlier. I currently have traces with debug flags Exec, ExecAsid
>>> and DynInst. I just need to know what to search for in them for useful
>>> info.****
>>>
>>> ****
>>>
>>> Thanks,****
>>>
>>> Andrew ****
>>>
>>> ****
>>>
>>> On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com>
>>> wrote:****
>>>
>>> Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if
>>> it also works for other processors.****
>>>
>>> ****
>>>
>>> In addition, I actually made more modification on the DRAMsim2 side.
>>> Maybe the most important one is that I changed the way how DRAMsim2 reports
>>> latency/bandwidth statistics. DRAMsim2 reports all the statistics after
>>> every EPOCH, and then resets all the numbers. For Gem5 users who are only
>>> interested in the statistics over the entire simulation time, you might
>>> want to change the codes in DRAMsim2/MemoryController.cpp and make similar
>>> changes like what I’ve done (that’s NOT in the patch since it’s more
>>> DRAMsim2-related).****
>>>
>>> ****
>>>
>>> Best,****
>>>
>>> Xiangyu****
>>>
>>> ****
>>>
>>> *From:* gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org]
>>> *On Behalf Of *Andrew Cebulski
>>> *Sent:* Sunday, December 18, 2011 6:38 AM
>>> *To:* gem5-***@gem5.org
>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration****
>>>
>>> ****
>>>
>>> Thanks for the integration patch Xiangyu! I'll let you know if I come
>>> across any bugs.****
>>>
>>> ****
>>>
>>> -Andrew****
>>>
>>> ****
>>>
>>> ****
>>>
>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>> From: "Dong, Xiangyu" <***@gmail.com>
>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>> Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
>>> Content-Type: text/plain; charset="us-ascii"
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>>> I'm willing to share it here.
>>>
>>>
>>>
>>> For those who have such needs, please go to my website
>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download
>>> the patch and test it. To enable
>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>> create
>>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>
>>>
>>>
>>> Please let me know if there are bugs.
>>>
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Xiangyu Dong
>>>
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>> >****
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users****
>>>
>>> ****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>
Iordan Alexandru
2012-03-14 12:23:48 UTC
Permalink
Hello

I added a pseudo instruction to GEM5 and I want to dump the function profile every time this pseudo-inst is called from my simulated application. The problem is that my FS simulation ends with a segmentation fault error. The backtrace in GDB shows the following:

#0  _Rb_tree_const_iterator (this=0x0, tc=0x94d8208, os=...) at build/ALPHA_FS/cpu/profile.cc:157
#1  end (this=0x0, tc=0x94d8208, os=...) at /usr/include/c++/4.5/bits/stl_tree.h:646
#2  end (this=0x0, tc=0x94d8208, os=...) at /usr/include/c++/4.5/bits/stl_map.h:335
#3  FunctionProfile::dump (this=0x0, tc=0x94d8208, os=...) at build/ALPHA_FS/cpu/profile.cc:126
#4  0x083b1240 in SimpleThread::dumpFuncProfile (this=0x94d7bf8) at build/ALPHA_FS/cpu/simple_thread.cc:227
#5  0x0818673f in PseudoInst::wool_func (tc=0x94d8208, sc_by=0, task_id=60)
    at build/ALPHA_FS/sim/pseudo_inst.cc:408
#6  0x080a344b in AlphaISAInst::M5reserved2::execute (this=0xb2ba728, xc=0x94d6f50, traceData=0x0)
    at build/ALPHA_FS/arch/alpha/atomic_simple_cpu_exec.cc:11641
#7  0x083d0006 in AtomicSimpleCPU::tick (this=0x94d6f50) at build/ALPHA_FS/cpu/simple/atomic.cc:619
#8  0x08141fb5 in EventQueue::serviceOne (this=0x8acdac8) at build/ALPHA_FS/sim/eventq.cc:204
#9  0x0817d72f in simulate (num_cycles=9223372036854775807) at build/ALPHA_FS/sim/simulate.cc:72
#10 0x0821f94d in _wrap_simulate__SWIG_0 (self=0x0, args=0xb79d60ac)
    at build/ALPHA_FS/python/swig/event_wrap.cc:4534
#11 _wrap_simulate (self=0x0, args=0xb79d60ac) at build/ALPHA_FS/python/swig/event_wrap.cc:4584
#12 0x001c531a in PyCFunction_Call () from /usr/lib/libpython2.7.so.1.0
#13 0x0022c3a9 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#14 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#15 0x0022c722 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#16 0x0022ccc3 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#17 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#18 0x0022e623 in PyEval_EvalCode () from /usr/lib/libpython2.7.so.1.0
#19 0x0022caae in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#20 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#21 0x0022c722 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#22 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#23 0x0022e623 in PyEval_EvalCode () from /usr/lib/libpython2.7.so.1.0
#24 0x0024f31e in PyRun_StringFlags () from /usr/lib/libpython2.7.so.1.0
#25 0x0814b8a9 in m5Main (argc=10, argv=0xbffff384) at build/ALPHA_FS/sim/init.cc:256
#26 0x0804e202 in main (argc=10, argv=0xbffff384) at build/ALPHA_FS/sim/main.cc:57


Is this a GCC version problem? Has anybody any idea how to solve this?

Thanks in advance!

Alexandru Iordan
Paul Rosenfeld
2012-03-14 15:21:45 UTC
Permalink
I think it isn't possible for anyone to give you a definitive answer with
only a GDB backtrace and no code/description. It looks like the 'this'
pointer is NULL for your map iterator. I kind of doubt that's a GCC error
though.

On Wed, Mar 14, 2012 at 8:23 AM, Iordan Alexandru <***@yahoo.com> wrote:

>
> Hello
>
> I added a pseudo instruction to GEM5 and I want to dump the function
> profile every time this pseudo-inst is called from my simulated
> application. The problem is that my FS simulation ends with a segmentation
> fault error. The backtrace in GDB shows the following:
>
> #0 _Rb_tree_const_iterator (this=0x0, tc=0x94d8208, os=...) at
> build/ALPHA_FS/cpu/profile.cc:157
> #1 end (this=0x0, tc=0x94d8208, os=...) at
> /usr/include/c++/4.5/bits/stl_tree.h:646
> #2 end (this=0x0, tc=0x94d8208, os=...) at
> /usr/include/c++/4.5/bits/stl_map.h:335
> #3 FunctionProfile::dump (this=0x0, tc=0x94d8208, os=...) at
> build/ALPHA_FS/cpu/profile.cc:126
> #4 0x083b1240 in SimpleThread::dumpFuncProfile (this=0x94d7bf8) at
> build/ALPHA_FS/cpu/simple_thread.cc:227
> #5 0x0818673f in PseudoInst::wool_func (tc=0x94d8208, sc_by=0, task_id=60)
> at build/ALPHA_FS/sim/pseudo_inst.cc:408
> #6 0x080a344b in AlphaISAInst::M5reserved2::execute (this=0xb2ba728,
> xc=0x94d6f50, traceData=0x0)
> at build/ALPHA_FS/arch/alpha/atomic_simple_cpu_exec.cc:11641
> #7 0x083d0006 in AtomicSimpleCPU::tick (this=0x94d6f50) at
> build/ALPHA_FS/cpu/simple/atomic.cc:619
> #8 0x08141fb5 in EventQueue::serviceOne (this=0x8acdac8) at
> build/ALPHA_FS/sim/eventq.cc:204
> #9 0x0817d72f in simulate (num_cycles=9223372036854775807) at
> build/ALPHA_FS/sim/simulate.cc:72
> #10 0x0821f94d in _wrap_simulate__SWIG_0 (self=0x0, args=0xb79d60ac)
> at build/ALPHA_FS/python/swig/event_wrap.cc:4534
> #11 _wrap_simulate (self=0x0, args=0xb79d60ac) at
> build/ALPHA_FS/python/swig/event_wrap.cc:4584
> #12 0x001c531a in PyCFunction_Call () from /usr/lib/libpython2.7.so.1.0
> #13 0x0022c3a9 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
> #14 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
> #15 0x0022c722 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
> #16 0x0022ccc3 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
> #17 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
> #18 0x0022e623 in PyEval_EvalCode () from /usr/lib/libpython2.7.so.1.0
> #19 0x0022caae in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
> #20 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
> #21 0x0022c722 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
> #22 0x0022e4c8 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
> #23 0x0022e623 in PyEval_EvalCode () from /usr/lib/libpython2.7.so.1.0
> #24 0x0024f31e in PyRun_StringFlags () from /usr/lib/libpython2.7.so.1.0
> #25 0x0814b8a9 in m5Main (argc=10, argv=0xbffff384) at
> build/ALPHA_FS/sim/init.cc:256
> #26 0x0804e202 in main (argc=10, argv=0xbffff384) at
> build/ALPHA_FS/sim/main.cc:57
>
> Is this a GCC version problem? Has anybody any idea how to solve this?
>
> Thanks in advance!
>
> Alexandru Iordan
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Ali Saidi
2012-03-14 18:27:11 UTC
Permalink
You have a null pointer.

Ali

On 14.03.2012 05:23, Iordan
Alexandru wrote:

> #3 FunctionProfile::dump (this=0x0, tc=0x94d8208,
os=...) at build/ALPHA_FS/cpu/profile.cc:126
Rio Xiangyu Dong
2012-03-14 16:23:57 UTC
Permalink
Hi Andrew,



I only used ARM_SE in my daily research work, and they all work fine. It now
seems that ARM_FS is the source of error. Your debug trace would be very
helpful and I will look into this issue when I have time to do it.



Thank you!



Best,

Xiangyu



From: gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] On
Behalf Of Andrew Cebulski
Sent: Tuesday, March 13, 2012 9:42 PM
To: gem5 users mailing list
Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Here's a summary from running with the printfs showing the reads, writes and
receive timings reported by dram.cc for libquantum:

reads, writes, receive timings
ARM_FS 1486578, 115473, 1605225

ARM_SE 347210, 3002, 350213

FS/SE 4.28, 38.47, 4.58

-Andrew

On Tue, Mar 13, 2012 at 11:33 PM, Andrew Cebulski <***@drexel.edu> wrote:

I figured out how to run my benchmark in ARM_SE. First, I had to recompile
my benchmark statically linked, since I run it dynamically linked in ARM_FS.
I was able to run the statically linked libquantum in ARM_SE with gem5.opt
without any errors, the same as your tests. I did look at the committed
instruction count within the stats file for the ARM_SE run with the O3 CPU,
and it does look accurate.



Next, I added this new statically linked libquantum to my disk image to run
in ARM_FS with gem5.opt. It failed with the same exact instcount error in
base_dyn_inst_impl.hh as with my dynamically linked libquantum benchmark.



So it would seem that the problem only occurs in FS, not SE.



Now the question is how to go about fixing the issue preventing your
integration of DramSim2 from working in ARM_FS... Any ideas?



It looks like you have some debugging coded into your dram.cc, but commented
out (printf statements). I'll add them back and compare their outputs
between ARM_SE and ARM_FS. I might convert them to DPRINTFs too.



Do you know how to setup one of your ARM_SE benchmarks to run in ARM_FS? It
basically involves putting it into the Gem5 Ubuntu image (i.e. using
util/gem5img.py), creating a rcS file, and adding it in Benchmarks.py. This
way you can try to reproduce the error in your environment. That would be a
good start.



-Andrew



On Tue, Mar 13, 2012 at 6:08 PM, Andrew Cebulski <***@drexel.edu> wrote:

As far as I can tell, the source of the instruction count problem either is
restricted to FS (don't know if restricted to ARM), varies with environment
setup or is caused by the cross-compiled binary file. I'm doubtful of the
latter two since I've run with various benchmarks in gem5.fast to
completion, and they have been within 10% of the committed instructions that
the benchmarks have run on physical hardware (as measured with
valgrind...same binary files).



At this point, it would be helpful to know if anyone else has run a
benchmark with your patch on ARM_FS (or other arch) successfully. Luckily,
you have made it very easy to test since it replaces the old DRAMMemory
object. Otherwise, I'll continue debugging this and emailing the list with
questions as I go.



Is it possible for this assertion to only occur in ARM_FS, not ARM_SE, with
the same benchmarks? I require full-system, so I haven't tested ARM_SE. If
my benchmarks pass with ARM_SE in gem5.opt, that should narrow the source of
the problem down.



Thanks,

Andrew



On Tue, Mar 13, 2012 at 4:30 PM, Rio Xiangyu Dong <***@gmail.com>
wrote:

I use gem5.opt, and never see this error.



From: gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] On
Behalf Of Andrew Cebulski
Sent: Tuesday, March 13, 2012 12:12 PM


To: gem5 users mailing list

Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Did you ever get this error building gem5.debug on revision 8643?

-----



cc1plus: warnings being treated as errors
In file included from build/dramsim2/SystemConfiguration.h:42:0,
from build/dramsim2/MemorySystem.h:42,
from build/ARM_FS/mem/dram.hh:42,
from build/ARM_FS/mem/dram.cc:39:
build/dramsim2/PrintMacros.h:49:0: error: "DEBUG" redefined
<command-line>:0:0: note: this is the location of the previous definition

-----

The DramSim2 repo has the same file:
https://github.com/dramninjasUMD/DRAMSim2/blob/master/PrintMacros.h


The solutions seems to be adding "#undef DEBUG" right before it gets
defined. I can't find where it's getting declared initially though...
Doing a grep of the gem5 src and ext repo's for a DEBUG define come up
empty. Commenting out all defines of DEBUG in PrintMacros.h still results
in the redefine message...



I have two versions of gcc readily available (4.5.0 and 4.5.3), and I get it
on both.



-Andrew

On Tue, Mar 13, 2012 at 1:56 PM, Andrew Cebulski <***@drexel.edu> wrote:

Hi Xiangyu,



I just started looking into this some more. So at first I thought it
was due to updating to a more recent revision, but then I went back to
revision 8643, added your patch, built and ran....and now get the error with
it too (when running ARM_FS/gem5.opt). I"m testing now to see if an update
to SWIG might have resulted in this error, maybe someone on the mailing list
would know if that's possible. The difference is 1.3.40 vs. 2.0.3, both of
which are supported according to the dependencies wiki page.



Just for completeness, here's the error from revision 8643:

build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst<Impl>::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount <= 1500' failed.



I also removed all the changes I've added for my research from the test
with revision 8643 and your patch, aside from adding a new rcS file and the
cross-compiled binary to the disk image for Libquantum (SPEC CPU2006). Note
that I use CodeSourcery for cross-compiling.



I have not tried running with gem5.debug, so I will be doing that today.
Maybe this is an assertion that is occurring due to an optimization. That
would mean it wouldn't be triggered in gem5.debug since it runs without
optimizations. Have you tested all debug, opt and fast with your tests?



Thanks,

Andrew



On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>
wrote:

Hi Andrew,



I didn't see this error in my simulations. May I ask which gem5 version you
are using? I find some of the latest code updates do not comply with my
changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
ARM_SE.



Thank you!



Best,

Xiangyu



From: Andrew Cebulski [mailto:***@drexel.edu]
Sent: Thursday, March 08, 2012 6:52 PM


To: gem5 users mailing list

Cc: ***@gmail.com; ***@umich.edu


Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Xiangyu,



I've been having an issue recently with the number of instructions I've
been seeing committed to the CPU (I have a separate thread on this). It
turns out the issue seems to be coming from this patch you created to
integrate DramSim2 with Gem5. Unfortunately, I've been running with
gem5.fast, not gem5.opt. So up until now, I haven't been seeing assertions.
I thought I'd run it with gem5.opt or debug back in December, but I must not
have. My runs on the Arm O3 cpu fails with this assertion:



build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst<Impl>::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount <= 1500' failed.



Have you seen similar results? Is this count how many instructions are
currently being processed by the cpu? My initial guess is that memory
instructions being sent to DramSim2 are getting counted as committed
regardless of whether they are mispredicted (and rerun). Any suggestions on
where to insert DPRINTFs, or use current ones, to find out if this is what
is happening?



Ali helped me earlier with getting the checker and some debug flags to
track earlier. I currently have traces with debug flags Exec, ExecAsid and
DynInst. I just need to know what to search for in them for useful info.



Thanks,

Andrew



On Sun, Dec 18, 2011 at 5:04 PM, Dong, Xiangyu <***@gmail.com> wrote:

Thanks. Actually I only tested it on ARM_SE and ARM_FS. Let me know if it
also works for other processors.



In addition, I actually made more modification on the DRAMsim2 side. Maybe
the most important one is that I changed the way how DRAMsim2 reports
latency/bandwidth statistics. DRAMsim2 reports all the statistics after
every EPOCH, and then resets all the numbers. For Gem5 users who are only
interested in the statistics over the entire simulation time, you might want
to change the codes in DRAMsim2/MemoryController.cpp and make similar
changes like what I've done (that's NOT in the patch since it's more
DRAMsim2-related).



Best,

Xiangyu



From: gem5-users-***@gem5.org [mailto:gem5-users-***@gem5.org] On
Behalf Of Andrew Cebulski
Sent: Sunday, December 18, 2011 6:38 AM
To: gem5-***@gem5.org
Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration



Thanks for the integration patch Xiangyu! I'll let you know if I come
across any bugs.



-Andrew





Date: Sun, 18 Dec 2011 01:48:58 -0800
From: "Dong, Xiangyu" <***@gmail.com>
To: "gem5 users mailing list" <gem5-***@gem5.org>
Subject: [gem5-users] A Patch for DRAMsim2 Integration
Message-ID: <001201ccbd6a$45102210$cf306630$@gmail.com>
Content-Type: text/plain; charset="us-ascii"

Hi all,



I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
I'm willing to share it here.



For those who have such needs, please go to my website
www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download the
patch and test it. To enable
DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can create
by yourself). The basic idea to enable the DRAMsim2 module is to use the
derived DRAMMemory class instead of PhysicalMemory class.



Please let me know if there are bugs.



Thank you!



Best,

Xiangyu Dong

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3
fdf5da/attachment.html>
Ali Saidi
2012-03-14 18:10:31 UTC
Permalink
The error is that there are more that 1500 instructions currently in
flight in the system. It could mean several things:

1. The value is
somewhat arbitrarily defined and maybe there are more than 1500 in your
system at one time?

2. Instructions aren't being destroyed correctly


You could try to to run a debug binary so you'll get a list of
instructions when it happens or increase the number which may be
appropriate for certain situations (but 1500 is quite a few inflight
instructions).

Ali

On 13.03.2012 10:56, Andrew Cebulski wrote:

>
Hi Xiangyu,
> I just started looking into this some more. So at first I
thought it was due to updating to a more recent revision, but then I
went back to revision 8643, added your patch, built and ran....and now
get the error with it too (when running ARM_FS/gem5.opt). I"m testing
now to see if an update to SWIG might have resulted in this error, maybe
someone on the mailing list would know if that's possible. The
difference is 1.3.40 vs. 2.0.3, both of which are supported according to
the dependencies wiki page.
> Just for completeness, here's the error
from revision 8643:
> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
> I have not tried running with gem5.debug, so I will
be doing that today. Maybe this is an assertion that is occurring due to
an optimization. That would mean it wouldn't be triggered in gem5.debug
since it runs without optimizations. Have you tested all debug, opt and
fast with your tests?
> Thanks,
> Andrew
>
> On Tue, Mar 13, 2012 at
1:37 PM, Rio Xiangyu Dong <***@gmail.com [8]> wrote:
>
>> Hi
Andrew,
>>
>> I didn't see this error in my simulations. May I ask
which gem5 version you are using? I find some of the latest code updates
do not comply with my changes. I am still using the DRAMsim2 patch on
Gem5 repo8643, and have run all the runnable benchmarks in SPEC2006,
SPEC2000, EEMBC2, and PARSEC2 on ARM_SE.
>>
>> Thank you!
>>
>>
Best,
>>
>> Xiangyu
>>
>> FROM: Andrew Cebulski
[mailto:***@drexel.edu [5]]
>> SENT: Thursday, March 08, 2012 6:52 PM

>>
>> TO: gem5 users mailing list CC: ***@gmail.com [6];
***@umich.edu [7]
>>
>> SUBJECT: Re: [gem5-users] A Patch for
DRAMsim2 Integration
>>
>> Xiangyu,
>>
>> I've been having an issue
recently with the number of instructions I've been seeing committed to
the CPU (I have a separate thread on this). It turns out the issue seems
to be coming from this patch you created to integrate DramSim2 with
Gem5. Unfortunately, I've been running with gem5.fast, not gem5.opt. So
up until now, I haven't been seeing assertions. I thought I'd run it
with gem5.opt or debug back in December, but I must not have. My runs on
the Arm O3 cpu fails with this assertion:
>>
>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>
>> -Andrew
>>

>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>> From: "Dong, Xiangyu"
<***@gmail.com [1]>
>>> To: "gem5 users mailing list"
<gem5-***@gem5.org [2]>
>>> Subject: [gem5-users] A Patch for DRAMsim2
Integration
>>> Message-ID: gmail.com>
>>> Content-Type: text/plain;
charset="us-ascii"
>>>
>>> Hi all,
>>>
>>> I have a Gem5+DRAMsim2
patch. I've tested it under both SE and FS modes.
>>> I'm willing to
share it here.
>>>
>>> For those who have such needs, please go to my
website
>>> www.cse.psu.edu/~xydong [3] to download the patch and test
it. To enable
>>> DRAMSim2, use se_dramsim2.py script instead of se.py
(for FS, you can create
>>> by yourself). The basic idea to enable the
DRAMsim2 module is to use the
>>> derived DRAMMemory class instead of
PhysicalMemory class.
>>>
>>> Please let me know if there are bugs.
>>>

>>> Thank you!
>>>
>>> Best,
>>>
>>> Xiangyu Dong
>>>
>>>
-------------- next part --------------
>>> An HTML attachment was
scrubbed...
>>> URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[4]>
>
> _______________________________________________
> gem5-users
mailing list
> gem5-***@gem5.org [9]
>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [10]




Links:
------
[1] mailto:***@gmail.com
[2]
mailto:gem5-***@gem5.org
[3] http://www.cse.psu.edu/~xydong
[4]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[5]
mailto:***@drexel.edu
[6] mailto:***@gmail.com
[7]
mailto:***@umich.edu
[8] mailto:***@gmail.com
[9]
mailto:gem5-***@gem5.org
[10]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Paul Rosenfeld
2012-03-14 18:29:43 UTC
Permalink
It looks like the transaction queue in the DRAMSim2's system.ini file is
quite large (512 entries). I'd try to crank down the number of entries in
this queue to something significantly smaller (say 32) and see if you still
hit this assertion.

On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:

> **
>
> The error is that there are more that 1500 instructions currently in
> flight in the system. It could mean several things:
>
> 1. The value is somewhat arbitrarily defined and maybe there are more than
> 1500 in your system at one time?
>
> 2. Instructions aren't being destroyed correctly
>
>
>
> You could try to to run a debug binary so you'll get a list of
> instructions when it happens or increase the number which may
> be appropriate for certain situations (but 1500 is quite a few inflight
> instructions).
>
>
>
> Ali
>
> On 13.03.2012 10:56, Andrew Cebulski wrote:
>
> Hi Xiangyu,
> I just started looking into this some more. So at first I thought it
> was due to updating to a more recent revision, but then I went back to
> revision 8643, added your patch, built and ran....and now get the error
> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
> update to SWIG might have resulted in this error, maybe someone on the
> mailing list would know if that's possible. The difference is 1.3.40 vs.
> 2.0.3, both of which are supported according to the dependencies wiki page.
> Just for completeness, here's the error from revision 8643:
> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
> I have not tried running with gem5.debug, so I will be doing that
> today. Maybe this is an assertion that is occurring due to an
> optimization. That would mean it wouldn't be triggered in gem5.debug since
> it runs without optimizations. Have you tested all debug, opt and fast
> with your tests?
> Thanks,
> Andrew
>
> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>wrote:
>
>> Hi Andrew,
>>
>>
>>
>> I didn’t see this error in my simulations. May I ask which gem5 version
>> you are using? I find some of the latest code updates do not comply with my
>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>> ARM_SE.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu
>>
>>
>>
>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>
>> *To:* gem5 users mailing list
>> *Cc:* ***@gmail.com; ***@umich.edu
>>
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>
>>
>>
>>
>>
>> Xiangyu,
>>
>>
>>
>> I've been having an issue recently with the number of instructions
>> I've been seeing committed to the CPU (I have a separate thread on this).
>> It turns out the issue seems to be coming from this patch you created to
>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>
>>
>>
>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>
>>
>>
>> -Andrew
>>
>>
>>
>>
>>
>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> From: "Dong, Xiangyu" <***@gmail.com>
>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> Message-ID: gmail.com>
>>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi all,
>>
>>
>>
>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>> I'm willing to share it here.
>>
>>
>>
>> For those who have such needs, please go to my website
>> www.cse.psu.edu/~xydong to download the patch and test it. To enable
>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>> create
>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>> derived DRAMMemory class instead of PhysicalMemory class.
>>
>>
>>
>> Please let me know if there are bugs.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu Dong
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >
>>
>>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Andrew Cebulski
2012-04-08 01:21:03 UTC
Permalink
Hi all,

I've looked into this problem some more, and have put together a couple
traces. I've been becoming more familiar with how gem5 handles dynamic
instructions, in particular how it destroys them. I have two traces to
compare, one with the physical memory, and the other with the integrated
dramsim2 dram memory. I also have two plots showing instruction counts
over time (sim ticks). All of these are linked at the end of the email.

First, I'm going to go into what I've been able to interpret regarding how
instructions are destroyed. In particular, comparing when DynInst's vs.
DynInstPtr's are deconstructed/removed from the cpu. I separate these
because I've seen a difference, as I discuss later. These explanations are
fairly non-existent on the wiki. There is a section header waiting to be
filled...

>From what I have been able to gather from the code, there is a list of all
the instructions in flight in cpu/o3/cpu.cc called instList, with the type
DynInstPtr. There are three conditions to instructions being cleaned from
this list:

1.) The ROB retires its head instruction
2.) Fetch receives a rob squashing signal from the commit, resulting in
removing any instruction not in the ROB
3.) Decode detects an incorrect branch prediction, resulting in removal of
all instructions back to the bad seq num.

Once all five stages have completed, the CPU cleans up all the removed
in-flight instructions. This line in particular
in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:

instList.erase(removeList.front());

When I turn on the debug flag O3CPU, I see the message "Removing
instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
after all 5 cpu stages have completed, and one of the conditions above is
met. I also see what tick it occurs on.

When I turn on the DynInst debug flag, I see when instructions are created
and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From analyzing
the trace files, I've gathered that this takes into account that
instructions have different execution lengths. So if one tick a memory
instruction in the instList (DynInstPtr) is removed, the DynInst for that
memory instruction will occur much later (i.e. 1M ticks later). I have yet
to determine how this is implemented.

Now for the problem.

What I'm seeing when I run dramsim2 dram memory is a significant difference
between the size of the instList vector (of DynInstPtr objects), and the
size of dynamic instruction count (of DynInst objects). The benchmark I'm
running is libquantum from SPEC 2006. For the first roughly 130B ticks,
the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh shadows the
instList size in o3/cpu.cc (figure linked below) very closely. Around tick
130B after libquantum started, it starts hitting what I'm assuming are
loops (therefore branch prediction), resulting in some behavior that seems
to imply improper instruction handling (i.e. more instructions in flight
than allowed by ROB).

I wasn't able to sync-up the physical and dramsim2 traces exactly by trace,
but they should represent roughly the same area of execution. They don't
execute the same due to the dramsim2 modeling the memory differently (i.e.
latency and other delays).

I've shared both traces on my public Dropbox here --
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz

Here are a couple plots of tick versus instruction count, with respect to
cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
cpu/o3/cpu.cc. --
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png

Note that I added the printout of the instList size to an existing O3CPU
DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.

Here are the commands I ran to parse the traces into data files to analyze
in MATLAB and create the plots:
zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
zgrep instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
awk '{print $1,$11}' > instlistsize.out

It seems to me like the problem might lie in gem5, but has just been
exposed by integrating this more detailed memory model, dramsim2, into
gem5. Either that, or their are some timing errors in how dramsim2 was
integrated. I doubt this, however, since those first 190B ticks executed
used the dramsim2 memory. I believe the problem is a combination of memory
instructions + complex loops (branch prediction), resulting in improper
destroying of instructions.

I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags. Their
are 192 ROB entries, which is why the instList size generally has a max of
about 192 instructions. The dynamic instruction counts (seen in the
dramsim2 plot) seem to also imply that instructions are incorrectly been
removed from the ROB, and then from the cpu's instruction list in cpu.cc,
which allows more and more instructions to be added to the system (possibly
from a bad branch).

I appreciate any help in debugging this and further figuring out the root
problem, just let me know if you need anything else from me. I don't have
much more time at the moment to debug, but I can take any advice for quick
changes and/or additional traces, then send the results back to the list
for discussion.

Thanks,
Andrew

P.S. Paul - I did try decreasing the size of the dramsim2 transaction (and
even command) queue from 512 to 32. The same instructions problem
occurred. It basically just decreased the execution time.

On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:

> **
>
> The error is that there are more that 1500 instructions currently in
> flight in the system. It could mean several things:
>
> 1. The value is somewhat arbitrarily defined and maybe there are more than
> 1500 in your system at one time?
>
> 2. Instructions aren't being destroyed correctly
>
>
>
> You could try to to run a debug binary so you'll get a list of
> instructions when it happens or increase the number which may
> be appropriate for certain situations (but 1500 is quite a few inflight
> instructions).
>
>
>
> Ali
>
> On 13.03.2012 10:56, Andrew Cebulski wrote:
>
> Hi Xiangyu,
> I just started looking into this some more. So at first I thought it
> was due to updating to a more recent revision, but then I went back to
> revision 8643, added your patch, built and ran....and now get the error
> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
> update to SWIG might have resulted in this error, maybe someone on the
> mailing list would know if that's possible. The difference is 1.3.40 vs.
> 2.0.3, both of which are supported according to the dependencies wiki page.
> Just for completeness, here's the error from revision 8643:
> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
> I have not tried running with gem5.debug, so I will be doing that
> today. Maybe this is an assertion that is occurring due to an
> optimization. That would mean it wouldn't be triggered in gem5.debug since
> it runs without optimizations. Have you tested all debug, opt and fast
> with your tests?
> Thanks,
> Andrew
>
> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>wrote:
>
>> Hi Andrew,
>>
>>
>>
>> I didn’t see this error in my simulations. May I ask which gem5 version
>> you are using? I find some of the latest code updates do not comply with my
>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>> ARM_SE.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu
>>
>>
>>
>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>
>> *To:* gem5 users mailing list
>> *Cc:* ***@gmail.com; ***@umich.edu
>>
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>
>>
>>
>>
>>
>> Xiangyu,
>>
>>
>>
>> I've been having an issue recently with the number of instructions
>> I've been seeing committed to the CPU (I have a separate thread on this).
>> It turns out the issue seems to be coming from this patch you created to
>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>
>>
>>
>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>
>>
>>
>> -Andrew
>>
>>
>>
>>
>>
>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> From: "Dong, Xiangyu" <***@gmail.com>
>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> Message-ID: gmail.com>
>>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi all,
>>
>>
>>
>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>> I'm willing to share it here.
>>
>>
>>
>> For those who have such needs, please go to my website
>> www.cse.psu.edu/~xydong to download the patch and test it. To enable
>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>> create
>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>> derived DRAMMemory class instead of PhysicalMemory class.
>>
>>
>>
>> Please let me know if there are bugs.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu Dong
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >
>>
>>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
>
>
>
>
Tao Zhang
2012-04-08 01:35:53 UTC
Permalink
Hi Andrew,

I just finished the integration of DRAMSim2 with Ruby. Since I think Xiangyu's patch only work with classic memory, as a workaround, I can share the code with you if you'd like to use Ruby.

Tao Zhang
Department of CSE
Penn State University
(from my iphone4)

On Apr 7, 2012, at 9:21 PM, Andrew Cebulski <***@drexel.edu> wrote:

> Hi all,
>
> I've looked into this problem some more, and have put together a couple traces. I've been becoming more familiar with how gem5 handles dynamic instructions, in particular how it destroys them. I have two traces to compare, one with the physical memory, and the other with the integrated dramsim2 dram memory. I also have two plots showing instruction counts over time (sim ticks). All of these are linked at the end of the email.
>
> First, I'm going to go into what I've been able to interpret regarding how instructions are destroyed. In particular, comparing when DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these because I've seen a difference, as I discuss later. These explanations are fairly non-existent on the wiki. There is a section header waiting to be filled...
>
> From what I have been able to gather from the code, there is a list of all the instructions in flight in cpu/o3/cpu.cc called instList, with the type DynInstPtr. There are three conditions to instructions being cleaned from this list:
>
> 1.) The ROB retires its head instruction
> 2.) Fetch receives a rob squashing signal from the commit, resulting in removing any instruction not in the ROB
> 3.) Decode detects an incorrect branch prediction, resulting in removal of all instructions back to the bad seq num.
>
> Once all five stages have completed, the CPU cleans up all the removed in-flight instructions. This line in particular in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>
> instList.erase(removeList.front());
>
> When I turn on the debug flag O3CPU, I see the message "Removing instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState after all 5 cpu stages have completed, and one of the conditions above is met. I also see what tick it occurs on.
>
> When I turn on the DynInst debug flag, I see when instructions are created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From analyzing the trace files, I've gathered that this takes into account that instructions have different execution lengths. So if one tick a memory instruction in the instList (DynInstPtr) is removed, the DynInst for that memory instruction will occur much later (i.e. 1M ticks later). I have yet to determine how this is implemented.
>
> Now for the problem.
>
> What I'm seeing when I run dramsim2 dram memory is a significant difference between the size of the instList vector (of DynInstPtr objects), and the size of dynamic instruction count (of DynInst objects). The benchmark I'm running is libquantum from SPEC 2006. For the first roughly 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh shadows the instList size in o3/cpu.cc (figure linked below) very closely. Around tick 130B after libquantum started, it starts hitting what I'm assuming are loops (therefore branch prediction), resulting in some behavior that seems to imply improper instruction handling (i.e. more instructions in flight than allowed by ROB).
>
> I wasn't able to sync-up the physical and dramsim2 traces exactly by trace, but they should represent roughly the same area of execution. They don't execute the same due to the dramsim2 modeling the memory differently (i.e. latency and other delays).
>
> I've shared both traces on my public Dropbox here --
> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>
> Here are a couple plots of tick versus instruction count, with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in cpu/o3/cpu.cc. --
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>
> Note that I added the printout of the instList size to an existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>
> Here are the commands I ran to parse the traces into data files to analyze in MATLAB and create the plots:
> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
> zgrep instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print $1,$11}' > instlistsize.out
>
> It seems to me like the problem might lie in gem5, but has just been exposed by integrating this more detailed memory model, dramsim2, into gem5. Either that, or their are some timing errors in how dramsim2 was integrated. I doubt this, however, since those first 190B ticks executed used the dramsim2 memory. I believe the problem is a combination of memory instructions + complex loops (branch prediction), resulting in improper destroying of instructions.
>
> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags. Their are 192 ROB entries, which is why the instList size generally has a max of about 192 instructions. The dynamic instruction counts (seen in the dramsim2 plot) seem to also imply that instructions are incorrectly been removed from the ROB, and then from the cpu's instruction list in cpu.cc, which allows more and more instructions to be added to the system (possibly from a bad branch).
>
> I appreciate any help in debugging this and further figuring out the root problem, just let me know if you need anything else from me. I don't have much more time at the moment to debug, but I can take any advice for quick changes and/or additional traces, then send the results back to the list for discussion.
>
> Thanks,
> Andrew
>
> P.S. Paul - I did try decreasing the size of the dramsim2 transaction (and even command) queue from 512 to 32. The same instructions problem occurred. It basically just decreased the execution time.
>
> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
> The error is that there are more that 1500 instructions currently in flight in the system. It could mean several things:
>
> 1. The value is somewhat arbitrarily defined and maybe there are more than 1500 in your system at one time?
>
> 2. Instructions aren't being destroyed correctly
>
>
>
> You could try to to run a debug binary so you'll get a list of instructions when it happens or increase the number which may be appropriate for certain situations (but 1500 is quite a few inflight instructions).
>
>
>
> Ali
>
> On 13.03.2012 10:56, Andrew Cebulski wrote:
>
>> Hi Xiangyu,
>>
>> I just started looking into this some more. So at first I thought it was due to updating to a more recent revision, but then I went back to revision 8643, added your patch, built and ran....and now get the error with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an update to SWIG might have resulted in this error, maybe someone on the mailing list would know if that's possible. The difference is 1.3.40 vs. 2.0.3, both of which are supported according to the dependencies wiki page.
>> Just for completeness, here's the error from revision 8643:
>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>> I have not tried running with gem5.debug, so I will be doing that today. Maybe this is an assertion that is occurring due to an optimization. That would mean it wouldn't be triggered in gem5.debug since it runs without optimizations. Have you tested all debug, opt and fast with your tests?
>> Thanks,
>> Andrew
>>
>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com> wrote:
>> Hi Andrew,
>>
>>
>>
>> I didn¡¯t see this error in my simulations. May I ask which gem5 version you are using? I find some of the latest code updates do not comply with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on ARM_SE.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu
>>
>>
>>
>> From: Andrew Cebulski [mailto:***@drexel.edu]
>> Sent: Thursday, March 08, 2012 6:52 PM
>>
>>
>> To: gem5 users mailing list
>> Cc: ***@gmail.com; ***@umich.edu
>>
>> Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration
>>
>>
>>
>>
>> Xiangyu,
>>
>>
>>
>> I've been having an issue recently with the number of instructions I've been seeing committed to the CPU (I have a separate thread on this). It turns out the issue seems to be coming from this patch you created to integrate DramSim2 with Gem5. Unfortunately, I've been running with gem5.fast, not gem5.opt. So up until now, I haven't been seeing assertions. I thought I'd run it with gem5.opt or debug back in December, but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>
>>
>>
>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>
>>
>>
>> -Andrew
>>
>>
>>
>>
>>
>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> From: "Dong, Xiangyu" <***@gmail.com>
>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> Message-ID: gmail.com>
>>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi all,
>>
>>
>>
>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>> I'm willing to share it here.
>>
>>
>>
>> For those who have such needs, please go to my website
>> www.cse.psu.edu/~xydong to download the patch and test it. To enable
>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can create
>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>> derived DRAMMemory class instead of PhysicalMemory class.
>>
>>
>>
>> Please let me know if there are bugs.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu Dong
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>
>
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Andrew Cebulski
2012-04-08 01:48:33 UTC
Permalink
Hi Tao,

That would be great...thanks! Have you tested the integration with ARM?
I know for awhile the status matrix on the gem5 wiki has shown that Ruby
"might work" with ARM (last updated the beginning of March).

-Andrew

On Sat, Apr 7, 2012 at 9:35 PM, Tao Zhang <***@gmail.com> wrote:

> Hi Andrew,
>
> I just finished the integration of DRAMSim2 with Ruby. Since I think
> Xiangyu's patch only work with classic memory, as a workaround, I can share
> the code with you if you'd like to use Ruby.
>
> Tao Zhang
> Department of CSE
> Penn State University
> (from my iphone4)
>
> On Apr 7, 2012, at 9:21 PM, Andrew Cebulski <***@drexel.edu> wrote:
>
> Hi all,
>
> I've looked into this problem some more, and have put together a couple
> traces. I've been becoming more familiar with how gem5 handles dynamic
> instructions, in particular how it destroys them. I have two traces to
> compare, one with the physical memory, and the other with the integrated
> dramsim2 dram memory. I also have two plots showing instruction counts
> over time (sim ticks). All of these are linked at the end of the email.
>
> First, I'm going to go into what I've been able to interpret regarding how
> instructions are destroyed. In particular, comparing when DynInst's vs.
> DynInstPtr's are deconstructed/removed from the cpu. I separate these
> because I've seen a difference, as I discuss later. These explanations are
> fairly non-existent on the wiki. There is a section header waiting to be
> filled...
>
> From what I have been able to gather from the code, there is a list of all
> the instructions in flight in cpu/o3/cpu.cc called instList, with the type
> DynInstPtr. There are three conditions to instructions being cleaned from
> this list:
>
> 1.) The ROB retires its head instruction
> 2.) Fetch receives a rob squashing signal from the commit, resulting in
> removing any instruction not in the ROB
> 3.) Decode detects an incorrect branch prediction, resulting in removal
> of all instructions back to the bad seq num.
>
> Once all five stages have completed, the CPU cleans up all the removed
> in-flight instructions. This line in particular
> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>
> instList.erase(removeList.front());
>
> When I turn on the debug flag O3CPU, I see the message "Removing
> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
> after all 5 cpu stages have completed, and one of the conditions above is
> met. I also see what tick it occurs on.
>
> When I turn on the DynInst debug flag, I see when instructions are created
> and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From analyzing
> the trace files, I've gathered that this takes into account that
> instructions have different execution lengths. So if one tick a memory
> instruction in the instList (DynInstPtr) is removed, the DynInst for that
> memory instruction will occur much later (i.e. 1M ticks later). I have yet
> to determine how this is implemented.
>
> Now for the problem.
>
> What I'm seeing when I run dramsim2 dram memory is a significant
> difference between the size of the instList vector (of DynInstPtr objects),
> and the size of dynamic instruction count (of DynInst objects). The
> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
> Around tick 130B after libquantum started, it starts hitting what I'm
> assuming are loops (therefore branch prediction), resulting in some
> behavior that seems to imply improper instruction handling (i.e. more
> instructions in flight than allowed by ROB).
>
> I wasn't able to sync-up the physical and dramsim2 traces exactly by
> trace, but they should represent roughly the same area of execution. They
> don't execute the same due to the dramsim2 modeling the memory differently
> (i.e. latency and other delays).
>
> I've shared both traces on my public Dropbox here --
>
> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>
> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>
> Here are a couple plots of tick versus instruction count, with respect to
> cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
> cpu/o3/cpu.cc. --
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>
> Note that I added the printout of the instList size to an existing O3CPU
> DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>
> Here are the commands I ran to parse the traces into data files to analyze
> in MATLAB and create the plots:
> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
> grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
> zgrep instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
> | awk '{print $1,$11}' > instlistsize.out
>
> It seems to me like the problem might lie in gem5, but has just been
> exposed by integrating this more detailed memory model, dramsim2, into
> gem5. Either that, or their are some timing errors in how dramsim2 was
> integrated. I doubt this, however, since those first 190B ticks executed
> used the dramsim2 memory. I believe the problem is a combination of memory
> instructions + complex loops (branch prediction), resulting in improper
> destroying of instructions.
>
> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
> Their are 192 ROB entries, which is why the instList size generally has a
> max of about 192 instructions. The dynamic instruction counts (seen in the
> dramsim2 plot) seem to also imply that instructions are incorrectly been
> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
> which allows more and more instructions to be added to the system (possibly
> from a bad branch).
>
> I appreciate any help in debugging this and further figuring out the root
> problem, just let me know if you need anything else from me. I don't have
> much more time at the moment to debug, but I can take any advice for quick
> changes and/or additional traces, then send the results back to the list
> for discussion.
>
> Thanks,
> Andrew
>
> P.S. Paul - I did try decreasing the size of the dramsim2 transaction (and
> even command) queue from 512 to 32. The same instructions problem
> occurred. It basically just decreased the execution time.
>
> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>
>> **
>>
>> The error is that there are more that 1500 instructions currently in
>> flight in the system. It could mean several things:
>>
>> 1. The value is somewhat arbitrarily defined and maybe there are more
>> than 1500 in your system at one time?
>>
>> 2. Instructions aren't being destroyed correctly
>>
>>
>>
>> You could try to to run a debug binary so you'll get a list of
>> instructions when it happens or increase the number which may
>> be appropriate for certain situations (but 1500 is quite a few inflight
>> instructions).
>>
>>
>>
>> Ali
>>
>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>
>> Hi Xiangyu,
>> I just started looking into this some more. So at first I thought it
>> was due to updating to a more recent revision, but then I went back to
>> revision 8643, added your patch, built and ran....and now get the error
>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
>> update to SWIG might have resulted in this error, maybe someone on the
>> mailing list would know if that's possible. The difference is 1.3.40 vs.
>> 2.0.3, both of which are supported according to the dependencies wiki page.
>> Just for completeness, here's the error from revision 8643:
>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>> I have not tried running with gem5.debug, so I will be doing that
>> today. Maybe this is an assertion that is occurring due to an
>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>> it runs without optimizations. Have you tested all debug, opt and fast
>> with your tests?
>> Thanks,
>> Andrew
>>
>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>wrote:
>>
>>> Hi Andrew,
>>>
>>>
>>>
>>> I didn’t see this error in my simulations. May I ask which gem5 version
>>> you are using? I find some of the latest code updates do not comply with my
>>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>>> ARM_SE.
>>>
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Xiangyu
>>>
>>>
>>>
>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>
>>> *To:* gem5 users mailing list
>>> *Cc:* ***@gmail.com; ***@umich.edu
>>>
>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>
>>>
>>>
>>>
>>>
>>> Xiangyu,
>>>
>>>
>>>
>>> I've been having an issue recently with the number of instructions
>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>> It turns out the issue seems to be coming from this patch you created to
>>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>>
>>>
>>>
>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>
>>>
>>>
>>> -Andrew
>>>
>>>
>>>
>>>
>>>
>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>> From: "Dong, Xiangyu" <***@gmail.com>
>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>> Message-ID: gmail.com>
>>>
>>> Content-Type: text/plain; charset="us-ascii"
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>>> I'm willing to share it here.
>>>
>>>
>>>
>>> For those who have such needs, please go to my website
>>> www.cse.psu.edu/~xydong to download the patch and test it. To enable
>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>> create
>>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>
>>>
>>>
>>> Please let me know if there are bugs.
>>>
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Xiangyu Dong
>>>
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>> >
>>>
>>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Tao Zhang
2012-04-08 02:07:45 UTC
Permalink
I am using Alpha now. I have tested the wrapper code by SPEC2006. I am sure this patch can support any ISA as long as Ruby can be used smoothly. You can quickly check whether ARM and Ruby is feasible.

Tao Zhang
Department of CSE
Penn State University
(from my iphone4)

On Apr 7, 2012, at 9:48 PM, Andrew Cebulski <***@drexel.edu> wrote:

> Hi Tao,
>
> That would be great...thanks! Have you tested the integration with ARM? I know for awhile the status matrix on the gem5 wiki has shown that Ruby "might work" with ARM (last updated the beginning of March).
>
> -Andrew
>
> On Sat, Apr 7, 2012 at 9:35 PM, Tao Zhang <***@gmail.com> wrote:
> Hi Andrew,
>
> I just finished the integration of DRAMSim2 with Ruby. Since I think Xiangyu's patch only work with classic memory, as a workaround, I can share the code with you if you'd like to use Ruby.
>
> Tao Zhang
> Department of CSE
> Penn State University
> (from my iphone4)
>
> On Apr 7, 2012, at 9:21 PM, Andrew Cebulski <***@drexel.edu> wrote:
>
>> Hi all,
>>
>> I've looked into this problem some more, and have put together a couple traces. I've been becoming more familiar with how gem5 handles dynamic instructions, in particular how it destroys them. I have two traces to compare, one with the physical memory, and the other with the integrated dramsim2 dram memory. I also have two plots showing instruction counts over time (sim ticks). All of these are linked at the end of the email.
>>
>> First, I'm going to go into what I've been able to interpret regarding how instructions are destroyed. In particular, comparing when DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these because I've seen a difference, as I discuss later. These explanations are fairly non-existent on the wiki. There is a section header waiting to be filled...
>>
>> From what I have been able to gather from the code, there is a list of all the instructions in flight in cpu/o3/cpu.cc called instList, with the type DynInstPtr. There are three conditions to instructions being cleaned from this list:
>>
>> 1.) The ROB retires its head instruction
>> 2.) Fetch receives a rob squashing signal from the commit, resulting in removing any instruction not in the ROB
>> 3.) Decode detects an incorrect branch prediction, resulting in removal of all instructions back to the bad seq num.
>>
>> Once all five stages have completed, the CPU cleans up all the removed in-flight instructions. This line in particular in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>
>> instList.erase(removeList.front());
>>
>> When I turn on the debug flag O3CPU, I see the message "Removing instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState after all 5 cpu stages have completed, and one of the conditions above is met. I also see what tick it occurs on.
>>
>> When I turn on the DynInst debug flag, I see when instructions are created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From analyzing the trace files, I've gathered that this takes into account that instructions have different execution lengths. So if one tick a memory instruction in the instList (DynInstPtr) is removed, the DynInst for that memory instruction will occur much later (i.e. 1M ticks later). I have yet to determine how this is implemented.
>>
>> Now for the problem.
>>
>> What I'm seeing when I run dramsim2 dram memory is a significant difference between the size of the instList vector (of DynInstPtr objects), and the size of dynamic instruction count (of DynInst objects). The benchmark I'm running is libquantum from SPEC 2006. For the first roughly 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh shadows the instList size in o3/cpu.cc (figure linked below) very closely. Around tick 130B after libquantum started, it starts hitting what I'm assuming are loops (therefore branch prediction), resulting in some behavior that seems to imply improper instruction handling (i.e. more instructions in flight than allowed by ROB).
>>
>> I wasn't able to sync-up the physical and dramsim2 traces exactly by trace, but they should represent roughly the same area of execution. They don't execute the same due to the dramsim2 modeling the memory differently (i.e. latency and other delays).
>>
>> I've shared both traces on my public Dropbox here --
>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>
>> Here are a couple plots of tick versus instruction count, with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in cpu/o3/cpu.cc. --
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>
>> Note that I added the printout of the instList size to an existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>
>> Here are the commands I ran to parse the traces into data files to analyze in MATLAB and create the plots:
>> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
>> zgrep instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print $1,$11}' > instlistsize.out
>>
>> It seems to me like the problem might lie in gem5, but has just been exposed by integrating this more detailed memory model, dramsim2, into gem5. Either that, or their are some timing errors in how dramsim2 was integrated. I doubt this, however, since those first 190B ticks executed used the dramsim2 memory. I believe the problem is a combination of memory instructions + complex loops (branch prediction), resulting in improper destroying of instructions.
>>
>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags. Their are 192 ROB entries, which is why the instList size generally has a max of about 192 instructions. The dynamic instruction counts (seen in the dramsim2 plot) seem to also imply that instructions are incorrectly been removed from the ROB, and then from the cpu's instruction list in cpu.cc, which allows more and more instructions to be added to the system (possibly from a bad branch).
>>
>> I appreciate any help in debugging this and further figuring out the root problem, just let me know if you need anything else from me. I don't have much more time at the moment to debug, but I can take any advice for quick changes and/or additional traces, then send the results back to the list for discussion.
>>
>> Thanks,
>> Andrew
>>
>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction (and even command) queue from 512 to 32. The same instructions problem occurred. It basically just decreased the execution time.
>>
>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>> The error is that there are more that 1500 instructions currently in flight in the system. It could mean several things:
>>
>> 1. The value is somewhat arbitrarily defined and maybe there are more than 1500 in your system at one time?
>>
>> 2. Instructions aren't being destroyed correctly
>>
>>
>>
>> You could try to to run a debug binary so you'll get a list of instructions when it happens or increase the number which may be appropriate for certain situations (but 1500 is quite a few inflight instructions).
>>
>>
>>
>> Ali
>>
>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>
>>> Hi Xiangyu,
>>>
>>> I just started looking into this some more. So at first I thought it was due to updating to a more recent revision, but then I went back to revision 8643, added your patch, built and ran....and now get the error with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an update to SWIG might have resulted in this error, maybe someone on the mailing list would know if that's possible. The difference is 1.3.40 vs. 2.0.3, both of which are supported according to the dependencies wiki page.
>>> Just for completeness, here's the error from revision 8643:
>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>> I have not tried running with gem5.debug, so I will be doing that today. Maybe this is an assertion that is occurring due to an optimization. That would mean it wouldn't be triggered in gem5.debug since it runs without optimizations. Have you tested all debug, opt and fast with your tests?
>>> Thanks,
>>> Andrew
>>>
>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com> wrote:
>>> Hi Andrew,
>>>
>>>
>>>
>>> I didn¡¯t see this error in my simulations. May I ask which gem5 version you are using? I find some of the latest code updates do not comply with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on ARM_SE.
>>>
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Xiangyu
>>>
>>>
>>>
>>> From: Andrew Cebulski [mailto:***@drexel.edu]
>>> Sent: Thursday, March 08, 2012 6:52 PM
>>>
>>>
>>> To: gem5 users mailing list
>>> Cc: ***@gmail.com; ***@umich.edu
>>>
>>> Subject: Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>
>>>
>>>
>>>
>>> Xiangyu,
>>>
>>>
>>>
>>> I've been having an issue recently with the number of instructions I've been seeing committed to the CPU (I have a separate thread on this). It turns out the issue seems to be coming from this patch you created to integrate DramSim2 with Gem5. Unfortunately, I've been running with gem5.fast, not gem5.opt. So up until now, I haven't been seeing assertions. I thought I'd run it with gem5.opt or debug back in December, but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>>
>>>
>>>
>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>
>>>
>>>
>>> -Andrew
>>>
>>>
>>>
>>>
>>>
>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>> From: "Dong, Xiangyu" <***@gmail.com>
>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>> Message-ID: gmail.com>
>>>
>>> Content-Type: text/plain; charset="us-ascii"
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>>> I'm willing to share it here.
>>>
>>>
>>>
>>> For those who have such needs, please go to my website
>>> www.cse.psu.edu/~xydong to download the patch and test it. To enable
>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can create
>>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>
>>>
>>>
>>> Please let me know if there are bugs.
>>>
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Xiangyu Dong
>>>
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gabe Black
2012-04-08 01:48:42 UTC
Permalink
Without digging into things too deeply, it looks like you may be leaking
references to dynamic instructions. The CPU may think it's done with
one, but until that final reference is removed, the object will hang
around forever. I think I've had problems before where there reference
count ended up off by one somehow and instructions would start piling
up. It's also possible that a clog develops in O3's pipeline and some
internal structure stops letting instructions through and starts
accumulating them. Either of these problems will be annoying to track
down, but with enough digging I've been able to fix these sorts of things.

This may have more to do with O3 not handling the benchmark you're
running well rather than a problem with your new DRAM model. There may
be some interaction between the two, though, where the new memory makes
the timing line up to cause O3 to behave poorly. What you can do is
instrument dynamic instruction creation and destruction and reference
counting (try print "this" for both the reference counting wrapper and
the dyn inst itself) and turn it on as close as you can to where things
go bad tick wise. Then look for an instruction which gets lost, and look
for where it's reference count is incremented and decremented. It should
be relatively easy to pair up where references are created and
destroyed, and you should be able to identify the reference which never
goes away. Then you need to figure out where that reference is being
created. After that, you should have enough information to identify why
the reference counting isn't being done correctly. It's arduous, but
that's the only way.

It's important to also make sure reference counts aren't decremented to
zero prematurely. I had a problem once where that happened and the
memory behind the object was updated by something that didn't know it
was dead. The memory had since been reallocated to another object of the
same type, so that other object reflected what happened to the phantom
one. If I remember that manifested as something weird like an add
causing a page fault or something.

Gabe

On 04/07/12 18:21, Andrew Cebulski wrote:
> Hi all,
>
> I've looked into this problem some more, and have put together a
> couple traces. I've been becoming more familiar with how gem5 handles
> dynamic instructions, in particular how it destroys them. I have two
> traces to compare, one with the physical memory, and the other with
> the integrated dramsim2 dram memory. I also have two plots showing
> instruction counts over time (sim ticks). All of these are linked at
> the end of the email.
>
> First, I'm going to go into what I've been able to interpret regarding
> how instructions are destroyed. In particular, comparing when
> DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
> separate these because I've seen a difference, as I discuss later.
> These explanations are fairly non-existent on the wiki. There is a
> section header waiting to be filled...
>
> From what I have been able to gather from the code, there is a list of
> all the instructions in flight in cpu/o3/cpu.cc called instList, with
> the type DynInstPtr. There are three conditions to instructions being
> cleaned from this list:
>
> 1.) The ROB retires its head instruction
> 2.) Fetch receives a rob squashing signal from the commit, resulting
> in removing any instruction not in the ROB
> 3.) Decode detects an incorrect branch prediction, resulting in
> removal of all instructions back to the bad seq num.
>
> Once all five stages have completed, the CPU cleans up all the removed
> in-flight instructions. This line in particular
> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>
> instList.erase(removeList.front());
>
> When I turn on the debug flag O3CPU, I see the message "Removing
> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and
> pcState after all 5 cpu stages have completed, and one of the
> conditions above is met. I also see what tick it occurs on.
>
> When I turn on the DynInst debug flag, I see when instructions are
> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
> analyzing the trace files, I've gathered that this takes into account
> that instructions have different execution lengths. So if one tick a
> memory instruction in the instList (DynInstPtr) is removed, the
> DynInst for that memory instruction will occur much later (i.e. 1M
> ticks later). I have yet to determine how this is implemented.
>
> Now for the problem.
>
> What I'm seeing when I run dramsim2 dram memory is a significant
> difference between the size of the instList vector (of DynInstPtr
> objects), and the size of dynamic instruction count (of DynInst
> objects). The benchmark I'm running is libquantum from SPEC 2006.
> For the first roughly 130B ticks, the dynamic instruction count kept
> in cpu/base_dyn_inst.impl.hh shadows the instList size in o3/cpu.cc
> (figure linked below) very closely. Around tick 130B after libquantum
> started, it starts hitting what I'm assuming are loops (therefore
> branch prediction), resulting in some behavior that seems to imply
> improper instruction handling (i.e. more instructions in flight than
> allowed by ROB).
>
> I wasn't able to sync-up the physical and dramsim2 traces exactly by
> trace, but they should represent roughly the same area of execution.
> They don't execute the same due to the dramsim2 modeling the memory
> differently (i.e. latency and other delays).
>
> I've shared both traces on my public Dropbox here --
> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>
> Here are a couple plots of tick versus instruction count, with respect
> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
> cpu/o3/cpu.cc. --
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>
> Note that I added the printout of the instList size to an existing
> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>
> Here are the commands I ran to parse the traces into data files to
> analyze in MATLAB and create the plots:
> zgrep DynInst
> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
> destroyed | awk '{print $1,$11}' > cpuinstcount.out
> zgrep instList
> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk
> '{print $1,$11}' > instlistsize.out
>
> It seems to me like the problem might lie in gem5, but has just been
> exposed by integrating this more detailed memory model, dramsim2, into
> gem5. Either that, or their are some timing errors in how dramsim2
> was integrated. I doubt this, however, since those first 190B ticks
> executed used the dramsim2 memory. I believe the problem is a
> combination of memory instructions + complex loops (branch
> prediction), resulting in improper destroying of instructions.
>
> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
> Their are 192 ROB entries, which is why the instList size generally
> has a max of about 192 instructions. The dynamic instruction counts
> (seen in the dramsim2 plot) seem to also imply that instructions are
> incorrectly been removed from the ROB, and then from the cpu's
> instruction list in cpu.cc, which allows more and more instructions to
> be added to the system (possibly from a bad branch).
>
> I appreciate any help in debugging this and further figuring out the
> root problem, just let me know if you need anything else from me. I
> don't have much more time at the moment to debug, but I can take any
> advice for quick changes and/or additional traces, then send the
> results back to the list for discussion.
>
> Thanks,
> Andrew
>
> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
> (and even command) queue from 512 to 32. The same instructions
> problem occurred. It basically just decreased the execution time.
>
> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu
> <mailto:***@umich.edu>> wrote:
>
> The error is that there are more that 1500 instructions currently
> in flight in the system. It could mean several things:
>
> 1. The value is somewhat arbitrarily defined and maybe there are
> more than 1500 in your system at one time?
>
> 2. Instructions aren't being destroyed correctly
>
>
>
> You could try to to run a debug binary so you'll get a list of
> instructions when it happens or increase the number which may
> be appropriate for certain situations (but 1500 is quite a few
> inflight instructions).
>
>
>
> Ali
>
> On 13.03.2012 10:56, Andrew Cebulski wrote:
>
>> Hi Xiangyu,
>>
>> I just started looking into this some more. So at first I
>> thought it was due to updating to a more recent revision, but
>> then I went back to revision 8643, added your patch, built and
>> ran....and now get the error with it too (when running
>> ARM_FS/gem5.opt). I"m testing now to see if an update to SWIG
>> might have resulted in this error, maybe someone on the mailing
>> list would know if that's possible. The difference is 1.3.40 vs.
>> 2.0.3, both of which are supported according to the dependencies
>> wiki page.
>> Just for completeness, here's the error from revision 8643:
>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount
>> I have not tried running with gem5.debug, so I will be doing
>> that today. Maybe this is an assertion that is occurring due to
>> an optimization. That would mean it wouldn't be triggered in
>> gem5.debug since it runs without optimizations. Have you tested
>> all debug, opt and fast with your tests?
>> Thanks,
>> Andrew
>>
>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong
>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>
>> Hi Andrew,
>>
>>
>>
>> I didn't see this error in my simulations. May I ask which
>> gem5 version you are using? I find some of the latest code
>> updates do not comply with my changes. I am still using the
>> DRAMsim2 patch on Gem5 repo8643, and have run all the
>> runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>> PARSEC2 on ARM_SE.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu
>>
>>
>>
>> *From:*Andrew Cebulski [mailto:***@drexel.edu
>> <mailto:***@drexel.edu>]
>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>
>>
>> *To:* gem5 users mailing list
>> *Cc:* ***@gmail.com <mailto:***@gmail.com>;
>> ***@umich.edu <mailto:***@umich.edu>
>>
>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>
>>
>>
>>
>>
>> Xiangyu,
>>
>>
>>
>> I've been having an issue recently with the number of
>> instructions I've been seeing committed to the CPU (I have a
>> separate thread on this). It turns out the issue seems to be
>> coming from this patch you created to integrate DramSim2 with
>> Gem5. Unfortunately, I've been running with gem5.fast, not
>> gem5.opt. So up until now, I haven't been seeing assertions.
>> I thought I'd run it with gem5.opt or debug back in
>> December, but I must not have. My runs on the Arm O3 cpu
>> fails with this assertion:
>>
>>
>>
>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount
>>
>>
>>
>> -Andrew
>>
>>
>>
>>
>>
>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> From: "Dong, Xiangyu" <***@gmail.com
>> <mailto:***@gmail.com>>
>> To: "gem5 users mailing list" <gem5-***@gem5.org
>> <mailto:gem5-***@gem5.org>>
>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> Message-ID: gmail.com <http://gmail.com>>
>>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi all,
>>
>>
>>
>> I have a Gem5+DRAMsim2 patch. I've tested it under both
>> SE and FS modes.
>> I'm willing to share it here.
>>
>>
>>
>> For those who have such needs, please go to my website
>> www.cse.psu.edu/~xydong
>> <http://www.cse.psu.edu/%7Exydong> to download the patch
>> and test it. To enable
>> DRAMSim2, use se_dramsim2.py script instead of se.py (for
>> FS, you can create
>> by yourself). The basic idea to enable the DRAMsim2
>> module is to use the
>> derived DRAMMemory class instead of PhysicalMemory class.
>>
>>
>>
>> Please let me know if there are bugs.
>>
>>
>>
>> Thank you!
>>
>>
>>
>> Best,
>>
>> Xiangyu Dong
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> <http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Andrew Cebulski
2012-05-01 08:27:51 UTC
Permalink
Hey Gabe,

Thanks for this...very helpful. I just recently got back into
debugging this problem. I made a small change in src/base/refcnt.hh to
allow me to return the current count of references to a DynInst object.

I then modified existing DPRINTFs to also print out reference counts,
then added some of my own when I needed extra visibility.

I've found one memory store instruction that seems to be getting lost.
What's happening is that is progresses as far as getting executed in the
IEW once, but a delayed translation occurs, deferring the store. By the
time it reenters the IEW, the IQ has marked the instruction as squashed.
Everything progresses as usual from here on out, with one exception. When
the instruction is removed from the CPUs instruction list, there is one
reference count hanging.

I've added in some additional debugging for my traces to help narrow
down where this reference is coming from. As far as I can tell, it's
because of a call to initiateAcc() within the executeStore function in the
lsq unit. Please see the following two traces. The first trace shows what
I just discussed. The second trace is another memory store instruction
that got squashed, however, it was squashed upon its first entry into the
IEW, therefore it never started execution.

http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out

Let me know if you have any ideas based on these two instruction
traces. I do not understand how the initiateAcc function results in
another reference, but maybe someone else does.... Since I don't see how
it makes a reference, it's hard to find out how to make sure it gets
dereferenced...

Unfortunately, I haven't been able to add a DPRINTF in
src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
references/deferences occur). Let me know if you have any advice on
this...if it's possible. I can't seem to get the right include files, and
likely right SConscript compile order...

Thanks,
Andrew

On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu> wrote:

> Without digging into things too deeply, it looks like you may be leaking
> references to dynamic instructions. The CPU may think it's done with one,
> but until that final reference is removed, the object will hang around
> forever. I think I've had problems before where there reference count ended
> up off by one somehow and instructions would start piling up. It's also
> possible that a clog develops in O3's pipeline and some internal structure
> stops letting instructions through and starts accumulating them. Either of
> these problems will be annoying to track down, but with enough digging I've
> been able to fix these sorts of things.
>
> This may have more to do with O3 not handling the benchmark you're running
> well rather than a problem with your new DRAM model. There may be some
> interaction between the two, though, where the new memory makes the timing
> line up to cause O3 to behave poorly. What you can do is instrument dynamic
> instruction creation and destruction and reference counting (try print
> "this" for both the reference counting wrapper and the dyn inst itself) and
> turn it on as close as you can to where things go bad tick wise. Then look
> for an instruction which gets lost, and look for where it's reference count
> is incremented and decremented. It should be relatively easy to pair up
> where references are created and destroyed, and you should be able to
> identify the reference which never goes away. Then you need to figure out
> where that reference is being created. After that, you should have enough
> information to identify why the reference counting isn't being done
> correctly. It's arduous, but that's the only way.
>
> It's important to also make sure reference counts aren't decremented to
> zero prematurely. I had a problem once where that happened and the memory
> behind the object was updated by something that didn't know it was dead.
> The memory had since been reallocated to another object of the same type,
> so that other object reflected what happened to the phantom one. If I
> remember that manifested as something weird like an add causing a page
> fault or something.
>
> Gabe
>
>
> On 04/07/12 18:21, Andrew Cebulski wrote:
>
> Hi all,
>
> I've looked into this problem some more, and have put together a couple
> traces. I've been becoming more familiar with how gem5 handles dynamic
> instructions, in particular how it destroys them. I have two traces to
> compare, one with the physical memory, and the other with the integrated
> dramsim2 dram memory. I also have two plots showing instruction counts
> over time (sim ticks). All of these are linked at the end of the email.
>
> First, I'm going to go into what I've been able to interpret regarding
> how instructions are destroyed. In particular, comparing when DynInst's
> vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these
> because I've seen a difference, as I discuss later. These explanations are
> fairly non-existent on the wiki. There is a section header waiting to be
> filled...
>
> From what I have been able to gather from the code, there is a list of
> all the instructions in flight in cpu/o3/cpu.cc called instList, with the
> type DynInstPtr. There are three conditions to instructions being cleaned
> from this list:
>
> 1.) The ROB retires its head instruction
> 2.) Fetch receives a rob squashing signal from the commit, resulting in
> removing any instruction not in the ROB
> 3.) Decode detects an incorrect branch prediction, resulting in removal
> of all instructions back to the bad seq num.
>
> Once all five stages have completed, the CPU cleans up all the removed
> in-flight instructions. This line in particular
> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>
> instList.erase(removeList.front());
>
> When I turn on the debug flag O3CPU, I see the message "Removing
> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
> after all 5 cpu stages have completed, and one of the conditions above is
> met. I also see what tick it occurs on.
>
> When I turn on the DynInst debug flag, I see when instructions are
> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
> analyzing the trace files, I've gathered that this takes into account that
> instructions have different execution lengths. So if one tick a memory
> instruction in the instList (DynInstPtr) is removed, the DynInst for that
> memory instruction will occur much later (i.e. 1M ticks later). I have yet
> to determine how this is implemented.
>
> Now for the problem.
>
> What I'm seeing when I run dramsim2 dram memory is a significant
> difference between the size of the instList vector (of DynInstPtr objects),
> and the size of dynamic instruction count (of DynInst objects). The
> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
> Around tick 130B after libquantum started, it starts hitting what I'm
> assuming are loops (therefore branch prediction), resulting in some
> behavior that seems to imply improper instruction handling (i.e. more
> instructions in flight than allowed by ROB).
>
> I wasn't able to sync-up the physical and dramsim2 traces exactly by
> trace, but they should represent roughly the same area of execution. They
> don't execute the same due to the dramsim2 modeling the memory differently
> (i.e. latency and other delays).
>
> I've shared both traces on my public Dropbox here --
>
> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>
> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>
> Here are a couple plots of tick versus instruction count, with respect
> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
> cpu/o3/cpu.cc. --
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>
> Note that I added the printout of the instList size to an existing O3CPU
> DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>
> Here are the commands I ran to parse the traces into data files to
> analyze in MATLAB and create the plots:
> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
> grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
> zgrep instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
> | awk '{print $1,$11}' > instlistsize.out
>
> It seems to me like the problem might lie in gem5, but has just been
> exposed by integrating this more detailed memory model, dramsim2, into
> gem5. Either that, or their are some timing errors in how dramsim2 was
> integrated. I doubt this, however, since those first 190B ticks executed
> used the dramsim2 memory. I believe the problem is a combination of memory
> instructions + complex loops (branch prediction), resulting in improper
> destroying of instructions.
>
> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
> Their are 192 ROB entries, which is why the instList size generally has a
> max of about 192 instructions. The dynamic instruction counts (seen in the
> dramsim2 plot) seem to also imply that instructions are incorrectly been
> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
> which allows more and more instructions to be added to the system (possibly
> from a bad branch).
>
> I appreciate any help in debugging this and further figuring out the
> root problem, just let me know if you need anything else from me. I don't
> have much more time at the moment to debug, but I can take any advice for
> quick changes and/or additional traces, then send the results back to the
> list for discussion.
>
> Thanks,
> Andrew
>
> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
> (and even command) queue from 512 to 32. The same instructions problem
> occurred. It basically just decreased the execution time.
>
> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>
>> The error is that there are more that 1500 instructions currently in
>> flight in the system. It could mean several things:
>>
>> 1. The value is somewhat arbitrarily defined and maybe there are more
>> than 1500 in your system at one time?
>>
>> 2. Instructions aren't being destroyed correctly
>>
>>
>>
>> You could try to to run a debug binary so you'll get a list of
>> instructions when it happens or increase the number which may
>> be appropriate for certain situations (but 1500 is quite a few inflight
>> instructions).
>>
>>
>>
>> Ali
>>
>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>
>> Hi Xiangyu,
>> I just started looking into this some more. So at first I thought it
>> was due to updating to a more recent revision, but then I went back to
>> revision 8643, added your patch, built and ran....and now get the error
>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
>> update to SWIG might have resulted in this error, maybe someone on the
>> mailing list would know if that's possible. The difference is 1.3.40 vs.
>> 2.0.3, both of which are supported according to the dependencies wiki page.
>> Just for completeness, here's the error from revision 8643:
>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>> I have not tried running with gem5.debug, so I will be doing that
>> today. Maybe this is an assertion that is occurring due to an
>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>> it runs without optimizations. Have you tested all debug, opt and fast
>> with your tests?
>> Thanks,
>> Andrew
>>
>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com>wrote:
>>
>>> Hi Andrew,
>>>
>>>
>>>
>>> I didn’t see this error in my simulations. May I ask which gem5 version
>>> you are using? I find some of the latest code updates do not comply with my
>>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>>> ARM_SE.
>>>
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Xiangyu
>>>
>>>
>>>
>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>
>>> *To:* gem5 users mailing list
>>> *Cc:* ***@gmail.com; ***@umich.edu
>>>
>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>
>>>
>>>
>>>
>>>
>>> Xiangyu,
>>>
>>>
>>>
>>> I've been having an issue recently with the number of instructions
>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>> It turns out the issue seems to be coming from this patch you created to
>>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>>
>>>
>>>
>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>
>>>
>>>
>>> -Andrew
>>>
>>>
>>>
>>>
>>>
>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>> From: "Dong, Xiangyu" <***@gmail.com>
>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>> Message-ID: gmail.com>
>>>
>>> Content-Type: text/plain; charset="us-ascii"
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS modes.
>>> I'm willing to share it here.
>>>
>>>
>>>
>>> For those who have such needs, please go to my website
>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download
>>> the patch and test it. To enable
>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>> create
>>> by yourself). The basic idea to enable the DRAMsim2 module is to use the
>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>
>>>
>>>
>>> Please let me know if there are bugs.
>>>
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Xiangyu Dong
>>>
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>> >
>>>
>>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
> _______________________________________________
> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Andrew Cebulski
2012-05-02 03:34:25 UTC
Permalink
Okay, I'm positive now that the issue lies with delayed translations that
are squashed before finishing. It seems to me like speculative load/stores
are being executed, rather than waiting for the instructions to commit.
Once the instructions begin getting (speculatively) executed in the TLB, a
reference is left there, which seems hard to root out and dereference after
the instruction ends up being squashed. At least, I have not been able to
find that out in the source code as of yet. Can anyone clarify on this?

Recall the following image that shows how the number of dynamic instruction
(DynInst) objects in-flight increases linearly for varying periods of time:

http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png

After enabling the TLB debug flag, I see that the linear increase in
instructions in flight is proportional to the number of TLB misses. These
TLB misses have a much larger delay (resulting in translation delays) due
to the fact the DramSim2 models the memory system more accurately. It
seems that with the classic memory system, TLB misses often do not have
translation delays. For whatever reason, it would also seem that every
instruction that has a TLB miss also is eventually squashed...

Here's a summary of outputs from my trace. These two DPRINTF messages
appears on the rising slopes (repeated up until the peak):

TLB Miss: Starting hardware table walker for 0(656)
TLB Miss: Starting hardware table walker for 0x4(656)

At the peak, the following message appears (from fetch) almost every tick
for (what I believe to be) every single one of the table walkers that were
squashed.

Fetch is waiting ITLB walk to finish!

The problem is that these ITLB table walks are for instructions that were
squashed as much as 0.3 billion cycles earlier, and since been removed from
the CPU's instruction list.

Any help will be greatly appreciated in solving this problem. I've hit a
roadblock with getting Ruby working with ARM, most likely due to the fact
that ARM has disjoint memory (x86 and Alpha do not). There's the 256 MB
for physical memory, then the 64 MB for the boot loader. I brought this up
in my last email about trying to get Ruby working. Therefore, I'm trying
to get this DramSim2 integration fixed so I can start modeling FS with DRAM
memory.


Note that these problems also occur in Soplex from the Spec CPU2006
benchmark suite (also hits 1500 in-flight instructions assertion). Due to
time constraints, I haven't tested on other benchmarks.

Thanks,
Andrew

On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu> wrote:

> Hey Gabe,
>
> Thanks for this...very helpful. I just recently got back into
> debugging this problem. I made a small change in src/base/refcnt.hh to
> allow me to return the current count of references to a DynInst object.
>
> I then modified existing DPRINTFs to also print out reference counts,
> then added some of my own when I needed extra visibility.
>
> I've found one memory store instruction that seems to be getting lost.
> What's happening is that is progresses as far as getting executed in the
> IEW once, but a delayed translation occurs, deferring the store. By the
> time it reenters the IEW, the IQ has marked the instruction as squashed.
> Everything progresses as usual from here on out, with one exception. When
> the instruction is removed from the CPUs instruction list, there is one
> reference count hanging.
>
> I've added in some additional debugging for my traces to help narrow
> down where this reference is coming from. As far as I can tell, it's
> because of a call to initiateAcc() within the executeStore function in the
> lsq unit. Please see the following two traces. The first trace shows what
> I just discussed. The second trace is another memory store instruction
> that got squashed, however, it was squashed upon its first entry into the
> IEW, therefore it never started execution.
>
> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>
> Let me know if you have any ideas based on these two instruction
> traces. I do not understand how the initiateAcc function results in
> another reference, but maybe someone else does.... Since I don't see how
> it makes a reference, it's hard to find out how to make sure it gets
> dereferenced...
>
> Unfortunately, I haven't been able to add a DPRINTF in
> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
> references/deferences occur). Let me know if you have any advice on
> this...if it's possible. I can't seem to get the right include files, and
> likely right SConscript compile order...
>
> Thanks,
> Andrew
>
>
> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu> wrote:
>
>> Without digging into things too deeply, it looks like you may be leaking
>> references to dynamic instructions. The CPU may think it's done with one,
>> but until that final reference is removed, the object will hang around
>> forever. I think I've had problems before where there reference count ended
>> up off by one somehow and instructions would start piling up. It's also
>> possible that a clog develops in O3's pipeline and some internal structure
>> stops letting instructions through and starts accumulating them. Either of
>> these problems will be annoying to track down, but with enough digging I've
>> been able to fix these sorts of things.
>>
>> This may have more to do with O3 not handling the benchmark you're
>> running well rather than a problem with your new DRAM model. There may be
>> some interaction between the two, though, where the new memory makes the
>> timing line up to cause O3 to behave poorly. What you can do is instrument
>> dynamic instruction creation and destruction and reference counting (try
>> print "this" for both the reference counting wrapper and the dyn inst
>> itself) and turn it on as close as you can to where things go bad tick
>> wise. Then look for an instruction which gets lost, and look for where it's
>> reference count is incremented and decremented. It should be relatively
>> easy to pair up where references are created and destroyed, and you should
>> be able to identify the reference which never goes away. Then you need to
>> figure out where that reference is being created. After that, you should
>> have enough information to identify why the reference counting isn't being
>> done correctly. It's arduous, but that's the only way.
>>
>> It's important to also make sure reference counts aren't decremented to
>> zero prematurely. I had a problem once where that happened and the memory
>> behind the object was updated by something that didn't know it was dead.
>> The memory had since been reallocated to another object of the same type,
>> so that other object reflected what happened to the phantom one. If I
>> remember that manifested as something weird like an add causing a page
>> fault or something.
>>
>> Gabe
>>
>>
>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>
>> Hi all,
>>
>> I've looked into this problem some more, and have put together a couple
>> traces. I've been becoming more familiar with how gem5 handles dynamic
>> instructions, in particular how it destroys them. I have two traces to
>> compare, one with the physical memory, and the other with the integrated
>> dramsim2 dram memory. I also have two plots showing instruction counts
>> over time (sim ticks). All of these are linked at the end of the email.
>>
>> First, I'm going to go into what I've been able to interpret regarding
>> how instructions are destroyed. In particular, comparing when DynInst's
>> vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these
>> because I've seen a difference, as I discuss later. These explanations are
>> fairly non-existent on the wiki. There is a section header waiting to be
>> filled...
>>
>> From what I have been able to gather from the code, there is a list of
>> all the instructions in flight in cpu/o3/cpu.cc called instList, with the
>> type DynInstPtr. There are three conditions to instructions being cleaned
>> from this list:
>>
>> 1.) The ROB retires its head instruction
>> 2.) Fetch receives a rob squashing signal from the commit, resulting in
>> removing any instruction not in the ROB
>> 3.) Decode detects an incorrect branch prediction, resulting in removal
>> of all instructions back to the bad seq num.
>>
>> Once all five stages have completed, the CPU cleans up all the removed
>> in-flight instructions. This line in particular
>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>
>> instList.erase(removeList.front());
>>
>> When I turn on the debug flag O3CPU, I see the message "Removing
>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>> after all 5 cpu stages have completed, and one of the conditions above is
>> met. I also see what tick it occurs on.
>>
>> When I turn on the DynInst debug flag, I see when instructions are
>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>> analyzing the trace files, I've gathered that this takes into account that
>> instructions have different execution lengths. So if one tick a memory
>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>> to determine how this is implemented.
>>
>> Now for the problem.
>>
>> What I'm seeing when I run dramsim2 dram memory is a significant
>> difference between the size of the instList vector (of DynInstPtr objects),
>> and the size of dynamic instruction count (of DynInst objects). The
>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>> Around tick 130B after libquantum started, it starts hitting what I'm
>> assuming are loops (therefore branch prediction), resulting in some
>> behavior that seems to imply improper instruction handling (i.e. more
>> instructions in flight than allowed by ROB).
>>
>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>> trace, but they should represent roughly the same area of execution. They
>> don't execute the same due to the dramsim2 modeling the memory differently
>> (i.e. latency and other delays).
>>
>> I've shared both traces on my public Dropbox here --
>>
>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>
>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>
>> Here are a couple plots of tick versus instruction count, with respect
>> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
>> cpu/o3/cpu.cc. --
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>
>> Note that I added the printout of the instList size to an existing
>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>
>> Here are the commands I ran to parse the traces into data files to
>> analyze in MATLAB and create the plots:
>> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>> | grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
>> zgrep instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>> | awk '{print $1,$11}' > instlistsize.out
>>
>> It seems to me like the problem might lie in gem5, but has just been
>> exposed by integrating this more detailed memory model, dramsim2, into
>> gem5. Either that, or their are some timing errors in how dramsim2 was
>> integrated. I doubt this, however, since those first 190B ticks executed
>> used the dramsim2 memory. I believe the problem is a combination of memory
>> instructions + complex loops (branch prediction), resulting in improper
>> destroying of instructions.
>>
>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>> Their are 192 ROB entries, which is why the instList size generally has a
>> max of about 192 instructions. The dynamic instruction counts (seen in the
>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>> which allows more and more instructions to be added to the system (possibly
>> from a bad branch).
>>
>> I appreciate any help in debugging this and further figuring out the
>> root problem, just let me know if you need anything else from me. I don't
>> have much more time at the moment to debug, but I can take any advice for
>> quick changes and/or additional traces, then send the results back to the
>> list for discussion.
>>
>> Thanks,
>> Andrew
>>
>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
>> (and even command) queue from 512 to 32. The same instructions problem
>> occurred. It basically just decreased the execution time.
>>
>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>
>>> The error is that there are more that 1500 instructions currently in
>>> flight in the system. It could mean several things:
>>>
>>> 1. The value is somewhat arbitrarily defined and maybe there are more
>>> than 1500 in your system at one time?
>>>
>>> 2. Instructions aren't being destroyed correctly
>>>
>>>
>>>
>>> You could try to to run a debug binary so you'll get a list of
>>> instructions when it happens or increase the number which may
>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>> instructions).
>>>
>>>
>>>
>>> Ali
>>>
>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>
>>> Hi Xiangyu,
>>> I just started looking into this some more. So at first I thought
>>> it was due to updating to a more recent revision, but then I went back to
>>> revision 8643, added your patch, built and ran....and now get the error
>>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
>>> update to SWIG might have resulted in this error, maybe someone on the
>>> mailing list would know if that's possible. The difference is 1.3.40 vs.
>>> 2.0.3, both of which are supported according to the dependencies wiki page.
>>> Just for completeness, here's the error from revision 8643:
>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>> I have not tried running with gem5.debug, so I will be doing that
>>> today. Maybe this is an assertion that is occurring due to an
>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>> it runs without optimizations. Have you tested all debug, opt and fast
>>> with your tests?
>>> Thanks,
>>> Andrew
>>>
>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com
>>> > wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>>
>>>>
>>>> I didn’t see this error in my simulations. May I ask which gem5 version
>>>> you are using? I find some of the latest code updates do not comply with my
>>>> changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and have run
>>>> all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
>>>> ARM_SE.
>>>>
>>>>
>>>>
>>>> Thank you!
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Xiangyu
>>>>
>>>>
>>>>
>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>
>>>> *To:* gem5 users mailing list
>>>> *Cc:* ***@gmail.com; ***@umich.edu
>>>>
>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Xiangyu,
>>>>
>>>>
>>>>
>>>> I've been having an issue recently with the number of instructions
>>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>>> It turns out the issue seems to be coming from this patch you created to
>>>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>>>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>>>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>>>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>>>
>>>>
>>>>
>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>
>>>>
>>>>
>>>> -Andrew
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>> Message-ID: gmail.com>
>>>>
>>>> Content-Type: text/plain; charset="us-ascii"
>>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>> modes.
>>>> I'm willing to share it here.
>>>>
>>>>
>>>>
>>>> For those who have such needs, please go to my website
>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to download
>>>> the patch and test it. To enable
>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>>> create
>>>> by yourself). The basic idea to enable the DRAMsim2 module is to use
>>>> the
>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>
>>>>
>>>>
>>>> Please let me know if there are bugs.
>>>>
>>>>
>>>>
>>>> Thank you!
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Xiangyu Dong
>>>>
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>> >
>>>>
>>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
Ali Saidi
2012-05-02 15:10:07 UTC
Permalink
Hi Andrew,

Thanks for digging into this. I think there is an issue
somewhere, but I'm still not sure where.

Ali

On 01.05.2012 23:34,
Andrew Cebulski wrote:

> Okay, I'm positive now that the issue lies
with delayed translations that are squashed before finishing.

On the
data on instruction side? You seem to allude to data in the paragraph
below, but then instructions in the latter text.

> It seems to me like
speculative load/stores are being executed, rather than waiting for the
instructions to commit. Once the instructions begin getting
(speculatively) executed in the TLB, a reference is left there, which
seems hard to root out and dereference after the instruction ends up
being squashed. At least, I have not been able to find that out in the
source code as of yet. Can anyone clarify on this?

There should only be
one translation outstanding from each instruction and data side walker.
Any nested transactions should be queued in the walker. Until one
finishes, I'm not sure how multiple would ever be outstanding.
R

> ses
linearly for varying periods of time:
>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[1]
> After enabling the TLB debug flag, I see that the linear increase
in instructions in flight is proportional to the number of TLB misses.
These TLB misses have a much larger delay (resulting in translation
delays) due to the fact the DramSim2 models the memory system more
accurately. It seems that with the classic memory system, TLB misses
often do not have translation delays. For whatever reason, it would also
seem that every instruction that has a TLB miss also is eventually
squashed...
>
> From a data side perspective this is reasonable. While
a miss is outstanding at
structions will stop committing and thus the
instructions in flight will begin to rise until the miss is satisfied.


Here's a summary of outputs from my trace. These two DPRINTF messages
appears on the rising slopes (repeated up until the peak):
TLB Miss

>
This is interesting/odd. I don't know a good reason why (1) a miss would
be outstanding to both address 0 and address 4 at the same time. In
almost all cases these pages are marked as no-access to detect
segfaults. Perhaps there is an issue where the
g into a loop faulting on
a bad access and then faulting again on the fault handler. I could
imagine this would happen if there was some corruption in the memory
system (for example the timings in dramsim exposing a bug in the cache
models or something).

At the peak, the following message appears
(from fetch) almost every tick for (what I believe to be) every single
one of the table walkers that were squashed.
Fetch is waiting ITLB walk
to finish!

There must be another walk in flight? The instruction side
will only have one fault outstanding at once. Successive branch
mispredicts will re-direct

> ht thing."
>
> The problem is that
these ITLB table walks are for instructions that were squashed as
much
on cycles earlier, and since been removed from the CPU's
instruction list.

I'm not following here.

Any help will be greatly
appreciated in solving this problem. I've hit a roadblock with getting
Ruby working with ARM, most likely due to the fact that ARM has disjoint
m

> r. I brought this up in my last email about trying to get Ruby
working. Therefore, I'm trying to get this DramSim2 integration fixed so
I can start modeling FS with DRAM memory.

Brad/Steve/Nilay anyone have
a suggestion on how to make this work?

Note that these problems also
occur in Soplex from the Spec CP

> en't tested on other benchmarks.
>
Thanks,
> Andrew
>
> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
<***@drexel.edu [2]> wrote:
> Hey Gabe,
> Thanks for this...very
helpful. I just recently got back into debugging this problem. I made a
small
c/base/refcnt.hh to allow me to return the current count of
references to a DynInst object.
I then modified existing DPRINTFs to
also print out reference counts, then added some of my own when I needed
extra

> What's happening is that is progresses as far as getting
executed in the IEW once, but a delayed translation occurs, deferring
the store. By the time it reenters the IEW, the IQ has marked the
instruction as squashed. Everything progresses as usual from here on
out, with one exception. When the instruction is removed from the CPUs
instruction list, there is one reference count hanging.
> I've added in
some additional debugging for my traces to help narrow down where this
reference is coming from. As far as I can tell, it's because of a call
to initiateAcc() within the executeStore function in the lsq unit.
Please see the following two traces. The first trace shows what I just
discussed. The second trace is another memory store instruction that got
squashed, however, it was squashed upon its first entry into the IEW,
therefore it never started execution.
>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [21]
>
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out [22]
> Let
me know if you have any ideas based on these two instruction traces. I
do not understand how the initiateAcc function results in another
reference, but maybe someone else does.... Since I don't see how it
makes a reference, it's hard to find out how to make sure it gets
dereferenced...
> Unfortunately, I haven't been able to add a DPRINTF
in src/base/refcnt.hh ...this would make things more clear (i.e. exactly
when references/deferences occur). Let me know if you have any advice on
this...if it's possible. I can't seem to get the right include files,
and likely right SConscript compile order...
> Thanks,
> Andrew
>
>
On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu [23]>
wrote:
>
>> Without digging into things too deeply, it looks like you
may be leaking references to dynamic instructions. The CPU may think
it's done with one, but until that final reference is removed, the
object will hang around forever. I think I've had problems before where
there reference count ended up off by one somehow and instructions would
start piling up. It's also possible that a clog develops in O3's
pipeline and some internal structure stops letting instructions through
and starts accumulating them. Either of these problems will be annoying
to track down, but with enough digging I've been able to fix these sorts
of things.
>>
>> This may have more to do with O3 not handling the
benchmark you're running well rather than a problem with your new DRAM
model. There may be some interaction between the two, though, where the
new memory makes the timing line up to cause O3 to behave poorly. What
you can do is instrument dynamic instruction creation and destruction
and reference counting (try print "this" for both the reference counting
wrapper and the dyn inst itself) and turn it on as close as you can to
where things go bad tick wise. Then look for an instruction which gets
lost, and look for where it's reference count is incremented and
decremented. It should be relatively easy to pair up where references
are created and destroyed, and you should be able to identify the
reference which never goes away. Then you need to figure out where that
reference is being created. After that, you should have enough
information to identify why the reference counting isn't being done
correctly. It's arduous, but that's the only way.
>>
>> It's important
to also make sure reference counts aren't decremented to zero
prematurely. I had a problem once where that happened and the memory
behind the object was updated by something that didn't know it was dead.
The memory had since been reallocated to another object of the same
type, so that other object reflected what happened to the phantom one.
If I remember that manifested as something weird like an add causing a
page fault or something.
>>
>> Gabe
>>
>> On 04/07/12 18:21, Andrew
Cebulski wrote:
>>
>>> Hi all,
>>> I've looked into this problem some
more, and have put together a couple traces. I've been becoming more
familiar with how gem5 handles dynamic instructions, in particular how
it destroys them. I have two traces to compare, one with the physical
memory, and the other with the integrated dramsim2 dram memory. I also
have two plots showing instruction counts over time (sim ticks). All of
these are linked at the end of the email.
>>> First, I'm going to go
into what I've been able to interpret regarding how instructions are
destroyed. In particular, comparing when DynInst's vs. DynInstPtr's are
deconstructed/removed from the cpu. I separate these because I've seen a
difference, as I discuss later. These explanations are fairly
non-existent on the wiki. There is a section header waiting to be
filled...
>>> From what I have been able to gather from the code, there
is a list of all the instructions in flight in cpu/o3/cpu.cc called
instList, with the type DynInstPtr. There are three conditions to
instructions being cleaned from this list:
>>> 1.) The ROB retires its
head instruction
>>> 2.) Fetch receives a rob squashing signal from the
commit, resulting in removing any instruction not in the ROB
>>> 3.)
Decode detects an incorrect branch prediction, resulting in removal of
all instructions back to the bad seq num.
>>> Once all five stages have
completed, the CPU cleans up all the removed in-flight instructions.
This line in particular in cleanUpRemovedInsts() in cpu/o3/cpu.cc
deconstructs a DynInstPtr:
>>> instList.erase(removeList.front());
>>>
When I turn on the debug flag O3CPU, I see the message "Removing
instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and
pcState after all 5 cpu stages have completed, and one of the conditions
above is met. I also see what tick it occurs on.
>>> When I turn on the
DynInst debug flag, I see when instructions are created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From analyzing the trace
files, I've gathered that this takes into account that instructions have
different execution lengths. So if one tick a memory instruction in the
instList (DynInstPtr) is removed, the DynInst for that memory
instruction will occur much later (i.e. 1M ticks later). I have yet to
determine how this is implemented.
>>> Now for the problem.
>>> What
I'm seeing when I run dramsim2 dram memory is a significant difference
between the size of the instList vector (of DynInstPtr objects), and the
size of dynamic instruction count (of DynInst objects). The benchmark
I'm running is libquantum from SPEC 2006. For the first roughly 130B
ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
shadows the instList size in o3/cpu.cc (figure linked below) very
closely. Around tick 130B after libquantum started, it starts hitting
what I'm assuming are loops (therefore branch prediction), resulting in
some behavior that seems to imply improper instruction handling (i.e.
more instructions in flight than allowed by ROB).
>>> I wasn't able to
sync-up the physical and dramsim2 traces exactly by trace, but they
should represent roughly the same area of execution. They don't execute
the same due to the dramsim2 modeling the memory differently (i.e.
latency and other delays).
>>> I've shared both traces on my public
Dropbox here --
>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>>> Here are a couple plots of tick versus instruction count,
with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
instList.size() in cpu/o3/cpu.cc. --
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>>>
>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>>> Note that I added the printout of the instList size to an
existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>
Here are the commands I ran to parse the traces into data files to
analyze in MATLAB and create the plots:
>>> zgrep DynInst
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed | awk '{print $1,$11}' > cpuinstcount.out
>>> zgrep instList
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
$1,$11}' > instlistsize.out
>>> It seems to me like the problem might
lie in gem5, but has just been exposed by integrating this more detailed
memory model, dramsim2, into gem5. Either that, or their are some timing
errors in how dramsim2 was integrated. I doubt this, however, since
those first 190B ticks executed used the dramsim2 memory. I believe the
problem is a combination of memory instructions + complex loops (branch
prediction), resulting in improper destroying of instructions.
>>> I've
included the ROB, Commit, Fetch, DynInst and O3CPU debug flags. Their
are 192 ROB entries, which is why the instList size generally has a max
of about 192 instructions. The dynamic instruction counts (seen in the
dramsim2 plot) seem to also imply that instructions are incorrectly been
removed from the ROB, and then from the cpu's instruction list in
cpu.cc, which allows more and more instructions to be added to the
system (possibly from a bad branch).
>>> I appreciate any help in
debugging this and further figuring out the root problem, just let me
know if you need anything else from me. I don't have much more time at
the moment to debug, but I can take any advice for quick changes and/or
additional traces, then send the results back to the list for
discussion.
>>> Thanks,
>>> Andrew
>>> P.S. Paul - I did try
decreasing the size of the dramsim2 transaction (and even command) queue
from 512 to 32. The same instructions problem occurred. It basically
just decreased the execution time.
>>>
>>> On Wed, Mar 14, 2012 at
2:10 PM, Ali Saidi <***@umich.edu [18]> wrote:
>>>
>>>> The error is
that there are more that 1500 instructions currently in flight in the
system. It could mean several things:
>>>>
>>>> 1. The value is
somewhat arbitrarily defined and maybe there are more than 1500 in your
system at one time?
>>>>
>>>> 2. Instructions aren't being destroyed
correctly
>>>>
>>>> You could try to to run a debug binary so you'll
get a list of instructions when it happens or increase the number which
may be appropriate for certain situations (but 1500 is quite a few
inflight instructions).
>>>>
>>>> Ali
>>>>
>>>> On 13.03.2012 10:56,
Andrew Cebulski wrote:
>>>>
>>>>> Hi Xiangyu,
>>>>> I just started
looking into this some more. So at first I thought it was due to
updating to a more recent revision, but then I went back to revision
8643, added your patch, built and ran....and now get the error with it
too (when running ARM_FS/gem5.opt). I"m testing now to see if an update
to SWIG might have resulted in this error, maybe someone on the mailing
list would know if that's possible. The difference is 1.3.40 vs. 2.0.3,
both of which are supported according to the dependencies wiki page.

>>>>> Just for completeness, here's the error from revision 8643:

>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>>>>
>>>>> I have not tried running with gem5.debug,
so I will be doing that today. Maybe this is an assertion that is
occurring due to an optimization. That would mean it wouldn't be
triggered in gem5.debug since it runs without optimizations. Have you
tested all debug, opt and fast with your tests?
>>>>> Thanks,
>>>>>
Andrew
>>>>>
>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong
<***@gmail.com [11]> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>

>>>>>> I didn't see this error in my simulations. May I ask which gem5
version you are using? I find some of the latest code updates do not
comply with my changes. I am still using the DRAMsim2 patch on Gem5
repo8643, and have run all the runnable benchmarks in SPEC2006,
SPEC2000, EEMBC2, and PARSEC2 on ARM_SE.
>>>>>>
>>>>>> Thank you!

>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Xiangyu
>>>>>>
>>>>>> FROM:
Andrew Cebulski [mailto:***@drexel.edu [8]]
>>>>>> SENT: Thursday,
March 08, 2012 6:52 PM
>>>>>>
>>>>>> TO: gem5 users mailing list
CC:***@gmail.com [9]; ***@umich.edu [10]
>>>>>>
>>>>>>
SUBJECT: Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>

>>>>>> Xiangyu,
>>>>>>
>>>>>> I've been having an issue recently with
the number of instructions I've been seeing committed to the CPU (I have
a separate thread on this). It turns out the issue seems to be coming
from this patch you created to integrate DramSim2 with Gem5.
Unfortunately, I've been running with gem5.fast, not gem5.opt. So up
until now, I haven't been seeing assertions. I thought I'd run it with
gem5.opt or debug back in December, but I must not have. My runs on the
Arm O3 cpu fails with this assertion:
>>>>>>
>>>>>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>
>>>>>>
-Andrew
>>>>>>
>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>
From: "Dong, Xiangyu" <***@gmail.com [3]>
>>>>>>> To: "gem5 users
mailing list" <gem5-***@gem5.org [4]>
>>>>>>> Subject: [gem5-users] A
Patch for DRAMsim2 Integration Message-ID: gmail.com [5]>
>>>>>>>

>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>
>>>>>>>
Hi all,
>>>>>>>
>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it
under both SE and FS modes.
>>>>>>> I'm willing to share it
here.
>>>>>>>
>>>>>>> For those who have such needs, please go to my
website
>>>>>>> www.cse.psu.edu/~xydong [6] to download the patch and
test it. To enable
>>>>>>> DRAMSim2, use se_dramsim2.py script instead
of se.py (for FS, you can create
>>>>>>> by yourself). The basic idea to
enable the DRAMsim2 module is to use the
>>>>>>> derived DRAMMemory
class instead of PhysicalMemory class.
>>>>>>>
>>>>>>> Please let me
know if there are bugs.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>>
Best,
>>>>>>>
>>>>>>> Xiangyu Dong
>>>>>>>
>>>>>>> -------------- next
part --------------
>>>>>>> An HTML attachment was scrubbed...
>>>>>>>
URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[7]>
>>>>>
>>>>> _______________________________________________
>>>>>
gem5-users mailing list
>>>>> gem5-***@gem5.org [12]
>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [13]
>>>
>>>
_______________________________________________
>>> gem5-users mailing
list
>>>
gem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>

>> _______________________________________________
>> gem5-users
mailing list
>> gem5-***@gem5.org [19]
>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[20]


Links:
------
[1]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[2]
mailto:***@drexel.edu
[3] mailto:***@gmail.com
[4]
mailto:gem5-***@gem5.org
[5] http://gmail.com
[6]
http://www.cse.psu.edu/%7Exydong
[7]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[8]
mailto:***@drexel.edu
[9] mailto:***@gmail.com
[10]
mailto:***@umich.edu
[11] mailto:***@gmail.com
[12]
mailto:gem5-***@gem5.org
[13]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:gem5-***@gem5.org
[20]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[21]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[22]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[23]
mailto:***@eecs.umich.edu
Andrew Cebulski
2012-05-02 21:22:28 UTC
Permalink
They are data TLB misses that occur as the in-flight instruction count
rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction
count finally linearly decreases is to 0x200. Also, at the start of the
rising slope, I see a miss to 0x8 and 0x2508c.

Here's a trace file:

http://dl.dropbox.com/u/2953302/gem5/tlb.out

To reduce size, I just have lines that have either TLB or walker in them.

I do see only a handful of instruction TLB misses.

-Andrew

On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:

> **
>
> Hi Andrew,
>
>
>
> Thanks for digging into this. I think there is an issue somewhere, but I'm
> still not sure where.
>
> Ali
>
> On 01.05.2012 23:34, Andrew Cebulski wrote:
>
> Okay, I'm positive now that the issue lies with delayed translations that
> are squashed before finishing.
>
> On the data on instruction side? You seem to allude to data in the
> paragraph below, but then instructions in the latter text.
>
> It seems to me like speculative load/stores are being executed, rather
> than waiting for the instructions to commit. Once the instructions begin
> getting (speculatively) executed in the TLB, a reference is left there,
> which seems hard to root out and dereference after the instruction ends up
> being squashed. At least, I have not been able to find that out in the
> source code as of yet. Can anyone clarify on this?
>
>
>
> There should only be one translation outstanding from each instruction and
> data side walker. Any nested transactions should be queued in the walker.
> Until one finishes, I'm not sure how multiple would ever be outstanding.
>
> Recall the following image that shows how the number of dynamic
> instruction (DynInst) objects in-flight increases linearly for varying
> periods of time:
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> After enabling the TLB debug flag, I see that the linear increase in
> instructions in flight is proportional to the number of TLB misses. These
> TLB misses have a much larger delay (resulting in translation delays) due
> to the fact the DramSim2 models the memory system more accurately. It
> seems that with the classic memory system, TLB misses often do not have
> translation delays. For whatever reason, it would also seem that every
> instruction that has a TLB miss also is eventually squashed...
>
> From a data side perspective this is reasonable. While a miss is
> outstanding at some point instructions will stop committing and thus the
> instructions in flight will begin to rise until the miss is satisfied.
>
> Here's a summary of outputs from my trace. These two DPRINTF messages
> appears on the rising slopes (repeated up until the peak):
> TLB Miss: Starting hardware table walker for 0(656)
> TLB Miss: Starting hardware table walker for 0x4(656)
>
> This is interesting/odd. I don't know a good reason why (1) a miss would
> be outstanding to both address 0 and address 4 at the same time. In almost
> all cases these pages are marked as no-access to detect segfaults. Perhaps
> there is an issue where the cpu is getting into a loop faulting on a bad
> access and then faulting again on the fault handler. I could imagine this
> would happen if there was some corruption in the memory system (for example
> the timings in dramsim exposing a bug in the cache models or something).
>
>
> At the peak, the following message appears (from fetch) almost every tick
> for (what I believe to be) every single one of the table walkers that were
> squashed.
> Fetch is waiting ITLB walk to finish!
>
> There must be another walk in flight? The instruction side will only
> have one fault outstanding at once. Successive branch mispredicts will
> re-direct fetch but there is code that catches the fact that a different
> walk completed then expected and "does the right thing."
>
> The problem is that these ITLB table walks are for instructions that
> were squashed as much as 0.3 billion cycles earlier, and since been removed
> from the CPU's instruction list.
>
> I'm not following here.
>
> Any help will be greatly appreciated in solving this problem. I've hit
> a roadblock with getting Ruby working with ARM, most likely due to the fact
> that ARM has disjoint memory (x86 and Alpha do not). There's the 256 MB
> for physical memory, then the 64 MB for the boot loader. I brought this up
> in my last email about trying to get Ruby working. Therefore, I'm trying
> to get this DramSim2 integration fixed so I can start modeling FS with DRAM
> memory.
>
> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>
>
> Note that these problems also occur in Soplex from the Spec CPU2006
> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
> time constraints, I haven't tested on other benchmarks.
> Thanks,
> Andrew
> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu> wrote:
>
>> Hey Gabe,
>> Thanks for this...very helpful. I just recently got back into
>> debugging this problem. I made a small change in src/base/refcnt.hh to
>> allow me to return the current count of references to a DynInst object.
>> I then modified existing DPRINTFs to also print out reference counts,
>> then added some of my own when I needed extra visibility.
>> I've found one memory store instruction that seems to be getting
>> lost. What's happening is that is progresses as far as getting executed in
>> the IEW once, but a delayed translation occurs, deferring the store. By
>> the time it reenters the IEW, the IQ has marked the instruction as
>> squashed. Everything progresses as usual from here on out, with one
>> exception. When the instruction is removed from the CPUs instruction list,
>> there is one reference count hanging.
>> I've added in some additional debugging for my traces to help narrow
>> down where this reference is coming from. As far as I can tell, it's
>> because of a call to initiateAcc() within the executeStore function in the
>> lsq unit. Please see the following two traces. The first trace shows what
>> I just discussed. The second trace is another memory store instruction
>> that got squashed, however, it was squashed upon its first entry into the
>> IEW, therefore it never started execution.
>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>> Let me know if you have any ideas based on these two instruction
>> traces. I do not understand how the initiateAcc function results in
>> another reference, but maybe someone else does.... Since I don't see how
>> it makes a reference, it's hard to find out how to make sure it gets
>> dereferenced...
>> Unfortunately, I haven't been able to add a DPRINTF in
>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>> references/deferences occur). Let me know if you have any advice on
>> this...if it's possible. I can't seem to get the right include files, and
>> likely right SConscript compile order...
>> Thanks,
>> Andrew
>>
>>
>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu> wrote:
>>
>>> Without digging into things too deeply, it looks like you may be leaking
>>> references to dynamic instructions. The CPU may think it's done with one,
>>> but until that final reference is removed, the object will hang around
>>> forever. I think I've had problems before where there reference count ended
>>> up off by one somehow and instructions would start piling up. It's also
>>> possible that a clog develops in O3's pipeline and some internal structure
>>> stops letting instructions through and starts accumulating them. Either of
>>> these problems will be annoying to track down, but with enough digging I've
>>> been able to fix these sorts of things.
>>>
>>> This may have more to do with O3 not handling the benchmark you're
>>> running well rather than a problem with your new DRAM model. There may be
>>> some interaction between the two, though, where the new memory makes the
>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>> dynamic instruction creation and destruction and reference counting (try
>>> print "this" for both the reference counting wrapper and the dyn inst
>>> itself) and turn it on as close as you can to where things go bad tick
>>> wise. Then look for an instruction which gets lost, and look for where it's
>>> reference count is incremented and decremented. It should be relatively
>>> easy to pair up where references are created and destroyed, and you should
>>> be able to identify the reference which never goes away. Then you need to
>>> figure out where that reference is being created. After that, you should
>>> have enough information to identify why the reference counting isn't being
>>> done correctly. It's arduous, but that's the only way.
>>>
>>> It's important to also make sure reference counts aren't decremented to
>>> zero prematurely. I had a problem once where that happened and the memory
>>> behind the object was updated by something that didn't know it was dead.
>>> The memory had since been reallocated to another object of the same type,
>>> so that other object reflected what happened to the phantom one. If I
>>> remember that manifested as something weird like an add causing a page
>>> fault or something.
>>>
>>> Gabe
>>>
>>>
>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>
>>> Hi all,
>>> I've looked into this problem some more, and have put together a couple
>>> traces. I've been becoming more familiar with how gem5 handles dynamic
>>> instructions, in particular how it destroys them. I have two traces to
>>> compare, one with the physical memory, and the other with the integrated
>>> dramsim2 dram memory. I also have two plots showing instruction counts
>>> over time (sim ticks). All of these are linked at the end of the email.
>>> First, I'm going to go into what I've been able to interpret regarding
>>> how instructions are destroyed. In particular, comparing when DynInst's
>>> vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these
>>> because I've seen a difference, as I discuss later. These explanations are
>>> fairly non-existent on the wiki. There is a section header waiting to be
>>> filled...
>>> From what I have been able to gather from the code, there is a list of
>>> all the instructions in flight in cpu/o3/cpu.cc called instList, with the
>>> type DynInstPtr. There are three conditions to instructions being cleaned
>>> from this list:
>>> 1.) The ROB retires its head instruction
>>> 2.) Fetch receives a rob squashing signal from the commit, resulting in
>>> removing any instruction not in the ROB
>>> 3.) Decode detects an incorrect branch prediction, resulting in removal
>>> of all instructions back to the bad seq num.
>>> Once all five stages have completed, the CPU cleans up all the removed
>>> in-flight instructions. This line in particular
>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>> instList.erase(removeList.front());
>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>> after all 5 cpu stages have completed, and one of the conditions above is
>>> met. I also see what tick it occurs on.
>>> When I turn on the DynInst debug flag, I see when instructions are
>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>> analyzing the trace files, I've gathered that this takes into account that
>>> instructions have different execution lengths. So if one tick a memory
>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>> to determine how this is implemented.
>>> Now for the problem.
>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>> difference between the size of the instList vector (of DynInstPtr objects),
>>> and the size of dynamic instruction count (of DynInst objects). The
>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>> assuming are loops (therefore branch prediction), resulting in some
>>> behavior that seems to imply improper instruction handling (i.e. more
>>> instructions in flight than allowed by ROB).
>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>> trace, but they should represent roughly the same area of execution. They
>>> don't execute the same due to the dramsim2 modeling the memory differently
>>> (i.e. latency and other delays).
>>> I've shared both traces on my public Dropbox here --
>>>
>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>
>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>> Here are a couple plots of tick versus instruction count, with respect
>>> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
>>> cpu/o3/cpu.cc. --
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>> Note that I added the printout of the instList size to an existing O3CPU
>>> DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>> Here are the commands I ran to parse the traces into data files to
>>> analyze in MATLAB and create the plots:
>>> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>> | grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
>>> zgrep instList
>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>> $1,$11}' > instlistsize.out
>>> It seems to me like the problem might lie in gem5, but has just been
>>> exposed by integrating this more detailed memory model, dramsim2, into
>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>> integrated. I doubt this, however, since those first 190B ticks executed
>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>> instructions + complex loops (branch prediction), resulting in improper
>>> destroying of instructions.
>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>> Their are 192 ROB entries, which is why the instList size generally has a
>>> max of about 192 instructions. The dynamic instruction counts (seen in the
>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>> which allows more and more instructions to be added to the system (possibly
>>> from a bad branch).
>>> I appreciate any help in debugging this and further figuring out the
>>> root problem, just let me know if you need anything else from me. I don't
>>> have much more time at the moment to debug, but I can take any advice for
>>> quick changes and/or additional traces, then send the results back to the
>>> list for discussion.
>>> Thanks,
>>> Andrew
>>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
>>> (and even command) queue from 512 to 32. The same instructions problem
>>> occurred. It basically just decreased the execution time.
>>>
>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> The error is that there are more that 1500 instructions currently in
>>>> flight in the system. It could mean several things:
>>>>
>>>> 1. The value is somewhat arbitrarily defined and maybe there are more
>>>> than 1500 in your system at one time?
>>>>
>>>> 2. Instructions aren't being destroyed correctly
>>>>
>>>> You could try to to run a debug binary so you'll get a list of
>>>> instructions when it happens or increase the number which may
>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>> instructions).
>>>>
>>>> Ali
>>>>
>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>
>>>> Hi Xiangyu,
>>>> I just started looking into this some more. So at first I thought
>>>> it was due to updating to a more recent revision, but then I went back to
>>>> revision 8643, added your patch, built and ran....and now get the error
>>>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
>>>> update to SWIG might have resulted in this error, maybe someone on the
>>>> mailing list would know if that's possible. The difference is 1.3.40 vs.
>>>> 2.0.3, both of which are supported according to the dependencies wiki page.
>>>> Just for completeness, here's the error from revision 8643:
>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>> I have not tried running with gem5.debug, so I will be doing that
>>>> today. Maybe this is an assertion that is occurring due to an
>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>> with your tests?
>>>> Thanks,
>>>> Andrew
>>>>
>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>
>>>>>
>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>> version you are using? I find some of the latest code updates do not comply
>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>> PARSEC2 on ARM_SE.
>>>>>
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Xiangyu
>>>>>
>>>>>
>>>>>
>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>
>>>>> *To:* gem5 users mailing list
>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>
>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>
>>>>> Xiangyu,
>>>>>
>>>>> I've been having an issue recently with the number of instructions
>>>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>>>> It turns out the issue seems to be coming from this patch you created to
>>>>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>>>>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>>>>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>>>>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>>>>
>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>>>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>
>>>>> -Andrew
>>>>>
>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>> Message-ID: gmail.com>
>>>>>
>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>>
>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>> modes.
>>>>> I'm willing to share it here.
>>>>>
>>>>>
>>>>>
>>>>> For those who have such needs, please go to my website
>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>> download the patch and test it. To enable
>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>>>> create
>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to use
>>>>> the
>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>
>>>>>
>>>>>
>>>>> Please let me know if there are bugs.
>>>>>
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Xiangyu Dong
>>>>>
>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL: <
>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>> >
>>>>>
>>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Ali Saidi
2012-05-02 21:28:45 UTC
Permalink
Something is wrong well before this point. There is no reason that
address 0x0 or 0x4 should be translated.

Did you happen to create a
checkpoint when caches were in the system?

Have you tried to run with
the checker cpu and see if it detects any errors?

Ali

On 02.05.2012
17:22, Andrew Cebulski wrote:

> They are data TLB misses that occur as
the in-flight instruction count rises (at 0x0 and 0x4). The last TLB
miss before the in-flight instruction count finally linearly decreases
is to 0x200. Also, at the start of the rising slope, I see a miss to 0x8
and 0x2508c.
>
> Here's a trace file:
>
>
http://dl.dropbox.com/u/2953302/gem5/tlb.out [26]
> To reduce size, I
just have lines that have either TLB or walker in them.
> I do see only
a handful of instruction TLB misses.
>
> -Andrew
>
> On Wed, May 2,
2012 at 11:10 AM, Ali Saidi <***@umich.edu [27]> wrote:
>
>> Hi
Andrew,
>>
>> Thanks for digging into this. I think there is an issue
somewhere, but I'm still not sure where.
>>
>> Ali
>>
>> On
01.05.2012 23:34, Andrew Cebulski wrote:
>>
>>> Okay, I'm positive now
that the issue lies with delayed translations that are squashed before
finishing.
>>
>> On the data on instruction side? You seem to allude to
data in the paragraph below, but then instructions in the latter text.

>>
>>> It seems to me like speculative load/stores are being executed,
rather than waiting for the instructions to commit. Once the
instructions begin getting (speculatively) executed in the TLB, a
reference is left there, which seems hard to root out and dereference
after the instruction ends up being squashed. At least, I have not been
able to find that out in the source code as of yet. Can anyone clarify
on this?
>>
>> There should only be one translation outstanding from
each instruction and data side walker. Any nested transactions should be
queued in the walker. Until one finishes, I'm not sure how multiple
would ever be outstanding.
>>
>> R
>>
>>> ncreases linearly for
varying periods of time:
>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[1]
>>> After enabling the TLB debug flag, I see that the linear
increase in instructions in flight is proportional to the number of TLB
misses. These TLB misses have a much larger delay (resulting in
translation delays) due to the fact the DramSim2 models the memory
system more accurately. It seems that with the classic memory system,
TLB misses often do not have translation delays. For whatever reason, it
would also seem that every instruction that has a TLB miss also is
eventually squashed...
>>>
>>> From a data side perspective this is
reasonable. While a miss is ou
>> some point instructions will stop
committing and thus the instructions in flight will begin to rise until
the miss is satisfied.
>>
>> Here's a summary of outputs from my
trace. These two DPRINTF messages appears on the rising slopes (repeated
up until the peak):
>> TLB
>>
>>> r 0x4(656)
>>>
>>> This is
interesting/odd. I don't know a good reason why (1) a miss would be
outstanding to both address 0 and address 4 at the same time. In almost
all cases these pages are marked as no-access to detect segfaults.
Perhaps there is a
>> the cpu is getting into a loop faulting on a bad
access and then faulting again on the fault handler. I could imagine
this would happen if there was some corruption in the memory system (for
example the timings in dramsim exposing a bug in the cache models or
something).
>>
>> At the peak, the following message appears (from
fetch) almost every tick for (what I believe to be) every single one of
the table walkers that were squashed.
>> Fetch is waiting ITLB walk to
finish!
>>
>> There must be another walk in flight? The instruction
side will only have one fault outstanding at once. Successive branch
mispredicts will
>>
>>> nd "does the right thing."
>>>
>>> The
problem is that these ITLB table walks are for instructions that wer
>>
much as 0.3 billion cycles earlier, and since been removed from the
CPU's instruction list.
>>
>> I'm not following here.
>>
>> Any help
will be greatly appreciated in solving this problem. I've hit a
roadblock with getting Ruby working with ARM, most likely due to the
fact
>>
>>> the 64 MB for the boot loader. I brought this up in my last
email about trying to get Ruby working. Therefore, I'm trying to get
this DramSim2 integration fixed so I can start modeling FS with D
>>
div>
>>
>> Brad/Steve/Nilay anyone have a suggestion on how to make
this work?
>>
>> Note that these problem
>>
>>> rtion). Due to time
constraints, I haven't tested on other benchmarks.
>>> Thanks,
>>>
Andrew
>>>
>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
<***@drexel.edu [2]> wrote:
>>>
>>> Hey Gabe,
>>> Thanks for
this.
>> l. I just recently got back into debugging this problem. I made
a small change in src/base/refcnt.hh to allow me to return the current
count of references to a DynInst object.
>> I the
>>
>>> extra
visibility.
>>> I've found one memory store instruction that seems to
be getting lost. What's happening is that is progresses as far as
getting executed in the IEW once, but a delayed translation occurs,
deferring the store. By the time it reenters the IEW, the IQ has marked
the instruction as squashed. Everything progresses as usual from here on
out, with one exception. When the instruction is removed from the CPUs
instruction list, there is one reference count hanging.
>>> I've added
in some additional debugging for my traces to help narrow down where
this reference is coming from. As far as I can tell, it's because of a
call to initiateAcc() within the executeStore function in the lsq unit.
Please see the following two traces. The first trace shows what I just
discussed. The second trace is another memory store instruction that got
squashed, however, it was squashed upon its first entry into the IEW,
therefore it never started execution.
>>>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [21]
>>>
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out [22]
>>>
Let me know if you have any ideas based on these two instruction traces.
I do not understand how the initiateAcc function results in another
reference, but maybe someone else does.... Since I don't see how it
makes a reference, it's hard to find out how to make sure it gets
dereferenced...
>>> Unfortunately, I haven't been able to add a DPRINTF
in src/base/refcnt.hh ...this would make things more clear (i.e. exactly
when references/deferences occur). Let me know if you have any advice on
this...if it's possible. I can't seem to get the right include files,
and likely right SConscript compile order...
>>> Thanks,
>>> Andrew

>>>
>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black
<***@eecs.umich.edu [23]> wrote:
>>>
>>>> Without digging into
things too deeply, it looks like you may be leaking references to
dynamic instructions. The CPU may think it's done with one, but until
that final reference is removed, the object will hang around forever. I
think I've had problems before where there reference count ended up off
by one somehow and instructions would start piling up. It's also
possible that a clog develops in O3's pipeline and some internal
structure stops letting instructions through and starts accumulating
them. Either of these problems will be annoying to track down, but with
enough digging I've been able to fix these sorts of things.
>>>>
>>>>
This may have more to do with O3 not handling the benchmark you're
running well rather than a problem with your new DRAM model. There may
be some interaction between the two, though, where the new memory makes
the timing line up to cause O3 to behave poorly. What you can do is
instrument dynamic instruction creation and destruction and reference
counting (try print "this" for both the reference counting wrapper and
the dyn inst itself) and turn it on as close as you can to where things
go bad tick wise. Then look for an instruction which gets lost, and look
for where it's reference count is incremented and decremented. It should
be relatively easy to pair up where references are created and
destroyed, and you should be able to identify the reference which never
goes away. Then you need to figure out where that reference is being
created. After that, you should have enough information to identify why
the reference counting isn't being done correctly. It's arduous, but
that's the only way.
>>>>
>>>> It's important to also make sure
reference counts aren't decremented to zero prematurely. I had a problem
once where that happened and the memory behind the object was updated by
something that didn't know it was dead. The memory had since been
reallocated to another object of the same type, so that other object
reflected what happened to the phantom one. If I remember that
manifested as something weird like an add causing a page fault or
something.
>>>>
>>>> Gabe
>>>>
>>>> On 04/07/12 18:21, Andrew
Cebulski wrote:
>>>>
>>>>> Hi all,
>>>>> I've looked into this
problem some more, and have put together a couple traces. I've been
becoming more familiar with how gem5 handles dynamic instructions, in
particular how it destroys them. I have two traces to compare, one with
the physical memory, and the other with the integrated dramsim2 dram
memory. I also have two plots showing instruction counts over time (sim
ticks). All of these are linked at the end of the email.
>>>>> First,
I'm going to go into what I've been able to interpret regarding how
instructions are destroyed. In particular, comparing when DynInst's vs.
DynInstPtr's are deconstructed/removed from the cpu. I separate these
because I've seen a difference, as I discuss later. These explanations
are fairly non-existent on the wiki. There is a section header waiting
to be filled...
>>>>> From what I have been able to gather from the
code, there is a list of all the instructions in flight in cpu/o3/cpu.cc
called instList, with the type DynInstPtr. There are three conditions to
instructions being cleaned from this list:
>>>>> 1.) The ROB retires
its head instruction
>>>>> 2.) Fetch receives a rob squashing signal
from the commit, resulting in removing any instruction not in the ROB

>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
removal of all instructions back to the bad seq num.
>>>>> Once all
five stages have completed, the CPU cleans up all the removed in-flight
instructions. This line in particular in cleanUpRemovedInsts() in
cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>>
instList.erase(removeList.front());
>>>>> When I turn on the debug flag
O3CPU, I see the message "Removing instruction, ..." (from o3/cpu.cc)
with the threadNum, seqNum and pcState after all 5 cpu stages have
completed, and one of the conditions above is met. I also see what tick
it occurs on.
>>>>> When I turn on the DynInst debug flag, I see when
instructions are created and destroyed (cpu/base_dyn_inst_impl.hh) and
what tick. From analyzing the trace files, I've gathered that this takes
into account that instructions have different execution lengths. So if
one tick a memory instruction in the instList (DynInstPtr) is removed,
the DynInst for that memory instruction will occur much later (i.e. 1M
ticks later). I have yet to determine how this is implemented.
>>>>>
Now for the problem.
>>>>> What I'm seeing when I run dramsim2 dram
memory is a significant difference between the size of the instList
vector (of DynInstPtr objects), and the size of dynamic instruction
count (of DynInst objects). The benchmark I'm running is libquantum from
SPEC 2006. For the first roughly 130B ticks, the dynamic instruction
count kept in cpu/base_dyn_inst.impl.hh shadows the instList size in
o3/cpu.cc (figure linked below) very closely. Around tick 130B after
libquantum started, it starts hitting what I'm assuming are loops
(therefore branch prediction), resulting in some behavior that seems to
imply improper instruction handling (i.e. more instructions in flight
than allowed by ROB).
>>>>> I wasn't able to sync-up the physical and
dramsim2 traces exactly by trace, but they should represent roughly the
same area of execution. They don't execute the same due to the dramsim2
modeling the memory differently (i.e. latency and other delays).
>>>>>
I've shared both traces on my public Dropbox here --
>>>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>>>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>>>>> Here are a couple plots of tick versus instruction count,
with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
instList.size() in cpu/o3/cpu.cc. --
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>>>>>
>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>>>>> Note that I added the printout of the instList size to an
existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>>
Here are the commands I ran to parse the traces into data files to
analyze in MATLAB and create the plots:
>>>>> zgrep DynInst
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed | awk '{print $1,$11}' > cpuinstcount.out
>>>>> zgrep
instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
awk '{print $1,$11}' > instlistsize.out
>>>>> It seems to me like the
problem might lie in gem5, but has just been exposed by integrating this
more detailed memory model, dramsim2, into gem5. Either that, or their
are some timing errors in how dramsim2 was integrated. I doubt this,
however, since those first 190B ticks executed used the dramsim2 memory.
I believe the problem is a combination of memory instructions + complex
loops (branch prediction), resulting in improper destroying of
instructions.
>>>>> I've included the ROB, Commit, Fetch, DynInst and
O3CPU debug flags. Their are 192 ROB entries, which is why the instList
size generally has a max of about 192 instructions. The dynamic
instruction counts (seen in the dramsim2 plot) seem to also imply that
instructions are incorrectly been removed from the ROB, and then from
the cpu's instruction list in cpu.cc, which allows more and more
instructions to be added to the system (possibly from a bad branch).

>>>>> I appreciate any help in debugging this and further figuring out
the root problem, just let me know if you need anything else from me. I
don't have much more time at the moment to debug, but I can take any
advice for quick changes and/or additional traces, then send the results
back to the list for discussion.
>>>>> Thanks,
>>>>> Andrew
>>>>>
P.S. Paul - I did try decreasing the size of the dramsim2 transaction
(and even command) queue from 512 to 32. The same instructions problem
occurred. It basically just decreased the execution time.
>>>>>
>>>>>
On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu [18]>
wrote:
>>>>>
>>>>>> The error is that there are more that 1500
instructions currently in flight in the system. It could mean several
things:
>>>>>>
>>>>>> 1. The value is somewhat arbitrarily defined and
maybe there are more than 1500 in your system at one time?
>>>>>>

>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>
>>>>>>
You could try to to run a debug binary so you'll get a list of
instructions when it happens or increase the number which may be
appropriate for certain situations (but 1500 is quite a few inflight
instructions).
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>> On 13.03.2012 10:56,
Andrew Cebulski wrote:
>>>>>>
>>>>>>> Hi Xiangyu,
>>>>>>> I just
started looking into this some more. So at first I thought it was due to
updating to a more recent revision, but then I went back to revision
8643, added your patch, built and ran....and now get the error with it
too (when running ARM_FS/gem5.opt). I"m testing now to see if an update
to SWIG might have resulted in this error, maybe someone on the mailing
list would know if that's possible. The difference is 1.3.40 vs. 2.0.3,
both of which are supported according to the dependencies wiki page.

>>>>>>> Just for completeness, here's the error from revision 8643:

>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>>>>>>
>>>>>>> I have not tried running with
gem5.debug, so I will be doing that today. Maybe this is an assertion
that is occurring due to an optimization. That would mean it wouldn't be
triggered in gem5.debug since it runs without optimizations. Have you
tested all debug, opt and fast with your tests?
>>>>>>> Thanks,

>>>>>>> Andrew
>>>>>>>
>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio
Xiangyu Dong <***@gmail.com [11]> wrote:
>>>>>>>
>>>>>>>> Hi
Andrew,
>>>>>>>>
>>>>>>>> I didn't see this error in my simulations.
May I ask which gem5 version you are using? I find some of the latest
code updates do not comply with my changes. I am still using the
DRAMsim2 patch on Gem5 repo8643, and have run all the runnable
benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on ARM_SE.

>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>

>>>>>>>> Xiangyu
>>>>>>>>
>>>>>>>> FROM: Andrew Cebulski
[mailto:***@drexel.edu [8]]
>>>>>>>> SENT: Thursday, March 08, 2012
6:52 PM
>>>>>>>>
>>>>>>>> TO: gem5 users mailing list
CC:***@gmail.com [9]; ***@umich.edu [10]
>>>>>>>>
>>>>>>>>
SUBJECT: Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>

>>>>>>>> Xiangyu,
>>>>>>>>
>>>>>>>> I've been having an issue
recently with the number of instructions I've been seeing committed to
the CPU (I have a separate thread on this). It turns out the issue seems
to be coming from this patch you created to integrate DramSim2 with
Gem5. Unfortunately, I've been running with gem5.fast, not gem5.opt. So
up until now, I haven't been seeing assertions. I thought I'd run it
with gem5.opt or debug back in December, but I must not have. My runs on
the Arm O3 cpu fails with this assertion:
>>>>>>>>
>>>>>>>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>
>>>>>>>>
-Andrew
>>>>>>>>
>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58
-0800
>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com
[3]>
>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org
[4]>
>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
Message-ID: gmail.com [5]>
>>>>>>>>>
>>>>>>>>> Content-Type:
text/plain; charset="us-ascii"
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>

>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
and FS modes.
>>>>>>>>> I'm willing to share it here.
>>>>>>>>>

>>>>>>>>> For those who have such needs, please go to my
website
>>>>>>>>> www.cse.psu.edu/~xydong [6] to download the patch and
test it. To enable
>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead
of se.py (for FS, you can create
>>>>>>>>> by yourself). The basic idea
to enable the DRAMsim2 module is to use the
>>>>>>>>> derived DRAMMemory
class instead of PhysicalMemory class.
>>>>>>>>>
>>>>>>>>> Please let
me know if there are bugs.
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>>

>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Xiangyu Dong
>>>>>>>>>
>>>>>>>>>
-------------- next part --------------
>>>>>>>>> An HTML attachment was
scrubbed...
>>>>>>>>> URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[7]>
>>>>>>>
>>>>>>>
_______________________________________________
>>>>>>> gem5-users
mailing list
>>>>>>> gem5-***@gem5.org [12]
>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [13]
>>>>>
>>>>>
_______________________________________________
>>>>> gem5-users mailing
list
>>>>>
gem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>

>>>> _______________________________________________
>>>> gem5-users
mailing list
>>>> gem5-***@gem5.org [19]
>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [20]
>>>
>>>
_______________________________________________
>>> gem5-users mailing
list
>>> gem5-***@gem5.org [24]
>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [25]




Links:
------
[1]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[2]
mailto:***@drexel.edu
[3] mailto:***@gmail.com
[4]
mailto:gem5-***@gem5.org
[5] http://gmail.com
[6]
http://www.cse.psu.edu/%7Exydong
[7]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[8]
mailto:***@drexel.edu
[9] mailto:***@gmail.com
[10]
mailto:***@umich.edu
[11] mailto:***@gmail.com
[12]
mailto:gem5-***@gem5.org
[13]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:gem5-***@gem5.org
[20]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[21]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[22]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[23]
mailto:***@eecs.umich.edu
[24] mailto:gem5-***@gem5.org
[25]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[26]
http://dl.dropbox.com/u/2953302/gem5/tlb.out
[27]
mailto:***@umich.edu
Andrew Cebulski
2012-05-02 23:58:50 UTC
Permalink
I have not run with the checker CPU recently. Here's the stderr output
from a run I did awhile back:

http://dl.dropbox.com/u/2953302/gem5/err.0

Note that the instruction match error is before my benchmark actually
starts running. The start of my boot script checks to see if my files
image is mounted (which it is), then continues on to run the benchmark. I
booted the system, mounted my files image, then took a checkpoint. I've
been running all my tests from that checkpoint. I found where my benchmark
started based on the ASID (from ExecAsid debug flag).

I delayed the start of gathering trace data until the second-to-last linear
increase in dynamic instructions in-flight. I'm running a new trace now.

-Andrew



On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:

> **
>
> Something is wrong well before this point. There is no reason that address
> 0x0 or 0x4 should be translated.
>
> Did you happen to create a checkpoint when caches were in the system?
>
> Have you tried to run with the checker cpu and see if it detects any
> errors?
>
>
>
> Ali
>
>
>
>
>
> On 02.05.2012 17:22, Andrew Cebulski wrote:
>
> They are data TLB misses that occur as the in-flight instruction count
> rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction
> count finally linearly decreases is to 0x200. Also, at the start of the
> rising slope, I see a miss to 0x8 and 0x2508c.
> Here's a trace file:
> http://dl.dropbox.com/u/2953302/gem5/tlb.out
> To reduce size, I just have lines that have either TLB or walker in them.
> I do see only a handful of instruction TLB misses.
> -Andrew
>
> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>
>> Hi Andrew,
>>
>>
>>
>> Thanks for digging into this. I think there is an issue somewhere, but
>> I'm still not sure where.
>>
>> Ali
>>
>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>
>> Okay, I'm positive now that the issue lies with delayed translations that
>> are squashed before finishing.
>>
>> On the data on instruction side? You seem to allude to data in the
>> paragraph below, but then instructions in the latter text.
>>
>> It seems to me like speculative load/stores are being executed, rather
>> than waiting for the instructions to commit. Once the instructions begin
>> getting (speculatively) executed in the TLB, a reference is left there,
>> which seems hard to root out and dereference after the instruction ends up
>> being squashed. At least, I have not been able to find that out in the
>> source code as of yet. Can anyone clarify on this?
>>
>>
>>
>> There should only be one translation outstanding from each instruction
>> and data side walker. Any nested transactions should be queued in the
>> walker. Until one finishes, I'm not sure how multiple would ever be
>> outstanding.
>>
>> Recall the following image that shows how the number of dynamic
>> instruction (DynInst) objects in-flight increases linearly for varying
>> periods of time:
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> After enabling the TLB debug flag, I see that the linear increase in
>> instructions in flight is proportional to the number of TLB misses. These
>> TLB misses have a much larger delay (resulting in translation delays) due
>> to the fact the DramSim2 models the memory system more accurately. It
>> seems that with the classic memory system, TLB misses often do not have
>> translation delays. For whatever reason, it would also seem that every
>> instruction that has a TLB miss also is eventually squashed...
>>
>> From a data side perspective this is reasonable. While a miss is
>> outstanding at some point instructions will stop committing and thus the
>> instructions in flight will begin to rise until the miss is satisfied.
>>
>> Here's a summary of outputs from my trace. These two DPRINTF messages
>> appears on the rising slopes (repeated up until the peak):
>> TLB Miss: Starting hardware table walker for 0(656)
>> TLB Miss: Starting hardware table walker for 0x4(656)
>>
>> This is interesting/odd. I don't know a good reason why (1) a miss
>> would be outstanding to both address 0 and address 4 at the same time. In
>> almost all cases these pages are marked as no-access to detect segfaults.
>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>> a bad access and then faulting again on the fault handler. I could imagine
>> this would happen if there was some corruption in the memory system (for
>> example the timings in dramsim exposing a bug in the cache models or
>> something).
>>
>>
>> At the peak, the following message appears (from fetch) almost every tick
>> for (what I believe to be) every single one of the table walkers that were
>> squashed.
>> Fetch is waiting ITLB walk to finish!
>>
>> There must be another walk in flight? The instruction side will only
>> have one fault outstanding at once. Successive branch mispredicts will
>> re-direct fetch but there is code that catches the fact that a different
>> walk completed then expected and "does the right thing."
>>
>> The problem is that these ITLB table walks are for instructions that
>> were squashed as much as 0.3 billion cycles earlier, and since been removed
>> from the CPU's instruction list.
>>
>> I'm not following here.
>>
>> Any help will be greatly appreciated in solving this problem. I've hit
>> a roadblock with getting Ruby working with ARM, most likely due to the fact
>> that ARM has disjoint memory (x86 and Alpha do not). There's the 256 MB
>> for physical memory, then the 64 MB for the boot loader. I brought this up
>> in my last email about trying to get Ruby working. Therefore, I'm trying
>> to get this DramSim2 integration fixed so I can start modeling FS with DRAM
>> memory.
>>
>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>
>>
>> Note that these problems also occur in Soplex from the Spec CPU2006
>> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
>> time constraints, I haven't tested on other benchmarks.
>> Thanks,
>> Andrew
>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu>wrote:
>>
>>> Hey Gabe,
>>> Thanks for this...very helpful. I just recently got back into
>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>> allow me to return the current count of references to a DynInst object.
>>> I then modified existing DPRINTFs to also print out reference
>>> counts, then added some of my own when I needed extra visibility.
>>> I've found one memory store instruction that seems to be getting
>>> lost. What's happening is that is progresses as far as getting executed in
>>> the IEW once, but a delayed translation occurs, deferring the store. By
>>> the time it reenters the IEW, the IQ has marked the instruction as
>>> squashed. Everything progresses as usual from here on out, with one
>>> exception. When the instruction is removed from the CPUs instruction list,
>>> there is one reference count hanging.
>>> I've added in some additional debugging for my traces to help narrow
>>> down where this reference is coming from. As far as I can tell, it's
>>> because of a call to initiateAcc() within the executeStore function in the
>>> lsq unit. Please see the following two traces. The first trace shows what
>>> I just discussed. The second trace is another memory store instruction
>>> that got squashed, however, it was squashed upon its first entry into the
>>> IEW, therefore it never started execution.
>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>> Let me know if you have any ideas based on these two instruction
>>> traces. I do not understand how the initiateAcc function results in
>>> another reference, but maybe someone else does.... Since I don't see how
>>> it makes a reference, it's hard to find out how to make sure it gets
>>> dereferenced...
>>> Unfortunately, I haven't been able to add a DPRINTF in
>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>> references/deferences occur). Let me know if you have any advice on
>>> this...if it's possible. I can't seem to get the right include files, and
>>> likely right SConscript compile order...
>>> Thanks,
>>> Andrew
>>>
>>>
>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu>wrote:
>>>
>>>> Without digging into things too deeply, it looks like you may be
>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>> with one, but until that final reference is removed, the object will hang
>>>> around forever. I think I've had problems before where there reference
>>>> count ended up off by one somehow and instructions would start piling up.
>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>> structure stops letting instructions through and starts accumulating them.
>>>> Either of these problems will be annoying to track down, but with enough
>>>> digging I've been able to fix these sorts of things.
>>>>
>>>> This may have more to do with O3 not handling the benchmark you're
>>>> running well rather than a problem with your new DRAM model. There may be
>>>> some interaction between the two, though, where the new memory makes the
>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>> dynamic instruction creation and destruction and reference counting (try
>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>> itself) and turn it on as close as you can to where things go bad tick
>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>> reference count is incremented and decremented. It should be relatively
>>>> easy to pair up where references are created and destroyed, and you should
>>>> be able to identify the reference which never goes away. Then you need to
>>>> figure out where that reference is being created. After that, you should
>>>> have enough information to identify why the reference counting isn't being
>>>> done correctly. It's arduous, but that's the only way.
>>>>
>>>> It's important to also make sure reference counts aren't decremented to
>>>> zero prematurely. I had a problem once where that happened and the memory
>>>> behind the object was updated by something that didn't know it was dead.
>>>> The memory had since been reallocated to another object of the same type,
>>>> so that other object reflected what happened to the phantom one. If I
>>>> remember that manifested as something weird like an add causing a page
>>>> fault or something.
>>>>
>>>> Gabe
>>>>
>>>>
>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>
>>>> Hi all,
>>>> I've looked into this problem some more, and have put together a couple
>>>> traces. I've been becoming more familiar with how gem5 handles dynamic
>>>> instructions, in particular how it destroys them. I have two traces to
>>>> compare, one with the physical memory, and the other with the integrated
>>>> dramsim2 dram memory. I also have two plots showing instruction counts
>>>> over time (sim ticks). All of these are linked at the end of the email.
>>>> First, I'm going to go into what I've been able to interpret regarding
>>>> how instructions are destroyed. In particular, comparing when DynInst's
>>>> vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these
>>>> because I've seen a difference, as I discuss later. These explanations are
>>>> fairly non-existent on the wiki. There is a section header waiting to be
>>>> filled...
>>>> From what I have been able to gather from the code, there is a list of
>>>> all the instructions in flight in cpu/o3/cpu.cc called instList, with the
>>>> type DynInstPtr. There are three conditions to instructions being cleaned
>>>> from this list:
>>>> 1.) The ROB retires its head instruction
>>>> 2.) Fetch receives a rob squashing signal from the commit, resulting
>>>> in removing any instruction not in the ROB
>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>> removal of all instructions back to the bad seq num.
>>>> Once all five stages have completed, the CPU cleans up all the removed
>>>> in-flight instructions. This line in particular
>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>> instList.erase(removeList.front());
>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>> met. I also see what tick it occurs on.
>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>> analyzing the trace files, I've gathered that this takes into account that
>>>> instructions have different execution lengths. So if one tick a memory
>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>>> to determine how this is implemented.
>>>> Now for the problem.
>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>> assuming are loops (therefore branch prediction), resulting in some
>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>> instructions in flight than allowed by ROB).
>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>>> trace, but they should represent roughly the same area of execution. They
>>>> don't execute the same due to the dramsim2 modeling the memory differently
>>>> (i.e. latency and other delays).
>>>> I've shared both traces on my public Dropbox here --
>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>> Here are a couple plots of tick versus instruction count, with respect
>>>> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
>>>> cpu/o3/cpu.cc. --
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>> Note that I added the printout of the instList size to an existing
>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>> Here are the commands I ran to parse the traces into data files to
>>>> analyze in MATLAB and create the plots:
>>>> zgrep DynInst
>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>> zgrep instList
>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>> $1,$11}' > instlistsize.out
>>>> It seems to me like the problem might lie in gem5, but has just been
>>>> exposed by integrating this more detailed memory model, dramsim2, into
>>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>>> integrated. I doubt this, however, since those first 190B ticks executed
>>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>>> instructions + complex loops (branch prediction), resulting in improper
>>>> destroying of instructions.
>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>> Their are 192 ROB entries, which is why the instList size generally has a
>>>> max of about 192 instructions. The dynamic instruction counts (seen in the
>>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>> which allows more and more instructions to be added to the system (possibly
>>>> from a bad branch).
>>>> I appreciate any help in debugging this and further figuring out the
>>>> root problem, just let me know if you need anything else from me. I don't
>>>> have much more time at the moment to debug, but I can take any advice for
>>>> quick changes and/or additional traces, then send the results back to the
>>>> list for discussion.
>>>> Thanks,
>>>> Andrew
>>>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
>>>> (and even command) queue from 512 to 32. The same instructions problem
>>>> occurred. It basically just decreased the execution time.
>>>>
>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>>
>>>>> The error is that there are more that 1500 instructions currently in
>>>>> flight in the system. It could mean several things:
>>>>>
>>>>> 1. The value is somewhat arbitrarily defined and maybe there are more
>>>>> than 1500 in your system at one time?
>>>>>
>>>>> 2. Instructions aren't being destroyed correctly
>>>>>
>>>>> You could try to to run a debug binary so you'll get a list of
>>>>> instructions when it happens or increase the number which may
>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>> instructions).
>>>>>
>>>>> Ali
>>>>>
>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>
>>>>> Hi Xiangyu,
>>>>> I just started looking into this some more. So at first I thought
>>>>> it was due to updating to a more recent revision, but then I went back to
>>>>> revision 8643, added your patch, built and ran....and now get the error
>>>>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an
>>>>> update to SWIG might have resulted in this error, maybe someone on the
>>>>> mailing list would know if that's possible. The difference is 1.3.40 vs.
>>>>> 2.0.3, both of which are supported according to the dependencies wiki page.
>>>>> Just for completeness, here's the error from revision 8643:
>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>> I have not tried running with gem5.debug, so I will be doing that
>>>>> today. Maybe this is an assertion that is occurring due to an
>>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>>> with your tests?
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>>> version you are using? I find some of the latest code updates do not comply
>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>> PARSEC2 on ARM_SE.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Xiangyu
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>
>>>>>> *To:* gem5 users mailing list
>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>
>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>
>>>>>> Xiangyu,
>>>>>>
>>>>>> I've been having an issue recently with the number of instructions
>>>>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>>>>> It turns out the issue seems to be coming from this patch you created to
>>>>>> integrate DramSim2 with Gem5. Unfortunately, I've been running with
>>>>>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing
>>>>>> assertions. I thought I'd run it with gem5.opt or debug back in December,
>>>>>> but I must not have. My runs on the Arm O3 cpu fails with this assertion:
>>>>>>
>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>>>>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>
>>>>>> -Andrew
>>>>>>
>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>> Message-ID: gmail.com>
>>>>>>
>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>>> modes.
>>>>>> I'm willing to share it here.
>>>>>>
>>>>>>
>>>>>>
>>>>>> For those who have such needs, please go to my website
>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>> download the patch and test it. To enable
>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>>>>> create
>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to use
>>>>>> the
>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please let me know if there are bugs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Xiangyu Dong
>>>>>>
>>>>>> -------------- next part --------------
>>>>>> An HTML attachment was scrubbed...
>>>>>> URL: <
>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>> >
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Ali Saidi
2012-05-03 01:07:57 UTC
Permalink
You haven't answered the question about if you created the
checkpoints with an atomic cpu without caches.

Ali

On 02.05.2012
19:58, Andrew Cebulski wrote:

> I have not run with the checker CPU
recently. Here's the stderr output from a run I did awhile back:
>
http://dl.dropbox.com/u/2953302/gem5/err.0 [30]
> Note that the
instruction match error is before my benchmark actually starts running.
The start of my boot script checks to see if my files image is mounted
(which it is), then continues on to run the benchmark. I booted the
system, mounted my files image, then took a checkpoint. I've been
running all my tests from that checkpoint. I found where my benchmark
started based on the ASID (from ExecAsid debug flag).
> I delayed the
start of gathering trace data until the second-to-last linear increase
in dynamic instructions in-flight. I'm running a new trace now.
>
-Andrew
>
> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu
[31]> wrote:
>
>> Something is wrong well before this point. There is
no reason that address 0x0 or 0x4 should be translated.
>>
>> Did you
happen to create a checkpoint when caches were in the system?
>>
>>
Have you tried to run with the checker cpu and see if it detects any
errors?
>>
>> Ali
>>
>> On 02.05.2012 17:22, Andrew Cebulski wrote:

>>
>>> They are data TLB misses that occur as the in-flight
instruction count rises (at 0x0 and 0x4). The last TLB miss before the
in-flight instruction count finally linearly decreases is to 0x200.
Also, at the start of the rising slope, I see a miss to 0x8 and 0x2508c.

>>>
>>> Here's a trace file:
>>>
>>>
http://dl.dropbox.com/u/2953302/gem5/tlb.out [26]
>>> To reduce size, I
just have lines that have either TLB or walker in them.
>>> I do see
only a handful of instruction TLB misses.
>>>
>>> -Andrew
>>>
>>> On
Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu [27]>
wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> Thanks for digging into this. I
think there is an issue somewhere, but I'm still not sure where.
>>>>

>>>> Ali
>>>>
>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>

>>>>> Okay, I'm positive now that the issue lies with delayed
translations that are squashed before finishing.
>>>>
>>>> On the data
on instruction side? You seem to allude to data in the paragraph below,
but then instructions in the latter text.
>>>>
>>>>> It seems to me
like speculative load/stores are being executed, rather than waiting for
the instructions to commit. Once the instructions begin getting
(speculatively) executed in the TLB, a reference is left there, which
seems hard to root out and dereference after the instruction ends up
being squashed. At least, I have not been able to find that out in the
source code as of yet. Can anyone clarify on this?
>>>>
>>>> There
should only be one translation outstanding from each instruction and
data side walker. Any nested transactions should be queued in the
walker. Until one finishes, I'm not sure how multiple would ever be
outstanding.
>>>>
>>>> R
>>>>
>>>>> ncreases linearly for varying
periods of time:
>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[1]
>>>>> After enabling the TLB debug flag, I see that the linear
increase in instructions in flight is proportional to the number of TLB
misses. These TLB misses have a much larger delay (resulting in
translation delays) due to the fact the DramSim2 models the memory
system more accurately. It seems that with the classic memory system,
TLB misses often do not have translation delays. For whatever reason, it
would also seem that every instruction that has a TLB miss also is
eventually squashed...
>>>>>
>>>>> From a data side perspective this
is reasonable. While a miss is ou
>>>> some point instructions will stop
committing and thus the instructions in flight will begin to rise until
the miss is satisfied.
>>>>
>>>> Here's a summary of outputs from my
trace. These two DPRINTF messages appears on the rising slopes (repeated
up until the peak):
>>>> TLB
>>>>
>>>>> r 0x4(656)
>>>>>
>>>>> This
is interesting/odd. I don't know a good reason why (1) a miss would be
outstanding to both address 0 and address 4 at the same time. In almost
all cases these pages are marked as no-access to detect segfaults.
Perhaps there is a
>>>> the cpu is getting into a loop faulting on a bad
access and then faulting again on the fault handler. I could imagine
this would happen if there was some corruption in the memory system (for
example the timings in dramsim exposing a bug in the cache models or
something).
>>>>
>>>> At the peak, the following message appears (from
fetch) almost every tick for (what I believe to be) every single one of
the table walkers that were squashed.
>>>> Fetch is waiting ITLB walk
to finish!
>>>>
>>>> There must be another walk in flight? The
instruction side will only have one fault outstanding at once.
Successive branch mispredicts will
>>>>
>>>>> nd "does the right
thing."
>>>>>
>>>>> The problem is that these ITLB table walks are for
instructions that wer
>>>> much as 0.3 billion cycles earlier, and since
been removed from the CPU's instruction list.
>>>>
>>>> I'm not
following here.
>>>>
>>>> Any help will be greatly appreciated in
solving this problem. I've hit a roadblock with getting Ruby working
with ARM, most likely due to the fact
>>>>
>>>>> the 64 MB for the boot
loader. I brought this up in my last email about trying to get Ruby
working. Therefore, I'm trying to get this DramSim2 integration fixed so
I can start modeling FS with D
>>>> div>
>>>>
>>>> Brad/Steve/Nilay
anyone have a suggestion on how to make this work?
>>>>
>>>> Note that
these problem
>>>>
>>>>> rtion). Due to time constraints, I haven't
tested on other benchmarks.
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu [2]>
wrote:
>>>>>
>>>>> Hey Gabe,
>>>>> Thanks for this.
>>>> l. I just
recently got back into debugging this problem. I made a small change in
src/base/refcnt.hh to allow me to return the current count of references
to a DynInst object.
>>>> I the
>>>>
>>>>> extra visibility.
>>>>>
I've found one memory store instruction that seems to be getting lost.
What's happening is that is progresses as far as getting executed in the
IEW once, but a delayed translation occurs, deferring the store. By the
time it reenters the IEW, the IQ has marked the instruction as squashed.
Everything progresses as usual from here on out, with one exception.
When the instruction is removed from the CPUs instruction list, there is
one reference count hanging.
>>>>> I've added in some additional
debugging for my traces to help narrow down where this reference is
coming from. As far as I can tell, it's because of a call to
initiateAcc() within the executeStore function in the lsq unit. Please
see the following two traces. The first trace shows what I just
discussed. The second trace is another memory store instruction that got
squashed, however, it was squashed upon its first entry into the IEW,
therefore it never started execution.
>>>>>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [21]
>>>>>
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out [22]
>>>>>
Let me know if you have any ideas based on these two instruction traces.
I do not understand how the initiateAcc function results in another
reference, but maybe someone else does.... Since I don't see how it
makes a reference, it's hard to find out how to make sure it gets
dereferenced...
>>>>> Unfortunately, I haven't been able to add a
DPRINTF in src/base/refcnt.hh ...this would make things more clear (i.e.
exactly when references/deferences occur). Let me know if you have any
advice on this...if it's possible. I can't seem to get the right include
files, and likely right SConscript compile order...
>>>>> Thanks,

>>>>> Andrew
>>>>>
>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black
<***@eecs.umich.edu [23]> wrote:
>>>>>
>>>>>> Without digging into
things too deeply, it looks like you may be leaking references to
dynamic instructions. The CPU may think it's done with one, but until
that final reference is removed, the object will hang around forever. I
think I've had problems before where there reference count ended up off
by one somehow and instructions would start piling up. It's also
possible that a clog develops in O3's pipeline and some internal
structure stops letting instructions through and starts accumulating
them. Either of these problems will be annoying to track down, but with
enough digging I've been able to fix these sorts of things.
>>>>>>

>>>>>> This may have more to do with O3 not handling the benchmark
you're running well rather than a problem with your new DRAM model.
There may be some interaction between the two, though, where the new
memory makes the timing line up to cause O3 to behave poorly. What you
can do is instrument dynamic instruction creation and destruction and
reference counting (try print "this" for both the reference counting
wrapper and the dyn inst itself) and turn it on as close as you can to
where things go bad tick wise. Then look for an instruction which gets
lost, and look for where it's reference count is incremented and
decremented. It should be relatively easy to pair up where references
are created and destroyed, and you should be able to identify the
reference which never goes away. Then you need to figure out where that
reference is being created. After that, you should have enough
information to identify why the reference counting isn't being done
correctly. It's arduous, but that's the only way.
>>>>>>
>>>>>> It's
important to also make sure reference counts aren't decremented to zero
prematurely. I had a problem once where that happened and the memory
behind the object was updated by something that didn't know it was dead.
The memory had since been reallocated to another object of the same
type, so that other object reflected what happened to the phantom one.
If I remember that manifested as something weird like an add causing a
page fault or something.
>>>>>>
>>>>>> Gabe
>>>>>>
>>>>>> On 04/07/12
18:21, Andrew Cebulski wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> I've
looked into this problem some more, and have put together a couple
traces. I've been becoming more familiar with how gem5 handles dynamic
instructions, in particular how it destroys them. I have two traces to
compare, one with the physical memory, and the other with the integrated
dramsim2 dram memory. I also have two plots showing instruction counts
over time (sim ticks). All of these are linked at the end of the email.

>>>>>>> First, I'm going to go into what I've been able to interpret
regarding how instructions are destroyed. In particular, comparing when
DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
separate these because I've seen a difference, as I discuss later. These
explanations are fairly non-existent on the wiki. There is a section
header waiting to be filled...
>>>>>>> From what I have been able to
gather from the code, there is a list of all the instructions in flight
in cpu/o3/cpu.cc called instList, with the type DynInstPtr. There are
three conditions to instructions being cleaned from this list:
>>>>>>>
1.) The ROB retires its head instruction
>>>>>>> 2.) Fetch receives a
rob squashing signal from the commit, resulting in removing any
instruction not in the ROB
>>>>>>> 3.) Decode detects an incorrect
branch prediction, resulting in removal of all instructions back to the
bad seq num.
>>>>>>> Once all five stages have completed, the CPU
cleans up all the removed in-flight instructions. This line in
particular in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a
DynInstPtr:
>>>>>>> instList.erase(removeList.front());
>>>>>>> When I
turn on the debug flag O3CPU, I see the message "Removing instruction,
..." (from o3/cpu.cc) with the threadNum, seqNum and pcState after all 5
cpu stages have completed, and one of the conditions above is met. I
also see what tick it occurs on.
>>>>>>> When I turn on the DynInst
debug flag, I see when instructions are created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From analyzing the trace
files, I've gathered that this takes into account that instructions have
different execution lengths. So if one tick a memory instruction in the
instList (DynInstPtr) is removed, the DynInst for that memory
instruction will occur much later (i.e. 1M ticks later). I have yet to
determine how this is implemented.
>>>>>>> Now for the problem.

>>>>>>> What I'm seeing when I run dramsim2 dram memory is a
significant difference between the size of the instList vector (of
DynInstPtr objects), and the size of dynamic instruction count (of
DynInst objects). The benchmark I'm running is libquantum from SPEC
2006. For the first roughly 130B ticks, the dynamic instruction count
kept in cpu/base_dyn_inst.impl.hh shadows the instList size in o3/cpu.cc
(figure linked below) very closely. Around tick 130B after libquantum
started, it starts hitting what I'm assuming are loops (therefore branch
prediction), resulting in some behavior that seems to imply improper
instruction handling (i.e. more instructions in flight than allowed by
ROB).
>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces
exactly by trace, but they should represent roughly the same area of
execution. They don't execute the same due to the dramsim2 modeling the
memory differently (i.e. latency and other delays).
>>>>>>> I've shared
both traces on my public Dropbox here --
>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>>>>>>> Here are a couple plots of tick versus instruction count,
with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
instList.size() in cpu/o3/cpu.cc. --
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>>>>>>>
>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>>>>>>> Note that I added the printout of the instList size to an
existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.

>>>>>>> Here are the commands I ran to parse the traces into data files
to analyze in MATLAB and create the plots:
>>>>>>> zgrep DynInst
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed | awk '{print $1,$11}' > cpuinstcount.out
>>>>>>> zgrep
instList dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
awk '{print $1,$11}' > instlistsize.out
>>>>>>> It seems to me like the
problem might lie in gem5, but has just been exposed by integrating this
more detailed memory model, dramsim2, into gem5. Either that, or their
are some timing errors in how dramsim2 was integrated. I doubt this,
however, since those first 190B ticks executed used the dramsim2 memory.
I believe the problem is a combination of memory instructions + complex
loops (branch prediction), resulting in improper destroying of
instructions.
>>>>>>> I've included the ROB, Commit, Fetch, DynInst and
O3CPU debug flags. Their are 192 ROB entries, which is why the instList
size generally has a max of about 192 instructions. The dynamic
instruction counts (seen in the dramsim2 plot) seem to also imply that
instructions are incorrectly been removed from the ROB, and then from
the cpu's instruction list in cpu.cc, which allows more and more
instructions to be added to the system (possibly from a bad branch).

>>>>>>> I appreciate any help in debugging this and further figuring
out the root problem, just let me know if you need anything else from
me. I don't have much more time at the moment to debug, but I can take
any advice for quick changes and/or additional traces, then send the
results back to the list for discussion.
>>>>>>> Thanks,
>>>>>>>
Andrew
>>>>>>> P.S. Paul - I did try decreasing the size of the
dramsim2 transaction (and even command) queue from 512 to 32. The same
instructions problem occurred. It basically just decreased the execution
time.
>>>>>>>
>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi
<***@umich.edu [18]> wrote:
>>>>>>>
>>>>>>>> The error is that there
are more that 1500 instructions currently in flight in the system. It
could mean several things:
>>>>>>>>
>>>>>>>> 1. The value is somewhat
arbitrarily defined and maybe there are more than 1500 in your system at
one time?
>>>>>>>>
>>>>>>>> 2. Instructions aren't being destroyed
correctly
>>>>>>>>
>>>>>>>> You could try to to run a debug binary so
you'll get a list of instructions when it happens or increase the number
which may be appropriate for certain situations (but 1500 is quite a few
inflight instructions).
>>>>>>>>
>>>>>>>> Ali
>>>>>>>>
>>>>>>>> On
13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>>>
>>>>>>>>> Hi
Xiangyu,
>>>>>>>>> I just started looking into this some more. So at
first I thought it was due to updating to a more recent revision, but
then I went back to revision 8643, added your patch, built and
ran....and now get the error with it too (when running ARM_FS/gem5.opt).
I"m testing now to see if an update to SWIG might have resulted in this
error, maybe someone on the mailing list would know if that's possible.
The difference is 1.3.40 vs. 2.0.3, both of which are supported
according to the dependencies wiki page.
>>>>>>>>> Just for
completeness, here's the error from revision 8643:
>>>>>>>>>
build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>
>>>>>>>>>
I have not tried running with gem5.debug, so I will be doing that today.
Maybe this is an assertion that is occurring due to an optimization.
That would mean it wouldn't be triggered in gem5.debug since it runs
without optimizations. Have you tested all debug, opt and fast with your
tests?
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> On Tue,
Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <***@gmail.com [11]>
wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> I
didn't see this error in my simulations. May I ask which gem5 version
you are using? I find some of the latest code updates do not comply with
my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
PARSEC2 on ARM_SE.
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>

>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Xiangyu
>>>>>>>>>>

>>>>>>>>>> FROM: Andrew Cebulski [mailto:***@drexel.edu [8]]

>>>>>>>>>> SENT: Thursday, March 08, 2012 6:52 PM
>>>>>>>>>>

>>>>>>>>>> TO: gem5 users mailing list CC:***@gmail.com [9];
***@umich.edu [10]
>>>>>>>>>>
>>>>>>>>>> SUBJECT: Re: [gem5-users] A
Patch for DRAMsim2 Integration
>>>>>>>>>>
>>>>>>>>>> Xiangyu,

>>>>>>>>>>
>>>>>>>>>> I've been having an issue recently with the
number of instructions I've been seeing committed to the CPU (I have a
separate thread on this). It turns out the issue seems to be coming from
this patch you created to integrate DramSim2 with Gem5. Unfortunately,
I've been running with gem5.fast, not gem5.opt. So up until now, I
haven't been seeing assertions. I thought I'd run it with gem5.opt or
debug back in December, but I must not have. My runs on the Arm O3 cpu
fails with this assertion:
>>>>>>>>>>
>>>>>>>>>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>>

>>>>>>>>>> -Andrew
>>>>>>>>>>
>>>>>>>>>>> Date: Sun, 18 Dec 2011
01:48:58 -0800
>>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com
[3]>
>>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org
[4]>
>>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
Message-ID: gmail.com [5]>
>>>>>>>>>>>
>>>>>>>>>>> Content-Type:
text/plain; charset="us-ascii"
>>>>>>>>>>>
>>>>>>>>>>> Hi
all,
>>>>>>>>>>>
>>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested
it under both SE and FS modes.
>>>>>>>>>>> I'm willing to share it
here.
>>>>>>>>>>>
>>>>>>>>>>> For those who have such needs, please go
to my website
>>>>>>>>>>> www.cse.psu.edu/~xydong [6] to download the
patch and test it. To enable
>>>>>>>>>>> DRAMSim2, use se_dramsim2.py
script instead of se.py (for FS, you can create
>>>>>>>>>>> by
yourself). The basic idea to enable the DRAMsim2 module is to use
the
>>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory
class.
>>>>>>>>>>>
>>>>>>>>>>> Please let me know if there are
bugs.
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>>
Best,
>>>>>>>>>>>
>>>>>>>>>>> Xiangyu Dong
>>>>>>>>>>>
>>>>>>>>>>>
-------------- next part --------------
>>>>>>>>>>> An HTML attachment
was scrubbed...
>>>>>>>>>>> URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[7]>
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> gem5-users
mailing list
>>>>>>>>> gem5-***@gem5.org [12]
>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [13]
>>>>>>>

>>>>>>> _______________________________________________
>>>>>>>
gem5-users mailing list
>>>>>>>
gem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>

>>>>>> _______________________________________________
>>>>>>
gem5-users mailing list
>>>>>> gem5-***@gem5.org [19]
>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [20]
>>>>>
>>>>>
_______________________________________________
>>>>> gem5-users mailing
list
>>>>> gem5-***@gem5.org [24]
>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [25]
>>
>>
_______________________________________________
>> gem5-users mailing
list
>> gem5-***@gem5.org [28]
>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [29]




Links:
------
[1]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[2]
mailto:***@drexel.edu
[3] mailto:***@gmail.com
[4]
mailto:gem5-***@gem5.org
[5] http://gmail.com
[6]
http://www.cse.psu.edu/%7Exydong
[7]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[8]
mailto:***@drexel.edu
[9] mailto:***@gmail.com
[10]
mailto:***@umich.edu
[11] mailto:***@gmail.com
[12]
mailto:gem5-***@gem5.org
[13]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:gem5-***@gem5.org
[20]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[21]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[22]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[23]
mailto:***@eecs.umich.edu
[24] mailto:gem5-***@gem5.org
[25]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[26]
http://dl.dropbox.com/u/2953302/gem5/tlb.out
[27]
mailto:***@umich.edu
[28] mailto:gem5-***@gem5.org
[29]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[30]
http://dl.dropbox.com/u/2953302/gem5/err.0
[31] mailto:***@umich.edu
Andrew Cebulski
2012-05-03 01:23:22 UTC
Permalink
Sorry, I created the checkpoint I referred to with an O3 CPU with caches.
From what I recall reading, caches don't get restored from checkpoints.
Since the checkpoint wasn't during the benchmark run, I assumed that was
okay.

-Andrew

On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:

> **
>
> You haven't answered the question about if you created the checkpoints
> with an atomic cpu without caches.
>
> Ali
>
>
>
>
>
> On 02.05.2012 19:58, Andrew Cebulski wrote:
>
> I have not run with the checker CPU recently. Here's the stderr output
> from a run I did awhile back:
> http://dl.dropbox.com/u/2953302/gem5/err.0
> Note that the instruction match error is before my benchmark actually
> starts running. The start of my boot script checks to see if my files
> image is mounted (which it is), then continues on to run the benchmark. I
> booted the system, mounted my files image, then took a checkpoint. I've
> been running all my tests from that checkpoint. I found where my benchmark
> started based on the ASID (from ExecAsid debug flag).
> I delayed the start of gathering trace data until the second-to-last
> linear increase in dynamic instructions in-flight. I'm running a new trace
> now.
> -Andrew
>
>
> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>
>> Something is wrong well before this point. There is no reason that
>> address 0x0 or 0x4 should be translated.
>>
>> Did you happen to create a checkpoint when caches were in the system?
>>
>> Have you tried to run with the checker cpu and see if it detects any
>> errors?
>>
>>
>>
>> Ali
>>
>>
>>
>>
>>
>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>
>> They are data TLB misses that occur as the in-flight instruction count
>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction
>> count finally linearly decreases is to 0x200. Also, at the start of the
>> rising slope, I see a miss to 0x8 and 0x2508c.
>> Here's a trace file:
>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>> To reduce size, I just have lines that have either TLB or walker in them.
>> I do see only a handful of instruction TLB misses.
>> -Andrew
>>
>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>>
>>> Hi Andrew,
>>>
>>>
>>>
>>> Thanks for digging into this. I think there is an issue somewhere, but
>>> I'm still not sure where.
>>>
>>> Ali
>>>
>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>
>>> Okay, I'm positive now that the issue lies with delayed translations
>>> that are squashed before finishing.
>>>
>>> On the data on instruction side? You seem to allude to data in the
>>> paragraph below, but then instructions in the latter text.
>>>
>>> It seems to me like speculative load/stores are being executed, rather
>>> than waiting for the instructions to commit. Once the instructions begin
>>> getting (speculatively) executed in the TLB, a reference is left there,
>>> which seems hard to root out and dereference after the instruction ends up
>>> being squashed. At least, I have not been able to find that out in the
>>> source code as of yet. Can anyone clarify on this?
>>>
>>>
>>>
>>> There should only be one translation outstanding from each instruction
>>> and data side walker. Any nested transactions should be queued in the
>>> walker. Until one finishes, I'm not sure how multiple would ever be
>>> outstanding.
>>>
>>> Recall the following image that shows how the number of dynamic
>>> instruction (DynInst) objects in-flight increases linearly for varying
>>> periods of time:
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>> After enabling the TLB debug flag, I see that the linear increase in
>>> instructions in flight is proportional to the number of TLB misses. These
>>> TLB misses have a much larger delay (resulting in translation delays) due
>>> to the fact the DramSim2 models the memory system more accurately. It
>>> seems that with the classic memory system, TLB misses often do not have
>>> translation delays. For whatever reason, it would also seem that every
>>> instruction that has a TLB miss also is eventually squashed...
>>>
>>> From a data side perspective this is reasonable. While a miss is
>>> outstanding at some point instructions will stop committing and thus the
>>> instructions in flight will begin to rise until the miss is satisfied.
>>>
>>> Here's a summary of outputs from my trace. These two DPRINTF messages
>>> appears on the rising slopes (repeated up until the peak):
>>> TLB Miss: Starting hardware table walker for 0(656)
>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>
>>> This is interesting/odd. I don't know a good reason why (1) a miss
>>> would be outstanding to both address 0 and address 4 at the same time. In
>>> almost all cases these pages are marked as no-access to detect segfaults.
>>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>>> a bad access and then faulting again on the fault handler. I could imagine
>>> this would happen if there was some corruption in the memory system (for
>>> example the timings in dramsim exposing a bug in the cache models or
>>> something).
>>>
>>>
>>> At the peak, the following message appears (from fetch) almost every
>>> tick for (what I believe to be) every single one of the table walkers that
>>> were squashed.
>>> Fetch is waiting ITLB walk to finish!
>>>
>>> There must be another walk in flight? The instruction side will only
>>> have one fault outstanding at once. Successive branch mispredicts will
>>> re-direct fetch but there is code that catches the fact that a different
>>> walk completed then expected and "does the right thing."
>>>
>>> The problem is that these ITLB table walks are for instructions that
>>> were squashed as much as 0.3 billion cycles earlier, and since been removed
>>> from the CPU's instruction list.
>>>
>>> I'm not following here.
>>>
>>> Any help will be greatly appreciated in solving this problem. I've
>>> hit a roadblock with getting Ruby working with ARM, most likely due to the
>>> fact that ARM has disjoint memory (x86 and Alpha do not). There's the 256
>>> MB for physical memory, then the 64 MB for the boot loader. I brought this
>>> up in my last email about trying to get Ruby working. Therefore, I'm
>>> trying to get this DramSim2 integration fixed so I can start modeling FS
>>> with DRAM memory.
>>>
>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>>
>>>
>>> Note that these problems also occur in Soplex from the Spec CPU2006
>>> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
>>> time constraints, I haven't tested on other benchmarks.
>>> Thanks,
>>> Andrew
>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu>wrote:
>>>
>>>> Hey Gabe,
>>>> Thanks for this...very helpful. I just recently got back into
>>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>>> allow me to return the current count of references to a DynInst object.
>>>> I then modified existing DPRINTFs to also print out reference
>>>> counts, then added some of my own when I needed extra visibility.
>>>> I've found one memory store instruction that seems to be getting
>>>> lost. What's happening is that is progresses as far as getting executed in
>>>> the IEW once, but a delayed translation occurs, deferring the store. By
>>>> the time it reenters the IEW, the IQ has marked the instruction as
>>>> squashed. Everything progresses as usual from here on out, with one
>>>> exception. When the instruction is removed from the CPUs instruction list,
>>>> there is one reference count hanging.
>>>> I've added in some additional debugging for my traces to help
>>>> narrow down where this reference is coming from. As far as I can tell,
>>>> it's because of a call to initiateAcc() within the executeStore function in
>>>> the lsq unit. Please see the following two traces. The first trace shows
>>>> what I just discussed. The second trace is another memory store
>>>> instruction that got squashed, however, it was squashed upon its first
>>>> entry into the IEW, therefore it never started execution.
>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>> Let me know if you have any ideas based on these two instruction
>>>> traces. I do not understand how the initiateAcc function results in
>>>> another reference, but maybe someone else does.... Since I don't see how
>>>> it makes a reference, it's hard to find out how to make sure it gets
>>>> dereferenced...
>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>>> references/deferences occur). Let me know if you have any advice on
>>>> this...if it's possible. I can't seem to get the right include files, and
>>>> likely right SConscript compile order...
>>>> Thanks,
>>>> Andrew
>>>>
>>>>
>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu>wrote:
>>>>
>>>>> Without digging into things too deeply, it looks like you may be
>>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>>> with one, but until that final reference is removed, the object will hang
>>>>> around forever. I think I've had problems before where there reference
>>>>> count ended up off by one somehow and instructions would start piling up.
>>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>>> structure stops letting instructions through and starts accumulating them.
>>>>> Either of these problems will be annoying to track down, but with enough
>>>>> digging I've been able to fix these sorts of things.
>>>>>
>>>>> This may have more to do with O3 not handling the benchmark you're
>>>>> running well rather than a problem with your new DRAM model. There may be
>>>>> some interaction between the two, though, where the new memory makes the
>>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>>> dynamic instruction creation and destruction and reference counting (try
>>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>>> itself) and turn it on as close as you can to where things go bad tick
>>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>>> reference count is incremented and decremented. It should be relatively
>>>>> easy to pair up where references are created and destroyed, and you should
>>>>> be able to identify the reference which never goes away. Then you need to
>>>>> figure out where that reference is being created. After that, you should
>>>>> have enough information to identify why the reference counting isn't being
>>>>> done correctly. It's arduous, but that's the only way.
>>>>>
>>>>> It's important to also make sure reference counts aren't decremented
>>>>> to zero prematurely. I had a problem once where that happened and the
>>>>> memory behind the object was updated by something that didn't know it was
>>>>> dead. The memory had since been reallocated to another object of the same
>>>>> type, so that other object reflected what happened to the phantom one. If I
>>>>> remember that manifested as something weird like an add causing a page
>>>>> fault or something.
>>>>>
>>>>> Gabe
>>>>>
>>>>>
>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>>
>>>>> Hi all,
>>>>> I've looked into this problem some more, and have put together a
>>>>> couple traces. I've been becoming more familiar with how gem5 handles
>>>>> dynamic instructions, in particular how it destroys them. I have two
>>>>> traces to compare, one with the physical memory, and the other with the
>>>>> integrated dramsim2 dram memory. I also have two plots showing instruction
>>>>> counts over time (sim ticks). All of these are linked at the end of the
>>>>> email.
>>>>> First, I'm going to go into what I've been able to interpret regarding
>>>>> how instructions are destroyed. In particular, comparing when DynInst's
>>>>> vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these
>>>>> because I've seen a difference, as I discuss later. These explanations are
>>>>> fairly non-existent on the wiki. There is a section header waiting to be
>>>>> filled...
>>>>> From what I have been able to gather from the code, there is a list of
>>>>> all the instructions in flight in cpu/o3/cpu.cc called instList, with the
>>>>> type DynInstPtr. There are three conditions to instructions being cleaned
>>>>> from this list:
>>>>> 1.) The ROB retires its head instruction
>>>>> 2.) Fetch receives a rob squashing signal from the commit, resulting
>>>>> in removing any instruction not in the ROB
>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>>> removal of all instructions back to the bad seq num.
>>>>> Once all five stages have completed, the CPU cleans up all the removed
>>>>> in-flight instructions. This line in particular
>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>> instList.erase(removeList.front());
>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>>> met. I also see what tick it occurs on.
>>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>> analyzing the trace files, I've gathered that this takes into account that
>>>>> instructions have different execution lengths. So if one tick a memory
>>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>>>> to determine how this is implemented.
>>>>> Now for the problem.
>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>>> assuming are loops (therefore branch prediction), resulting in some
>>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>>> instructions in flight than allowed by ROB).
>>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>>>> trace, but they should represent roughly the same area of execution. They
>>>>> don't execute the same due to the dramsim2 modeling the memory differently
>>>>> (i.e. latency and other delays).
>>>>> I've shared both traces on my public Dropbox here --
>>>>>
>>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>>
>>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>>> Here are a couple plots of tick versus instruction count, with respect
>>>>> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
>>>>> cpu/o3/cpu.cc. --
>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>> Note that I added the printout of the instList size to an existing
>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>> Here are the commands I ran to parse the traces into data files to
>>>>> analyze in MATLAB and create the plots:
>>>>> zgrep DynInst
>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>>> zgrep instList
>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>>> $1,$11}' > instlistsize.out
>>>>> It seems to me like the problem might lie in gem5, but has just been
>>>>> exposed by integrating this more detailed memory model, dramsim2, into
>>>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>>>> integrated. I doubt this, however, since those first 190B ticks executed
>>>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>>>> instructions + complex loops (branch prediction), resulting in improper
>>>>> destroying of instructions.
>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>>> Their are 192 ROB entries, which is why the instList size generally has a
>>>>> max of about 192 instructions. The dynamic instruction counts (seen in the
>>>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>>> which allows more and more instructions to be added to the system (possibly
>>>>> from a bad branch).
>>>>> I appreciate any help in debugging this and further figuring out the
>>>>> root problem, just let me know if you need anything else from me. I don't
>>>>> have much more time at the moment to debug, but I can take any advice for
>>>>> quick changes and/or additional traces, then send the results back to the
>>>>> list for discussion.
>>>>> Thanks,
>>>>> Andrew
>>>>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
>>>>> (and even command) queue from 512 to 32. The same instructions problem
>>>>> occurred. It basically just decreased the execution time.
>>>>>
>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>
>>>>>> The error is that there are more that 1500 instructions currently
>>>>>> in flight in the system. It could mean several things:
>>>>>>
>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are more
>>>>>> than 1500 in your system at one time?
>>>>>>
>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>
>>>>>> You could try to to run a debug binary so you'll get a list of
>>>>>> instructions when it happens or increase the number which may
>>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>>> instructions).
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>
>>>>>> Hi Xiangyu,
>>>>>> I just started looking into this some more. So at first I
>>>>>> thought it was due to updating to a more recent revision, but then I went
>>>>>> back to revision 8643, added your patch, built and ran....and now get the
>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing now to see
>>>>>> if an update to SWIG might have resulted in this error, maybe someone on
>>>>>> the mailing list would know if that's possible. The difference is 1.3.40
>>>>>> vs. 2.0.3, both of which are supported according to the dependencies wiki
>>>>>> page.
>>>>>> Just for completeness, here's the error from revision 8643:
>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>> I have not tried running with gem5.debug, so I will be doing that
>>>>>> today. Maybe this is an assertion that is occurring due to an
>>>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>>>> with your tests?
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>>>> version you are using? I find some of the latest code updates do not comply
>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>>> PARSEC2 on ARM_SE.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Xiangyu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>>
>>>>>>> *To:* gem5 users mailing list
>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>>
>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>
>>>>>>> Xiangyu,
>>>>>>>
>>>>>>> I've been having an issue recently with the number of
>>>>>>> instructions I've been seeing committed to the CPU (I have a separate
>>>>>>> thread on this). It turns out the issue seems to be coming from this patch
>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately, I've been
>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I haven't been
>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or debug back in
>>>>>>> December, but I must not have. My runs on the Arm O3 cpu fails with this
>>>>>>> assertion:
>>>>>>>
>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>
>>>>>>> -Andrew
>>>>>>>
>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>> Message-ID: gmail.com>
>>>>>>>
>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>>>> modes.
>>>>>>> I'm willing to share it here.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> For those who have such needs, please go to my website
>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>>> download the patch and test it. To enable
>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you
>>>>>>> can create
>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to
>>>>>>> use the
>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know if there are bugs.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Xiangyu Dong
>>>>>>>
>>>>>>> -------------- next part --------------
>>>>>>> An HTML attachment was scrubbed...
>>>>>>> URL: <
>>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Ali Saidi
2012-05-03 01:53:11 UTC
Permalink
It's likely the cause for all of your problems. Dirty data in the
caches doesn't get restored either. You should always create checkpoints
with an atomic cpu and without caches.

Ali

On 02.05.2012 21:23,
Andrew Cebulski wrote:

> Sorry, I created the checkpoint I referred to
with an O3 CPU with caches. From what I recall reading, caches don't get
restored from checkpoints. Since the checkpoint wasn't during the
benchmark run, I assumed that was okay.
> -Andrew
>
> On Wed, May 2,
2012 at 9:07 PM, Ali Saidi <***@umich.edu [34]> wrote:
>
>> You
haven't answered the question about if you created the checkpoints with
an atomic cpu without caches.
>>
>> Ali
>>
>> On 02.05.2012 19:58,
Andrew Cebulski wrote:
>>
>>> I have not run with the checker CPU
recently. Here's the stderr output from a run I did awhile back:
>>>
http://dl.dropbox.com/u/2953302/gem5/err.0 [30]
>>> Note that the
instruction match error is before my benchmark actually starts running.
The start of my boot script checks to see if my files image is mounted
(which it is), then continues on to run the benchmark. I booted the
system, mounted my files image, then took a checkpoint. I've been
running all my tests from that checkpoint. I found where my benchmark
started based on the ASID (from ExecAsid debug flag).
>>> I delayed the
start of gathering trace data until the second-to-last linear increase
in dynamic instructions in-flight. I'm running a new trace now.
>>>
-Andrew
>>>
>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi
<***@umich.edu [31]> wrote:
>>>
>>>> Something is wrong well before
this point. There is no reason that address 0x0 or 0x4 should be
translated.
>>>>
>>>> Did you happen to create a checkpoint when
caches were in the system?
>>>>
>>>> Have you tried to run with the
checker cpu and see if it detects any errors?
>>>>
>>>> Ali
>>>>

>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>>
>>>>> They are
data TLB misses that occur as the in-flight instruction count rises (at
0x0 and 0x4). The last TLB miss before the in-flight instruction count
finally linearly decreases is to 0x200. Also, at the start of the rising
slope, I see a miss to 0x8 and 0x2508c.
>>>>>
>>>>> Here's a trace
file:
>>>>>
>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out [26]

>>>>> To reduce size, I just have lines that have either TLB or walker
in them.
>>>>> I do see only a handful of instruction TLB misses.

>>>>>
>>>>> -Andrew
>>>>>
>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali
Saidi <***@umich.edu [27]> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>

>>>>>> Thanks for digging into this. I think there is an issue
somewhere, but I'm still not sure where.
>>>>>>
>>>>>> Ali
>>>>>>

>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>>>
>>>>>>>
Okay, I'm positive now that the issue lies with delayed translations
that are squashed before finishing.
>>>>>>
>>>>>> On the data on
instruction side? You seem to allude to data in the paragraph below, but
then instructions in the latter text.
>>>>>>
>>>>>>> It seems to me
like speculative load/stores are being executed, rather than waiting for
the instructions to commit. Once the instructions begin getting
(speculatively) executed in the TLB, a reference is left there, which
seems hard to root out and dereference after the instruction ends up
being squashed. At least, I have not been able to find that out in the
source code as of yet. Can anyone clarify on this?
>>>>>>
>>>>>> There
should only be one translation outstanding from each instruction and
data side walker. Any nested transactions should be queued in the
walker. Until one finishes, I'm not sure how multiple would ever be
outstanding.
>>>>>>
>>>>>> R
>>>>>>
>>>>>>> ncreases linearly for
varying periods of time:
>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[1]
>>>>>>> After enabling the TLB debug flag, I see that the linear
increase in instructions in flight is proportional to the number of TLB
misses. These TLB misses have a much larger delay (resulting in
translation delays) due to the fact the DramSim2 models the memory
system more accurately. It seems that with the classic memory system,
TLB misses often do not have translation delays. For whatever reason, it
would also seem that every instruction that has a TLB miss also is
eventually squashed...
>>>>>>>
>>>>>>> From a data side perspective
this is reasonable. While a miss is ou
>>>>>> some point instructions
will stop committing and thus the instructions in flight will begin to
rise until the miss is satisfied.
>>>>>>
>>>>>> Here's a summary of
outputs from my trace. These two DPRINTF messages appears on the rising
slopes (repeated up until the peak):
>>>>>> TLB
>>>>>>
>>>>>>> r
0x4(656)
>>>>>>>
>>>>>>> This is interesting/odd. I don't know a good
reason why (1) a miss would be outstanding to both address 0 and address
4 at the same time. In almost all cases these pages are marked as
no-access to detect segfaults. Perhaps there is a
>>>>>> the cpu is
getting into a loop faulting on a bad access and then faulting again on
the fault handler. I could imagine this would happen if there was some
corruption in the memory system (for example the timings in dramsim
exposing a bug in the cache models or something).
>>>>>>
>>>>>> At the
peak, the following message appears (from fetch) almost every tick for
(what I believe to be) every single one of the table walkers that were
squashed.
>>>>>> Fetch is waiting ITLB walk to finish!
>>>>>>
>>>>>>
There must be another walk in flight? The instruction side will only
have one fault outstanding at once. Successive branch mispredicts
will
>>>>>>
>>>>>>> nd "does the right thing."
>>>>>>>
>>>>>>> The
problem is that these ITLB table walks are for instructions that
wer
>>>>>> much as 0.3 billion cycles earlier, and since been removed
from the CPU's instruction list.
>>>>>>
>>>>>> I'm not following here.

>>>>>>
>>>>>> Any help will be greatly appreciated in solving this
problem. I've hit a roadblock with getting Ruby working with ARM, most
likely due to the fact
>>>>>>
>>>>>>> the 64 MB for the boot loader. I
brought this up in my last email about trying to get Ruby working.
Therefore, I'm trying to get this DramSim2 integration fixed so I can
start modeling FS with D
>>>>>> div>
>>>>>>
>>>>>> Brad/Steve/Nilay
anyone have a suggestion on how to make this work?
>>>>>>
>>>>>> Note
that these problem
>>>>>>
>>>>>>> rtion). Due to time constraints, I
haven't tested on other benchmarks.
>>>>>>> Thanks,
>>>>>>> Andrew

>>>>>>>
>>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
<***@drexel.edu [2]> wrote:
>>>>>>>
>>>>>>> Hey Gabe,
>>>>>>>
Thanks for this.
>>>>>> l. I just recently got back into debugging this
problem. I made a small change in src/base/refcnt.hh to allow me to
return the current count of references to a DynInst object.
>>>>>> I
the
>>>>>>
>>>>>>> extra visibility.
>>>>>>> I've found one memory
store instruction that seems to be getting lost. What's happening is
that is progresses as far as getting executed in the IEW once, but a
delayed translation occurs, deferring the store. By the time it reenters
the IEW, the IQ has marked the instruction as squashed. Everything
progresses as usual from here on out, with one exception. When the
instruction is removed from the CPUs instruction list, there is one
reference count hanging.
>>>>>>> I've added in some additional
debugging for my traces to help narrow down where this reference is
coming from. As far as I can tell, it's because of a call to
initiateAcc() within the executeStore function in the lsq unit. Please
see the following two traces. The first trace shows what I just
discussed. The second trace is another memory store instruction that got
squashed, however, it was squashed upon its first entry into the IEW,
therefore it never started execution.
>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [21]
>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out [22]

>>>>>>> Let me know if you have any ideas based on these two
instruction traces. I do not understand how the initiateAcc function
results in another reference, but maybe someone else does.... Since I
don't see how it makes a reference, it's hard to find out how to make
sure it gets dereferenced...
>>>>>>> Unfortunately, I haven't been able
to add a DPRINTF in src/base/refcnt.hh ...this would make things more
clear (i.e. exactly when references/deferences occur). Let me know if
you have any advice on this...if it's possible. I can't seem to get the
right include files, and likely right SConscript compile order...

>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Sat, Apr 7, 2012
at 9:48 PM, Gabe Black <***@eecs.umich.edu [23]> wrote:
>>>>>>>

>>>>>>>> Without digging into things too deeply, it looks like you may
be leaking references to dynamic instructions. The CPU may think it's
done with one, but until that final reference is removed, the object
will hang around forever. I think I've had problems before where there
reference count ended up off by one somehow and instructions would start
piling up. It's also possible that a clog develops in O3's pipeline and
some internal structure stops letting instructions through and starts
accumulating them. Either of these problems will be annoying to track
down, but with enough digging I've been able to fix these sorts of
things.
>>>>>>>>
>>>>>>>> This may have more to do with O3 not handling
the benchmark you're running well rather than a problem with your new
DRAM model. There may be some interaction between the two, though, where
the new memory makes the timing line up to cause O3 to behave poorly.
What you can do is instrument dynamic instruction creation and
destruction and reference counting (try print "this" for both the
reference counting wrapper and the dyn inst itself) and turn it on as
close as you can to where things go bad tick wise. Then look for an
instruction which gets lost, and look for where it's reference count is
incremented and decremented. It should be relatively easy to pair up
where references are created and destroyed, and you should be able to
identify the reference which never goes away. Then you need to figure
out where that reference is being created. After that, you should have
enough information to identify why the reference counting isn't being
done correctly. It's arduous, but that's the only way.
>>>>>>>>

>>>>>>>> It's important to also make sure reference counts aren't
decremented to zero prematurely. I had a problem once where that
happened and the memory behind the object was updated by something that
didn't know it was dead. The memory had since been reallocated to
another object of the same type, so that other object reflected what
happened to the phantom one. If I remember that manifested as something
weird like an add causing a page fault or something.
>>>>>>>>
>>>>>>>>
Gabe
>>>>>>>>
>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:

>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I've looked into this problem
some more, and have put together a couple traces. I've been becoming
more familiar with how gem5 handles dynamic instructions, in particular
how it destroys them. I have two traces to compare, one with the
physical memory, and the other with the integrated dramsim2 dram memory.
I also have two plots showing instruction counts over time (sim ticks).
All of these are linked at the end of the email.
>>>>>>>>> First, I'm
going to go into what I've been able to interpret regarding how
instructions are destroyed. In particular, comparing when DynInst's vs.
DynInstPtr's are deconstructed/removed from the cpu. I separate these
because I've seen a difference, as I discuss later. These explanations
are fairly non-existent on the wiki. There is a section header waiting
to be filled...
>>>>>>>>> From what I have been able to gather from the
code, there is a list of all the instructions in flight in cpu/o3/cpu.cc
called instList, with the type DynInstPtr. There are three conditions to
instructions being cleaned from this list:
>>>>>>>>> 1.) The ROB
retires its head instruction
>>>>>>>>> 2.) Fetch receives a rob
squashing signal from the commit, resulting in removing any instruction
not in the ROB
>>>>>>>>> 3.) Decode detects an incorrect branch
prediction, resulting in removal of all instructions back to the bad seq
num.
>>>>>>>>> Once all five stages have completed, the CPU cleans up
all the removed in-flight instructions. This line in particular in
cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:

>>>>>>>>> instList.erase(removeList.front());
>>>>>>>>> When I turn on
the debug flag O3CPU, I see the message "Removing instruction, ..."
(from o3/cpu.cc) with the threadNum, seqNum and pcState after all 5 cpu
stages have completed, and one of the conditions above is met. I also
see what tick it occurs on.
>>>>>>>>> When I turn on the DynInst debug
flag, I see when instructions are created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From analyzing the trace
files, I've gathered that this takes into account that instructions have
different execution lengths. So if one tick a memory instruction in the
instList (DynInstPtr) is removed, the DynInst for that memory
instruction will occur much later (i.e. 1M ticks later). I have yet to
determine how this is implemented.
>>>>>>>>> Now for the problem.

>>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a
significant difference between the size of the instList vector (of
DynInstPtr objects), and the size of dynamic instruction count (of
DynInst objects). The benchmark I'm running is libquantum from SPEC
2006. For the first roughly 130B ticks, the dynamic instruction count
kept in cpu/base_dyn_inst.impl.hh shadows the instList size in o3/cpu.cc
(figure linked below) very closely. Around tick 130B after libquantum
started, it starts hitting what I'm assuming are loops (therefore branch
prediction), resulting in some behavior that seems to imply improper
instruction handling (i.e. more instructions in flight than allowed by
ROB).
>>>>>>>>> I wasn't able to sync-up the physical and dramsim2
traces exactly by trace, but they should represent roughly the same area
of execution. They don't execute the same due to the dramsim2 modeling
the memory differently (i.e. latency and other delays).
>>>>>>>>> I've
shared both traces on my public Dropbox here --
>>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>>>>>>>>> Here are a couple plots of tick versus instruction
count, with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
instList.size() in cpu/o3/cpu.cc. --
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>>>>>>>>>
>>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>>>>>>>>> Note that I added the printout of the instList size to
an existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.

>>>>>>>>> Here are the commands I ran to parse the traces into data
files to analyze in MATLAB and create the plots:
>>>>>>>>> zgrep
DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
>>>>>>>>>
zgrep instList
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
$1,$11}' > instlistsize.out
>>>>>>>>> It seems to me like the problem
might lie in gem5, but has just been exposed by integrating this more
detailed memory model, dramsim2, into gem5. Either that, or their are
some timing errors in how dramsim2 was integrated. I doubt this,
however, since those first 190B ticks executed used the dramsim2 memory.
I believe the problem is a combination of memory instructions + complex
loops (branch prediction), resulting in improper destroying of
instructions.
>>>>>>>>> I've included the ROB, Commit, Fetch, DynInst
and O3CPU debug flags. Their are 192 ROB entries, which is why the
instList size generally has a max of about 192 instructions. The dynamic
instruction counts (seen in the dramsim2 plot) seem to also imply that
instructions are incorrectly been removed from the ROB, and then from
the cpu's instruction list in cpu.cc, which allows more and more
instructions to be added to the system (possibly from a bad branch).

>>>>>>>>> I appreciate any help in debugging this and further figuring
out the root problem, just let me know if you need anything else from
me. I don't have much more time at the moment to debug, but I can take
any advice for quick changes and/or additional traces, then send the
results back to the list for discussion.
>>>>>>>>> Thanks,
>>>>>>>>>
Andrew
>>>>>>>>> P.S. Paul - I did try decreasing the size of the
dramsim2 transaction (and even command) queue from 512 to 32. The same
instructions problem occurred. It basically just decreased the execution
time.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi
<***@umich.edu [18]> wrote:
>>>>>>>>>
>>>>>>>>>> The error is that
there are more that 1500 instructions currently in flight in the system.
It could mean several things:
>>>>>>>>>>
>>>>>>>>>> 1. The value is
somewhat arbitrarily defined and maybe there are more than 1500 in your
system at one time?
>>>>>>>>>>
>>>>>>>>>> 2. Instructions aren't being
destroyed correctly
>>>>>>>>>>
>>>>>>>>>> You could try to to run a
debug binary so you'll get a list of instructions when it happens or
increase the number which may be appropriate for certain situations (but
1500 is quite a few inflight instructions).
>>>>>>>>>>
>>>>>>>>>> Ali

>>>>>>>>>>
>>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:

>>>>>>>>>>
>>>>>>>>>>> Hi Xiangyu,
>>>>>>>>>>> I just started looking
into this some more. So at first I thought it was due to updating to a
more recent revision, but then I went back to revision 8643, added your
patch, built and ran....and now get the error with it too (when running
ARM_FS/gem5.opt). I"m testing now to see if an update to SWIG might have
resulted in this error, maybe someone on the mailing list would know if
that's possible. The difference is 1.3.40 vs. 2.0.3, both of which are
supported according to the dependencies wiki page.
>>>>>>>>>>> Just for
completeness, here's the error from revision 8643:
>>>>>>>>>>>
build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>>>

>>>>>>>>>>> I have not tried running with gem5.debug, so I will be
doing that today. Maybe this is an assertion that is occurring due to an
optimization. That would mean it wouldn't be triggered in gem5.debug
since it runs without optimizations. Have you tested all debug, opt and
fast with your tests?
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
Andrew
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio
Xiangyu Dong <***@gmail.com [11]> wrote:
>>>>>>>>>>>

>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> I didn't see this
error in my simulations. May I ask which gem5 version you are using? I
find some of the latest code updates do not comply with my changes. I am
still using the DRAMsim2 patch on Gem5 repo8643, and have run all the
runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and PARSEC2 on
ARM_SE.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>

>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Xiangyu
>>>>>>>>>>>>

>>>>>>>>>>>> FROM: Andrew Cebulski [mailto:***@drexel.edu [8]]

>>>>>>>>>>>> SENT: Thursday, March 08, 2012 6:52 PM
>>>>>>>>>>>>

>>>>>>>>>>>> TO: gem5 users mailing list CC:***@gmail.com [9];
***@umich.edu [10]
>>>>>>>>>>>>
>>>>>>>>>>>> SUBJECT: Re:
[gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>>>>

>>>>>>>>>>>> Xiangyu,
>>>>>>>>>>>>
>>>>>>>>>>>> I've been having an
issue recently with the number of instructions I've been seeing
committed to the CPU (I have a separate thread on this). It turns out
the issue seems to be coming from this patch you created to integrate
DramSim2 with Gem5. Unfortunately, I've been running with gem5.fast, not
gem5.opt. So up until now, I haven't been seeing assertions. I thought
I'd run it with gem5.opt or debug back in December, but I must not have.
My runs on the Arm O3 cpu fails with this assertion:
>>>>>>>>>>>>

>>>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>>>>>>>>>>>
>>>>>>>>>>>> -Andrew
>>>>>>>>>>>>

>>>>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>>>>>>> From:
"Dong, Xiangyu" <***@gmail.com [3]>
>>>>>>>>>>>>> To: "gem5 users
mailing list" <gem5-***@gem5.org [4]>
>>>>>>>>>>>>> Subject:
[gem5-users] A Patch for DRAMsim2 Integration Message-ID: gmail.com [5]>

>>>>>>>>>>>>>
>>>>>>>>>>>>> Content-Type: text/plain;
charset="us-ascii"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>

>>>>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both
SE and FS modes.
>>>>>>>>>>>>> I'm willing to share it
here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For those who have such needs, please
go to my website
>>>>>>>>>>>>> www.cse.psu.edu/~xydong [6] to download
the patch and test it. To enable
>>>>>>>>>>>>> DRAMSim2, use
se_dramsim2.py script instead of se.py (for FS, you can
create
>>>>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2
module is to use the
>>>>>>>>>>>>> derived DRAMMemory class instead of
PhysicalMemory class.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please let me know if
there are bugs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>

>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xiangyu
Dong
>>>>>>>>>>>>>
>>>>>>>>>>>>> -------------- next part
--------------
>>>>>>>>>>>>> An HTML attachment was
scrubbed...
>>>>>>>>>>>>> URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[7]>
>>>>>>>>>>>
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> gem5-users
mailing list
>>>>>>>>>>> gem5-***@gem5.org [12]
>>>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [13]
>>>>>>>>>

>>>>>>>>> _______________________________________________
>>>>>>>>>
gem5-users mailing list
>>>>>>>>>
gem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>

>>>>>>>> _______________________________________________
>>>>>>>>
gem5-users mailing list
>>>>>>>> gem5-***@gem5.org [19]
>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [20]
>>>>>>>

>>>>>>> _______________________________________________
>>>>>>>
gem5-users mailing list
>>>>>>> gem5-***@gem5.org [24]
>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [25]
>>>>
>>>>
_______________________________________________
>>>> gem5-users mailing
list
>>>> gem5-***@gem5.org [28]
>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [29]
>>
>>
_______________________________________________
>> gem5-users mailing
list
>> gem5-***@gem5.org [32]
>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [33]




Links:
------
[1]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[2]
mailto:***@drexel.edu
[3] mailto:***@gmail.com
[4]
mailto:gem5-***@gem5.org
[5] http://gmail.com
[6]
http://www.cse.psu.edu/%7Exydong
[7]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[8]
mailto:***@drexel.edu
[9] mailto:***@gmail.com
[10]
mailto:***@umich.edu
[11] mailto:***@gmail.com
[12]
mailto:gem5-***@gem5.org
[13]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:gem5-***@gem5.org
[20]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[21]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[22]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[23]
mailto:***@eecs.umich.edu
[24] mailto:gem5-***@gem5.org
[25]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[26]
http://dl.dropbox.com/u/2953302/gem5/tlb.out
[27]
mailto:***@umich.edu
[28] mailto:gem5-***@gem5.org
[29]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[30]
http://dl.dropbox.com/u/2953302/gem5/err.0
[31]
mailto:***@umich.edu
[32] mailto:gem5-***@gem5.org
[33]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[34]
mailto:***@umich.edu
Andrew Cebulski
2012-05-03 02:12:57 UTC
Permalink
I started hitting this assertion (that the number of insts in flight was >
1500) before I started using a checkpoint. I created the checkpoint
afterwards to decrease the time needed to run simulations to debug this
problem. I'll create a new checkpoint, then send the new trace output.

-Andrew

On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:

> **
>
> It's likely the cause for all of your problems. Dirty data in the caches
> doesn't get restored either. You should always create checkpoints with an
> atomic cpu and without caches.
>
>
>
> Ali
>
>
>
> On 02.05.2012 21:23, Andrew Cebulski wrote:
>
> Sorry, I created the checkpoint I referred to with an O3 CPU with caches.
> From what I recall reading, caches don't get restored from checkpoints.
> Since the checkpoint wasn't during the benchmark run, I assumed that was
> okay.
> -Andrew
>
> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>
>> You haven't answered the question about if you created the checkpoints
>> with an atomic cpu without caches.
>>
>> Ali
>>
>>
>>
>>
>>
>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>
>> I have not run with the checker CPU recently. Here's the stderr output
>> from a run I did awhile back:
>> http://dl.dropbox.com/u/2953302/gem5/err.0
>> Note that the instruction match error is before my benchmark actually
>> starts running. The start of my boot script checks to see if my files
>> image is mounted (which it is), then continues on to run the benchmark. I
>> booted the system, mounted my files image, then took a checkpoint. I've
>> been running all my tests from that checkpoint. I found where my benchmark
>> started based on the ASID (from ExecAsid debug flag).
>> I delayed the start of gathering trace data until the second-to-last
>> linear increase in dynamic instructions in-flight. I'm running a new trace
>> now.
>> -Andrew
>>
>>
>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>>
>>> Something is wrong well before this point. There is no reason that
>>> address 0x0 or 0x4 should be translated.
>>>
>>> Did you happen to create a checkpoint when caches were in the system?
>>>
>>> Have you tried to run with the checker cpu and see if it detects any
>>> errors?
>>>
>>>
>>>
>>> Ali
>>>
>>>
>>>
>>>
>>>
>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>
>>> They are data TLB misses that occur as the in-flight instruction count
>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction
>>> count finally linearly decreases is to 0x200. Also, at the start of the
>>> rising slope, I see a miss to 0x8 and 0x2508c.
>>> Here's a trace file:
>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>> To reduce size, I just have lines that have either TLB or walker in them.
>>> I do see only a handful of instruction TLB misses.
>>> -Andrew
>>>
>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>>
>>>>
>>>> Thanks for digging into this. I think there is an issue somewhere, but
>>>> I'm still not sure where.
>>>>
>>>> Ali
>>>>
>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>
>>>> Okay, I'm positive now that the issue lies with delayed translations
>>>> that are squashed before finishing.
>>>>
>>>> On the data on instruction side? You seem to allude to data in the
>>>> paragraph below, but then instructions in the latter text.
>>>>
>>>> It seems to me like speculative load/stores are being executed, rather
>>>> than waiting for the instructions to commit. Once the instructions begin
>>>> getting (speculatively) executed in the TLB, a reference is left there,
>>>> which seems hard to root out and dereference after the instruction ends up
>>>> being squashed. At least, I have not been able to find that out in the
>>>> source code as of yet. Can anyone clarify on this?
>>>>
>>>>
>>>>
>>>> There should only be one translation outstanding from each
>>>> instruction and data side walker. Any nested transactions should be queued
>>>> in the walker. Until one finishes, I'm not sure how multiple would ever be
>>>> outstanding.
>>>>
>>>> Recall the following image that shows how the number of dynamic
>>>> instruction (DynInst) objects in-flight increases linearly for varying
>>>> periods of time:
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>> After enabling the TLB debug flag, I see that the linear increase in
>>>> instructions in flight is proportional to the number of TLB misses. These
>>>> TLB misses have a much larger delay (resulting in translation delays) due
>>>> to the fact the DramSim2 models the memory system more accurately. It
>>>> seems that with the classic memory system, TLB misses often do not have
>>>> translation delays. For whatever reason, it would also seem that every
>>>> instruction that has a TLB miss also is eventually squashed...
>>>>
>>>> From a data side perspective this is reasonable. While a miss is
>>>> outstanding at some point instructions will stop committing and thus the
>>>> instructions in flight will begin to rise until the miss is satisfied.
>>>>
>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>>> messages appears on the rising slopes (repeated up until the peak):
>>>> TLB Miss: Starting hardware table walker for 0(656)
>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>>
>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>>>> would be outstanding to both address 0 and address 4 at the same time. In
>>>> almost all cases these pages are marked as no-access to detect segfaults.
>>>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>>>> a bad access and then faulting again on the fault handler. I could imagine
>>>> this would happen if there was some corruption in the memory system (for
>>>> example the timings in dramsim exposing a bug in the cache models or
>>>> something).
>>>>
>>>>
>>>> At the peak, the following message appears (from fetch) almost every
>>>> tick for (what I believe to be) every single one of the table walkers that
>>>> were squashed.
>>>> Fetch is waiting ITLB walk to finish!
>>>>
>>>> There must be another walk in flight? The instruction side will only
>>>> have one fault outstanding at once. Successive branch mispredicts will
>>>> re-direct fetch but there is code that catches the fact that a different
>>>> walk completed then expected and "does the right thing."
>>>>
>>>> The problem is that these ITLB table walks are for instructions that
>>>> were squashed as much as 0.3 billion cycles earlier, and since been removed
>>>> from the CPU's instruction list.
>>>>
>>>> I'm not following here.
>>>>
>>>> Any help will be greatly appreciated in solving this problem. I've
>>>> hit a roadblock with getting Ruby working with ARM, most likely due to the
>>>> fact that ARM has disjoint memory (x86 and Alpha do not). There's the 256
>>>> MB for physical memory, then the 64 MB for the boot loader. I brought this
>>>> up in my last email about trying to get Ruby working. Therefore, I'm
>>>> trying to get this DramSim2 integration fixed so I can start modeling FS
>>>> with DRAM memory.
>>>>
>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>>>
>>>>
>>>> Note that these problems also occur in Soplex from the Spec CPU2006
>>>> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
>>>> time constraints, I haven't tested on other benchmarks.
>>>> Thanks,
>>>> Andrew
>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu>wrote:
>>>>
>>>>> Hey Gabe,
>>>>> Thanks for this...very helpful. I just recently got back into
>>>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>>>> allow me to return the current count of references to a DynInst object.
>>>>> I then modified existing DPRINTFs to also print out reference
>>>>> counts, then added some of my own when I needed extra visibility.
>>>>> I've found one memory store instruction that seems to be getting
>>>>> lost. What's happening is that is progresses as far as getting executed in
>>>>> the IEW once, but a delayed translation occurs, deferring the store. By
>>>>> the time it reenters the IEW, the IQ has marked the instruction as
>>>>> squashed. Everything progresses as usual from here on out, with one
>>>>> exception. When the instruction is removed from the CPUs instruction list,
>>>>> there is one reference count hanging.
>>>>> I've added in some additional debugging for my traces to help
>>>>> narrow down where this reference is coming from. As far as I can tell,
>>>>> it's because of a call to initiateAcc() within the executeStore function in
>>>>> the lsq unit. Please see the following two traces. The first trace shows
>>>>> what I just discussed. The second trace is another memory store
>>>>> instruction that got squashed, however, it was squashed upon its first
>>>>> entry into the IEW, therefore it never started execution.
>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>>> Let me know if you have any ideas based on these two instruction
>>>>> traces. I do not understand how the initiateAcc function results in
>>>>> another reference, but maybe someone else does.... Since I don't see how
>>>>> it makes a reference, it's hard to find out how to make sure it gets
>>>>> dereferenced...
>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>>>> references/deferences occur). Let me know if you have any advice on
>>>>> this...if it's possible. I can't seem to get the right include files, and
>>>>> likely right SConscript compile order...
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu>wrote:
>>>>>
>>>>>> Without digging into things too deeply, it looks like you may be
>>>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>>>> with one, but until that final reference is removed, the object will hang
>>>>>> around forever. I think I've had problems before where there reference
>>>>>> count ended up off by one somehow and instructions would start piling up.
>>>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>>>> structure stops letting instructions through and starts accumulating them.
>>>>>> Either of these problems will be annoying to track down, but with enough
>>>>>> digging I've been able to fix these sorts of things.
>>>>>>
>>>>>> This may have more to do with O3 not handling the benchmark you're
>>>>>> running well rather than a problem with your new DRAM model. There may be
>>>>>> some interaction between the two, though, where the new memory makes the
>>>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>>>> dynamic instruction creation and destruction and reference counting (try
>>>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>>>> itself) and turn it on as close as you can to where things go bad tick
>>>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>>>> reference count is incremented and decremented. It should be relatively
>>>>>> easy to pair up where references are created and destroyed, and you should
>>>>>> be able to identify the reference which never goes away. Then you need to
>>>>>> figure out where that reference is being created. After that, you should
>>>>>> have enough information to identify why the reference counting isn't being
>>>>>> done correctly. It's arduous, but that's the only way.
>>>>>>
>>>>>> It's important to also make sure reference counts aren't decremented
>>>>>> to zero prematurely. I had a problem once where that happened and the
>>>>>> memory behind the object was updated by something that didn't know it was
>>>>>> dead. The memory had since been reallocated to another object of the same
>>>>>> type, so that other object reflected what happened to the phantom one. If I
>>>>>> remember that manifested as something weird like an add causing a page
>>>>>> fault or something.
>>>>>>
>>>>>> Gabe
>>>>>>
>>>>>>
>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>>>
>>>>>> Hi all,
>>>>>> I've looked into this problem some more, and have put together a
>>>>>> couple traces. I've been becoming more familiar with how gem5 handles
>>>>>> dynamic instructions, in particular how it destroys them. I have two
>>>>>> traces to compare, one with the physical memory, and the other with the
>>>>>> integrated dramsim2 dram memory. I also have two plots showing instruction
>>>>>> counts over time (sim ticks). All of these are linked at the end of the
>>>>>> email.
>>>>>> First, I'm going to go into what I've been able to interpret
>>>>>> regarding how instructions are destroyed. In particular, comparing when
>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>>>> separate these because I've seen a difference, as I discuss later. These
>>>>>> explanations are fairly non-existent on the wiki. There is a section
>>>>>> header waiting to be filled...
>>>>>> From what I have been able to gather from the code, there is a list
>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called instList, with
>>>>>> the type DynInstPtr. There are three conditions to instructions being
>>>>>> cleaned from this list:
>>>>>> 1.) The ROB retires its head instruction
>>>>>> 2.) Fetch receives a rob squashing signal from the commit, resulting
>>>>>> in removing any instruction not in the ROB
>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>>>> removal of all instructions back to the bad seq num.
>>>>>> Once all five stages have completed, the CPU cleans up all the
>>>>>> removed in-flight instructions. This line in particular
>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>>> instList.erase(removeList.front());
>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>>>> met. I also see what tick it occurs on.
>>>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>>> analyzing the trace files, I've gathered that this takes into account that
>>>>>> instructions have different execution lengths. So if one tick a memory
>>>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>>>>> to determine how this is implemented.
>>>>>> Now for the problem.
>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>>>> assuming are loops (therefore branch prediction), resulting in some
>>>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>>>> instructions in flight than allowed by ROB).
>>>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>>>>> trace, but they should represent roughly the same area of execution. They
>>>>>> don't execute the same due to the dramsim2 modeling the memory differently
>>>>>> (i.e. latency and other delays).
>>>>>> I've shared both traces on my public Dropbox here --
>>>>>>
>>>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>>>
>>>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>>>> Here are a couple plots of tick versus instruction count, with
>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size()
>>>>>> in cpu/o3/cpu.cc. --
>>>>>>
>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>>>>
>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>> Note that I added the printout of the instList size to an existing
>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>>> Here are the commands I ran to parse the traces into data files to
>>>>>> analyze in MATLAB and create the plots:
>>>>>> zgrep DynInst
>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>>>> zgrep instList
>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>>>> $1,$11}' > instlistsize.out
>>>>>> It seems to me like the problem might lie in gem5, but has just been
>>>>>> exposed by integrating this more detailed memory model, dramsim2, into
>>>>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>>>>> integrated. I doubt this, however, since those first 190B ticks executed
>>>>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>>>>> instructions + complex loops (branch prediction), resulting in improper
>>>>>> destroying of instructions.
>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>>>> Their are 192 ROB entries, which is why the instList size generally has a
>>>>>> max of about 192 instructions. The dynamic instruction counts (seen in the
>>>>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>>>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>>>> which allows more and more instructions to be added to the system (possibly
>>>>>> from a bad branch).
>>>>>> I appreciate any help in debugging this and further figuring out the
>>>>>> root problem, just let me know if you need anything else from me. I don't
>>>>>> have much more time at the moment to debug, but I can take any advice for
>>>>>> quick changes and/or additional traces, then send the results back to the
>>>>>> list for discussion.
>>>>>> Thanks,
>>>>>> Andrew
>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
>>>>>> (and even command) queue from 512 to 32. The same instructions problem
>>>>>> occurred. It basically just decreased the execution time.
>>>>>>
>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>>
>>>>>>> The error is that there are more that 1500 instructions currently
>>>>>>> in flight in the system. It could mean several things:
>>>>>>>
>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are
>>>>>>> more than 1500 in your system at one time?
>>>>>>>
>>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>>
>>>>>>> You could try to to run a debug binary so you'll get a list of
>>>>>>> instructions when it happens or increase the number which may
>>>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>>>> instructions).
>>>>>>>
>>>>>>> Ali
>>>>>>>
>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>>
>>>>>>> Hi Xiangyu,
>>>>>>> I just started looking into this some more. So at first I
>>>>>>> thought it was due to updating to a more recent revision, but then I went
>>>>>>> back to revision 8643, added your patch, built and ran....and now get the
>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing now to see
>>>>>>> if an update to SWIG might have resulted in this error, maybe someone on
>>>>>>> the mailing list would know if that's possible. The difference is 1.3.40
>>>>>>> vs. 2.0.3, both of which are supported according to the dependencies wiki
>>>>>>> page.
>>>>>>> Just for completeness, here's the error from revision 8643:
>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>> I have not tried running with gem5.debug, so I will be doing that
>>>>>>> today. Maybe this is an assertion that is occurring due to an
>>>>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>>>>> with your tests?
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>>>>> version you are using? I find some of the latest code updates do not comply
>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>>>> PARSEC2 on ARM_SE.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Xiangyu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>>>
>>>>>>>> *To:* gem5 users mailing list
>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>>>
>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>
>>>>>>>> Xiangyu,
>>>>>>>>
>>>>>>>> I've been having an issue recently with the number of
>>>>>>>> instructions I've been seeing committed to the CPU (I have a separate
>>>>>>>> thread on this). It turns out the issue seems to be coming from this patch
>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately, I've been
>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I haven't been
>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or debug back in
>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu fails with this
>>>>>>>> assertion:
>>>>>>>>
>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>
>>>>>>>> -Andrew
>>>>>>>>
>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>> Message-ID: gmail.com>
>>>>>>>>
>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>>>>> modes.
>>>>>>>> I'm willing to share it here.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> For those who have such needs, please go to my website
>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>>>> download the patch and test it. To enable
>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you
>>>>>>>> can create
>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to
>>>>>>>> use the
>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know if there are bugs.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Xiangyu Dong
>>>>>>>>
>>>>>>>> -------------- next part --------------
>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>> URL: <
>>>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-users mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Andrew Cebulski
2012-05-03 02:22:40 UTC
Permalink
I double-checked by looking at the config.ini file. It turns out I did
actually create the checkpoint with an Atomic CPU without caches. Sorry
for the confusion.

-Andrew

On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu> wrote:

> I started hitting this assertion (that the number of insts in flight was >
> 1500) before I started using a checkpoint. I created the checkpoint
> afterwards to decrease the time needed to run simulations to debug this
> problem. I'll create a new checkpoint, then send the new trace output.
>
> -Andrew
>
>
> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>
>> **
>>
>> It's likely the cause for all of your problems. Dirty data in the caches
>> doesn't get restored either. You should always create checkpoints with an
>> atomic cpu and without caches.
>>
>>
>>
>> Ali
>>
>>
>>
>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>>
>> Sorry, I created the checkpoint I referred to with an O3 CPU with caches.
>> From what I recall reading, caches don't get restored from checkpoints.
>> Since the checkpoint wasn't during the benchmark run, I assumed that was
>> okay.
>> -Andrew
>>
>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>>
>>> You haven't answered the question about if you created the checkpoints
>>> with an atomic cpu without caches.
>>>
>>> Ali
>>>
>>>
>>>
>>>
>>>
>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>>
>>> I have not run with the checker CPU recently. Here's the stderr output
>>> from a run I did awhile back:
>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>>> Note that the instruction match error is before my benchmark actually
>>> starts running. The start of my boot script checks to see if my files
>>> image is mounted (which it is), then continues on to run the benchmark. I
>>> booted the system, mounted my files image, then took a checkpoint. I've
>>> been running all my tests from that checkpoint. I found where my benchmark
>>> started based on the ASID (from ExecAsid debug flag).
>>> I delayed the start of gathering trace data until the second-to-last
>>> linear increase in dynamic instructions in-flight. I'm running a new trace
>>> now.
>>> -Andrew
>>>
>>>
>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> Something is wrong well before this point. There is no reason that
>>>> address 0x0 or 0x4 should be translated.
>>>>
>>>> Did you happen to create a checkpoint when caches were in the system?
>>>>
>>>> Have you tried to run with the checker cpu and see if it detects any
>>>> errors?
>>>>
>>>>
>>>>
>>>> Ali
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>>
>>>> They are data TLB misses that occur as the in-flight instruction count
>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction
>>>> count finally linearly decreases is to 0x200. Also, at the start of the
>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>>>> Here's a trace file:
>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>>> To reduce size, I just have lines that have either TLB or walker in
>>>> them.
>>>> I do see only a handful of instruction TLB misses.
>>>> -Andrew
>>>>
>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>
>>>>>
>>>>> Thanks for digging into this. I think there is an issue somewhere, but
>>>>> I'm still not sure where.
>>>>>
>>>>> Ali
>>>>>
>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>>
>>>>> Okay, I'm positive now that the issue lies with delayed translations
>>>>> that are squashed before finishing.
>>>>>
>>>>> On the data on instruction side? You seem to allude to data in the
>>>>> paragraph below, but then instructions in the latter text.
>>>>>
>>>>> It seems to me like speculative load/stores are being executed,
>>>>> rather than waiting for the instructions to commit. Once the instructions
>>>>> begin getting (speculatively) executed in the TLB, a reference is left
>>>>> there, which seems hard to root out and dereference after the instruction
>>>>> ends up being squashed. At least, I have not been able to find that out in
>>>>> the source code as of yet. Can anyone clarify on this?
>>>>>
>>>>>
>>>>>
>>>>> There should only be one translation outstanding from each
>>>>> instruction and data side walker. Any nested transactions should be queued
>>>>> in the walker. Until one finishes, I'm not sure how multiple would ever be
>>>>> outstanding.
>>>>>
>>>>> Recall the following image that shows how the number of dynamic
>>>>> instruction (DynInst) objects in-flight increases linearly for varying
>>>>> periods of time:
>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>> After enabling the TLB debug flag, I see that the linear increase in
>>>>> instructions in flight is proportional to the number of TLB misses. These
>>>>> TLB misses have a much larger delay (resulting in translation delays) due
>>>>> to the fact the DramSim2 models the memory system more accurately. It
>>>>> seems that with the classic memory system, TLB misses often do not have
>>>>> translation delays. For whatever reason, it would also seem that every
>>>>> instruction that has a TLB miss also is eventually squashed...
>>>>>
>>>>> From a data side perspective this is reasonable. While a miss is
>>>>> outstanding at some point instructions will stop committing and thus the
>>>>> instructions in flight will begin to rise until the miss is satisfied.
>>>>>
>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>>>> messages appears on the rising slopes (repeated up until the peak):
>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>>>
>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>>>>> would be outstanding to both address 0 and address 4 at the same time. In
>>>>> almost all cases these pages are marked as no-access to detect segfaults.
>>>>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>>>>> a bad access and then faulting again on the fault handler. I could imagine
>>>>> this would happen if there was some corruption in the memory system (for
>>>>> example the timings in dramsim exposing a bug in the cache models or
>>>>> something).
>>>>>
>>>>>
>>>>> At the peak, the following message appears (from fetch) almost every
>>>>> tick for (what I believe to be) every single one of the table walkers that
>>>>> were squashed.
>>>>> Fetch is waiting ITLB walk to finish!
>>>>>
>>>>> There must be another walk in flight? The instruction side will only
>>>>> have one fault outstanding at once. Successive branch mispredicts will
>>>>> re-direct fetch but there is code that catches the fact that a different
>>>>> walk completed then expected and "does the right thing."
>>>>>
>>>>> The problem is that these ITLB table walks are for instructions that
>>>>> were squashed as much as 0.3 billion cycles earlier, and since been removed
>>>>> from the CPU's instruction list.
>>>>>
>>>>> I'm not following here.
>>>>>
>>>>> Any help will be greatly appreciated in solving this problem. I've
>>>>> hit a roadblock with getting Ruby working with ARM, most likely due to the
>>>>> fact that ARM has disjoint memory (x86 and Alpha do not). There's the 256
>>>>> MB for physical memory, then the 64 MB for the boot loader. I brought this
>>>>> up in my last email about trying to get Ruby working. Therefore, I'm
>>>>> trying to get this DramSim2 integration fixed so I can start modeling FS
>>>>> with DRAM memory.
>>>>>
>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>>>>
>>>>>
>>>>> Note that these problems also occur in Soplex from the Spec CPU2006
>>>>> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
>>>>> time constraints, I haven't tested on other benchmarks.
>>>>> Thanks,
>>>>> Andrew
>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu>wrote:
>>>>>
>>>>>> Hey Gabe,
>>>>>> Thanks for this...very helpful. I just recently got back into
>>>>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>>>>> allow me to return the current count of references to a DynInst object.
>>>>>> I then modified existing DPRINTFs to also print out reference
>>>>>> counts, then added some of my own when I needed extra visibility.
>>>>>> I've found one memory store instruction that seems to be getting
>>>>>> lost. What's happening is that is progresses as far as getting executed in
>>>>>> the IEW once, but a delayed translation occurs, deferring the store. By
>>>>>> the time it reenters the IEW, the IQ has marked the instruction as
>>>>>> squashed. Everything progresses as usual from here on out, with one
>>>>>> exception. When the instruction is removed from the CPUs instruction list,
>>>>>> there is one reference count hanging.
>>>>>> I've added in some additional debugging for my traces to help
>>>>>> narrow down where this reference is coming from. As far as I can tell,
>>>>>> it's because of a call to initiateAcc() within the executeStore function in
>>>>>> the lsq unit. Please see the following two traces. The first trace shows
>>>>>> what I just discussed. The second trace is another memory store
>>>>>> instruction that got squashed, however, it was squashed upon its first
>>>>>> entry into the IEW, therefore it never started execution.
>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>>>> Let me know if you have any ideas based on these two instruction
>>>>>> traces. I do not understand how the initiateAcc function results in
>>>>>> another reference, but maybe someone else does.... Since I don't see how
>>>>>> it makes a reference, it's hard to find out how to make sure it gets
>>>>>> dereferenced...
>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>>>>> references/deferences occur). Let me know if you have any advice on
>>>>>> this...if it's possible. I can't seem to get the right include files, and
>>>>>> likely right SConscript compile order...
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu>wrote:
>>>>>>
>>>>>>> Without digging into things too deeply, it looks like you may be
>>>>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>>>>> with one, but until that final reference is removed, the object will hang
>>>>>>> around forever. I think I've had problems before where there reference
>>>>>>> count ended up off by one somehow and instructions would start piling up.
>>>>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>>>>> structure stops letting instructions through and starts accumulating them.
>>>>>>> Either of these problems will be annoying to track down, but with enough
>>>>>>> digging I've been able to fix these sorts of things.
>>>>>>>
>>>>>>> This may have more to do with O3 not handling the benchmark you're
>>>>>>> running well rather than a problem with your new DRAM model. There may be
>>>>>>> some interaction between the two, though, where the new memory makes the
>>>>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>>>>> dynamic instruction creation and destruction and reference counting (try
>>>>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>>>>> itself) and turn it on as close as you can to where things go bad tick
>>>>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>>>>> reference count is incremented and decremented. It should be relatively
>>>>>>> easy to pair up where references are created and destroyed, and you should
>>>>>>> be able to identify the reference which never goes away. Then you need to
>>>>>>> figure out where that reference is being created. After that, you should
>>>>>>> have enough information to identify why the reference counting isn't being
>>>>>>> done correctly. It's arduous, but that's the only way.
>>>>>>>
>>>>>>> It's important to also make sure reference counts aren't decremented
>>>>>>> to zero prematurely. I had a problem once where that happened and the
>>>>>>> memory behind the object was updated by something that didn't know it was
>>>>>>> dead. The memory had since been reallocated to another object of the same
>>>>>>> type, so that other object reflected what happened to the phantom one. If I
>>>>>>> remember that manifested as something weird like an add causing a page
>>>>>>> fault or something.
>>>>>>>
>>>>>>> Gabe
>>>>>>>
>>>>>>>
>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>> I've looked into this problem some more, and have put together a
>>>>>>> couple traces. I've been becoming more familiar with how gem5 handles
>>>>>>> dynamic instructions, in particular how it destroys them. I have two
>>>>>>> traces to compare, one with the physical memory, and the other with the
>>>>>>> integrated dramsim2 dram memory. I also have two plots showing instruction
>>>>>>> counts over time (sim ticks). All of these are linked at the end of the
>>>>>>> email.
>>>>>>> First, I'm going to go into what I've been able to interpret
>>>>>>> regarding how instructions are destroyed. In particular, comparing when
>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>>>>> separate these because I've seen a difference, as I discuss later. These
>>>>>>> explanations are fairly non-existent on the wiki. There is a section
>>>>>>> header waiting to be filled...
>>>>>>> From what I have been able to gather from the code, there is a list
>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called instList, with
>>>>>>> the type DynInstPtr. There are three conditions to instructions being
>>>>>>> cleaned from this list:
>>>>>>> 1.) The ROB retires its head instruction
>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>>>>>>> resulting in removing any instruction not in the ROB
>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>>>>> removal of all instructions back to the bad seq num.
>>>>>>> Once all five stages have completed, the CPU cleans up all the
>>>>>>> removed in-flight instructions. This line in particular
>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>>>> instList.erase(removeList.front());
>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>>>>> met. I also see what tick it occurs on.
>>>>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>>>> analyzing the trace files, I've gathered that this takes into account that
>>>>>>> instructions have different execution lengths. So if one tick a memory
>>>>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>>>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>>>>>> to determine how this is implemented.
>>>>>>> Now for the problem.
>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>>>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>>>>> assuming are loops (therefore branch prediction), resulting in some
>>>>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>>>>> instructions in flight than allowed by ROB).
>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>>>>>> trace, but they should represent roughly the same area of execution. They
>>>>>>> don't execute the same due to the dramsim2 modeling the memory differently
>>>>>>> (i.e. latency and other delays).
>>>>>>> I've shared both traces on my public Dropbox here --
>>>>>>>
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>>>>
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>>>>> Here are a couple plots of tick versus instruction count, with
>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size()
>>>>>>> in cpu/o3/cpu.cc. --
>>>>>>>
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>>>>>
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>>> Note that I added the printout of the instList size to an existing
>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>>>> Here are the commands I ran to parse the traces into data files to
>>>>>>> analyze in MATLAB and create the plots:
>>>>>>> zgrep DynInst
>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>>>>> zgrep instList
>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>>>>> $1,$11}' > instlistsize.out
>>>>>>> It seems to me like the problem might lie in gem5, but has just been
>>>>>>> exposed by integrating this more detailed memory model, dramsim2, into
>>>>>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>>>>>> integrated. I doubt this, however, since those first 190B ticks executed
>>>>>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>>>>>> instructions + complex loops (branch prediction), resulting in improper
>>>>>>> destroying of instructions.
>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>>>>> Their are 192 ROB entries, which is why the instList size generally has a
>>>>>>> max of about 192 instructions. The dynamic instruction counts (seen in the
>>>>>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>>>>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>>>>> which allows more and more instructions to be added to the system (possibly
>>>>>>> from a bad branch).
>>>>>>> I appreciate any help in debugging this and further figuring out the
>>>>>>> root problem, just let me know if you need anything else from me. I don't
>>>>>>> have much more time at the moment to debug, but I can take any advice for
>>>>>>> quick changes and/or additional traces, then send the results back to the
>>>>>>> list for discussion.
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>>>>>>> transaction (and even command) queue from 512 to 32. The same instructions
>>>>>>> problem occurred. It basically just decreased the execution time.
>>>>>>>
>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>>>
>>>>>>>> The error is that there are more that 1500 instructions currently
>>>>>>>> in flight in the system. It could mean several things:
>>>>>>>>
>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are
>>>>>>>> more than 1500 in your system at one time?
>>>>>>>>
>>>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>>>
>>>>>>>> You could try to to run a debug binary so you'll get a list of
>>>>>>>> instructions when it happens or increase the number which may
>>>>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>>>>> instructions).
>>>>>>>>
>>>>>>>> Ali
>>>>>>>>
>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>>>
>>>>>>>> Hi Xiangyu,
>>>>>>>> I just started looking into this some more. So at first I
>>>>>>>> thought it was due to updating to a more recent revision, but then I went
>>>>>>>> back to revision 8643, added your patch, built and ran....and now get the
>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing now to see
>>>>>>>> if an update to SWIG might have resulted in this error, maybe someone on
>>>>>>>> the mailing list would know if that's possible. The difference is 1.3.40
>>>>>>>> vs. 2.0.3, both of which are supported according to the dependencies wiki
>>>>>>>> page.
>>>>>>>> Just for completeness, here's the error from revision 8643:
>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>>>>>>>> that today. Maybe this is an assertion that is occurring due to an
>>>>>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>>>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>>>>>> with your tests?
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>>>>>> version you are using? I find some of the latest code updates do not comply
>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>>>>> PARSEC2 on ARM_SE.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Xiangyu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>>>>
>>>>>>>>> *To:* gem5 users mailing list
>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>>>>
>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>
>>>>>>>>> Xiangyu,
>>>>>>>>>
>>>>>>>>> I've been having an issue recently with the number of
>>>>>>>>> instructions I've been seeing committed to the CPU (I have a separate
>>>>>>>>> thread on this). It turns out the issue seems to be coming from this patch
>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately, I've been
>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I haven't been
>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or debug back in
>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu fails with this
>>>>>>>>> assertion:
>>>>>>>>>
>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>
>>>>>>>>> -Andrew
>>>>>>>>>
>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>> Message-ID: gmail.com>
>>>>>>>>>
>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>>>>>> modes.
>>>>>>>>> I'm willing to share it here.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For those who have such needs, please go to my website
>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>>>>> download the patch and test it. To enable
>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you
>>>>>>>>> can create
>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to
>>>>>>>>> use the
>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know if there are bugs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Xiangyu Dong
>>>>>>>>>
>>>>>>>>> -------------- next part --------------
>>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>> URL: <
>>>>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-users mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-users mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
Andrew Cebulski
2012-05-03 13:13:04 UTC
Permalink
I have some new debug output from the beginning of the figure I linked back
to in one of the previous emails. Where this trace starts, the
instructions in flight roughly shadow the size of the ROB (and CPU's
instruction list size). The first file contains just TLB and TLB walker
messages. The second file contains all debug output around the first TLB
walker at 0(656). It doesn't show up until 30M ticks into the trace, which
I started at tick 3.472*10e12.

http://dl.dropbox.com/u/2953302/gem5/tlb_walk.out.gz (Note: 15MB
compressed, 300MB uncompressed)
http://dl.dropbox.com/u/2953302/gem5/areaaroundfault.out

Also, I've sorted the dynamic instructions for this trace, comparing their
creation and destruction tick numbers. No instruction is completely lost,
just delayed up to a few hundred million ticks from being destroyed.

http://dl.dropbox.com/u/2953302/gem5/instdelaysort.csv

Note, I am running a trace from before this point in the benchmark,
however, it's a much larger trace. I should have some results from it
later today. At the moment, I just plan on searching for an table walks to
either 0(656) or 0x4(656). Please let me now if you have any other
suggestions for what to parse/look into in the trace to email to the list.

Thanks,
Andrew

On Wed, May 2, 2012 at 10:22 PM, Andrew Cebulski <***@drexel.edu> wrote:

> I double-checked by looking at the config.ini file. It turns out I did
> actually create the checkpoint with an Atomic CPU without caches. Sorry
> for the confusion.
>
> -Andrew
>
>
> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu> wrote:
>
>> I started hitting this assertion (that the number of insts in flight was
>> > 1500) before I started using a checkpoint. I created the checkpoint
>> afterwards to decrease the time needed to run simulations to debug this
>> problem. I'll create a new checkpoint, then send the new trace output.
>>
>> -Andrew
>>
>>
>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>>
>>> **
>>>
>>> It's likely the cause for all of your problems. Dirty data in the caches
>>> doesn't get restored either. You should always create checkpoints with an
>>> atomic cpu and without caches.
>>>
>>>
>>>
>>> Ali
>>>
>>>
>>>
>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>>>
>>> Sorry, I created the checkpoint I referred to with an O3 CPU with
>>> caches. From what I recall reading, caches don't get restored from
>>> checkpoints. Since the checkpoint wasn't during the benchmark run, I
>>> assumed that was okay.
>>> -Andrew
>>>
>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> You haven't answered the question about if you created the
>>>> checkpoints with an atomic cpu without caches.
>>>>
>>>> Ali
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>>>
>>>> I have not run with the checker CPU recently. Here's the stderr output
>>>> from a run I did awhile back:
>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>>>> Note that the instruction match error is before my benchmark actually
>>>> starts running. The start of my boot script checks to see if my files
>>>> image is mounted (which it is), then continues on to run the benchmark. I
>>>> booted the system, mounted my files image, then took a checkpoint. I've
>>>> been running all my tests from that checkpoint. I found where my benchmark
>>>> started based on the ASID (from ExecAsid debug flag).
>>>> I delayed the start of gathering trace data until the second-to-last
>>>> linear increase in dynamic instructions in-flight. I'm running a new trace
>>>> now.
>>>> -Andrew
>>>>
>>>>
>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>>>>
>>>>> Something is wrong well before this point. There is no reason that
>>>>> address 0x0 or 0x4 should be translated.
>>>>>
>>>>> Did you happen to create a checkpoint when caches were in the system?
>>>>>
>>>>> Have you tried to run with the checker cpu and see if it detects any
>>>>> errors?
>>>>>
>>>>>
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>>>
>>>>> They are data TLB misses that occur as the in-flight instruction count
>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction
>>>>> count finally linearly decreases is to 0x200. Also, at the start of the
>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>>>>> Here's a trace file:
>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>>>> To reduce size, I just have lines that have either TLB or walker in
>>>>> them.
>>>>> I do see only a handful of instruction TLB misses.
>>>>> -Andrew
>>>>>
>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for digging into this. I think there is an issue somewhere,
>>>>>> but I'm still not sure where.
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>>>
>>>>>> Okay, I'm positive now that the issue lies with delayed translations
>>>>>> that are squashed before finishing.
>>>>>>
>>>>>> On the data on instruction side? You seem to allude to data in the
>>>>>> paragraph below, but then instructions in the latter text.
>>>>>>
>>>>>> It seems to me like speculative load/stores are being executed,
>>>>>> rather than waiting for the instructions to commit. Once the instructions
>>>>>> begin getting (speculatively) executed in the TLB, a reference is left
>>>>>> there, which seems hard to root out and dereference after the instruction
>>>>>> ends up being squashed. At least, I have not been able to find that out in
>>>>>> the source code as of yet. Can anyone clarify on this?
>>>>>>
>>>>>>
>>>>>>
>>>>>> There should only be one translation outstanding from each
>>>>>> instruction and data side walker. Any nested transactions should be queued
>>>>>> in the walker. Until one finishes, I'm not sure how multiple would ever be
>>>>>> outstanding.
>>>>>>
>>>>>> Recall the following image that shows how the number of dynamic
>>>>>> instruction (DynInst) objects in-flight increases linearly for varying
>>>>>> periods of time:
>>>>>>
>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>> After enabling the TLB debug flag, I see that the linear increase in
>>>>>> instructions in flight is proportional to the number of TLB misses. These
>>>>>> TLB misses have a much larger delay (resulting in translation delays) due
>>>>>> to the fact the DramSim2 models the memory system more accurately. It
>>>>>> seems that with the classic memory system, TLB misses often do not have
>>>>>> translation delays. For whatever reason, it would also seem that every
>>>>>> instruction that has a TLB miss also is eventually squashed...
>>>>>>
>>>>>> From a data side perspective this is reasonable. While a miss is
>>>>>> outstanding at some point instructions will stop committing and thus the
>>>>>> instructions in flight will begin to rise until the miss is satisfied.
>>>>>>
>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>>>>> messages appears on the rising slopes (repeated up until the peak):
>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>>>>
>>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>>>>>> would be outstanding to both address 0 and address 4 at the same time. In
>>>>>> almost all cases these pages are marked as no-access to detect segfaults.
>>>>>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>>>>>> a bad access and then faulting again on the fault handler. I could imagine
>>>>>> this would happen if there was some corruption in the memory system (for
>>>>>> example the timings in dramsim exposing a bug in the cache models or
>>>>>> something).
>>>>>>
>>>>>>
>>>>>> At the peak, the following message appears (from fetch) almost every
>>>>>> tick for (what I believe to be) every single one of the table walkers that
>>>>>> were squashed.
>>>>>> Fetch is waiting ITLB walk to finish!
>>>>>>
>>>>>> There must be another walk in flight? The instruction side will
>>>>>> only have one fault outstanding at once. Successive branch mispredicts will
>>>>>> re-direct fetch but there is code that catches the fact that a different
>>>>>> walk completed then expected and "does the right thing."
>>>>>>
>>>>>> The problem is that these ITLB table walks are for instructions
>>>>>> that were squashed as much as 0.3 billion cycles earlier, and since been
>>>>>> removed from the CPU's instruction list.
>>>>>>
>>>>>> I'm not following here.
>>>>>>
>>>>>> Any help will be greatly appreciated in solving this problem. I've
>>>>>> hit a roadblock with getting Ruby working with ARM, most likely due to the
>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not). There's the 256
>>>>>> MB for physical memory, then the 64 MB for the boot loader. I brought this
>>>>>> up in my last email about trying to get Ruby working. Therefore, I'm
>>>>>> trying to get this DramSim2 integration fixed so I can start modeling FS
>>>>>> with DRAM memory.
>>>>>>
>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>>>>>
>>>>>>
>>>>>> Note that these problems also occur in Soplex from the Spec CPU2006
>>>>>> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
>>>>>> time constraints, I haven't tested on other benchmarks.
>>>>>> Thanks,
>>>>>> Andrew
>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu>wrote:
>>>>>>
>>>>>>> Hey Gabe,
>>>>>>> Thanks for this...very helpful. I just recently got back into
>>>>>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>>>>>> allow me to return the current count of references to a DynInst object.
>>>>>>> I then modified existing DPRINTFs to also print out reference
>>>>>>> counts, then added some of my own when I needed extra visibility.
>>>>>>> I've found one memory store instruction that seems to be getting
>>>>>>> lost. What's happening is that is progresses as far as getting executed in
>>>>>>> the IEW once, but a delayed translation occurs, deferring the store. By
>>>>>>> the time it reenters the IEW, the IQ has marked the instruction as
>>>>>>> squashed. Everything progresses as usual from here on out, with one
>>>>>>> exception. When the instruction is removed from the CPUs instruction list,
>>>>>>> there is one reference count hanging.
>>>>>>> I've added in some additional debugging for my traces to help
>>>>>>> narrow down where this reference is coming from. As far as I can tell,
>>>>>>> it's because of a call to initiateAcc() within the executeStore function in
>>>>>>> the lsq unit. Please see the following two traces. The first trace shows
>>>>>>> what I just discussed. The second trace is another memory store
>>>>>>> instruction that got squashed, however, it was squashed upon its first
>>>>>>> entry into the IEW, therefore it never started execution.
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>>>>> Let me know if you have any ideas based on these two instruction
>>>>>>> traces. I do not understand how the initiateAcc function results in
>>>>>>> another reference, but maybe someone else does.... Since I don't see how
>>>>>>> it makes a reference, it's hard to find out how to make sure it gets
>>>>>>> dereferenced...
>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>>>>>> references/deferences occur). Let me know if you have any advice on
>>>>>>> this...if it's possible. I can't seem to get the right include files, and
>>>>>>> likely right SConscript compile order...
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu>wrote:
>>>>>>>
>>>>>>>> Without digging into things too deeply, it looks like you may be
>>>>>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>>>>>> with one, but until that final reference is removed, the object will hang
>>>>>>>> around forever. I think I've had problems before where there reference
>>>>>>>> count ended up off by one somehow and instructions would start piling up.
>>>>>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>>>>>> structure stops letting instructions through and starts accumulating them.
>>>>>>>> Either of these problems will be annoying to track down, but with enough
>>>>>>>> digging I've been able to fix these sorts of things.
>>>>>>>>
>>>>>>>> This may have more to do with O3 not handling the benchmark you're
>>>>>>>> running well rather than a problem with your new DRAM model. There may be
>>>>>>>> some interaction between the two, though, where the new memory makes the
>>>>>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>>>>>> dynamic instruction creation and destruction and reference counting (try
>>>>>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>>>>>> itself) and turn it on as close as you can to where things go bad tick
>>>>>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>>>>>> reference count is incremented and decremented. It should be relatively
>>>>>>>> easy to pair up where references are created and destroyed, and you should
>>>>>>>> be able to identify the reference which never goes away. Then you need to
>>>>>>>> figure out where that reference is being created. After that, you should
>>>>>>>> have enough information to identify why the reference counting isn't being
>>>>>>>> done correctly. It's arduous, but that's the only way.
>>>>>>>>
>>>>>>>> It's important to also make sure reference counts aren't
>>>>>>>> decremented to zero prematurely. I had a problem once where that happened
>>>>>>>> and the memory behind the object was updated by something that didn't know
>>>>>>>> it was dead. The memory had since been reallocated to another object of the
>>>>>>>> same type, so that other object reflected what happened to the phantom one.
>>>>>>>> If I remember that manifested as something weird like an add causing a page
>>>>>>>> fault or something.
>>>>>>>>
>>>>>>>> Gabe
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I've looked into this problem some more, and have put together a
>>>>>>>> couple traces. I've been becoming more familiar with how gem5 handles
>>>>>>>> dynamic instructions, in particular how it destroys them. I have two
>>>>>>>> traces to compare, one with the physical memory, and the other with the
>>>>>>>> integrated dramsim2 dram memory. I also have two plots showing instruction
>>>>>>>> counts over time (sim ticks). All of these are linked at the end of the
>>>>>>>> email.
>>>>>>>> First, I'm going to go into what I've been able to interpret
>>>>>>>> regarding how instructions are destroyed. In particular, comparing when
>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>>>>>> separate these because I've seen a difference, as I discuss later. These
>>>>>>>> explanations are fairly non-existent on the wiki. There is a section
>>>>>>>> header waiting to be filled...
>>>>>>>> From what I have been able to gather from the code, there is a list
>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called instList, with
>>>>>>>> the type DynInstPtr. There are three conditions to instructions being
>>>>>>>> cleaned from this list:
>>>>>>>> 1.) The ROB retires its head instruction
>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>>>>>>>> resulting in removing any instruction not in the ROB
>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>>>>>> removal of all instructions back to the bad seq num.
>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>>>>>>>> removed in-flight instructions. This line in particular
>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>>>>> instList.erase(removeList.front());
>>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>>>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>>>>>> met. I also see what tick it occurs on.
>>>>>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>>>>> analyzing the trace files, I've gathered that this takes into account that
>>>>>>>> instructions have different execution lengths. So if one tick a memory
>>>>>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>>>>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>>>>>>> to determine how this is implemented.
>>>>>>>> Now for the problem.
>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>>>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>>>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>>>>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>>>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>>>>>> assuming are loops (therefore branch prediction), resulting in some
>>>>>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>>>>>> instructions in flight than allowed by ROB).
>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly
>>>>>>>> by trace, but they should represent roughly the same area of execution.
>>>>>>>> They don't execute the same due to the dramsim2 modeling the memory
>>>>>>>> differently (i.e. latency and other delays).
>>>>>>>> I've shared both traces on my public Dropbox here --
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>>>>>> Here are a couple plots of tick versus instruction count, with
>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size()
>>>>>>>> in cpu/o3/cpu.cc. --
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>>>> Note that I added the printout of the instList size to an existing
>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>>>>> Here are the commands I ran to parse the traces into data files to
>>>>>>>> analyze in MATLAB and create the plots:
>>>>>>>> zgrep DynInst
>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>>>>>> zgrep instList
>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>>>>>> $1,$11}' > instlistsize.out
>>>>>>>> It seems to me like the problem might lie in gem5, but has just
>>>>>>>> been exposed by integrating this more detailed memory model, dramsim2, into
>>>>>>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>>>>>>> integrated. I doubt this, however, since those first 190B ticks executed
>>>>>>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>>>>>>> instructions + complex loops (branch prediction), resulting in improper
>>>>>>>> destroying of instructions.
>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
>>>>>>>> flags. Their are 192 ROB entries, which is why the instList size generally
>>>>>>>> has a max of about 192 instructions. The dynamic instruction counts (seen
>>>>>>>> in the dramsim2 plot) seem to also imply that instructions are incorrectly
>>>>>>>> been removed from the ROB, and then from the cpu's instruction list in
>>>>>>>> cpu.cc, which allows more and more instructions to be added to the system
>>>>>>>> (possibly from a bad branch).
>>>>>>>> I appreciate any help in debugging this and further figuring out
>>>>>>>> the root problem, just let me know if you need anything else from me. I
>>>>>>>> don't have much more time at the moment to debug, but I can take any advice
>>>>>>>> for quick changes and/or additional traces, then send the results back to
>>>>>>>> the list for discussion.
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>>>>>>>> transaction (and even command) queue from 512 to 32. The same instructions
>>>>>>>> problem occurred. It basically just decreased the execution time.
>>>>>>>>
>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>>>>
>>>>>>>>> The error is that there are more that 1500 instructions
>>>>>>>>> currently in flight in the system. It could mean several things:
>>>>>>>>>
>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are
>>>>>>>>> more than 1500 in your system at one time?
>>>>>>>>>
>>>>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>>>>
>>>>>>>>> You could try to to run a debug binary so you'll get a list of
>>>>>>>>> instructions when it happens or increase the number which may
>>>>>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>>>>>> instructions).
>>>>>>>>>
>>>>>>>>> Ali
>>>>>>>>>
>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>>>>
>>>>>>>>> Hi Xiangyu,
>>>>>>>>> I just started looking into this some more. So at first I
>>>>>>>>> thought it was due to updating to a more recent revision, but then I went
>>>>>>>>> back to revision 8643, added your patch, built and ran....and now get the
>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing now to see
>>>>>>>>> if an update to SWIG might have resulted in this error, maybe someone on
>>>>>>>>> the mailing list would know if that's possible. The difference is 1.3.40
>>>>>>>>> vs. 2.0.3, both of which are supported according to the dependencies wiki
>>>>>>>>> page.
>>>>>>>>> Just for completeness, here's the error from revision 8643:
>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>>>>>>>>> that today. Maybe this is an assertion that is occurring due to an
>>>>>>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>>>>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>>>>>>> with your tests?
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>>>>>>> version you are using? I find some of the latest code updates do not comply
>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>>>>>> PARSEC2 on ARM_SE.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Xiangyu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>>>>>
>>>>>>>>>> *To:* gem5 users mailing list
>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>>>>>
>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>>
>>>>>>>>>> Xiangyu,
>>>>>>>>>>
>>>>>>>>>> I've been having an issue recently with the number of
>>>>>>>>>> instructions I've been seeing committed to the CPU (I have a separate
>>>>>>>>>> thread on this). It turns out the issue seems to be coming from this patch
>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately, I've been
>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I haven't been
>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or debug back in
>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu fails with this
>>>>>>>>>> assertion:
>>>>>>>>>>
>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>>
>>>>>>>>>> -Andrew
>>>>>>>>>>
>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>> Message-ID: gmail.com>
>>>>>>>>>>
>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and
>>>>>>>>>> FS modes.
>>>>>>>>>> I'm willing to share it here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For those who have such needs, please go to my website
>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>>>>>> download the patch and test it. To enable
>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you
>>>>>>>>>> can create
>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to
>>>>>>>>>> use the
>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please let me know if there are bugs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Xiangyu Dong
>>>>>>>>>>
>>>>>>>>>> -------------- next part --------------
>>>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>>> URL: <
>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-users mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-users mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>
Seongil O
2012-05-04 11:29:55 UTC
Permalink
Hi,

Is there any DRAMsim2 patch which supports checkpointing?

If you know, please inform me the location of the patch.

Recently, one of my coworker integrated DRAMsim2 to gem5-stable with
DRAMsim2 patch.

But, the patch does not contain serialize() and unserialize() methods for
DRAMsim2, which means all internal stats of DRAMsim2 are not checkpointed.

Thanks.

2012/5/3 Andrew Cebulski <***@drexel.edu>

> I have some new debug output from the beginning of the figure I linked
> back to in one of the previous emails. Where this trace starts, the
> instructions in flight roughly shadow the size of the ROB (and CPU's
> instruction list size). The first file contains just TLB and TLB walker
> messages. The second file contains all debug output around the first TLB
> walker at 0(656). It doesn't show up until 30M ticks into the trace, which
> I started at tick 3.472*10e12.
>
> http://dl.dropbox.com/u/2953302/gem5/tlb_walk.out.gz (Note: 15MB
> compressed, 300MB uncompressed)
> http://dl.dropbox.com/u/2953302/gem5/areaaroundfault.out
>
> Also, I've sorted the dynamic instructions for this trace, comparing their
> creation and destruction tick numbers. No instruction is completely lost,
> just delayed up to a few hundred million ticks from being destroyed.
>
> http://dl.dropbox.com/u/2953302/gem5/instdelaysort.csv
>
> Note, I am running a trace from before this point in the benchmark,
> however, it's a much larger trace. I should have some results from it
> later today. At the moment, I just plan on searching for an table walks to
> either 0(656) or 0x4(656). Please let me now if you have any other
> suggestions for what to parse/look into in the trace to email to the list.
>
> Thanks,
> Andrew
>
> On Wed, May 2, 2012 at 10:22 PM, Andrew Cebulski <***@drexel.edu> wrote:
>
>> I double-checked by looking at the config.ini file. It turns out I did
>> actually create the checkpoint with an Atomic CPU without caches. Sorry
>> for the confusion.
>>
>> -Andrew
>>
>>
>> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu>wrote:
>>
>>> I started hitting this assertion (that the number of insts in flight was
>>> > 1500) before I started using a checkpoint. I created the checkpoint
>>> afterwards to decrease the time needed to run simulations to debug this
>>> problem. I'll create a new checkpoint, then send the new trace output.
>>>
>>> -Andrew
>>>
>>>
>>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> **
>>>>
>>>> It's likely the cause for all of your problems. Dirty data in the
>>>> caches doesn't get restored either. You should always create checkpoints
>>>> with an atomic cpu and without caches.
>>>>
>>>>
>>>>
>>>> Ali
>>>>
>>>>
>>>>
>>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>>>>
>>>> Sorry, I created the checkpoint I referred to with an O3 CPU with
>>>> caches. From what I recall reading, caches don't get restored from
>>>> checkpoints. Since the checkpoint wasn't during the benchmark run, I
>>>> assumed that was okay.
>>>> -Andrew
>>>>
>>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>>>>
>>>>> You haven't answered the question about if you created the
>>>>> checkpoints with an atomic cpu without caches.
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>>>>
>>>>> I have not run with the checker CPU recently. Here's the stderr
>>>>> output from a run I did awhile back:
>>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>>>>> Note that the instruction match error is before my benchmark actually
>>>>> starts running. The start of my boot script checks to see if my files
>>>>> image is mounted (which it is), then continues on to run the benchmark. I
>>>>> booted the system, mounted my files image, then took a checkpoint. I've
>>>>> been running all my tests from that checkpoint. I found where my benchmark
>>>>> started based on the ASID (from ExecAsid debug flag).
>>>>> I delayed the start of gathering trace data until the second-to-last
>>>>> linear increase in dynamic instructions in-flight. I'm running a new trace
>>>>> now.
>>>>> -Andrew
>>>>>
>>>>>
>>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>
>>>>>> Something is wrong well before this point. There is no reason that
>>>>>> address 0x0 or 0x4 should be translated.
>>>>>>
>>>>>> Did you happen to create a checkpoint when caches were in the system?
>>>>>>
>>>>>> Have you tried to run with the checker cpu and see if it detects any
>>>>>> errors?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>>>>
>>>>>> They are data TLB misses that occur as the in-flight instruction
>>>>>> count rises (at 0x0 and 0x4). The last TLB miss before the in-flight
>>>>>> instruction count finally linearly decreases is to 0x200. Also, at the
>>>>>> start of the rising slope, I see a miss to 0x8 and 0x2508c.
>>>>>> Here's a trace file:
>>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>>>>> To reduce size, I just have lines that have either TLB or walker in
>>>>>> them.
>>>>>> I do see only a handful of instruction TLB misses.
>>>>>> -Andrew
>>>>>>
>>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for digging into this. I think there is an issue somewhere,
>>>>>>> but I'm still not sure where.
>>>>>>>
>>>>>>> Ali
>>>>>>>
>>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>>>>
>>>>>>> Okay, I'm positive now that the issue lies with delayed translations
>>>>>>> that are squashed before finishing.
>>>>>>>
>>>>>>> On the data on instruction side? You seem to allude to data in the
>>>>>>> paragraph below, but then instructions in the latter text.
>>>>>>>
>>>>>>> It seems to me like speculative load/stores are being executed,
>>>>>>> rather than waiting for the instructions to commit. Once the instructions
>>>>>>> begin getting (speculatively) executed in the TLB, a reference is left
>>>>>>> there, which seems hard to root out and dereference after the instruction
>>>>>>> ends up being squashed. At least, I have not been able to find that out in
>>>>>>> the source code as of yet. Can anyone clarify on this?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There should only be one translation outstanding from each
>>>>>>> instruction and data side walker. Any nested transactions should be queued
>>>>>>> in the walker. Until one finishes, I'm not sure how multiple would ever be
>>>>>>> outstanding.
>>>>>>>
>>>>>>> Recall the following image that shows how the number of dynamic
>>>>>>> instruction (DynInst) objects in-flight increases linearly for varying
>>>>>>> periods of time:
>>>>>>>
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>>> After enabling the TLB debug flag, I see that the linear increase in
>>>>>>> instructions in flight is proportional to the number of TLB misses. These
>>>>>>> TLB misses have a much larger delay (resulting in translation delays) due
>>>>>>> to the fact the DramSim2 models the memory system more accurately. It
>>>>>>> seems that with the classic memory system, TLB misses often do not have
>>>>>>> translation delays. For whatever reason, it would also seem that every
>>>>>>> instruction that has a TLB miss also is eventually squashed...
>>>>>>>
>>>>>>> From a data side perspective this is reasonable. While a miss is
>>>>>>> outstanding at some point instructions will stop committing and thus the
>>>>>>> instructions in flight will begin to rise until the miss is satisfied.
>>>>>>>
>>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>>>>>> messages appears on the rising slopes (repeated up until the peak):
>>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>>>>>
>>>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>>>>>>> would be outstanding to both address 0 and address 4 at the same time. In
>>>>>>> almost all cases these pages are marked as no-access to detect segfaults.
>>>>>>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>>>>>>> a bad access and then faulting again on the fault handler. I could imagine
>>>>>>> this would happen if there was some corruption in the memory system (for
>>>>>>> example the timings in dramsim exposing a bug in the cache models or
>>>>>>> something).
>>>>>>>
>>>>>>>
>>>>>>> At the peak, the following message appears (from fetch) almost every
>>>>>>> tick for (what I believe to be) every single one of the table walkers that
>>>>>>> were squashed.
>>>>>>> Fetch is waiting ITLB walk to finish!
>>>>>>>
>>>>>>> There must be another walk in flight? The instruction side will
>>>>>>> only have one fault outstanding at once. Successive branch mispredicts will
>>>>>>> re-direct fetch but there is code that catches the fact that a different
>>>>>>> walk completed then expected and "does the right thing."
>>>>>>>
>>>>>>> The problem is that these ITLB table walks are for instructions
>>>>>>> that were squashed as much as 0.3 billion cycles earlier, and since been
>>>>>>> removed from the CPU's instruction list.
>>>>>>>
>>>>>>> I'm not following here.
>>>>>>>
>>>>>>> Any help will be greatly appreciated in solving this problem.
>>>>>>> I've hit a roadblock with getting Ruby working with ARM, most likely due
>>>>>>> to the fact that ARM has disjoint memory (x86 and Alpha do not). There's
>>>>>>> the 256 MB for physical memory, then the 64 MB for the boot loader. I
>>>>>>> brought this up in my last email about trying to get Ruby working.
>>>>>>> Therefore, I'm trying to get this DramSim2 integration fixed so I can
>>>>>>> start modeling FS with DRAM memory.
>>>>>>>
>>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>>>>>>
>>>>>>>
>>>>>>> Note that these problems also occur in Soplex from the Spec CPU2006
>>>>>>> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
>>>>>>> time constraints, I haven't tested on other benchmarks.
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hey Gabe,
>>>>>>>> Thanks for this...very helpful. I just recently got back into
>>>>>>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>>>>>>> allow me to return the current count of references to a DynInst object.
>>>>>>>> I then modified existing DPRINTFs to also print out reference
>>>>>>>> counts, then added some of my own when I needed extra visibility.
>>>>>>>> I've found one memory store instruction that seems to be
>>>>>>>> getting lost. What's happening is that is progresses as far as getting
>>>>>>>> executed in the IEW once, but a delayed translation occurs, deferring the
>>>>>>>> store. By the time it reenters the IEW, the IQ has marked the instruction
>>>>>>>> as squashed. Everything progresses as usual from here on out, with one
>>>>>>>> exception. When the instruction is removed from the CPUs instruction list,
>>>>>>>> there is one reference count hanging.
>>>>>>>> I've added in some additional debugging for my traces to help
>>>>>>>> narrow down where this reference is coming from. As far as I can tell,
>>>>>>>> it's because of a call to initiateAcc() within the executeStore function in
>>>>>>>> the lsq unit. Please see the following two traces. The first trace shows
>>>>>>>> what I just discussed. The second trace is another memory store
>>>>>>>> instruction that got squashed, however, it was squashed upon its first
>>>>>>>> entry into the IEW, therefore it never started execution.
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>>>>>> Let me know if you have any ideas based on these two
>>>>>>>> instruction traces. I do not understand how the initiateAcc function
>>>>>>>> results in another reference, but maybe someone else does.... Since I
>>>>>>>> don't see how it makes a reference, it's hard to find out how to make sure
>>>>>>>> it gets dereferenced...
>>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>>>>>>> references/deferences occur). Let me know if you have any advice on
>>>>>>>> this...if it's possible. I can't seem to get the right include files, and
>>>>>>>> likely right SConscript compile order...
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu>wrote:
>>>>>>>>
>>>>>>>>> Without digging into things too deeply, it looks like you may be
>>>>>>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>>>>>>> with one, but until that final reference is removed, the object will hang
>>>>>>>>> around forever. I think I've had problems before where there reference
>>>>>>>>> count ended up off by one somehow and instructions would start piling up.
>>>>>>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>>>>>>> structure stops letting instructions through and starts accumulating them.
>>>>>>>>> Either of these problems will be annoying to track down, but with enough
>>>>>>>>> digging I've been able to fix these sorts of things.
>>>>>>>>>
>>>>>>>>> This may have more to do with O3 not handling the benchmark you're
>>>>>>>>> running well rather than a problem with your new DRAM model. There may be
>>>>>>>>> some interaction between the two, though, where the new memory makes the
>>>>>>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>>>>>>> dynamic instruction creation and destruction and reference counting (try
>>>>>>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>>>>>>> itself) and turn it on as close as you can to where things go bad tick
>>>>>>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>>>>>>> reference count is incremented and decremented. It should be relatively
>>>>>>>>> easy to pair up where references are created and destroyed, and you should
>>>>>>>>> be able to identify the reference which never goes away. Then you need to
>>>>>>>>> figure out where that reference is being created. After that, you should
>>>>>>>>> have enough information to identify why the reference counting isn't being
>>>>>>>>> done correctly. It's arduous, but that's the only way.
>>>>>>>>>
>>>>>>>>> It's important to also make sure reference counts aren't
>>>>>>>>> decremented to zero prematurely. I had a problem once where that happened
>>>>>>>>> and the memory behind the object was updated by something that didn't know
>>>>>>>>> it was dead. The memory had since been reallocated to another object of the
>>>>>>>>> same type, so that other object reflected what happened to the phantom one.
>>>>>>>>> If I remember that manifested as something weird like an add causing a page
>>>>>>>>> fault or something.
>>>>>>>>>
>>>>>>>>> Gabe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I've looked into this problem some more, and have put together a
>>>>>>>>> couple traces. I've been becoming more familiar with how gem5 handles
>>>>>>>>> dynamic instructions, in particular how it destroys them. I have two
>>>>>>>>> traces to compare, one with the physical memory, and the other with the
>>>>>>>>> integrated dramsim2 dram memory. I also have two plots showing instruction
>>>>>>>>> counts over time (sim ticks). All of these are linked at the end of the
>>>>>>>>> email.
>>>>>>>>> First, I'm going to go into what I've been able to interpret
>>>>>>>>> regarding how instructions are destroyed. In particular, comparing when
>>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>>>>>>> separate these because I've seen a difference, as I discuss later. These
>>>>>>>>> explanations are fairly non-existent on the wiki. There is a section
>>>>>>>>> header waiting to be filled...
>>>>>>>>> From what I have been able to gather from the code, there is a
>>>>>>>>> list of all the instructions in flight in cpu/o3/cpu.cc called instList,
>>>>>>>>> with the type DynInstPtr. There are three conditions to instructions being
>>>>>>>>> cleaned from this list:
>>>>>>>>> 1.) The ROB retires its head instruction
>>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>>>>>>>>> resulting in removing any instruction not in the ROB
>>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>>>>>>> removal of all instructions back to the bad seq num.
>>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>>>>>>>>> removed in-flight instructions. This line in particular
>>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>>>>>> instList.erase(removeList.front());
>>>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>>>>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>>>>>>> met. I also see what tick it occurs on.
>>>>>>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>>>>>> analyzing the trace files, I've gathered that this takes into account that
>>>>>>>>> instructions have different execution lengths. So if one tick a memory
>>>>>>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>>>>>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>>>>>>>> to determine how this is implemented.
>>>>>>>>> Now for the problem.
>>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>>>>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>>>>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>>>>>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>>>>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>>>>>>> assuming are loops (therefore branch prediction), resulting in some
>>>>>>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>>>>>>> instructions in flight than allowed by ROB).
>>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly
>>>>>>>>> by trace, but they should represent roughly the same area of execution.
>>>>>>>>> They don't execute the same due to the dramsim2 modeling the memory
>>>>>>>>> differently (i.e. latency and other delays).
>>>>>>>>> I've shared both traces on my public Dropbox here --
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>>>>>>> Here are a couple plots of tick versus instruction count, with
>>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size()
>>>>>>>>> in cpu/o3/cpu.cc. --
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>>>>> Note that I added the printout of the instList size to an existing
>>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>>>>>> Here are the commands I ran to parse the traces into data files to
>>>>>>>>> analyze in MATLAB and create the plots:
>>>>>>>>> zgrep DynInst
>>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>>>>>>> zgrep instList
>>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>>>>>>> $1,$11}' > instlistsize.out
>>>>>>>>> It seems to me like the problem might lie in gem5, but has just
>>>>>>>>> been exposed by integrating this more detailed memory model, dramsim2, into
>>>>>>>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>>>>>>>> integrated. I doubt this, however, since those first 190B ticks executed
>>>>>>>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>>>>>>>> instructions + complex loops (branch prediction), resulting in improper
>>>>>>>>> destroying of instructions.
>>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
>>>>>>>>> flags. Their are 192 ROB entries, which is why the instList size generally
>>>>>>>>> has a max of about 192 instructions. The dynamic instruction counts (seen
>>>>>>>>> in the dramsim2 plot) seem to also imply that instructions are incorrectly
>>>>>>>>> been removed from the ROB, and then from the cpu's instruction list in
>>>>>>>>> cpu.cc, which allows more and more instructions to be added to the system
>>>>>>>>> (possibly from a bad branch).
>>>>>>>>> I appreciate any help in debugging this and further figuring out
>>>>>>>>> the root problem, just let me know if you need anything else from me. I
>>>>>>>>> don't have much more time at the moment to debug, but I can take any advice
>>>>>>>>> for quick changes and/or additional traces, then send the results back to
>>>>>>>>> the list for discussion.
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>>>>>>>>> transaction (and even command) queue from 512 to 32. The same instructions
>>>>>>>>> problem occurred. It basically just decreased the execution time.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu>wrote:
>>>>>>>>>
>>>>>>>>>> The error is that there are more that 1500 instructions
>>>>>>>>>> currently in flight in the system. It could mean several things:
>>>>>>>>>>
>>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are
>>>>>>>>>> more than 1500 in your system at one time?
>>>>>>>>>>
>>>>>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>>>>>
>>>>>>>>>> You could try to to run a debug binary so you'll get a list of
>>>>>>>>>> instructions when it happens or increase the number which may
>>>>>>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>>>>>>> instructions).
>>>>>>>>>>
>>>>>>>>>> Ali
>>>>>>>>>>
>>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Xiangyu,
>>>>>>>>>> I just started looking into this some more. So at first I
>>>>>>>>>> thought it was due to updating to a more recent revision, but then I went
>>>>>>>>>> back to revision 8643, added your patch, built and ran....and now get the
>>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing now to see
>>>>>>>>>> if an update to SWIG might have resulted in this error, maybe someone on
>>>>>>>>>> the mailing list would know if that's possible. The difference is 1.3.40
>>>>>>>>>> vs. 2.0.3, both of which are supported according to the dependencies wiki
>>>>>>>>>> page.
>>>>>>>>>> Just for completeness, here's the error from revision 8643:
>>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>>>>>>>>>> that today. Maybe this is an assertion that is occurring due to an
>>>>>>>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>>>>>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>>>>>>>> with your tests?
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>>>>>>>> version you are using? I find some of the latest code updates do not comply
>>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>>>>>>> PARSEC2 on ARM_SE.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Xiangyu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>>>>>>
>>>>>>>>>>> *To:* gem5 users mailing list
>>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>>>>>>
>>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>>>
>>>>>>>>>>> Xiangyu,
>>>>>>>>>>>
>>>>>>>>>>> I've been having an issue recently with the number of
>>>>>>>>>>> instructions I've been seeing committed to the CPU (I have a separate
>>>>>>>>>>> thread on this). It turns out the issue seems to be coming from this patch
>>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately, I've been
>>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I haven't been
>>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or debug back in
>>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu fails with this
>>>>>>>>>>> assertion:
>>>>>>>>>>>
>>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>>>
>>>>>>>>>>> -Andrew
>>>>>>>>>>>
>>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>>> Message-ID: gmail.com>
>>>>>>>>>>>
>>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and
>>>>>>>>>>> FS modes.
>>>>>>>>>>> I'm willing to share it here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For those who have such needs, please go to my website
>>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>>>>>>> download the patch and test it. To enable
>>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS,
>>>>>>>>>>> you can create
>>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is
>>>>>>>>>>> to use the
>>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please let me know if there are bugs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Xiangyu Dong
>>>>>>>>>>>
>>>>>>>>>>> -------------- next part --------------
>>>>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>>>> URL: <
>>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-users mailing list
>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-users mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-users mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Gabriel Michael Black
2012-05-04 12:53:39 UTC
Permalink
I haven't had a chance to study what's going on here, but could the
problem be that we don't have bandwidth limits/back pressure
implemented for the TLB and delayed translation? It could be that the
CPU is pumping instructions into translation which eventually drain
out/are squashed, and if too many accumulate they trip that assert.

That may not actually make any sense as far as what the code is
actually doing, but it occurred to me as a possibility and I thought
I'd throw it out there.

Gabe

Quoting Andrew Cebulski <***@drexel.edu>:

> I double-checked by looking at the config.ini file. It turns out I did
> actually create the checkpoint with an Atomic CPU without caches. Sorry
> for the confusion.
>
> -Andrew
>
> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu> wrote:
>
>> I started hitting this assertion (that the number of insts in flight was >
>> 1500) before I started using a checkpoint. I created the checkpoint
>> afterwards to decrease the time needed to run simulations to debug this
>> problem. I'll create a new checkpoint, then send the new trace output.
>>
>> -Andrew
>>
>>
>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>>
>>> **
>>>
>>> It's likely the cause for all of your problems. Dirty data in the caches
>>> doesn't get restored either. You should always create checkpoints with an
>>> atomic cpu and without caches.
>>>
>>>
>>>
>>> Ali
>>>
>>>
>>>
>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>>>
>>> Sorry, I created the checkpoint I referred to with an O3 CPU with caches.
>>> From what I recall reading, caches don't get restored from checkpoints.
>>> Since the checkpoint wasn't during the benchmark run, I assumed that was
>>> okay.
>>> -Andrew
>>>
>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> You haven't answered the question about if you created the checkpoints
>>>> with an atomic cpu without caches.
>>>>
>>>> Ali
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>>>
>>>> I have not run with the checker CPU recently. Here's the stderr output
>>>> from a run I did awhile back:
>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>>>> Note that the instruction match error is before my benchmark actually
>>>> starts running. The start of my boot script checks to see if my files
>>>> image is mounted (which it is), then continues on to run the benchmark. I
>>>> booted the system, mounted my files image, then took a checkpoint. I've
>>>> been running all my tests from that checkpoint. I found where my
>>>> benchmark
>>>> started based on the ASID (from ExecAsid debug flag).
>>>> I delayed the start of gathering trace data until the second-to-last
>>>> linear increase in dynamic instructions in-flight. I'm running a
>>>> new trace
>>>> now.
>>>> -Andrew
>>>>
>>>>
>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>>>>
>>>>> Something is wrong well before this point. There is no reason that
>>>>> address 0x0 or 0x4 should be translated.
>>>>>
>>>>> Did you happen to create a checkpoint when caches were in the system?
>>>>>
>>>>> Have you tried to run with the checker cpu and see if it detects any
>>>>> errors?
>>>>>
>>>>>
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>>>
>>>>> They are data TLB misses that occur as the in-flight instruction count
>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight
>>>>> instruction
>>>>> count finally linearly decreases is to 0x200. Also, at the start of the
>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>>>>> Here's a trace file:
>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>>>> To reduce size, I just have lines that have either TLB or walker in
>>>>> them.
>>>>> I do see only a handful of instruction TLB misses.
>>>>> -Andrew
>>>>>
>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for digging into this. I think there is an issue somewhere, but
>>>>>> I'm still not sure where.
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>>>
>>>>>> Okay, I'm positive now that the issue lies with delayed translations
>>>>>> that are squashed before finishing.
>>>>>>
>>>>>> On the data on instruction side? You seem to allude to data in the
>>>>>> paragraph below, but then instructions in the latter text.
>>>>>>
>>>>>> It seems to me like speculative load/stores are being executed,
>>>>>> rather than waiting for the instructions to commit. Once the
>>>>>> instructions
>>>>>> begin getting (speculatively) executed in the TLB, a reference is left
>>>>>> there, which seems hard to root out and dereference after the
>>>>>> instruction
>>>>>> ends up being squashed. At least, I have not been able to find
>>>>>> that out in
>>>>>> the source code as of yet. Can anyone clarify on this?
>>>>>>
>>>>>>
>>>>>>
>>>>>> There should only be one translation outstanding from each
>>>>>> instruction and data side walker. Any nested transactions
>>>>>> should be queued
>>>>>> in the walker. Until one finishes, I'm not sure how multiple
>>>>>> would ever be
>>>>>> outstanding.
>>>>>>
>>>>>> Recall the following image that shows how the number of dynamic
>>>>>> instruction (DynInst) objects in-flight increases linearly for varying
>>>>>> periods of time:
>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>> After enabling the TLB debug flag, I see that the linear increase in
>>>>>> instructions in flight is proportional to the number of TLB
>>>>>> misses. These
>>>>>> TLB misses have a much larger delay (resulting in translation
>>>>>> delays) due
>>>>>> to the fact the DramSim2 models the memory system more accurately. It
>>>>>> seems that with the classic memory system, TLB misses often do not have
>>>>>> translation delays. For whatever reason, it would also seem that every
>>>>>> instruction that has a TLB miss also is eventually squashed...
>>>>>>
>>>>>> From a data side perspective this is reasonable. While a miss is
>>>>>> outstanding at some point instructions will stop committing and thus the
>>>>>> instructions in flight will begin to rise until the miss is satisfied.
>>>>>>
>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>>>>> messages appears on the rising slopes (repeated up until the peak):
>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>>>>
>>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>>>>>> would be outstanding to both address 0 and address 4 at the
>>>>>> same time. In
>>>>>> almost all cases these pages are marked as no-access to detect
>>>>>> segfaults.
>>>>>> Perhaps there is an issue where the cpu is getting into a loop
>>>>>> faulting on
>>>>>> a bad access and then faulting again on the fault handler. I
>>>>>> could imagine
>>>>>> this would happen if there was some corruption in the memory system (for
>>>>>> example the timings in dramsim exposing a bug in the cache models or
>>>>>> something).
>>>>>>
>>>>>>
>>>>>> At the peak, the following message appears (from fetch) almost every
>>>>>> tick for (what I believe to be) every single one of the table
>>>>>> walkers that
>>>>>> were squashed.
>>>>>> Fetch is waiting ITLB walk to finish!
>>>>>>
>>>>>> There must be another walk in flight? The instruction side will only
>>>>>> have one fault outstanding at once. Successive branch mispredicts will
>>>>>> re-direct fetch but there is code that catches the fact that a different
>>>>>> walk completed then expected and "does the right thing."
>>>>>>
>>>>>> The problem is that these ITLB table walks are for instructions that
>>>>>> were squashed as much as 0.3 billion cycles earlier, and since
>>>>>> been removed
>>>>>> from the CPU's instruction list.
>>>>>>
>>>>>> I'm not following here.
>>>>>>
>>>>>> Any help will be greatly appreciated in solving this problem. I've
>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
>>>>>> due to the
>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not).
>>>>>> There's the 256
>>>>>> MB for physical memory, then the 64 MB for the boot loader. I
>>>>>> brought this
>>>>>> up in my last email about trying to get Ruby working. Therefore, I'm
>>>>>> trying to get this DramSim2 integration fixed so I can start modeling FS
>>>>>> with DRAM memory.
>>>>>>
>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>>>>>
>>>>>>
>>>>>> Note that these problems also occur in Soplex from the Spec CPU2006
>>>>>> benchmark suite (also hits 1500 in-flight instructions
>>>>>> assertion). Due to
>>>>>> time constraints, I haven't tested on other benchmarks.
>>>>>> Thanks,
>>>>>> Andrew
>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
>>>>>> <***@drexel.edu>wrote:
>>>>>>
>>>>>>> Hey Gabe,
>>>>>>> Thanks for this...very helpful. I just recently got back into
>>>>>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>>>>>> allow me to return the current count of references to a DynInst object.
>>>>>>> I then modified existing DPRINTFs to also print out reference
>>>>>>> counts, then added some of my own when I needed extra visibility.
>>>>>>> I've found one memory store instruction that seems to be getting
>>>>>>> lost. What's happening is that is progresses as far as
>>>>>>> getting executed in
>>>>>>> the IEW once, but a delayed translation occurs, deferring the
>>>>>>> store. By
>>>>>>> the time it reenters the IEW, the IQ has marked the instruction as
>>>>>>> squashed. Everything progresses as usual from here on out, with one
>>>>>>> exception. When the instruction is removed from the CPUs
>>>>>>> instruction list,
>>>>>>> there is one reference count hanging.
>>>>>>> I've added in some additional debugging for my traces to help
>>>>>>> narrow down where this reference is coming from. As far as I can tell,
>>>>>>> it's because of a call to initiateAcc() within the
>>>>>>> executeStore function in
>>>>>>> the lsq unit. Please see the following two traces. The first
>>>>>>> trace shows
>>>>>>> what I just discussed. The second trace is another memory store
>>>>>>> instruction that got squashed, however, it was squashed upon its first
>>>>>>> entry into the IEW, therefore it never started execution.
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>>>>> Let me know if you have any ideas based on these two instruction
>>>>>>> traces. I do not understand how the initiateAcc function results in
>>>>>>> another reference, but maybe someone else does.... Since I
>>>>>>> don't see how
>>>>>>> it makes a reference, it's hard to find out how to make sure it gets
>>>>>>> dereferenced...
>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e.
>>>>>>> exactly when
>>>>>>> references/deferences occur). Let me know if you have any advice on
>>>>>>> this...if it's possible. I can't seem to get the right
>>>>>>> include files, and
>>>>>>> likely right SConscript compile order...
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black
>>>>>>> <***@eecs.umich.edu>wrote:
>>>>>>>
>>>>>>>> Without digging into things too deeply, it looks like you may be
>>>>>>>> leaking references to dynamic instructions. The CPU may think
>>>>>>>> it's done
>>>>>>>> with one, but until that final reference is removed, the
>>>>>>>> object will hang
>>>>>>>> around forever. I think I've had problems before where there reference
>>>>>>>> count ended up off by one somehow and instructions would
>>>>>>>> start piling up.
>>>>>>>> It's also possible that a clog develops in O3's pipeline and
>>>>>>>> some internal
>>>>>>>> structure stops letting instructions through and starts
>>>>>>>> accumulating them.
>>>>>>>> Either of these problems will be annoying to track down, but
>>>>>>>> with enough
>>>>>>>> digging I've been able to fix these sorts of things.
>>>>>>>>
>>>>>>>> This may have more to do with O3 not handling the benchmark you're
>>>>>>>> running well rather than a problem with your new DRAM model.
>>>>>>>> There may be
>>>>>>>> some interaction between the two, though, where the new
>>>>>>>> memory makes the
>>>>>>>> timing line up to cause O3 to behave poorly. What you can do
>>>>>>>> is instrument
>>>>>>>> dynamic instruction creation and destruction and reference
>>>>>>>> counting (try
>>>>>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>>>>>> itself) and turn it on as close as you can to where things go bad tick
>>>>>>>> wise. Then look for an instruction which gets lost, and look
>>>>>>>> for where it's
>>>>>>>> reference count is incremented and decremented. It should be
>>>>>>>> relatively
>>>>>>>> easy to pair up where references are created and destroyed,
>>>>>>>> and you should
>>>>>>>> be able to identify the reference which never goes away. Then
>>>>>>>> you need to
>>>>>>>> figure out where that reference is being created. After that,
>>>>>>>> you should
>>>>>>>> have enough information to identify why the reference
>>>>>>>> counting isn't being
>>>>>>>> done correctly. It's arduous, but that's the only way.
>>>>>>>>
>>>>>>>> It's important to also make sure reference counts aren't decremented
>>>>>>>> to zero prematurely. I had a problem once where that happened and the
>>>>>>>> memory behind the object was updated by something that didn't
>>>>>>>> know it was
>>>>>>>> dead. The memory had since been reallocated to another object
>>>>>>>> of the same
>>>>>>>> type, so that other object reflected what happened to the
>>>>>>>> phantom one. If I
>>>>>>>> remember that manifested as something weird like an add causing a page
>>>>>>>> fault or something.
>>>>>>>>
>>>>>>>> Gabe
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I've looked into this problem some more, and have put together a
>>>>>>>> couple traces. I've been becoming more familiar with how gem5 handles
>>>>>>>> dynamic instructions, in particular how it destroys them. I have two
>>>>>>>> traces to compare, one with the physical memory, and the
>>>>>>>> other with the
>>>>>>>> integrated dramsim2 dram memory. I also have two plots
>>>>>>>> showing instruction
>>>>>>>> counts over time (sim ticks). All of these are linked at the
>>>>>>>> end of the
>>>>>>>> email.
>>>>>>>> First, I'm going to go into what I've been able to interpret
>>>>>>>> regarding how instructions are destroyed. In particular,
>>>>>>>> comparing when
>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>>>>>> separate these because I've seen a difference, as I discuss
>>>>>>>> later. These
>>>>>>>> explanations are fairly non-existent on the wiki. There is a section
>>>>>>>> header waiting to be filled...
>>>>>>>> From what I have been able to gather from the code, there is a list
>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called
>>>>>>>> instList, with
>>>>>>>> the type DynInstPtr. There are three conditions to instructions being
>>>>>>>> cleaned from this list:
>>>>>>>> 1.) The ROB retires its head instruction
>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>>>>>>>> resulting in removing any instruction not in the ROB
>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>>>>>> removal of all instructions back to the bad seq num.
>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>>>>>>>> removed in-flight instructions. This line in particular
>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>>>>> instList.erase(removeList.front());
>>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum
>>>>>>>> and pcState
>>>>>>>> after all 5 cpu stages have completed, and one of the
>>>>>>>> conditions above is
>>>>>>>> met. I also see what tick it occurs on.
>>>>>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>>>>> analyzing the trace files, I've gathered that this takes into
>>>>>>>> account that
>>>>>>>> instructions have different execution lengths. So if one
>>>>>>>> tick a memory
>>>>>>>> instruction in the instList (DynInstPtr) is removed, the
>>>>>>>> DynInst for that
>>>>>>>> memory instruction will occur much later (i.e. 1M ticks
>>>>>>>> later). I have yet
>>>>>>>> to determine how this is implemented.
>>>>>>>> Now for the problem.
>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>>>>>> difference between the size of the instList vector (of
>>>>>>>> DynInstPtr objects),
>>>>>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the
>>>>>>>> first roughly
>>>>>>>> 130B ticks, the dynamic instruction count kept in
>>>>>>>> cpu/base_dyn_inst.impl.hh
>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below)
>>>>>>>> very closely.
>>>>>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>>>>>> assuming are loops (therefore branch prediction), resulting in some
>>>>>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>>>>>> instructions in flight than allowed by ROB).
>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>>>>>>> trace, but they should represent roughly the same area of
>>>>>>>> execution. They
>>>>>>>> don't execute the same due to the dramsim2 modeling the
>>>>>>>> memory differently
>>>>>>>> (i.e. latency and other delays).
>>>>>>>> I've shared both traces on my public Dropbox here --
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>>>>>> Here are a couple plots of tick versus instruction count, with
>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
>>>>>>>> instList.size()
>>>>>>>> in cpu/o3/cpu.cc. --
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>>>> Note that I added the printout of the instList size to an existing
>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>>>>> Here are the commands I ran to parse the traces into data files to
>>>>>>>> analyze in MATLAB and create the plots:
>>>>>>>> zgrep DynInst
>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>>>>>>>> grep destroyed
>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>>>>>> zgrep instList
>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>>>>>>>> awk '{print
>>>>>>>> $1,$11}' > instlistsize.out
>>>>>>>> It seems to me like the problem might lie in gem5, but has just been
>>>>>>>> exposed by integrating this more detailed memory model, dramsim2, into
>>>>>>>> gem5. Either that, or their are some timing errors in how
>>>>>>>> dramsim2 was
>>>>>>>> integrated. I doubt this, however, since those first 190B
>>>>>>>> ticks executed
>>>>>>>> used the dramsim2 memory. I believe the problem is a
>>>>>>>> combination of memory
>>>>>>>> instructions + complex loops (branch prediction), resulting
>>>>>>>> in improper
>>>>>>>> destroying of instructions.
>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>>>>>> Their are 192 ROB entries, which is why the instList size
>>>>>>>> generally has a
>>>>>>>> max of about 192 instructions. The dynamic instruction
>>>>>>>> counts (seen in the
>>>>>>>> dramsim2 plot) seem to also imply that instructions are
>>>>>>>> incorrectly been
>>>>>>>> removed from the ROB, and then from the cpu's instruction
>>>>>>>> list in cpu.cc,
>>>>>>>> which allows more and more instructions to be added to the
>>>>>>>> system (possibly
>>>>>>>> from a bad branch).
>>>>>>>> I appreciate any help in debugging this and further figuring out the
>>>>>>>> root problem, just let me know if you need anything else from
>>>>>>>> me. I don't
>>>>>>>> have much more time at the moment to debug, but I can take
>>>>>>>> any advice for
>>>>>>>> quick changes and/or additional traces, then send the results
>>>>>>>> back to the
>>>>>>>> list for discussion.
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>>>>>>>> transaction (and even command) queue from 512 to 32. The
>>>>>>>> same instructions
>>>>>>>> problem occurred. It basically just decreased the execution time.
>>>>>>>>
>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>>>>
>>>>>>>>> The error is that there are more that 1500 instructions currently
>>>>>>>>> in flight in the system. It could mean several things:
>>>>>>>>>
>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are
>>>>>>>>> more than 1500 in your system at one time?
>>>>>>>>>
>>>>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>>>>
>>>>>>>>> You could try to to run a debug binary so you'll get a list of
>>>>>>>>> instructions when it happens or increase the number which may
>>>>>>>>> be appropriate for certain situations (but 1500 is quite a
>>>>>>>>> few inflight
>>>>>>>>> instructions).
>>>>>>>>>
>>>>>>>>> Ali
>>>>>>>>>
>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>>>>
>>>>>>>>> Hi Xiangyu,
>>>>>>>>> I just started looking into this some more. So at first I
>>>>>>>>> thought it was due to updating to a more recent revision,
>>>>>>>>> but then I went
>>>>>>>>> back to revision 8643, added your patch, built and
>>>>>>>>> ran....and now get the
>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m
>>>>>>>>> testing now to see
>>>>>>>>> if an update to SWIG might have resulted in this error,
>>>>>>>>> maybe someone on
>>>>>>>>> the mailing list would know if that's possible. The
>>>>>>>>> difference is 1.3.40
>>>>>>>>> vs. 2.0.3, both of which are supported according to the
>>>>>>>>> dependencies wiki
>>>>>>>>> page.
>>>>>>>>> Just for completeness, here's the error from revision 8643:
>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>>>>>>>>> `cpu->instcount
>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>>>>>>>>> that today. Maybe this is an assertion that is occurring due to an
>>>>>>>>> optimization. That would mean it wouldn't be triggered in
>>>>>>>>> gem5.debug since
>>>>>>>>> it runs without optimizations. Have you tested all debug,
>>>>>>>>> opt and fast
>>>>>>>>> with your tests?
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I didn?t see this error in my simulations. May I ask which gem5
>>>>>>>>>> version you are using? I find some of the latest code
>>>>>>>>>> updates do not comply
>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on
>>>>>>>>>> Gem5 repo8643, and
>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000,
>>>>>>>>>> EEMBC2, and
>>>>>>>>>> PARSEC2 on ARM_SE.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Xiangyu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>>>>>
>>>>>>>>>> *To:* gem5 users mailing list
>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>>>>>
>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>>
>>>>>>>>>> Xiangyu,
>>>>>>>>>>
>>>>>>>>>> I've been having an issue recently with the number of
>>>>>>>>>> instructions I've been seeing committed to the CPU (I have
>>>>>>>>>> a separate
>>>>>>>>>> thread on this). It turns out the issue seems to be coming
>>>>>>>>>> from this patch
>>>>>>>>>> you created to integrate DramSim2 with Gem5.
>>>>>>>>>> Unfortunately, I've been
>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I
>>>>>>>>>> haven't been
>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or
>>>>>>>>>> debug back in
>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu
>>>>>>>>>> fails with this
>>>>>>>>>> assertion:
>>>>>>>>>>
>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>>>>>>>>>> `cpu->instcount
>>>>>>>>>>
>>>>>>>>>> -Andrew
>>>>>>>>>>
>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>> Message-ID: gmail.com>
>>>>>>>>>>
>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>>>>>>> modes.
>>>>>>>>>> I'm willing to share it here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For those who have such needs, please go to my website
>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>>>>>> download the patch and test it. To enable
>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you
>>>>>>>>>> can create
>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to
>>>>>>>>>> use the
>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please let me know if there are bugs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Xiangyu Dong
>>>>>>>>>>
>>>>>>>>>> -------------- next part --------------
>>>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>>> URL: <
>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-users mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-users mailing
>>>>>>>> listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-users mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>
Ali Saidi
2012-05-06 16:01:41 UTC
Permalink
Hi Andrew,

Could you add some code to the table walker to see how big the following are getting:
stateQueueL1.size()
stateQueueL2.size()
pendingQueue.size()

Perhaps we're some how getting into a loop where there are a lot of translations to invalid addresses that get squashed and they pile up in the table walker?

Thanks,
Ali



On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:

> I haven't had a chance to study what's going on here, but could the problem be that we don't have bandwidth limits/back pressure implemented for the TLB and delayed translation? It could be that the CPU is pumping instructions into translation which eventually drain out/are squashed, and if too many accumulate they trip that assert.
>
> That may not actually make any sense as far as what the code is actually doing, but it occurred to me as a possibility and I thought I'd throw it out there.
>
> Gabe
>
> Quoting Andrew Cebulski <***@drexel.edu>:
>
>> I double-checked by looking at the config.ini file. It turns out I did
>> actually create the checkpoint with an Atomic CPU without caches. Sorry
>> for the confusion.
>>
>> -Andrew
>>
>> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu> wrote:
>>
>>> I started hitting this assertion (that the number of insts in flight was >
>>> 1500) before I started using a checkpoint. I created the checkpoint
>>> afterwards to decrease the time needed to run simulations to debug this
>>> problem. I'll create a new checkpoint, then send the new trace output.
>>>
>>> -Andrew
>>>
>>>
>>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> **
>>>>
>>>> It's likely the cause for all of your problems. Dirty data in the caches
>>>> doesn't get restored either. You should always create checkpoints with an
>>>> atomic cpu and without caches.
>>>>
>>>>
>>>>
>>>> Ali
>>>>
>>>>
>>>>
>>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>>>>
>>>> Sorry, I created the checkpoint I referred to with an O3 CPU with caches.
>>>> From what I recall reading, caches don't get restored from checkpoints.
>>>> Since the checkpoint wasn't during the benchmark run, I assumed that was
>>>> okay.
>>>> -Andrew
>>>>
>>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>>>>
>>>>> You haven't answered the question about if you created the checkpoints
>>>>> with an atomic cpu without caches.
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>>>>
>>>>> I have not run with the checker CPU recently. Here's the stderr output
>>>>> from a run I did awhile back:
>>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>>>>> Note that the instruction match error is before my benchmark actually
>>>>> starts running. The start of my boot script checks to see if my files
>>>>> image is mounted (which it is), then continues on to run the benchmark. I
>>>>> booted the system, mounted my files image, then took a checkpoint. I've
>>>>> been running all my tests from that checkpoint. I found where my benchmark
>>>>> started based on the ASID (from ExecAsid debug flag).
>>>>> I delayed the start of gathering trace data until the second-to-last
>>>>> linear increase in dynamic instructions in-flight. I'm running a new trace
>>>>> now.
>>>>> -Andrew
>>>>>
>>>>>
>>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>
>>>>>> Something is wrong well before this point. There is no reason that
>>>>>> address 0x0 or 0x4 should be translated.
>>>>>>
>>>>>> Did you happen to create a checkpoint when caches were in the system?
>>>>>>
>>>>>> Have you tried to run with the checker cpu and see if it detects any
>>>>>> errors?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>>>>
>>>>>> They are data TLB misses that occur as the in-flight instruction count
>>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction
>>>>>> count finally linearly decreases is to 0x200. Also, at the start of the
>>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>>>>>> Here's a trace file:
>>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>>>>> To reduce size, I just have lines that have either TLB or walker in
>>>>>> them.
>>>>>> I do see only a handful of instruction TLB misses.
>>>>>> -Andrew
>>>>>>
>>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for digging into this. I think there is an issue somewhere, but
>>>>>>> I'm still not sure where.
>>>>>>>
>>>>>>> Ali
>>>>>>>
>>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>>>>>
>>>>>>> Okay, I'm positive now that the issue lies with delayed translations
>>>>>>> that are squashed before finishing.
>>>>>>>
>>>>>>> On the data on instruction side? You seem to allude to data in the
>>>>>>> paragraph below, but then instructions in the latter text.
>>>>>>>
>>>>>>> It seems to me like speculative load/stores are being executed,
>>>>>>> rather than waiting for the instructions to commit. Once the instructions
>>>>>>> begin getting (speculatively) executed in the TLB, a reference is left
>>>>>>> there, which seems hard to root out and dereference after the instruction
>>>>>>> ends up being squashed. At least, I have not been able to find that out in
>>>>>>> the source code as of yet. Can anyone clarify on this?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There should only be one translation outstanding from each
>>>>>>> instruction and data side walker. Any nested transactions should be queued
>>>>>>> in the walker. Until one finishes, I'm not sure how multiple would ever be
>>>>>>> outstanding.
>>>>>>>
>>>>>>> Recall the following image that shows how the number of dynamic
>>>>>>> instruction (DynInst) objects in-flight increases linearly for varying
>>>>>>> periods of time:
>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>>> After enabling the TLB debug flag, I see that the linear increase in
>>>>>>> instructions in flight is proportional to the number of TLB misses. These
>>>>>>> TLB misses have a much larger delay (resulting in translation delays) due
>>>>>>> to the fact the DramSim2 models the memory system more accurately. It
>>>>>>> seems that with the classic memory system, TLB misses often do not have
>>>>>>> translation delays. For whatever reason, it would also seem that every
>>>>>>> instruction that has a TLB miss also is eventually squashed...
>>>>>>>
>>>>>>> From a data side perspective this is reasonable. While a miss is
>>>>>>> outstanding at some point instructions will stop committing and thus the
>>>>>>> instructions in flight will begin to rise until the miss is satisfied.
>>>>>>>
>>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>>>>>> messages appears on the rising slopes (repeated up until the peak):
>>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>>>>>
>>>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>>>>>>> would be outstanding to both address 0 and address 4 at the same time. In
>>>>>>> almost all cases these pages are marked as no-access to detect segfaults.
>>>>>>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>>>>>>> a bad access and then faulting again on the fault handler. I could imagine
>>>>>>> this would happen if there was some corruption in the memory system (for
>>>>>>> example the timings in dramsim exposing a bug in the cache models or
>>>>>>> something).
>>>>>>>
>>>>>>>
>>>>>>> At the peak, the following message appears (from fetch) almost every
>>>>>>> tick for (what I believe to be) every single one of the table walkers that
>>>>>>> were squashed.
>>>>>>> Fetch is waiting ITLB walk to finish!
>>>>>>>
>>>>>>> There must be another walk in flight? The instruction side will only
>>>>>>> have one fault outstanding at once. Successive branch mispredicts will
>>>>>>> re-direct fetch but there is code that catches the fact that a different
>>>>>>> walk completed then expected and "does the right thing."
>>>>>>>
>>>>>>> The problem is that these ITLB table walks are for instructions that
>>>>>>> were squashed as much as 0.3 billion cycles earlier, and since been removed
>>>>>>> from the CPU's instruction list.
>>>>>>>
>>>>>>> I'm not following here.
>>>>>>>
>>>>>>> Any help will be greatly appreciated in solving this problem. I've
>>>>>>> hit a roadblock with getting Ruby working with ARM, most likely due to the
>>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not). There's the 256
>>>>>>> MB for physical memory, then the 64 MB for the boot loader. I brought this
>>>>>>> up in my last email about trying to get Ruby working. Therefore, I'm
>>>>>>> trying to get this DramSim2 integration fixed so I can start modeling FS
>>>>>>> with DRAM memory.
>>>>>>>
>>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>>>>>>
>>>>>>>
>>>>>>> Note that these problems also occur in Soplex from the Spec CPU2006
>>>>>>> benchmark suite (also hits 1500 in-flight instructions assertion). Due to
>>>>>>> time constraints, I haven't tested on other benchmarks.
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu>wrote:
>>>>>>>
>>>>>>>> Hey Gabe,
>>>>>>>> Thanks for this...very helpful. I just recently got back into
>>>>>>>> debugging this problem. I made a small change in src/base/refcnt.hh to
>>>>>>>> allow me to return the current count of references to a DynInst object.
>>>>>>>> I then modified existing DPRINTFs to also print out reference
>>>>>>>> counts, then added some of my own when I needed extra visibility.
>>>>>>>> I've found one memory store instruction that seems to be getting
>>>>>>>> lost. What's happening is that is progresses as far as getting executed in
>>>>>>>> the IEW once, but a delayed translation occurs, deferring the store. By
>>>>>>>> the time it reenters the IEW, the IQ has marked the instruction as
>>>>>>>> squashed. Everything progresses as usual from here on out, with one
>>>>>>>> exception. When the instruction is removed from the CPUs instruction list,
>>>>>>>> there is one reference count hanging.
>>>>>>>> I've added in some additional debugging for my traces to help
>>>>>>>> narrow down where this reference is coming from. As far as I can tell,
>>>>>>>> it's because of a call to initiateAcc() within the executeStore function in
>>>>>>>> the lsq unit. Please see the following two traces. The first trace shows
>>>>>>>> what I just discussed. The second trace is another memory store
>>>>>>>> instruction that got squashed, however, it was squashed upon its first
>>>>>>>> entry into the IEW, therefore it never started execution.
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>>>>>> Let me know if you have any ideas based on these two instruction
>>>>>>>> traces. I do not understand how the initiateAcc function results in
>>>>>>>> another reference, but maybe someone else does.... Since I don't see how
>>>>>>>> it makes a reference, it's hard to find out how to make sure it gets
>>>>>>>> dereferenced...
>>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>>>>>>> references/deferences occur). Let me know if you have any advice on
>>>>>>>> this...if it's possible. I can't seem to get the right include files, and
>>>>>>>> likely right SConscript compile order...
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu>wrote:
>>>>>>>>
>>>>>>>>> Without digging into things too deeply, it looks like you may be
>>>>>>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>>>>>>> with one, but until that final reference is removed, the object will hang
>>>>>>>>> around forever. I think I've had problems before where there reference
>>>>>>>>> count ended up off by one somehow and instructions would start piling up.
>>>>>>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>>>>>>> structure stops letting instructions through and starts accumulating them.
>>>>>>>>> Either of these problems will be annoying to track down, but with enough
>>>>>>>>> digging I've been able to fix these sorts of things.
>>>>>>>>>
>>>>>>>>> This may have more to do with O3 not handling the benchmark you're
>>>>>>>>> running well rather than a problem with your new DRAM model. There may be
>>>>>>>>> some interaction between the two, though, where the new memory makes the
>>>>>>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>>>>>>> dynamic instruction creation and destruction and reference counting (try
>>>>>>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>>>>>>> itself) and turn it on as close as you can to where things go bad tick
>>>>>>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>>>>>>> reference count is incremented and decremented. It should be relatively
>>>>>>>>> easy to pair up where references are created and destroyed, and you should
>>>>>>>>> be able to identify the reference which never goes away. Then you need to
>>>>>>>>> figure out where that reference is being created. After that, you should
>>>>>>>>> have enough information to identify why the reference counting isn't being
>>>>>>>>> done correctly. It's arduous, but that's the only way.
>>>>>>>>>
>>>>>>>>> It's important to also make sure reference counts aren't decremented
>>>>>>>>> to zero prematurely. I had a problem once where that happened and the
>>>>>>>>> memory behind the object was updated by something that didn't know it was
>>>>>>>>> dead. The memory had since been reallocated to another object of the same
>>>>>>>>> type, so that other object reflected what happened to the phantom one. If I
>>>>>>>>> remember that manifested as something weird like an add causing a page
>>>>>>>>> fault or something.
>>>>>>>>>
>>>>>>>>> Gabe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I've looked into this problem some more, and have put together a
>>>>>>>>> couple traces. I've been becoming more familiar with how gem5 handles
>>>>>>>>> dynamic instructions, in particular how it destroys them. I have two
>>>>>>>>> traces to compare, one with the physical memory, and the other with the
>>>>>>>>> integrated dramsim2 dram memory. I also have two plots showing instruction
>>>>>>>>> counts over time (sim ticks). All of these are linked at the end of the
>>>>>>>>> email.
>>>>>>>>> First, I'm going to go into what I've been able to interpret
>>>>>>>>> regarding how instructions are destroyed. In particular, comparing when
>>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>>>>>>> separate these because I've seen a difference, as I discuss later. These
>>>>>>>>> explanations are fairly non-existent on the wiki. There is a section
>>>>>>>>> header waiting to be filled...
>>>>>>>>> From what I have been able to gather from the code, there is a list
>>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called instList, with
>>>>>>>>> the type DynInstPtr. There are three conditions to instructions being
>>>>>>>>> cleaned from this list:
>>>>>>>>> 1.) The ROB retires its head instruction
>>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>>>>>>>>> resulting in removing any instruction not in the ROB
>>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>>>>>>>>> removal of all instructions back to the bad seq num.
>>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>>>>>>>>> removed in-flight instructions. This line in particular
>>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>>>>>>> instList.erase(removeList.front());
>>>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>>>>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>>>>>>> met. I also see what tick it occurs on.
>>>>>>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>>>>>> analyzing the trace files, I've gathered that this takes into account that
>>>>>>>>> instructions have different execution lengths. So if one tick a memory
>>>>>>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>>>>>>> memory instruction will occur much later (i.e. 1M ticks later). I have yet
>>>>>>>>> to determine how this is implemented.
>>>>>>>>> Now for the problem.
>>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>>>>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>>>>>>> and the size of dynamic instruction count (of DynInst objects). The
>>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly
>>>>>>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>>>>>>> Around tick 130B after libquantum started, it starts hitting what I'm
>>>>>>>>> assuming are loops (therefore branch prediction), resulting in some
>>>>>>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>>>>>>> instructions in flight than allowed by ROB).
>>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>>>>>>>> trace, but they should represent roughly the same area of execution. They
>>>>>>>>> don't execute the same due to the dramsim2 modeling the memory differently
>>>>>>>>> (i.e. latency and other delays).
>>>>>>>>> I've shared both traces on my public Dropbox here --
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>>>>>>> Here are a couple plots of tick versus instruction count, with
>>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size()
>>>>>>>>> in cpu/o3/cpu.cc. --
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>>>>>>>
>>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>>>>>>> Note that I added the printout of the instList size to an existing
>>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>>>>>> Here are the commands I ran to parse the traces into data files to
>>>>>>>>> analyze in MATLAB and create the plots:
>>>>>>>>> zgrep DynInst
>>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>>>>>>> zgrep instList
>>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>>>>>>> $1,$11}' > instlistsize.out
>>>>>>>>> It seems to me like the problem might lie in gem5, but has just been
>>>>>>>>> exposed by integrating this more detailed memory model, dramsim2, into
>>>>>>>>> gem5. Either that, or their are some timing errors in how dramsim2 was
>>>>>>>>> integrated. I doubt this, however, since those first 190B ticks executed
>>>>>>>>> used the dramsim2 memory. I believe the problem is a combination of memory
>>>>>>>>> instructions + complex loops (branch prediction), resulting in improper
>>>>>>>>> destroying of instructions.
>>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>>>>>>> Their are 192 ROB entries, which is why the instList size generally has a
>>>>>>>>> max of about 192 instructions. The dynamic instruction counts (seen in the
>>>>>>>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>>>>>>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>>>>>>> which allows more and more instructions to be added to the system (possibly
>>>>>>>>> from a bad branch).
>>>>>>>>> I appreciate any help in debugging this and further figuring out the
>>>>>>>>> root problem, just let me know if you need anything else from me. I don't
>>>>>>>>> have much more time at the moment to debug, but I can take any advice for
>>>>>>>>> quick changes and/or additional traces, then send the results back to the
>>>>>>>>> list for discussion.
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>>>>>>>>> transaction (and even command) queue from 512 to 32. The same instructions
>>>>>>>>> problem occurred. It basically just decreased the execution time.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu> wrote:
>>>>>>>>>
>>>>>>>>>> The error is that there are more that 1500 instructions currently
>>>>>>>>>> in flight in the system. It could mean several things:
>>>>>>>>>>
>>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are
>>>>>>>>>> more than 1500 in your system at one time?
>>>>>>>>>>
>>>>>>>>>> 2. Instructions aren't being destroyed correctly
>>>>>>>>>>
>>>>>>>>>> You could try to to run a debug binary so you'll get a list of
>>>>>>>>>> instructions when it happens or increase the number which may
>>>>>>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>>>>>>> instructions).
>>>>>>>>>>
>>>>>>>>>> Ali
>>>>>>>>>>
>>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Xiangyu,
>>>>>>>>>> I just started looking into this some more. So at first I
>>>>>>>>>> thought it was due to updating to a more recent revision, but then I went
>>>>>>>>>> back to revision 8643, added your patch, built and ran....and now get the
>>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing now to see
>>>>>>>>>> if an update to SWIG might have resulted in this error, maybe someone on
>>>>>>>>>> the mailing list would know if that's possible. The difference is 1.3.40
>>>>>>>>>> vs. 2.0.3, both of which are supported according to the dependencies wiki
>>>>>>>>>> page.
>>>>>>>>>> Just for completeness, here's the error from revision 8643:
>>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>>>>>>>>>> that today. Maybe this is an assertion that is occurring due to an
>>>>>>>>>> optimization. That would mean it wouldn't be triggered in gem5.debug since
>>>>>>>>>> it runs without optimizations. Have you tested all debug, opt and fast
>>>>>>>>>> with your tests?
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I didn?t see this error in my simulations. May I ask which gem5
>>>>>>>>>>> version you are using? I find some of the latest code updates do not comply
>>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>>>>>>> PARSEC2 on ARM_SE.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Xiangyu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>>>>>>
>>>>>>>>>>> *To:* gem5 users mailing list
>>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>>>>>>>>>
>>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>>>
>>>>>>>>>>> Xiangyu,
>>>>>>>>>>>
>>>>>>>>>>> I've been having an issue recently with the number of
>>>>>>>>>>> instructions I've been seeing committed to the CPU (I have a separate
>>>>>>>>>>> thread on this). It turns out the issue seems to be coming from this patch
>>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately, I've been
>>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I haven't been
>>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or debug back in
>>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu fails with this
>>>>>>>>>>> assertion:
>>>>>>>>>>>
>>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>>>>>
>>>>>>>>>>> -Andrew
>>>>>>>>>>>
>>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>>>>> Message-ID: gmail.com>
>>>>>>>>>>>
>>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>>>>>>>> modes.
>>>>>>>>>>> I'm willing to share it here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For those who have such needs, please go to my website
>>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>>>>>>> download the patch and test it. To enable
>>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you
>>>>>>>>>>> can create
>>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is to
>>>>>>>>>>> use the
>>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please let me know if there are bugs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Xiangyu Dong
>>>>>>>>>>>
>>>>>>>>>>> -------------- next part --------------
>>>>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>>>> URL: <
>>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-users mailing list
>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-users mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-users mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Andrew Cebulski
2012-05-08 01:53:39 UTC
Permalink
Hi Ali and Gabe,

Here's the trace file:
http://dl.dropbox.com/u/2953302/gem5/table_walker.out

The pending queue size in the table walker follows the shape of the
dynamic instruction curves. The L1 and L2 queue size never go above 0.
Comparing DynInst count in cpu->instcount with pendingQueue size:

http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png

-Andrew

On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu> wrote:

> Hi Andrew,
>
> Could you add some code to the table walker to see how big the following
> are getting:
> stateQueueL1.size()
> stateQueueL2.size()
> pendingQueue.size()
>
> Perhaps we're some how getting into a loop where there are a lot of
> translations to invalid addresses that get squashed and they pile up in the
> table walker?
>
> Thanks,
> Ali
>
>
>
> On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:
>
> > I haven't had a chance to study what's going on here, but could the
> problem be that we don't have bandwidth limits/back pressure implemented
> for the TLB and delayed translation? It could be that the CPU is pumping
> instructions into translation which eventually drain out/are squashed, and
> if too many accumulate they trip that assert.
> >
> > That may not actually make any sense as far as what the code is actually
> doing, but it occurred to me as a possibility and I thought I'd throw it
> out there.
> >
> > Gabe
> >
> > Quoting Andrew Cebulski <***@drexel.edu>:
> >
> >> I double-checked by looking at the config.ini file. It turns out I did
> >> actually create the checkpoint with an Atomic CPU without caches. Sorry
> >> for the confusion.
> >>
> >> -Andrew
> >>
> >> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu>
> wrote:
> >>
> >>> I started hitting this assertion (that the number of insts in flight
> was >
> >>> 1500) before I started using a checkpoint. I created the checkpoint
> >>> afterwards to decrease the time needed to run simulations to debug this
> >>> problem. I'll create a new checkpoint, then send the new trace output.
> >>>
> >>> -Andrew
> >>>
> >>>
> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
> >>>
> >>>> **
> >>>>
> >>>> It's likely the cause for all of your problems. Dirty data in the
> caches
> >>>> doesn't get restored either. You should always create checkpoints
> with an
> >>>> atomic cpu and without caches.
> >>>>
> >>>>
> >>>>
> >>>> Ali
> >>>>
> >>>>
> >>>>
> >>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
> >>>>
> >>>> Sorry, I created the checkpoint I referred to with an O3 CPU with
> caches.
> >>>> From what I recall reading, caches don't get restored from
> checkpoints.
> >>>> Since the checkpoint wasn't during the benchmark run, I assumed that
> was
> >>>> okay.
> >>>> -Andrew
> >>>>
> >>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
> >>>>
> >>>>> You haven't answered the question about if you created the
> checkpoints
> >>>>> with an atomic cpu without caches.
> >>>>>
> >>>>> Ali
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
> >>>>>
> >>>>> I have not run with the checker CPU recently. Here's the stderr
> output
> >>>>> from a run I did awhile back:
> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
> >>>>> Note that the instruction match error is before my benchmark actually
> >>>>> starts running. The start of my boot script checks to see if my
> files
> >>>>> image is mounted (which it is), then continues on to run the
> benchmark. I
> >>>>> booted the system, mounted my files image, then took a checkpoint.
> I've
> >>>>> been running all my tests from that checkpoint. I found where my
> benchmark
> >>>>> started based on the ASID (from ExecAsid debug flag).
> >>>>> I delayed the start of gathering trace data until the second-to-last
> >>>>> linear increase in dynamic instructions in-flight. I'm running a
> new trace
> >>>>> now.
> >>>>> -Andrew
> >>>>>
> >>>>>
> >>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
> >>>>>
> >>>>>> Something is wrong well before this point. There is no reason that
> >>>>>> address 0x0 or 0x4 should be translated.
> >>>>>>
> >>>>>> Did you happen to create a checkpoint when caches were in the
> system?
> >>>>>>
> >>>>>> Have you tried to run with the checker cpu and see if it detects any
> >>>>>> errors?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Ali
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
> >>>>>>
> >>>>>> They are data TLB misses that occur as the in-flight instruction
> count
> >>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight
> instruction
> >>>>>> count finally linearly decreases is to 0x200. Also, at the start
> of the
> >>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
> >>>>>> Here's a trace file:
> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
> >>>>>> To reduce size, I just have lines that have either TLB or walker in
> >>>>>> them.
> >>>>>> I do see only a handful of instruction TLB misses.
> >>>>>> -Andrew
> >>>>>>
> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu> wrote:
> >>>>>>
> >>>>>>> Hi Andrew,
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks for digging into this. I think there is an issue somewhere,
> but
> >>>>>>> I'm still not sure where.
> >>>>>>>
> >>>>>>> Ali
> >>>>>>>
> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
> >>>>>>>
> >>>>>>> Okay, I'm positive now that the issue lies with delayed
> translations
> >>>>>>> that are squashed before finishing.
> >>>>>>>
> >>>>>>> On the data on instruction side? You seem to allude to data in the
> >>>>>>> paragraph below, but then instructions in the latter text.
> >>>>>>>
> >>>>>>> It seems to me like speculative load/stores are being executed,
> >>>>>>> rather than waiting for the instructions to commit. Once the
> instructions
> >>>>>>> begin getting (speculatively) executed in the TLB, a reference is
> left
> >>>>>>> there, which seems hard to root out and dereference after the
> instruction
> >>>>>>> ends up being squashed. At least, I have not been able to find
> that out in
> >>>>>>> the source code as of yet. Can anyone clarify on this?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> There should only be one translation outstanding from each
> >>>>>>> instruction and data side walker. Any nested transactions should
> be queued
> >>>>>>> in the walker. Until one finishes, I'm not sure how multiple would
> ever be
> >>>>>>> outstanding.
> >>>>>>>
> >>>>>>> Recall the following image that shows how the number of dynamic
> >>>>>>> instruction (DynInst) objects in-flight increases linearly for
> varying
> >>>>>>> periods of time:
> >>>>>>>
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> >>>>>>> After enabling the TLB debug flag, I see that the linear increase
> in
> >>>>>>> instructions in flight is proportional to the number of TLB
> misses. These
> >>>>>>> TLB misses have a much larger delay (resulting in translation
> delays) due
> >>>>>>> to the fact the DramSim2 models the memory system more accurately.
> It
> >>>>>>> seems that with the classic memory system, TLB misses often do not
> have
> >>>>>>> translation delays. For whatever reason, it would also seem that
> every
> >>>>>>> instruction that has a TLB miss also is eventually squashed...
> >>>>>>>
> >>>>>>> From a data side perspective this is reasonable. While a miss is
> >>>>>>> outstanding at some point instructions will stop committing and
> thus the
> >>>>>>> instructions in flight will begin to rise until the miss is
> satisfied.
> >>>>>>>
> >>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
> >>>>>>> messages appears on the rising slopes (repeated up until the peak):
> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
> >>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
> >>>>>>>
> >>>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
> >>>>>>> would be outstanding to both address 0 and address 4 at the same
> time. In
> >>>>>>> almost all cases these pages are marked as no-access to detect
> segfaults.
> >>>>>>> Perhaps there is an issue where the cpu is getting into a loop
> faulting on
> >>>>>>> a bad access and then faulting again on the fault handler. I could
> imagine
> >>>>>>> this would happen if there was some corruption in the memory
> system (for
> >>>>>>> example the timings in dramsim exposing a bug in the cache models
> or
> >>>>>>> something).
> >>>>>>>
> >>>>>>>
> >>>>>>> At the peak, the following message appears (from fetch) almost
> every
> >>>>>>> tick for (what I believe to be) every single one of the table
> walkers that
> >>>>>>> were squashed.
> >>>>>>> Fetch is waiting ITLB walk to finish!
> >>>>>>>
> >>>>>>> There must be another walk in flight? The instruction side will
> only
> >>>>>>> have one fault outstanding at once. Successive branch mispredicts
> will
> >>>>>>> re-direct fetch but there is code that catches the fact that a
> different
> >>>>>>> walk completed then expected and "does the right thing."
> >>>>>>>
> >>>>>>> The problem is that these ITLB table walks are for instructions
> that
> >>>>>>> were squashed as much as 0.3 billion cycles earlier, and since
> been removed
> >>>>>>> from the CPU's instruction list.
> >>>>>>>
> >>>>>>> I'm not following here.
> >>>>>>>
> >>>>>>> Any help will be greatly appreciated in solving this problem. I've
> >>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
> due to the
> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not). There's
> the 256
> >>>>>>> MB for physical memory, then the 64 MB for the boot loader. I
> brought this
> >>>>>>> up in my last email about trying to get Ruby working. Therefore,
> I'm
> >>>>>>> trying to get this DramSim2 integration fixed so I can start
> modeling FS
> >>>>>>> with DRAM memory.
> >>>>>>>
> >>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this work?
> >>>>>>>
> >>>>>>>
> >>>>>>> Note that these problems also occur in Soplex from the Spec CPU2006
> >>>>>>> benchmark suite (also hits 1500 in-flight instructions assertion).
> Due to
> >>>>>>> time constraints, I haven't tested on other benchmarks.
> >>>>>>> Thanks,
> >>>>>>> Andrew
> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <
> ***@drexel.edu>wrote:
> >>>>>>>
> >>>>>>>> Hey Gabe,
> >>>>>>>> Thanks for this...very helpful. I just recently got back into
> >>>>>>>> debugging this problem. I made a small change in
> src/base/refcnt.hh to
> >>>>>>>> allow me to return the current count of references to a DynInst
> object.
> >>>>>>>> I then modified existing DPRINTFs to also print out reference
> >>>>>>>> counts, then added some of my own when I needed extra visibility.
> >>>>>>>> I've found one memory store instruction that seems to be
> getting
> >>>>>>>> lost. What's happening is that is progresses as far as getting
> executed in
> >>>>>>>> the IEW once, but a delayed translation occurs, deferring the
> store. By
> >>>>>>>> the time it reenters the IEW, the IQ has marked the instruction as
> >>>>>>>> squashed. Everything progresses as usual from here on out, with
> one
> >>>>>>>> exception. When the instruction is removed from the CPUs
> instruction list,
> >>>>>>>> there is one reference count hanging.
> >>>>>>>> I've added in some additional debugging for my traces to help
> >>>>>>>> narrow down where this reference is coming from. As far as I can
> tell,
> >>>>>>>> it's because of a call to initiateAcc() within the executeStore
> function in
> >>>>>>>> the lsq unit. Please see the following two traces. The first
> trace shows
> >>>>>>>> what I just discussed. The second trace is another memory store
> >>>>>>>> instruction that got squashed, however, it was squashed upon its
> first
> >>>>>>>> entry into the IEW, therefore it never started execution.
> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
> >>>>>>>> Let me know if you have any ideas based on these two
> instruction
> >>>>>>>> traces. I do not understand how the initiateAcc function results
> in
> >>>>>>>> another reference, but maybe someone else does.... Since I don't
> see how
> >>>>>>>> it makes a reference, it's hard to find out how to make sure it
> gets
> >>>>>>>> dereferenced...
> >>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
> >>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e.
> exactly when
> >>>>>>>> references/deferences occur). Let me know if you have any advice
> on
> >>>>>>>> this...if it's possible. I can't seem to get the right include
> files, and
> >>>>>>>> likely right SConscript compile order...
> >>>>>>>> Thanks,
> >>>>>>>> Andrew
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu
> >wrote:
> >>>>>>>>
> >>>>>>>>> Without digging into things too deeply, it looks like you may be
> >>>>>>>>> leaking references to dynamic instructions. The CPU may think
> it's done
> >>>>>>>>> with one, but until that final reference is removed, the object
> will hang
> >>>>>>>>> around forever. I think I've had problems before where there
> reference
> >>>>>>>>> count ended up off by one somehow and instructions would start
> piling up.
> >>>>>>>>> It's also possible that a clog develops in O3's pipeline and
> some internal
> >>>>>>>>> structure stops letting instructions through and starts
> accumulating them.
> >>>>>>>>> Either of these problems will be annoying to track down, but
> with enough
> >>>>>>>>> digging I've been able to fix these sorts of things.
> >>>>>>>>>
> >>>>>>>>> This may have more to do with O3 not handling the benchmark
> you're
> >>>>>>>>> running well rather than a problem with your new DRAM model.
> There may be
> >>>>>>>>> some interaction between the two, though, where the new memory
> makes the
> >>>>>>>>> timing line up to cause O3 to behave poorly. What you can do is
> instrument
> >>>>>>>>> dynamic instruction creation and destruction and reference
> counting (try
> >>>>>>>>> print "this" for both the reference counting wrapper and the dyn
> inst
> >>>>>>>>> itself) and turn it on as close as you can to where things go
> bad tick
> >>>>>>>>> wise. Then look for an instruction which gets lost, and look for
> where it's
> >>>>>>>>> reference count is incremented and decremented. It should be
> relatively
> >>>>>>>>> easy to pair up where references are created and destroyed, and
> you should
> >>>>>>>>> be able to identify the reference which never goes away. Then
> you need to
> >>>>>>>>> figure out where that reference is being created. After that,
> you should
> >>>>>>>>> have enough information to identify why the reference counting
> isn't being
> >>>>>>>>> done correctly. It's arduous, but that's the only way.
> >>>>>>>>>
> >>>>>>>>> It's important to also make sure reference counts aren't
> decremented
> >>>>>>>>> to zero prematurely. I had a problem once where that happened
> and the
> >>>>>>>>> memory behind the object was updated by something that didn't
> know it was
> >>>>>>>>> dead. The memory had since been reallocated to another object of
> the same
> >>>>>>>>> type, so that other object reflected what happened to the
> phantom one. If I
> >>>>>>>>> remember that manifested as something weird like an add causing
> a page
> >>>>>>>>> fault or something.
> >>>>>>>>>
> >>>>>>>>> Gabe
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
> >>>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>> I've looked into this problem some more, and have put together a
> >>>>>>>>> couple traces. I've been becoming more familiar with how gem5
> handles
> >>>>>>>>> dynamic instructions, in particular how it destroys them. I
> have two
> >>>>>>>>> traces to compare, one with the physical memory, and the other
> with the
> >>>>>>>>> integrated dramsim2 dram memory. I also have two plots showing
> instruction
> >>>>>>>>> counts over time (sim ticks). All of these are linked at the
> end of the
> >>>>>>>>> email.
> >>>>>>>>> First, I'm going to go into what I've been able to interpret
> >>>>>>>>> regarding how instructions are destroyed. In particular,
> comparing when
> >>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the
> cpu. I
> >>>>>>>>> separate these because I've seen a difference, as I discuss
> later. These
> >>>>>>>>> explanations are fairly non-existent on the wiki. There is a
> section
> >>>>>>>>> header waiting to be filled...
> >>>>>>>>> From what I have been able to gather from the code, there is a
> list
> >>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called
> instList, with
> >>>>>>>>> the type DynInstPtr. There are three conditions to instructions
> being
> >>>>>>>>> cleaned from this list:
> >>>>>>>>> 1.) The ROB retires its head instruction
> >>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
> >>>>>>>>> resulting in removing any instruction not in the ROB
> >>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
> >>>>>>>>> removal of all instructions back to the bad seq num.
> >>>>>>>>> Once all five stages have completed, the CPU cleans up all the
> >>>>>>>>> removed in-flight instructions. This line in particular
> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a
> DynInstPtr:
> >>>>>>>>> instList.erase(removeList.front());
> >>>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
> >>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum
> and pcState
> >>>>>>>>> after all 5 cpu stages have completed, and one of the conditions
> above is
> >>>>>>>>> met. I also see what tick it occurs on.
> >>>>>>>>> When I turn on the DynInst debug flag, I see when instructions
> are
> >>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick.
> From
> >>>>>>>>> analyzing the trace files, I've gathered that this takes into
> account that
> >>>>>>>>> instructions have different execution lengths. So if one tick a
> memory
> >>>>>>>>> instruction in the instList (DynInstPtr) is removed, the DynInst
> for that
> >>>>>>>>> memory instruction will occur much later (i.e. 1M ticks later).
> I have yet
> >>>>>>>>> to determine how this is implemented.
> >>>>>>>>> Now for the problem.
> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
> >>>>>>>>> difference between the size of the instList vector (of
> DynInstPtr objects),
> >>>>>>>>> and the size of dynamic instruction count (of DynInst objects).
> The
> >>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the
> first roughly
> >>>>>>>>> 130B ticks, the dynamic instruction count kept in
> cpu/base_dyn_inst.impl.hh
> >>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below)
> very closely.
> >>>>>>>>> Around tick 130B after libquantum started, it starts hitting
> what I'm
> >>>>>>>>> assuming are loops (therefore branch prediction), resulting in
> some
> >>>>>>>>> behavior that seems to imply improper instruction handling (i.e.
> more
> >>>>>>>>> instructions in flight than allowed by ROB).
> >>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces
> exactly by
> >>>>>>>>> trace, but they should represent roughly the same area of
> execution. They
> >>>>>>>>> don't execute the same due to the dramsim2 modeling the memory
> differently
> >>>>>>>>> (i.e. latency and other delays).
> >>>>>>>>> I've shared both traces on my public Dropbox here --
> >>>>>>>>>
> >>>>>>>>>
> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
> >>>>>>>>>
> >>>>>>>>>
> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
> >>>>>>>>> Here are a couple plots of tick versus instruction count, with
> >>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
> instList.size()
> >>>>>>>>> in cpu/o3/cpu.cc. --
> >>>>>>>>>
> >>>>>>>>>
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
> >>>>>>>>>
> >>>>>>>>>
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> >>>>>>>>> Note that I added the printout of the instList size to an
> existing
> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
> >>>>>>>>> Here are the commands I ran to parse the traces into data files
> to
> >>>>>>>>> analyze in MATLAB and create the plots:
> >>>>>>>>> zgrep DynInst
> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
> grep destroyed
> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
> >>>>>>>>> zgrep instList
> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk
> '{print
> >>>>>>>>> $1,$11}' > instlistsize.out
> >>>>>>>>> It seems to me like the problem might lie in gem5, but has just
> been
> >>>>>>>>> exposed by integrating this more detailed memory model,
> dramsim2, into
> >>>>>>>>> gem5. Either that, or their are some timing errors in how
> dramsim2 was
> >>>>>>>>> integrated. I doubt this, however, since those first 190B ticks
> executed
> >>>>>>>>> used the dramsim2 memory. I believe the problem is a
> combination of memory
> >>>>>>>>> instructions + complex loops (branch prediction), resulting in
> improper
> >>>>>>>>> destroying of instructions.
> >>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
> flags.
> >>>>>>>>> Their are 192 ROB entries, which is why the instList size
> generally has a
> >>>>>>>>> max of about 192 instructions. The dynamic instruction counts
> (seen in the
> >>>>>>>>> dramsim2 plot) seem to also imply that instructions are
> incorrectly been
> >>>>>>>>> removed from the ROB, and then from the cpu's instruction list
> in cpu.cc,
> >>>>>>>>> which allows more and more instructions to be added to the
> system (possibly
> >>>>>>>>> from a bad branch).
> >>>>>>>>> I appreciate any help in debugging this and further figuring out
> the
> >>>>>>>>> root problem, just let me know if you need anything else from
> me. I don't
> >>>>>>>>> have much more time at the moment to debug, but I can take any
> advice for
> >>>>>>>>> quick changes and/or additional traces, then send the results
> back to the
> >>>>>>>>> list for discussion.
> >>>>>>>>> Thanks,
> >>>>>>>>> Andrew
> >>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
> >>>>>>>>> transaction (and even command) queue from 512 to 32. The same
> instructions
> >>>>>>>>> problem occurred. It basically just decreased the execution
> time.
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu>
> wrote:
> >>>>>>>>>
> >>>>>>>>>> The error is that there are more that 1500 instructions
> currently
> >>>>>>>>>> in flight in the system. It could mean several things:
> >>>>>>>>>>
> >>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there are
> >>>>>>>>>> more than 1500 in your system at one time?
> >>>>>>>>>>
> >>>>>>>>>> 2. Instructions aren't being destroyed correctly
> >>>>>>>>>>
> >>>>>>>>>> You could try to to run a debug binary so you'll get a list of
> >>>>>>>>>> instructions when it happens or increase the number which may
> >>>>>>>>>> be appropriate for certain situations (but 1500 is quite a few
> inflight
> >>>>>>>>>> instructions).
> >>>>>>>>>>
> >>>>>>>>>> Ali
> >>>>>>>>>>
> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Xiangyu,
> >>>>>>>>>> I just started looking into this some more. So at first I
> >>>>>>>>>> thought it was due to updating to a more recent revision, but
> then I went
> >>>>>>>>>> back to revision 8643, added your patch, built and ran....and
> now get the
> >>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing
> now to see
> >>>>>>>>>> if an update to SWIG might have resulted in this error, maybe
> someone on
> >>>>>>>>>> the mailing list would know if that's possible. The difference
> is 1.3.40
> >>>>>>>>>> vs. 2.0.3, both of which are supported according to the
> dependencies wiki
> >>>>>>>>>> page.
> >>>>>>>>>> Just for completeness, here's the error from revision 8643:
> >>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
> >>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
> `cpu->instcount
> >>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
> >>>>>>>>>> that today. Maybe this is an assertion that is occurring due
> to an
> >>>>>>>>>> optimization. That would mean it wouldn't be triggered in
> gem5.debug since
> >>>>>>>>>> it runs without optimizations. Have you tested all debug, opt
> and fast
> >>>>>>>>>> with your tests?
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Andrew
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
> >>>>>>>>>> ***@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Andrew,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I didn?t see this error in my simulations. May I ask which gem5
> >>>>>>>>>>> version you are using? I find some of the latest code updates
> do not comply
> >>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5
> repo8643, and
> >>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000,
> EEMBC2, and
> >>>>>>>>>>> PARSEC2 on ARM_SE.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Xiangyu
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
> >>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
> >>>>>>>>>>>
> >>>>>>>>>>> *To:* gem5 users mailing list
> >>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
> >>>>>>>>>>>
> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
> >>>>>>>>>>>
> >>>>>>>>>>> Xiangyu,
> >>>>>>>>>>>
> >>>>>>>>>>> I've been having an issue recently with the number of
> >>>>>>>>>>> instructions I've been seeing committed to the CPU (I have a
> separate
> >>>>>>>>>>> thread on this). It turns out the issue seems to be coming
> from this patch
> >>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately,
> I've been
> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I
> haven't been
> >>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or
> debug back in
> >>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu
> fails with this
> >>>>>>>>>>> assertion:
> >>>>>>>>>>>
> >>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
> >>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
> `cpu->instcount
> >>>>>>>>>>>
> >>>>>>>>>>> -Andrew
> >>>>>>>>>>>
> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
> >>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
> >>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
> >>>>>>>>>>> Message-ID: gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
> and FS
> >>>>>>>>>>> modes.
> >>>>>>>>>>> I'm willing to share it here.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> For those who have such needs, please go to my website
> >>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
> >>>>>>>>>>> download the patch and test it. To enable
> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS,
> you
> >>>>>>>>>>> can create
> >>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module is
> to
> >>>>>>>>>>> use the
> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Please let me know if there are bugs.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Xiangyu Dong
> >>>>>>>>>>>
> >>>>>>>>>>> -------------- next part --------------
> >>>>>>>>>>> An HTML attachment was scrubbed...
> >>>>>>>>>>> URL: <
> >>>>>>>>>>>
> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
> >>>>>>>>>>> >
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> gem5-users mailing list
> >>>>>>>>>> gem5-***@gem5.org
> >>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://
> m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> gem5-users mailing list
> >>>>>>>>> gem5-***@gem5.org
> >>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> gem5-users mailing list
> >>>>>>> gem5-***@gem5.org
> >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> gem5-users mailing list
> >>>>>> gem5-***@gem5.org
> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> gem5-users mailing list
> >>>>> gem5-***@gem5.org
> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> gem5-users mailing list
> >>>> gem5-***@gem5.org
> >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >>>>
> >>>
> >>>
> >>
> >
> >
> > _______________________________________________
> > gem5-users mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Andrew Cebulski
2012-05-10 16:51:43 UTC
Permalink
Ali/Gabe,

Do either of you need anything else from me to debug on your end?
Should I send a more detailed trace?

Thanks,
Andrew

On Mon, May 7, 2012 at 9:53 PM, Andrew Cebulski <***@drexel.edu> wrote:

> Hi Ali and Gabe,
>
> Here's the trace file:
> http://dl.dropbox.com/u/2953302/gem5/table_walker.out
>
> The pending queue size in the table walker follows the shape of the
> dynamic instruction curves. The L1 and L2 queue size never go above 0.
> Comparing DynInst count in cpu->instcount with pendingQueue size:
>
> http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
>
> -Andrew
>
>
> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu> wrote:
>
>> Hi Andrew,
>>
>> Could you add some code to the table walker to see how big the following
>> are getting:
>> stateQueueL1.size()
>> stateQueueL2.size()
>> pendingQueue.size()
>>
>> Perhaps we're some how getting into a loop where there are a lot of
>> translations to invalid addresses that get squashed and they pile up in the
>> table walker?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:
>>
>> > I haven't had a chance to study what's going on here, but could the
>> problem be that we don't have bandwidth limits/back pressure implemented
>> for the TLB and delayed translation? It could be that the CPU is pumping
>> instructions into translation which eventually drain out/are squashed, and
>> if too many accumulate they trip that assert.
>> >
>> > That may not actually make any sense as far as what the code is
>> actually doing, but it occurred to me as a possibility and I thought I'd
>> throw it out there.
>> >
>> > Gabe
>> >
>> > Quoting Andrew Cebulski <***@drexel.edu>:
>> >
>> >> I double-checked by looking at the config.ini file. It turns out I did
>> >> actually create the checkpoint with an Atomic CPU without caches.
>> Sorry
>> >> for the confusion.
>> >>
>> >> -Andrew
>> >>
>> >> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu>
>> wrote:
>> >>
>> >>> I started hitting this assertion (that the number of insts in flight
>> was >
>> >>> 1500) before I started using a checkpoint. I created the checkpoint
>> >>> afterwards to decrease the time needed to run simulations to debug
>> this
>> >>> problem. I'll create a new checkpoint, then send the new trace
>> output.
>> >>>
>> >>> -Andrew
>> >>>
>> >>>
>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>> >>>
>> >>>> **
>> >>>>
>> >>>> It's likely the cause for all of your problems. Dirty data in the
>> caches
>> >>>> doesn't get restored either. You should always create checkpoints
>> with an
>> >>>> atomic cpu and without caches.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Ali
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>> >>>>
>> >>>> Sorry, I created the checkpoint I referred to with an O3 CPU with
>> caches.
>> >>>> From what I recall reading, caches don't get restored from
>> checkpoints.
>> >>>> Since the checkpoint wasn't during the benchmark run, I assumed that
>> was
>> >>>> okay.
>> >>>> -Andrew
>> >>>>
>> >>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>> >>>>
>> >>>>> You haven't answered the question about if you created the
>> checkpoints
>> >>>>> with an atomic cpu without caches.
>> >>>>>
>> >>>>> Ali
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>> >>>>>
>> >>>>> I have not run with the checker CPU recently. Here's the stderr
>> output
>> >>>>> from a run I did awhile back:
>> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>> >>>>> Note that the instruction match error is before my benchmark
>> actually
>> >>>>> starts running. The start of my boot script checks to see if my
>> files
>> >>>>> image is mounted (which it is), then continues on to run the
>> benchmark. I
>> >>>>> booted the system, mounted my files image, then took a checkpoint.
>> I've
>> >>>>> been running all my tests from that checkpoint. I found where my
>> benchmark
>> >>>>> started based on the ASID (from ExecAsid debug flag).
>> >>>>> I delayed the start of gathering trace data until the second-to-last
>> >>>>> linear increase in dynamic instructions in-flight. I'm running a
>> new trace
>> >>>>> now.
>> >>>>> -Andrew
>> >>>>>
>> >>>>>
>> >>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>> >>>>>
>> >>>>>> Something is wrong well before this point. There is no reason that
>> >>>>>> address 0x0 or 0x4 should be translated.
>> >>>>>>
>> >>>>>> Did you happen to create a checkpoint when caches were in the
>> system?
>> >>>>>>
>> >>>>>> Have you tried to run with the checker cpu and see if it detects
>> any
>> >>>>>> errors?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Ali
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>> >>>>>>
>> >>>>>> They are data TLB misses that occur as the in-flight instruction
>> count
>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight
>> instruction
>> >>>>>> count finally linearly decreases is to 0x200. Also, at the start
>> of the
>> >>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>> >>>>>> Here's a trace file:
>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>> >>>>>> To reduce size, I just have lines that have either TLB or walker in
>> >>>>>> them.
>> >>>>>> I do see only a handful of instruction TLB misses.
>> >>>>>> -Andrew
>> >>>>>>
>> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu>
>> wrote:
>> >>>>>>
>> >>>>>>> Hi Andrew,
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks for digging into this. I think there is an issue
>> somewhere, but
>> >>>>>>> I'm still not sure where.
>> >>>>>>>
>> >>>>>>> Ali
>> >>>>>>>
>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>> >>>>>>>
>> >>>>>>> Okay, I'm positive now that the issue lies with delayed
>> translations
>> >>>>>>> that are squashed before finishing.
>> >>>>>>>
>> >>>>>>> On the data on instruction side? You seem to allude to data in the
>> >>>>>>> paragraph below, but then instructions in the latter text.
>> >>>>>>>
>> >>>>>>> It seems to me like speculative load/stores are being executed,
>> >>>>>>> rather than waiting for the instructions to commit. Once the
>> instructions
>> >>>>>>> begin getting (speculatively) executed in the TLB, a reference is
>> left
>> >>>>>>> there, which seems hard to root out and dereference after the
>> instruction
>> >>>>>>> ends up being squashed. At least, I have not been able to find
>> that out in
>> >>>>>>> the source code as of yet. Can anyone clarify on this?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> There should only be one translation outstanding from each
>> >>>>>>> instruction and data side walker. Any nested transactions should
>> be queued
>> >>>>>>> in the walker. Until one finishes, I'm not sure how multiple
>> would ever be
>> >>>>>>> outstanding.
>> >>>>>>>
>> >>>>>>> Recall the following image that shows how the number of dynamic
>> >>>>>>> instruction (DynInst) objects in-flight increases linearly for
>> varying
>> >>>>>>> periods of time:
>> >>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>> After enabling the TLB debug flag, I see that the linear increase
>> in
>> >>>>>>> instructions in flight is proportional to the number of TLB
>> misses. These
>> >>>>>>> TLB misses have a much larger delay (resulting in translation
>> delays) due
>> >>>>>>> to the fact the DramSim2 models the memory system more
>> accurately. It
>> >>>>>>> seems that with the classic memory system, TLB misses often do
>> not have
>> >>>>>>> translation delays. For whatever reason, it would also seem that
>> every
>> >>>>>>> instruction that has a TLB miss also is eventually squashed...
>> >>>>>>>
>> >>>>>>> From a data side perspective this is reasonable. While a miss is
>> >>>>>>> outstanding at some point instructions will stop committing and
>> thus the
>> >>>>>>> instructions in flight will begin to rise until the miss is
>> satisfied.
>> >>>>>>>
>> >>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>> >>>>>>> messages appears on the rising slopes (repeated up until the
>> peak):
>> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>> >>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>> >>>>>>>
>> >>>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>> >>>>>>> would be outstanding to both address 0 and address 4 at the same
>> time. In
>> >>>>>>> almost all cases these pages are marked as no-access to detect
>> segfaults.
>> >>>>>>> Perhaps there is an issue where the cpu is getting into a loop
>> faulting on
>> >>>>>>> a bad access and then faulting again on the fault handler. I
>> could imagine
>> >>>>>>> this would happen if there was some corruption in the memory
>> system (for
>> >>>>>>> example the timings in dramsim exposing a bug in the cache models
>> or
>> >>>>>>> something).
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> At the peak, the following message appears (from fetch) almost
>> every
>> >>>>>>> tick for (what I believe to be) every single one of the table
>> walkers that
>> >>>>>>> were squashed.
>> >>>>>>> Fetch is waiting ITLB walk to finish!
>> >>>>>>>
>> >>>>>>> There must be another walk in flight? The instruction side will
>> only
>> >>>>>>> have one fault outstanding at once. Successive branch mispredicts
>> will
>> >>>>>>> re-direct fetch but there is code that catches the fact that a
>> different
>> >>>>>>> walk completed then expected and "does the right thing."
>> >>>>>>>
>> >>>>>>> The problem is that these ITLB table walks are for instructions
>> that
>> >>>>>>> were squashed as much as 0.3 billion cycles earlier, and since
>> been removed
>> >>>>>>> from the CPU's instruction list.
>> >>>>>>>
>> >>>>>>> I'm not following here.
>> >>>>>>>
>> >>>>>>> Any help will be greatly appreciated in solving this problem.
>> I've
>> >>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
>> due to the
>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not).
>> There's the 256
>> >>>>>>> MB for physical memory, then the 64 MB for the boot loader. I
>> brought this
>> >>>>>>> up in my last email about trying to get Ruby working. Therefore,
>> I'm
>> >>>>>>> trying to get this DramSim2 integration fixed so I can start
>> modeling FS
>> >>>>>>> with DRAM memory.
>> >>>>>>>
>> >>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this
>> work?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Note that these problems also occur in Soplex from the Spec
>> CPU2006
>> >>>>>>> benchmark suite (also hits 1500 in-flight instructions
>> assertion). Due to
>> >>>>>>> time constraints, I haven't tested on other benchmarks.
>> >>>>>>> Thanks,
>> >>>>>>> Andrew
>> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <
>> ***@drexel.edu>wrote:
>> >>>>>>>
>> >>>>>>>> Hey Gabe,
>> >>>>>>>> Thanks for this...very helpful. I just recently got back into
>> >>>>>>>> debugging this problem. I made a small change in
>> src/base/refcnt.hh to
>> >>>>>>>> allow me to return the current count of references to a DynInst
>> object.
>> >>>>>>>> I then modified existing DPRINTFs to also print out reference
>> >>>>>>>> counts, then added some of my own when I needed extra visibility.
>> >>>>>>>> I've found one memory store instruction that seems to be
>> getting
>> >>>>>>>> lost. What's happening is that is progresses as far as getting
>> executed in
>> >>>>>>>> the IEW once, but a delayed translation occurs, deferring the
>> store. By
>> >>>>>>>> the time it reenters the IEW, the IQ has marked the instruction
>> as
>> >>>>>>>> squashed. Everything progresses as usual from here on out, with
>> one
>> >>>>>>>> exception. When the instruction is removed from the CPUs
>> instruction list,
>> >>>>>>>> there is one reference count hanging.
>> >>>>>>>> I've added in some additional debugging for my traces to help
>> >>>>>>>> narrow down where this reference is coming from. As far as I
>> can tell,
>> >>>>>>>> it's because of a call to initiateAcc() within the executeStore
>> function in
>> >>>>>>>> the lsq unit. Please see the following two traces. The first
>> trace shows
>> >>>>>>>> what I just discussed. The second trace is another memory store
>> >>>>>>>> instruction that got squashed, however, it was squashed upon its
>> first
>> >>>>>>>> entry into the IEW, therefore it never started execution.
>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>> >>>>>>>> Let me know if you have any ideas based on these two
>> instruction
>> >>>>>>>> traces. I do not understand how the initiateAcc function
>> results in
>> >>>>>>>> another reference, but maybe someone else does.... Since I
>> don't see how
>> >>>>>>>> it makes a reference, it's hard to find out how to make sure it
>> gets
>> >>>>>>>> dereferenced...
>> >>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>> >>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e.
>> exactly when
>> >>>>>>>> references/deferences occur). Let me know if you have any
>> advice on
>> >>>>>>>> this...if it's possible. I can't seem to get the right include
>> files, and
>> >>>>>>>> likely right SConscript compile order...
>> >>>>>>>> Thanks,
>> >>>>>>>> Andrew
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <
>> ***@eecs.umich.edu>wrote:
>> >>>>>>>>
>> >>>>>>>>> Without digging into things too deeply, it looks like you may be
>> >>>>>>>>> leaking references to dynamic instructions. The CPU may think
>> it's done
>> >>>>>>>>> with one, but until that final reference is removed, the object
>> will hang
>> >>>>>>>>> around forever. I think I've had problems before where there
>> reference
>> >>>>>>>>> count ended up off by one somehow and instructions would start
>> piling up.
>> >>>>>>>>> It's also possible that a clog develops in O3's pipeline and
>> some internal
>> >>>>>>>>> structure stops letting instructions through and starts
>> accumulating them.
>> >>>>>>>>> Either of these problems will be annoying to track down, but
>> with enough
>> >>>>>>>>> digging I've been able to fix these sorts of things.
>> >>>>>>>>>
>> >>>>>>>>> This may have more to do with O3 not handling the benchmark
>> you're
>> >>>>>>>>> running well rather than a problem with your new DRAM model.
>> There may be
>> >>>>>>>>> some interaction between the two, though, where the new memory
>> makes the
>> >>>>>>>>> timing line up to cause O3 to behave poorly. What you can do is
>> instrument
>> >>>>>>>>> dynamic instruction creation and destruction and reference
>> counting (try
>> >>>>>>>>> print "this" for both the reference counting wrapper and the
>> dyn inst
>> >>>>>>>>> itself) and turn it on as close as you can to where things go
>> bad tick
>> >>>>>>>>> wise. Then look for an instruction which gets lost, and look
>> for where it's
>> >>>>>>>>> reference count is incremented and decremented. It should be
>> relatively
>> >>>>>>>>> easy to pair up where references are created and destroyed, and
>> you should
>> >>>>>>>>> be able to identify the reference which never goes away. Then
>> you need to
>> >>>>>>>>> figure out where that reference is being created. After that,
>> you should
>> >>>>>>>>> have enough information to identify why the reference counting
>> isn't being
>> >>>>>>>>> done correctly. It's arduous, but that's the only way.
>> >>>>>>>>>
>> >>>>>>>>> It's important to also make sure reference counts aren't
>> decremented
>> >>>>>>>>> to zero prematurely. I had a problem once where that happened
>> and the
>> >>>>>>>>> memory behind the object was updated by something that didn't
>> know it was
>> >>>>>>>>> dead. The memory had since been reallocated to another object
>> of the same
>> >>>>>>>>> type, so that other object reflected what happened to the
>> phantom one. If I
>> >>>>>>>>> remember that manifested as something weird like an add causing
>> a page
>> >>>>>>>>> fault or something.
>> >>>>>>>>>
>> >>>>>>>>> Gabe
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi all,
>> >>>>>>>>> I've looked into this problem some more, and have put together a
>> >>>>>>>>> couple traces. I've been becoming more familiar with how gem5
>> handles
>> >>>>>>>>> dynamic instructions, in particular how it destroys them. I
>> have two
>> >>>>>>>>> traces to compare, one with the physical memory, and the other
>> with the
>> >>>>>>>>> integrated dramsim2 dram memory. I also have two plots showing
>> instruction
>> >>>>>>>>> counts over time (sim ticks). All of these are linked at the
>> end of the
>> >>>>>>>>> email.
>> >>>>>>>>> First, I'm going to go into what I've been able to interpret
>> >>>>>>>>> regarding how instructions are destroyed. In particular,
>> comparing when
>> >>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the
>> cpu. I
>> >>>>>>>>> separate these because I've seen a difference, as I discuss
>> later. These
>> >>>>>>>>> explanations are fairly non-existent on the wiki. There is a
>> section
>> >>>>>>>>> header waiting to be filled...
>> >>>>>>>>> From what I have been able to gather from the code, there is a
>> list
>> >>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called
>> instList, with
>> >>>>>>>>> the type DynInstPtr. There are three conditions to
>> instructions being
>> >>>>>>>>> cleaned from this list:
>> >>>>>>>>> 1.) The ROB retires its head instruction
>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>> >>>>>>>>> resulting in removing any instruction not in the ROB
>> >>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>> >>>>>>>>> removal of all instructions back to the bad seq num.
>> >>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>> >>>>>>>>> removed in-flight instructions. This line in particular
>> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a
>> DynInstPtr:
>> >>>>>>>>> instList.erase(removeList.front());
>> >>>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>> >>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum
>> and pcState
>> >>>>>>>>> after all 5 cpu stages have completed, and one of the
>> conditions above is
>> >>>>>>>>> met. I also see what tick it occurs on.
>> >>>>>>>>> When I turn on the DynInst debug flag, I see when instructions
>> are
>> >>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what
>> tick. From
>> >>>>>>>>> analyzing the trace files, I've gathered that this takes into
>> account that
>> >>>>>>>>> instructions have different execution lengths. So if one tick
>> a memory
>> >>>>>>>>> instruction in the instList (DynInstPtr) is removed, the
>> DynInst for that
>> >>>>>>>>> memory instruction will occur much later (i.e. 1M ticks later).
>> I have yet
>> >>>>>>>>> to determine how this is implemented.
>> >>>>>>>>> Now for the problem.
>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>> >>>>>>>>> difference between the size of the instList vector (of
>> DynInstPtr objects),
>> >>>>>>>>> and the size of dynamic instruction count (of DynInst objects).
>> The
>> >>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the
>> first roughly
>> >>>>>>>>> 130B ticks, the dynamic instruction count kept in
>> cpu/base_dyn_inst.impl.hh
>> >>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below)
>> very closely.
>> >>>>>>>>> Around tick 130B after libquantum started, it starts hitting
>> what I'm
>> >>>>>>>>> assuming are loops (therefore branch prediction), resulting in
>> some
>> >>>>>>>>> behavior that seems to imply improper instruction handling
>> (i.e. more
>> >>>>>>>>> instructions in flight than allowed by ROB).
>> >>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces
>> exactly by
>> >>>>>>>>> trace, but they should represent roughly the same area of
>> execution. They
>> >>>>>>>>> don't execute the same due to the dramsim2 modeling the memory
>> differently
>> >>>>>>>>> (i.e. latency and other delays).
>> >>>>>>>>> I've shared both traces on my public Dropbox here --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>> >>>>>>>>> Here are a couple plots of tick versus instruction count, with
>> >>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
>> instList.size()
>> >>>>>>>>> in cpu/o3/cpu.cc. --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>>>> Note that I added the printout of the instList size to an
>> existing
>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>> >>>>>>>>> Here are the commands I ran to parse the traces into data files
>> to
>> >>>>>>>>> analyze in MATLAB and create the plots:
>> >>>>>>>>> zgrep DynInst
>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> grep destroyed
>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>> >>>>>>>>> zgrep instList
>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> awk '{print
>> >>>>>>>>> $1,$11}' > instlistsize.out
>> >>>>>>>>> It seems to me like the problem might lie in gem5, but has just
>> been
>> >>>>>>>>> exposed by integrating this more detailed memory model,
>> dramsim2, into
>> >>>>>>>>> gem5. Either that, or their are some timing errors in how
>> dramsim2 was
>> >>>>>>>>> integrated. I doubt this, however, since those first 190B
>> ticks executed
>> >>>>>>>>> used the dramsim2 memory. I believe the problem is a
>> combination of memory
>> >>>>>>>>> instructions + complex loops (branch prediction), resulting in
>> improper
>> >>>>>>>>> destroying of instructions.
>> >>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
>> flags.
>> >>>>>>>>> Their are 192 ROB entries, which is why the instList size
>> generally has a
>> >>>>>>>>> max of about 192 instructions. The dynamic instruction counts
>> (seen in the
>> >>>>>>>>> dramsim2 plot) seem to also imply that instructions are
>> incorrectly been
>> >>>>>>>>> removed from the ROB, and then from the cpu's instruction list
>> in cpu.cc,
>> >>>>>>>>> which allows more and more instructions to be added to the
>> system (possibly
>> >>>>>>>>> from a bad branch).
>> >>>>>>>>> I appreciate any help in debugging this and further figuring
>> out the
>> >>>>>>>>> root problem, just let me know if you need anything else from
>> me. I don't
>> >>>>>>>>> have much more time at the moment to debug, but I can take any
>> advice for
>> >>>>>>>>> quick changes and/or additional traces, then send the results
>> back to the
>> >>>>>>>>> list for discussion.
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Andrew
>> >>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>> >>>>>>>>> transaction (and even command) queue from 512 to 32. The same
>> instructions
>> >>>>>>>>> problem occurred. It basically just decreased the execution
>> time.
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu>
>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> The error is that there are more that 1500 instructions
>> currently
>> >>>>>>>>>> in flight in the system. It could mean several things:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there
>> are
>> >>>>>>>>>> more than 1500 in your system at one time?
>> >>>>>>>>>>
>> >>>>>>>>>> 2. Instructions aren't being destroyed correctly
>> >>>>>>>>>>
>> >>>>>>>>>> You could try to to run a debug binary so you'll get a list of
>> >>>>>>>>>> instructions when it happens or increase the number which may
>> >>>>>>>>>> be appropriate for certain situations (but 1500 is quite a few
>> inflight
>> >>>>>>>>>> instructions).
>> >>>>>>>>>>
>> >>>>>>>>>> Ali
>> >>>>>>>>>>
>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Xiangyu,
>> >>>>>>>>>> I just started looking into this some more. So at first I
>> >>>>>>>>>> thought it was due to updating to a more recent revision, but
>> then I went
>> >>>>>>>>>> back to revision 8643, added your patch, built and ran....and
>> now get the
>> >>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing
>> now to see
>> >>>>>>>>>> if an update to SWIG might have resulted in this error, maybe
>> someone on
>> >>>>>>>>>> the mailing list would know if that's possible. The
>> difference is 1.3.40
>> >>>>>>>>>> vs. 2.0.3, both of which are supported according to the
>> dependencies wiki
>> >>>>>>>>>> page.
>> >>>>>>>>>> Just for completeness, here's the error from revision 8643:
>> >>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount
>> >>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>> >>>>>>>>>> that today. Maybe this is an assertion that is occurring due
>> to an
>> >>>>>>>>>> optimization. That would mean it wouldn't be triggered in
>> gem5.debug since
>> >>>>>>>>>> it runs without optimizations. Have you tested all debug, opt
>> and fast
>> >>>>>>>>>> with your tests?
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Andrew
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>> >>>>>>>>>> ***@gmail.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Andrew,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I didn?t see this error in my simulations. May I ask which
>> gem5
>> >>>>>>>>>>> version you are using? I find some of the latest code updates
>> do not comply
>> >>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5
>> repo8643, and
>> >>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000,
>> EEMBC2, and
>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>> >>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>> >>>>>>>>>>>
>> >>>>>>>>>>> *To:* gem5 users mailing list
>> >>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>> >>>>>>>>>>>
>> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I've been having an issue recently with the number of
>> >>>>>>>>>>> instructions I've been seeing committed to the CPU (I have a
>> separate
>> >>>>>>>>>>> thread on this). It turns out the issue seems to be coming
>> from this patch
>> >>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately,
>> I've been
>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I
>> haven't been
>> >>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or
>> debug back in
>> >>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu
>> fails with this
>> >>>>>>>>>>> assertion:
>> >>>>>>>>>>>
>> >>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount
>> >>>>>>>>>>>
>> >>>>>>>>>>> -Andrew
>> >>>>>>>>>>>
>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>> >>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>> >>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> >>>>>>>>>>> Message-ID: gmail.com>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
>> and FS
>> >>>>>>>>>>> modes.
>> >>>>>>>>>>> I'm willing to share it here.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> For those who have such needs, please go to my website
>> >>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>> >>>>>>>>>>> download the patch and test it. To enable
>> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS,
>> you
>> >>>>>>>>>>> can create
>> >>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module
>> is to
>> >>>>>>>>>>> use the
>> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Please let me know if there are bugs.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu Dong
>> >>>>>>>>>>>
>> >>>>>>>>>>> -------------- next part --------------
>> >>>>>>>>>>> An HTML attachment was scrubbed...
>> >>>>>>>>>>> URL: <
>> >>>>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >>>>>>>>>>> >
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>> _______________________________________________
>> >>>>>>>>>> gem5-users mailing list
>> >>>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://
>> m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing list
>> >>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> gem5-users mailing list
>> >>>>>>> gem5-***@gem5.org
>> >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> gem5-users mailing list
>> >>>>>> gem5-***@gem5.org
>> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> gem5-users mailing list
>> >>>>> gem5-***@gem5.org
>> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> gem5-users mailing list
>> >>>> gem5-***@gem5.org
>> >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> > _______________________________________________
>> > gem5-users mailing list
>> > gem5-***@gem5.org
>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
Ali Saidi
2012-05-10 18:03:46 UTC
Permalink
Hi Andrew,

Sorry, I haven't had a chance to look at this in depth
yet.

Ali

On 10.05.2012 11:51, Andrew Cebulski wrote:

> Ali/Gabe,

> Do either of you need anything else from me to debug on your end?
Should I send a more detailed trace?
> Thanks,
> Andrew
>
> On Mon,
May 7, 2012 at 9:53 PM, Andrew Cebulski <***@drexel.edu [49]> wrote:
>

>> Hi Ali and Gabe,
>> Here's the trace file:
http://dl.dropbox.com/u/2953302/gem5/table_walker.out [46]
>> The
pending queue size in the table walker follows the shape of the dynamic
instruction curves. The L1 and L2 queue size never go above 0. Comparing
DynInst count in cpu->instcount with pendingQueue size:
>>
http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png [47]
>>
>>
-Andrew
>>
>> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi
<***@umich.edu [48]> wrote:
>>
>>> Hi Andrew,
>>>
>>> Could you add
some code to the table walker to see how big the following are
getting:
>>> stateQueueL1.size()
>>> stateQueueL2.size()
>>>
pendingQueue.size()
>>>
>>> Perhaps we're some how getting into a loop
where there are a lot of translations to invalid addresses that get
squashed and they pile up in the table walker?
>>>
>>> Thanks,
>>>
Ali
>>>
>>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black
wrote:
>>>
>>> > I haven't had a chance to study what's going on here,
but could the problem be that we don't have bandwidth limits/back
pressure implemented for the TLB and delayed translation? It could be
that the CPU is pumping instructions into translation which eventually
drain out/are squashed, and if too many accumulate they trip that
assert.
>>> >
>>> > That may not actually make any sense as far as what
the code is actually doing, but it occurred to me as a possibility and I
thought I'd throw it out there.
>>> >
>>> > Gabe
>>> >
>>> > Quoting
Andrew Cebulski <***@drexel.edu [1]>:
>>> >
>>> >> I double-checked by
looking at the config.ini file. It turns out I did
>>> >> actually
create the checkpoint with an Atomic CPU without caches. Sorry
>>> >>
for the confusion.
>>> >>
>>> >> -Andrew
>>> >>
>>> >> On Wed, May 2,
2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu [2]> wrote:
>>>
>>
>>> >>> I started hitting this assertion (that the number of insts in
flight was >
>>> >>> 1500) before I started using a checkpoint. I
created the checkpoint
>>> >>> afterwards to decrease the time needed to
run simulations to debug this
>>> >>> problem. I'll create a new
checkpoint, then send the new trace output.
>>> >>>
>>> >>> -Andrew
>>>
>>>
>>> >>>
>>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi
<***@umich.edu [3]> wrote:
>>> >>>
>>> >>>> **
>>> >>>>
>>> >>>> It's
likely the cause for all of your problems. Dirty data in the caches
>>>
>>>> doesn't get restored either. You should always create checkpoints
with an
>>> >>>> atomic cpu and without caches.
>>> >>>>
>>> >>>>
>>>
>>>>
>>> >>>> Ali
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 02.05.2012
21:23, Andrew Cebulski wrote:
>>> >>>>
>>> >>>> Sorry, I created the
checkpoint I referred to with an O3 CPU with caches.
>>> >>>> From what
I recall reading, caches don't get restored from checkpoints.
>>> >>>>
Since the checkpoint wasn't during the benchmark run, I assumed that
was
>>> >>>> okay.
>>> >>>> -Andrew
>>> >>>>
>>> >>>> On Wed, May 2,
2012 at 9:07 PM, Ali Saidi <***@umich.edu [4]> wrote:
>>> >>>>
>>>
>>>>> You haven't answered the question about if you created the
checkpoints
>>> >>>>> with an atomic cpu without caches.
>>> >>>>>
>>>
>>>>> Ali
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On
02.05.2012 19:58, Andrew Cebulski wrote:
>>> >>>>>
>>> >>>>> I have not
run with the checker CPU recently. Here's the stderr output
>>> >>>>>
from a run I did awhile back:
>>> >>>>>
http://dl.dropbox.com/u/2953302/gem5/err.0 [5]
>>> >>>>> Note that the
instruction match error is before my benchmark actually
>>> >>>>> starts
running. The start of my boot script checks to see if my files
>>> >>>>>
image is mounted (which it is), then continues on to run the benchmark.
I
>>> >>>>> booted the system, mounted my files image, then took a
checkpoint. I've
>>> >>>>> been running all my tests from that
checkpoint. I found where my benchmark
>>> >>>>> started based on the
ASID (from ExecAsid debug flag).
>>> >>>>> I delayed the start of
gathering trace data until the second-to-last
>>> >>>>> linear increase
in dynamic instructions in-flight. I'm running a new trace
>>> >>>>>
now.
>>> >>>>> -Andrew
>>> >>>>>
>>> >>>>>
>>> >>>>> On Wed, May 2, 2012
at 5:28 PM, Ali Saidi <***@umich.edu [6]> wrote:
>>> >>>>>
>>> >>>>>>
Something is wrong well before this point. There is no reason that
>>>
>>>>>> address 0x0 or 0x4 should be translated.
>>> >>>>>>
>>> >>>>>>
Did you happen to create a checkpoint when caches were in the
system?
>>> >>>>>>
>>> >>>>>> Have you tried to run with the checker cpu
and see if it detects any
>>> >>>>>> errors?
>>> >>>>>>
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>> Ali
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>
>>>>>>
>>> >>>>>> They are data TLB misses that occur as the in-flight
instruction count
>>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss
before the in-flight instruction
>>> >>>>>> count finally linearly
decreases is to 0x200. Also, at the start of the
>>> >>>>>> rising
slope, I see a miss to 0x8 and 0x2508c.
>>> >>>>>> Here's a trace
file:
>>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out [7]
>>>
>>>>>> To reduce size, I just have lines that have either TLB or walker
in
>>> >>>>>> them.
>>> >>>>>> I do see only a handful of instruction
TLB misses.
>>> >>>>>> -Andrew
>>> >>>>>>
>>> >>>>>> On Wed, May 2, 2012
at 11:10 AM, Ali Saidi <***@umich.edu [8]> wrote:
>>> >>>>>>
>>>
>>>>>>> Hi Andrew,
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
Thanks for digging into this. I think there is an issue somewhere,
but
>>> >>>>>>> I'm still not sure where.
>>> >>>>>>>
>>> >>>>>>>
Ali
>>> >>>>>>>
>>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski
wrote:
>>> >>>>>>>
>>> >>>>>>> Okay, I'm positive now that the issue
lies with delayed translations
>>> >>>>>>> that are squashed before
finishing.
>>> >>>>>>>
>>> >>>>>>> On the data on instruction side? You
seem to allude to data in the
>>> >>>>>>> paragraph below, but then
instructions in the latter text.
>>> >>>>>>>
>>> >>>>>>> It seems to me
like speculative load/stores are being executed,
>>> >>>>>>> rather than
waiting for the instructions to commit. Once the instructions
>>>
>>>>>>> begin getting (speculatively) executed in the TLB, a reference
is left
>>> >>>>>>> there, which seems hard to root out and dereference
after the instruction
>>> >>>>>>> ends up being squashed. At least, I
have not been able to find that out in
>>> >>>>>>> the source code as of
yet. Can anyone clarify on this?
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>>
>>>>>>> There should only be one translation outstanding from each
>>>
>>>>>>> instruction and data side walker. Any nested transactions should
be queued
>>> >>>>>>> in the walker. Until one finishes, I'm not sure
how multiple would ever be
>>> >>>>>>> outstanding.
>>> >>>>>>>
>>>
>>>>>>> Recall the following image that shows how the number of
dynamic
>>> >>>>>>> instruction (DynInst) objects in-flight increases
linearly for varying
>>> >>>>>>> periods of time:
>>> >>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[9]
>>> >>>>>>> After enabling the TLB debug flag, I see that the linear
increase in
>>> >>>>>>> instructions in flight is proportional to the
number of TLB misses. These
>>> >>>>>>> TLB misses have a much larger
delay (resulting in translation delays) due
>>> >>>>>>> to the fact the
DramSim2 models the memory system more accurately. It
>>> >>>>>>> seems
that with the classic memory system, TLB misses often do not have
>>>
>>>>>>> translation delays. For whatever reason, it would also seem that
every
>>> >>>>>>> instruction that has a TLB miss also is eventually
squashed...
>>> >>>>>>>
>>> >>>>>>> From a data side perspective this is
reasonable. While a miss is
>>> >>>>>>> outstanding at some point
instructions will stop committing and thus the
>>> >>>>>>> instructions
in flight will begin to rise until the miss is satisfied.
>>>
>>>>>>>
>>> >>>>>>> Here's a summary of outputs from my trace. These two
DPRINTF
>>> >>>>>>> messages appears on the rising slopes (repeated up
until the peak):
>>> >>>>>>> TLB Miss: Starting hardware table walker
for 0(656)
>>> >>>>>>> TLB Miss: Starting hardware table walker for
0x4(656)
>>> >>>>>>>
>>> >>>>>>> This is interesting/odd. I don't know a
good reason why (1) a miss
>>> >>>>>>> would be outstanding to both
address 0 and address 4 at the same time. In
>>> >>>>>>> almost all
cases these pages are marked as no-access to detect segfaults.
>>>
>>>>>>> Perhaps there is an issue where the cpu is getting into a loop
faulting on
>>> >>>>>>> a bad access and then faulting again on the
fault handler. I could imagine
>>> >>>>>>> this would happen if there
was some corruption in the memory system (for
>>> >>>>>>> example the
timings in dramsim exposing a bug in the cache models or
>>> >>>>>>>
something).
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> At the peak, the
following message appears (from fetch) almost every
>>> >>>>>>> tick for
(what I believe to be) every single one of the table walkers that
>>>
>>>>>>> were squashed.
>>> >>>>>>> Fetch is waiting ITLB walk to
finish!
>>> >>>>>>>
>>> >>>>>>> There must be another walk in flight?
The instruction side will only
>>> >>>>>>> have one fault outstanding at
once. Successive branch mispredicts will
>>> >>>>>>> re-direct fetch but
there is code that catches the fact that a different
>>> >>>>>>> walk
completed then expected and "does the right thing."
>>> >>>>>>>
>>>
>>>>>>> The problem is that these ITLB table walks are for instructions
that
>>> >>>>>>> were squashed as much as 0.3 billion cycles earlier,
and since been removed
>>> >>>>>>> from the CPU's instruction list.
>>>
>>>>>>>
>>> >>>>>>> I'm not following here.
>>> >>>>>>>
>>> >>>>>>> Any
help will be greatly appreciated in solving this problem. I've
>>>
>>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
due to the
>>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha
do not). There's the 256
>>> >>>>>>> MB for physical memory, then the 64
MB for the boot loader. I brought this
>>> >>>>>>> up in my last email
about trying to get Ruby working. Therefore, I'm
>>> >>>>>>> trying to
get this DramSim2 integration fixed so I can start modeling FS
>>>
>>>>>>> with DRAM memory.
>>> >>>>>>>
>>> >>>>>>> Brad/Steve/Nilay
anyone have a suggestion on how to make this work?
>>> >>>>>>>
>>>
>>>>>>>
>>> >>>>>>> Note that these problems also occur in Soplex from
the Spec CPU2006
>>> >>>>>>> benchmark suite (also hits 1500 in-flight
instructions assertion). Due to
>>> >>>>>>> time constraints, I haven't
tested on other benchmarks.
>>> >>>>>>> Thanks,
>>> >>>>>>> Andrew
>>>
>>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
<***@drexel.edu [10]>wrote:
>>> >>>>>>>
>>> >>>>>>>> Hey Gabe,
>>>
>>>>>>>> Thanks for this...very helpful. I just recently got back
into
>>> >>>>>>>> debugging this problem. I made a small change in
src/base/refcnt.hh to
>>> >>>>>>>> allow me to return the current count
of references to a DynInst object.
>>> >>>>>>>> I then modified existing
DPRINTFs to also print out reference
>>> >>>>>>>> counts, then added
some of my own when I needed extra visibility.
>>> >>>>>>>> I've found
one memory store instruction that seems to be getting
>>> >>>>>>>> lost.
What's happening is that is progresses as far as getting executed in
>>>
>>>>>>>> the IEW once, but a delayed translation occurs, deferring the
store. By
>>> >>>>>>>> the time it reenters the IEW, the IQ has marked
the instruction as
>>> >>>>>>>> squashed. Everything progresses as usual
from here on out, with one
>>> >>>>>>>> exception. When the instruction
is removed from the CPUs instruction list,
>>> >>>>>>>> there is one
reference count hanging.
>>> >>>>>>>> I've added in some additional
debugging for my traces to help
>>> >>>>>>>> narrow down where this
reference is coming from. As far as I can tell,
>>> >>>>>>>> it's
because of a call to initiateAcc() within the executeStore function
in
>>> >>>>>>>> the lsq unit. Please see the following two traces. The
first trace shows
>>> >>>>>>>> what I just discussed. The second trace
is another memory store
>>> >>>>>>>> instruction that got squashed,
however, it was squashed upon its first
>>> >>>>>>>> entry into the IEW,
therefore it never started execution.
>>> >>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [11]
>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[12]
>>> >>>>>>>> Let me know if you have any ideas based on these two
instruction
>>> >>>>>>>> traces. I do not understand how the initiateAcc
function results in
>>> >>>>>>>> another reference, but maybe someone
else does.... Since I don't see how
>>> >>>>>>>> it makes a reference,
it's hard to find out how to make sure it gets
>>> >>>>>>>>
dereferenced...
>>> >>>>>>>> Unfortunately, I haven't been able to add a
DPRINTF in
>>> >>>>>>>> src/base/refcnt.hh ...this would make things
more clear (i.e. exactly when
>>> >>>>>>>> references/deferences occur).
Let me know if you have any advice on
>>> >>>>>>>> this...if it's
possible. I can't seem to get the right include files, and
>>> >>>>>>>>
likely right SConscript compile order...
>>> >>>>>>>> Thanks,
>>>
>>>>>>>> Andrew
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, Apr 7,
2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu [13]>wrote:
>>>
>>>>>>>>
>>> >>>>>>>>> Without digging into things too deeply, it looks
like you may be
>>> >>>>>>>>> leaking references to dynamic
instructions. The CPU may think it's done
>>> >>>>>>>>> with one, but
until that final reference is removed, the object will hang
>>>
>>>>>>>>> around forever. I think I've had problems before where there
reference
>>> >>>>>>>>> count ended up off by one somehow and
instructions would start piling up.
>>> >>>>>>>>> It's also possible
that a clog develops in O3's pipeline and some internal
>>> >>>>>>>>>
structure stops letting instructions through and starts accumulating
them.
>>> >>>>>>>>> Either of these problems will be annoying to track
down, but with enough
>>> >>>>>>>>> digging I've been able to fix these
sorts of things.
>>> >>>>>>>>>
>>> >>>>>>>>> This may have more to do
with O3 not handling the benchmark you're
>>> >>>>>>>>> running well
rather than a problem with your new DRAM model. There may be
>>>
>>>>>>>>> some interaction between the two, though, where the new memory
makes the
>>> >>>>>>>>> timing line up to cause O3 to behave poorly.
What you can do is instrument
>>> >>>>>>>>> dynamic instruction creation
and destruction and reference counting (try
>>> >>>>>>>>> print "this"
for both the reference counting wrapper and the dyn inst
>>> >>>>>>>>>
itself) and turn it on as close as you can to where things go bad
tick
>>> >>>>>>>>> wise. Then look for an instruction which gets lost,
and look for where it's
>>> >>>>>>>>> reference count is incremented and
decremented. It should be relatively
>>> >>>>>>>>> easy to pair up where
references are created and destroyed, and you should
>>> >>>>>>>>> be
able to identify the reference which never goes away. Then you need
to
>>> >>>>>>>>> figure out where that reference is being created. After
that, you should
>>> >>>>>>>>> have enough information to identify why
the reference counting isn't being
>>> >>>>>>>>> done correctly. It's
arduous, but that's the only way.
>>> >>>>>>>>>
>>> >>>>>>>>> It's
important to also make sure reference counts aren't decremented
>>>
>>>>>>>>> to zero prematurely. I had a problem once where that happened
and the
>>> >>>>>>>>> memory behind the object was updated by something
that didn't know it was
>>> >>>>>>>>> dead. The memory had since been
reallocated to another object of the same
>>> >>>>>>>>> type, so that
other object reflected what happened to the phantom one. If I
>>>
>>>>>>>>> remember that manifested as something weird like an add
causing a page
>>> >>>>>>>>> fault or something.
>>> >>>>>>>>>
>>>
>>>>>>>>> Gabe
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On 04/07/12
18:21, Andrew Cebulski wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Hi all,
>>>
>>>>>>>>> I've looked into this problem some more, and have put together
a
>>> >>>>>>>>> couple traces. I've been becoming more familiar with how
gem5 handles
>>> >>>>>>>>> dynamic instructions, in particular how it
destroys them. I have two
>>> >>>>>>>>> traces to compare, one with the
physical memory, and the other with the
>>> >>>>>>>>> integrated
dramsim2 dram memory. I also have two plots showing instruction
>>>
>>>>>>>>> counts over time (sim ticks). All of these are linked at the
end of the
>>> >>>>>>>>> email.
>>> >>>>>>>>> First, I'm going to go
into what I've been able to interpret
>>> >>>>>>>>> regarding how
instructions are destroyed. In particular, comparing when
>>> >>>>>>>>>
DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>
>>>>>>>>> separate these because I've seen a difference, as I discuss
later. These
>>> >>>>>>>>> explanations are fairly non-existent on the
wiki. There is a section
>>> >>>>>>>>> header waiting to be
filled...
>>> >>>>>>>>> From what I have been able to gather from the
code, there is a list
>>> >>>>>>>>> of all the instructions in flight in
cpu/o3/cpu.cc called instList, with
>>> >>>>>>>>> the type DynInstPtr.
There are three conditions to instructions being
>>> >>>>>>>>> cleaned
from this list:
>>> >>>>>>>>> 1.) The ROB retires its head
instruction
>>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from
the commit,
>>> >>>>>>>>> resulting in removing any instruction not in
the ROB
>>> >>>>>>>>> 3.) Decode detects an incorrect branch prediction,
resulting in
>>> >>>>>>>>> removal of all instructions back to the bad
seq num.
>>> >>>>>>>>> Once all five stages have completed, the CPU
cleans up all the
>>> >>>>>>>>> removed in-flight instructions. This
line in particular
>>> >>>>>>>>> in cleanUpRemovedInsts() in
cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>> >>>>>>>>>
instList.erase(removeList.front());
>>> >>>>>>>>> When I turn on the
debug flag O3CPU, I see the message "Removing
>>> >>>>>>>>> instruction,
..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>
>>>>>>>>> after all 5 cpu stages have completed, and one of the
conditions above is
>>> >>>>>>>>> met. I also see what tick it occurs
on.
>>> >>>>>>>>> When I turn on the DynInst debug flag, I see when
instructions are
>>> >>>>>>>>> created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From
>>> >>>>>>>>> analyzing
the trace files, I've gathered that this takes into account that
>>>
>>>>>>>>> instructions have different execution lengths. So if one tick
a memory
>>> >>>>>>>>> instruction in the instList (DynInstPtr) is
removed, the DynInst for that
>>> >>>>>>>>> memory instruction will
occur much later (i.e. 1M ticks later). I have yet
>>> >>>>>>>>> to
determine how this is implemented.
>>> >>>>>>>>> Now for the
problem.
>>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory
is a significant
>>> >>>>>>>>> difference between the size of the
instList vector (of DynInstPtr objects),
>>> >>>>>>>>> and the size of
dynamic instruction count (of DynInst objects). The
>>> >>>>>>>>>
benchmark I'm running is libquantum from SPEC 2006. For the first
roughly
>>> >>>>>>>>> 130B ticks, the dynamic instruction count kept in
cpu/base_dyn_inst.impl.hh
>>> >>>>>>>>> shadows the instList size in
o3/cpu.cc (figure linked below) very closely.
>>> >>>>>>>>> Around tick
130B after libquantum started, it starts hitting what I'm
>>> >>>>>>>>>
assuming are loops (therefore branch prediction), resulting in some
>>>
>>>>>>>>> behavior that seems to imply improper instruction handling
(i.e. more
>>> >>>>>>>>> instructions in flight than allowed by
ROB).
>>> >>>>>>>>> I wasn't able to sync-up the physical and dramsim2
traces exactly by
>>> >>>>>>>>> trace, but they should represent roughly
the same area of execution. They
>>> >>>>>>>>> don't execute the same
due to the dramsim2 modeling the memory differently
>>> >>>>>>>>> (i.e.
latency and other delays).
>>> >>>>>>>>> I've shared both traces on my
public Dropbox here --
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>>> >>>>>>>>> Here are a couple plots of tick versus instruction
count, with
>>> >>>>>>>>> respect to cpu->instcount in
cpu/base_dyn_inst.impl.hh and instList.size()
>>> >>>>>>>>> in
cpu/o3/cpu.cc. --
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>>> >>>>>>>>> Note that I added the printout of the instList size
to an existing
>>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in
cpu/o3/cpu.cc.
>>> >>>>>>>>> Here are the commands I ran to parse the
traces into data files to
>>> >>>>>>>>> analyze in MATLAB and create the
plots:
>>> >>>>>>>>> zgrep DynInst
>>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed
>>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>
>>>>>>>>> zgrep instList
>>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk
'{print
>>> >>>>>>>>> $1,$11}' > instlistsize.out
>>> >>>>>>>>> It seems
to me like the problem might lie in gem5, but has just been
>>>
>>>>>>>>> exposed by integrating this more detailed memory model,
dramsim2, into
>>> >>>>>>>>> gem5. Either that, or their are some timing
errors in how dramsim2 was
>>> >>>>>>>>> integrated. I doubt this,
however, since those first 190B ticks executed
>>> >>>>>>>>> used the
dramsim2 memory. I believe the problem is a combination of memory
>>>
>>>>>>>>> instructions + complex loops (branch prediction), resulting in
improper
>>> >>>>>>>>> destroying of instructions.
>>> >>>>>>>>> I've
included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>
>>>>>>>>> Their are 192 ROB entries, which is why the instList size
generally has a
>>> >>>>>>>>> max of about 192 instructions. The dynamic
instruction counts (seen in the
>>> >>>>>>>>> dramsim2 plot) seem to
also imply that instructions are incorrectly been
>>> >>>>>>>>> removed
from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>
>>>>>>>>> which allows more and more instructions to be added to the
system (possibly
>>> >>>>>>>>> from a bad branch).
>>> >>>>>>>>> I
appreciate any help in debugging this and further figuring out the
>>>
>>>>>>>>> root problem, just let me know if you need anything else from
me. I don't
>>> >>>>>>>>> have much more time at the moment to debug,
but I can take any advice for
>>> >>>>>>>>> quick changes and/or
additional traces, then send the results back to the
>>> >>>>>>>>> list
for discussion.
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> Andrew
>>> >>>>>>>>>
P.S. Paul - I did try decreasing the size of the dramsim2
>>> >>>>>>>>>
transaction (and even command) queue from 512 to 32. The same
instructions
>>> >>>>>>>>> problem occurred. It basically just decreased
the execution time.
>>> >>>>>>>>>
>>> >>>>>>>>> On Wed, Mar 14, 2012 at
2:10 PM, Ali Saidi <***@umich.edu [18]> wrote:
>>> >>>>>>>>>
>>>
>>>>>>>>>> The error is that there are more that 1500 instructions
currently
>>> >>>>>>>>>> in flight in the system. It could mean several
things:
>>> >>>>>>>>>>
>>> >>>>>>>>>> 1. The value is somewhat
arbitrarily defined and maybe there are
>>> >>>>>>>>>> more than 1500 in
your system at one time?
>>> >>>>>>>>>>
>>> >>>>>>>>>> 2. Instructions
aren't being destroyed correctly
>>> >>>>>>>>>>
>>> >>>>>>>>>> You could
try to to run a debug binary so you'll get a list of
>>> >>>>>>>>>>
instructions when it happens or increase the number which may
>>>
>>>>>>>>>> be appropriate for certain situations (but 1500 is quite a
few inflight
>>> >>>>>>>>>> instructions).
>>> >>>>>>>>>>
>>> >>>>>>>>>>
Ali
>>> >>>>>>>>>>
>>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski
wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi Xiangyu,
>>> >>>>>>>>>> I just
started looking into this some more. So at first I
>>> >>>>>>>>>>
thought it was due to updating to a more recent revision, but then I
went
>>> >>>>>>>>>> back to revision 8643, added your patch, built and
ran....and now get the
>>> >>>>>>>>>> error with it too (when running
ARM_FS/gem5.opt). I"m testing now to see
>>> >>>>>>>>>> if an update to
SWIG might have resulted in this error, maybe someone on
>>> >>>>>>>>>>
the mailing list would know if that's possible. The difference is
1.3.40
>>> >>>>>>>>>> vs. 2.0.3, both of which are supported according
to the dependencies wiki
>>> >>>>>>>>>> page.
>>> >>>>>>>>>> Just for
completeness, here's the error from revision 8643:
>>> >>>>>>>>>>
build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>> >>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>> >>>>>>>>>> I have not tried running with gem5.debug,
so I will be doing
>>> >>>>>>>>>> that today. Maybe this is an assertion
that is occurring due to an
>>> >>>>>>>>>> optimization. That would mean
it wouldn't be triggered in gem5.debug since
>>> >>>>>>>>>> it runs
without optimizations. Have you tested all debug, opt and fast
>>>
>>>>>>>>>> with your tests?
>>> >>>>>>>>>> Thanks,
>>> >>>>>>>>>>
Andrew
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM,
Rio Xiangyu Dong <
>>> >>>>>>>>>> ***@gmail.com [19]> wrote:
>>>
>>>>>>>>>>
>>> >>>>>>>>>>> Hi Andrew,
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I didn?t see this error in
my simulations. May I ask which gem5
>>> >>>>>>>>>>> version you are
using? I find some of the latest code updates do not comply
>>>
>>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5
repo8643, and
>>> >>>>>>>>>>> have run all the runnable benchmarks in
SPEC2006, SPEC2000, EEMBC2, and
>>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thank
you!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
Best,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Xiangyu
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *From:* Andrew Cebulski
[mailto:***@drexel.edu [20]]
>>> >>>>>>>>>>> *Sent:* Thursday, March
08, 2012 6:52 PM
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *To:* gem5 users
mailing list
>>> >>>>>>>>>>> *Cc:****@gmail.com [21];
***@umich.edu [22]
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *Subject:* Re:
[gem5-users] A Patch for DRAMsim2 Integration
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>> Xiangyu,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I've been having an
issue recently with the number of
>>> >>>>>>>>>>> instructions I've been
seeing committed to the CPU (I have a separate
>>> >>>>>>>>>>> thread on
this). It turns out the issue seems to be coming from this patch
>>>
>>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately,
I've been
>>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up
until now, I haven't been
>>> >>>>>>>>>>> seeing assertions. I thought
I'd run it with gem5.opt or debug back in
>>> >>>>>>>>>>> December, but
I must not have. My runs on the Arm O3 cpu fails with this
>>>
>>>>>>>>>>> assertion:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>> >>>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> -Andrew
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>
>>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com [23]>
>>>
>>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org [24]>
>>>
>>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>
>>>>>>>>>>> Message-ID: gmail.com [25]>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
Content-Type: text/plain; charset="us-ascii"
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>> Hi all,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
and FS
>>> >>>>>>>>>>> modes.
>>> >>>>>>>>>>> I'm willing to share it
here.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
For those who have such needs, please go to my website
>>> >>>>>>>>>>>
www.cse.psu.edu/~xydong [26] <http://www.cse.psu.edu/%7Exydong [27]>
to
>>> >>>>>>>>>>> download the patch and test it. To enable
>>>
>>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for
FS, you
>>> >>>>>>>>>>> can create
>>> >>>>>>>>>>> by yourself). The
basic idea to enable the DRAMsim2 module is to
>>> >>>>>>>>>>> use
the
>>> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory
class.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
Please let me know if there are bugs.
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thank you!
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best,
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>> Xiangyu Dong
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
-------------- next part --------------
>>> >>>>>>>>>>> An HTML
attachment was scrubbed...
>>> >>>>>>>>>>> URL: <
>>> >>>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[28]
>>> >>>>>>>>>>> >
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
_______________________________________________
>>> >>>>>>>>>>
gem5-users mailing list
>>> >>>>>>>>>> gem5-***@gem5.org [29]
>>>
>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [30]
>>>
>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
_______________________________________________
>>> >>>>>>>>> gem5-users
mailing
listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
_______________________________________________
>>> >>>>>>>>> gem5-users
mailing list
>>> >>>>>>>>> gem5-***@gem5.org [32]
>>> >>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [33]
>>>
>>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
_______________________________________________
>>> >>>>>>> gem5-users
mailing list
>>> >>>>>>> gem5-***@gem5.org [34]
>>> >>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [35]
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
_______________________________________________
>>> >>>>>> gem5-users
mailing list
>>> >>>>>> gem5-***@gem5.org [36]
>>> >>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [37]
>>> >>>>>
>>>
>>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
_______________________________________________
>>> >>>>> gem5-users
mailing list
>>> >>>>> gem5-***@gem5.org [38]
>>> >>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [39]
>>> >>>>
>>>
>>>>
>>> >>>>
>>> >>>>
>>> >>>>
_______________________________________________
>>> >>>> gem5-users
mailing list
>>> >>>> gem5-***@gem5.org [40]
>>> >>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [41]
>>> >>>>
>>>
>>>
>>> >>>
>>> >>
>>> >
>>> >
>>> >
_______________________________________________
>>> > gem5-users mailing
list
>>> > gem5-***@gem5.org [42]
>>> >
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [43]
>>> >
>>>
>>>
_______________________________________________
>>> gem5-users mailing
list
>>> gem5-***@gem5.org [44]
>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [45]




Links:
------
[1] mailto:***@drexel.edu
[2]
mailto:***@drexel.edu
[3] mailto:***@umich.edu
[4]
mailto:***@umich.edu
[5]
http://dl.dropbox.com/u/2953302/gem5/err.0
[6]
mailto:***@umich.edu
[7]
http://dl.dropbox.com/u/2953302/gem5/tlb.out
[8]
mailto:***@umich.edu
[9]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[10]
mailto:***@drexel.edu
[11]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[12]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[13]
mailto:***@eecs.umich.edu
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:***@gmail.com
[20]
mailto:***@drexel.edu
[21] mailto:***@gmail.com
[22]
mailto:***@umich.edu
[23] mailto:***@gmail.com
[24]
mailto:gem5-***@gem5.org
[25] http://gmail.com
[26]
http://www.cse.psu.edu/~xydong
[27]
http://www.cse.psu.edu/%7Exydong
[28]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[29]
mailto:gem5-***@gem5.org
[30]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[32]
mailto:gem5-***@gem5.org
[33]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[34]
mailto:gem5-***@gem5.org
[35]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[36]
mailto:gem5-***@gem5.org
[37]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[38]
mailto:gem5-***@gem5.org
[39]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[40]
mailto:gem5-***@gem5.org
[41]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[42]
mailto:gem5-***@gem5.org
[43]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[44]
mailto:gem5-***@gem5.org
[45]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[46]
http://dl.dropbox.com/u/2953302/gem5/table_walker.out
[47]
http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
[48]
mailto:***@umich.edu
[49] mailto:***@drexel.edu
Ali Saidi
2012-06-05 22:57:41 UTC
Permalink
Hi Andrew,

I think I might know what is going on here. My best
guess is:

1. the cache is out of mshrs because of some long latency
miss

2. the instructions behind it are wrong path

3. the cpu keeps
replaying the load because the cpu is blocked

4. each time it replays
it the thing goes through another round of translation

That is why you
see the structures and translations filling up. It wouldn't be so bad if
the translations weren't going on in parallel as well.

I don't know
why the CPU model replays instructions like this instead of just
stalling, anyone? But it shouldn't do it continuously. cacheBlocked()
tells you if the cache is currently stalled so one path to fixing this
might be to either stall decode/rename/iew/translation if
iew.ldstQueue.cachedBlocked() is true. The fix might be as simple as:


diff -r cc47e11ccec1 src/cpu/o3/inst_queue_impl.hh
---
a/src/cpu/o3/inst_queue_impl.hh Tue Jun 05 14:20:13 2012 -0400
+++
b/src/cpu/o3/inst_queue_impl.hh Tue Jun 05 17:56:47 2012 -0500
@@
-1098,7 +1098,7 @@
{
for (ListIt it = deferredMemInsts.begin(); it !=
deferredMemInsts.end();
++it) {
- if ((*it)->translationCompleted() ||
(*it)->isSquashed()) {
+ if (((*it)->translationCompleted() &&
!iewStage.ldstQueue.cacheBlocked()) || (*it)->isSquashed()) {

DynInstPtr ret = *it;
deferredMemInsts.erase(it);
return
ret;

Hopefully that will get point you in the right
direction.

Thanks,

Ali

On 10.05.2012 12:51, Andrew Cebulski wrote:


> Ali/Gabe,
> Do either of you need anything else from me to debug on
your end? Should I send a more detailed trace?
> Thanks,
> Andrew
>
>
On Mon, May 7, 2012 at 9:53 PM, Andrew Cebulski <***@drexel.edu [49]>
wrote:
>
>> Hi Ali and Gabe,
>> Here's the trace file:
http://dl.dropbox.com/u/2953302/gem5/table_walker.out [46]
>> The
pending queue size in the table walker follows the shape of the dynamic
instruction curves. The L1 and L2 queue size never go above 0. Comparing
DynInst count in cpu->instcount with pendingQueue size:
>>
http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png [47]
>>
>>
-Andrew
>>
>> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi
<***@umich.edu [48]> wrote:
>>
>>> Hi Andrew,
>>>
>>> Could you add
some code to the table walker to see how big the following are
getting:
>>> stateQueueL1.size()
>>> stateQueueL2.size()
>>>
pendingQueue.size()
>>>
>>> Perhaps we're some how getting into a loop
where there are a lot of translations to invalid addresses that get
squashed and they pile up in the table walker?
>>>
>>> Thanks,
>>>
Ali
>>>
>>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black
wrote:
>>>
>>> > I haven't had a chance to study what's going on here,
but could the problem be that we don't have bandwidth limits/back
pressure implemented for the TLB and delayed translation? It could be
that the CPU is pumping instructions into translation which eventually
drain out/are squashed, and if too many accumulate they trip that
assert.
>>> >
>>> > That may not actually make any sense as far as what
the code is actually doing, but it occurred to me as a possibility and I
thought I'd throw it out there.
>>> >
>>> > Gabe
>>> >
>>> > Quoting
Andrew Cebulski <***@drexel.edu [1]>:
>>> >
>>> >> I double-checked by
looking at the config.ini file. It turns out I did
>>> >> actually
create the checkpoint with an Atomic CPU without caches. Sorry
>>> >>
for the confusion.
>>> >>
>>> >> -Andrew
>>> >>
>>> >> On Wed, May 2,
2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu [2]> wrote:
>>>
>>
>>> >>> I started hitting this assertion (that the number of insts in
flight was >
>>> >>> 1500) before I started using a checkpoint. I
created the checkpoint
>>> >>> afterwards to decrease the time needed to
run simulations to debug this
>>> >>> problem. I'll create a new
checkpoint, then send the new trace output.
>>> >>>
>>> >>> -Andrew
>>>
>>>
>>> >>>
>>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi
<***@umich.edu [3]> wrote:
>>> >>>
>>> >>>> **
>>> >>>>
>>> >>>> It's
likely the cause for all of your problems. Dirty data in the caches
>>>
>>>> doesn't get restored either. You should always create checkpoints
with an
>>> >>>> atomic cpu and without caches.
>>> >>>>
>>> >>>>
>>>
>>>>
>>> >>>> Ali
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 02.05.2012
21:23, Andrew Cebulski wrote:
>>> >>>>
>>> >>>> Sorry, I created the
checkpoint I referred to with an O3 CPU with caches.
>>> >>>> From what
I recall reading, caches don't get restored from checkpoints.
>>> >>>>
Since the checkpoint wasn't during the benchmark run, I assumed that
was
>>> >>>> okay.
>>> >>>> -Andrew
>>> >>>>
>>> >>>> On Wed, May 2,
2012 at 9:07 PM, Ali Saidi <***@umich.edu [4]> wrote:
>>> >>>>
>>>
>>>>> You haven't answered the question about if you created the
checkpoints
>>> >>>>> with an atomic cpu without caches.
>>> >>>>>
>>>
>>>>> Ali
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On
02.05.2012 19:58, Andrew Cebulski wrote:
>>> >>>>>
>>> >>>>> I have not
run with the checker CPU recently. Here's the stderr output
>>> >>>>>
from a run I did awhile back:
>>> >>>>>
http://dl.dropbox.com/u/2953302/gem5/err.0 [5]
>>> >>>>> Note that the
instruction match error is before my benchmark actually
>>> >>>>> starts
running. The start of my boot script checks to see if my files
>>> >>>>>
image is mounted (which it is), then continues on to run the benchmark.
I
>>> >>>>> booted the system, mounted my files image, then took a
checkpoint. I've
>>> >>>>> been running all my tests from that
checkpoint. I found where my benchmark
>>> >>>>> started based on the
ASID (from ExecAsid debug flag).
>>> >>>>> I delayed the start of
gathering trace data until the second-to-last
>>> >>>>> linear increase
in dynamic instructions in-flight. I'm running a new trace
>>> >>>>>
now.
>>> >>>>> -Andrew
>>> >>>>>
>>> >>>>>
>>> >>>>> On Wed, May 2, 2012
at 5:28 PM, Ali Saidi <***@umich.edu [6]> wrote:
>>> >>>>>
>>> >>>>>>
Something is wrong well before this point. There is no reason that
>>>
>>>>>> address 0x0 or 0x4 should be translated.
>>> >>>>>>
>>> >>>>>>
Did you happen to create a checkpoint when caches were in the
system?
>>> >>>>>>
>>> >>>>>> Have you tried to run with the checker cpu
and see if it detects any
>>> >>>>>> errors?
>>> >>>>>>
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>> Ali
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>
>>>>>>
>>> >>>>>> They are data TLB misses that occur as the in-flight
instruction count
>>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss
before the in-flight instruction
>>> >>>>>> count finally linearly
decreases is to 0x200. Also, at the start of the
>>> >>>>>> rising
slope, I see a miss to 0x8 and 0x2508c.
>>> >>>>>> Here's a trace
file:
>>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out [7]
>>>
>>>>>> To reduce size, I just have lines that have either TLB or walker
in
>>> >>>>>> them.
>>> >>>>>> I do see only a handful of instruction
TLB misses.
>>> >>>>>> -Andrew
>>> >>>>>>
>>> >>>>>> On Wed, May 2, 2012
at 11:10 AM, Ali Saidi <***@umich.edu [8]> wrote:
>>> >>>>>>
>>>
>>>>>>> Hi Andrew,
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
Thanks for digging into this. I think there is an issue somewhere,
but
>>> >>>>>>> I'm still not sure where.
>>> >>>>>>>
>>> >>>>>>>
Ali
>>> >>>>>>>
>>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski
wrote:
>>> >>>>>>>
>>> >>>>>>> Okay, I'm positive now that the issue
lies with delayed translations
>>> >>>>>>> that are squashed before
finishing.
>>> >>>>>>>
>>> >>>>>>> On the data on instruction side? You
seem to allude to data in the
>>> >>>>>>> paragraph below, but then
instructions in the latter text.
>>> >>>>>>>
>>> >>>>>>> It seems to me
like speculative load/stores are being executed,
>>> >>>>>>> rather than
waiting for the instructions to commit. Once the instructions
>>>
>>>>>>> begin getting (speculatively) executed in the TLB, a reference
is left
>>> >>>>>>> there, which seems hard to root out and dereference
after the instruction
>>> >>>>>>> ends up being squashed. At least, I
have not been able to find that out in
>>> >>>>>>> the source code as of
yet. Can anyone clarify on this?
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>>
>>>>>>> There should only be one translation outstanding from each
>>>
>>>>>>> instruction and data side walker. Any nested transactions should
be queued
>>> >>>>>>> in the walker. Until one finishes, I'm not sure
how multiple would ever be
>>> >>>>>>> outstanding.
>>> >>>>>>>
>>>
>>>>>>> Recall the following image that shows how the number of
dynamic
>>> >>>>>>> instruction (DynInst) objects in-flight increases
linearly for varying
>>> >>>>>>> periods of time:
>>> >>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[9]
>>> >>>>>>> After enabling the TLB debug flag, I see that the linear
increase in
>>> >>>>>>> instructions in flight is proportional to the
number of TLB misses. These
>>> >>>>>>> TLB misses have a much larger
delay (resulting in translation delays) due
>>> >>>>>>> to the fact the
DramSim2 models the memory system more accurately. It
>>> >>>>>>> seems
that with the classic memory system, TLB misses often do not have
>>>
>>>>>>> translation delays. For whatever reason, it would also seem that
every
>>> >>>>>>> instruction that has a TLB miss also is eventually
squashed...
>>> >>>>>>>
>>> >>>>>>> From a data side perspective this is
reasonable. While a miss is
>>> >>>>>>> outstanding at some point
instructions will stop committing and thus the
>>> >>>>>>> instructions
in flight will begin to rise until the miss is satisfied.
>>>
>>>>>>>
>>> >>>>>>> Here's a summary of outputs from my trace. These two
DPRINTF
>>> >>>>>>> messages appears on the rising slopes (repeated up
until the peak):
>>> >>>>>>> TLB Miss: Starting hardware table walker
for 0(656)
>>> >>>>>>> TLB Miss: Starting hardware table walker for
0x4(656)
>>> >>>>>>>
>>> >>>>>>> This is interesting/odd. I don't know a
good reason why (1) a miss
>>> >>>>>>> would be outstanding to both
address 0 and address 4 at the same time. In
>>> >>>>>>> almost all
cases these pages are marked as no-access to detect segfaults.
>>>
>>>>>>> Perhaps there is an issue where the cpu is getting into a loop
faulting on
>>> >>>>>>> a bad access and then faulting again on the
fault handler. I could imagine
>>> >>>>>>> this would happen if there
was some corruption in the memory system (for
>>> >>>>>>> example the
timings in dramsim exposing a bug in the cache models or
>>> >>>>>>>
something).
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> At the peak, the
following message appears (from fetch) almost every
>>> >>>>>>> tick for
(what I believe to be) every single one of the table walkers that
>>>
>>>>>>> were squashed.
>>> >>>>>>> Fetch is waiting ITLB walk to
finish!
>>> >>>>>>>
>>> >>>>>>> There must be another walk in flight?
The instruction side will only
>>> >>>>>>> have one fault outstanding at
once. Successive branch mispredicts will
>>> >>>>>>> re-direct fetch but
there is code that catches the fact that a different
>>> >>>>>>> walk
completed then expected and "does the right thing."
>>> >>>>>>>
>>>
>>>>>>> The problem is that these ITLB table walks are for instructions
that
>>> >>>>>>> were squashed as much as 0.3 billion cycles earlier,
and since been removed
>>> >>>>>>> from the CPU's instruction list.
>>>
>>>>>>>
>>> >>>>>>> I'm not following here.
>>> >>>>>>>
>>> >>>>>>> Any
help will be greatly appreciated in solving this problem. I've
>>>
>>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
due to the
>>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha
do not). There's the 256
>>> >>>>>>> MB for physical memory, then the 64
MB for the boot loader. I brought this
>>> >>>>>>> up in my last email
about trying to get Ruby working. Therefore, I'm
>>> >>>>>>> trying to
get this DramSim2 integration fixed so I can start modeling FS
>>>
>>>>>>> with DRAM memory.
>>> >>>>>>>
>>> >>>>>>> Brad/Steve/Nilay
anyone have a suggestion on how to make this work?
>>> >>>>>>>
>>>
>>>>>>>
>>> >>>>>>> Note that these problems also occur in Soplex from
the Spec CPU2006
>>> >>>>>>> benchmark suite (also hits 1500 in-flight
instructions assertion). Due to
>>> >>>>>>> time constraints, I haven't
tested on other benchmarks.
>>> >>>>>>> Thanks,
>>> >>>>>>> Andrew
>>>
>>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
<***@drexel.edu [10]>wrote:
>>> >>>>>>>
>>> >>>>>>>> Hey Gabe,
>>>
>>>>>>>> Thanks for this...very helpful. I just recently got back
into
>>> >>>>>>>> debugging this problem. I made a small change in
src/base/refcnt.hh to
>>> >>>>>>>> allow me to return the current count
of references to a DynInst object.
>>> >>>>>>>> I then modified existing
DPRINTFs to also print out reference
>>> >>>>>>>> counts, then added
some of my own when I needed extra visibility.
>>> >>>>>>>> I've found
one memory store instruction that seems to be getting
>>> >>>>>>>> lost.
What's happening is that is progresses as far as getting executed in
>>>
>>>>>>>> the IEW once, but a delayed translation occurs, deferring the
store. By
>>> >>>>>>>> the time it reenters the IEW, the IQ has marked
the instruction as
>>> >>>>>>>> squashed. Everything progresses as usual
from here on out, with one
>>> >>>>>>>> exception. When the instruction
is removed from the CPUs instruction list,
>>> >>>>>>>> there is one
reference count hanging.
>>> >>>>>>>> I've added in some additional
debugging for my traces to help
>>> >>>>>>>> narrow down where this
reference is coming from. As far as I can tell,
>>> >>>>>>>> it's
because of a call to initiateAcc() within the executeStore function
in
>>> >>>>>>>> the lsq unit. Please see the following two traces. The
first trace shows
>>> >>>>>>>> what I just discussed. The second trace
is another memory store
>>> >>>>>>>> instruction that got squashed,
however, it was squashed upon its first
>>> >>>>>>>> entry into the IEW,
therefore it never started execution.
>>> >>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [11]
>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[12]
>>> >>>>>>>> Let me know if you have any ideas based on these two
instruction
>>> >>>>>>>> traces. I do not understand how the initiateAcc
function results in
>>> >>>>>>>> another reference, but maybe someone
else does.... Since I don't see how
>>> >>>>>>>> it makes a reference,
it's hard to find out how to make sure it gets
>>> >>>>>>>>
dereferenced...
>>> >>>>>>>> Unfortunately, I haven't been able to add a
DPRINTF in
>>> >>>>>>>> src/base/refcnt.hh ...this would make things
more clear (i.e. exactly when
>>> >>>>>>>> references/deferences occur).
Let me know if you have any advice on
>>> >>>>>>>> this...if it's
possible. I can't seem to get the right include files, and
>>> >>>>>>>>
likely right SConscript compile order...
>>> >>>>>>>> Thanks,
>>>
>>>>>>>> Andrew
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, Apr 7,
2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu [13]>wrote:
>>>
>>>>>>>>
>>> >>>>>>>>> Without digging into things too deeply, it looks
like you may be
>>> >>>>>>>>> leaking references to dynamic
instructions. The CPU may think it's done
>>> >>>>>>>>> with one, but
until that final reference is removed, the object will hang
>>>
>>>>>>>>> around forever. I think I've had problems before where there
reference
>>> >>>>>>>>> count ended up off by one somehow and
instructions would start piling up.
>>> >>>>>>>>> It's also possible
that a clog develops in O3's pipeline and some internal
>>> >>>>>>>>>
structure stops letting instructions through and starts accumulating
them.
>>> >>>>>>>>> Either of these problems will be annoying to track
down, but with enough
>>> >>>>>>>>> digging I've been able to fix these
sorts of things.
>>> >>>>>>>>>
>>> >>>>>>>>> This may have more to do
with O3 not handling the benchmark you're
>>> >>>>>>>>> running well
rather than a problem with your new DRAM model. There may be
>>>
>>>>>>>>> some interaction between the two, though, where the new memory
makes the
>>> >>>>>>>>> timing line up to cause O3 to behave poorly.
What you can do is instrument
>>> >>>>>>>>> dynamic instruction creation
and destruction and reference counting (try
>>> >>>>>>>>> print "this"
for both the reference counting wrapper and the dyn inst
>>> >>>>>>>>>
itself) and turn it on as close as you can to where things go bad
tick
>>> >>>>>>>>> wise. Then look for an instruction which gets lost,
and look for where it's
>>> >>>>>>>>> reference count is incremented and
decremented. It should be relatively
>>> >>>>>>>>> easy to pair up where
references are created and destroyed, and you should
>>> >>>>>>>>> be
able to identify the reference which never goes away. Then you need
to
>>> >>>>>>>>> figure out where that reference is being created. After
that, you should
>>> >>>>>>>>> have enough information to identify why
the reference counting isn't being
>>> >>>>>>>>> done correctly. It's
arduous, but that's the only way.
>>> >>>>>>>>>
>>> >>>>>>>>> It's
important to also make sure reference counts aren't decremented
>>>
>>>>>>>>> to zero prematurely. I had a problem once where that happened
and the
>>> >>>>>>>>> memory behind the object was updated by something
that didn't know it was
>>> >>>>>>>>> dead. The memory had since been
reallocated to another object of the same
>>> >>>>>>>>> type, so that
other object reflected what happened to the phantom one. If I
>>>
>>>>>>>>> remember that manifested as something weird like an add
causing a page
>>> >>>>>>>>> fault or something.
>>> >>>>>>>>>
>>>
>>>>>>>>> Gabe
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On 04/07/12
18:21, Andrew Cebulski wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Hi all,
>>>
>>>>>>>>> I've looked into this problem some more, and have put together
a
>>> >>>>>>>>> couple traces. I've been becoming more familiar with how
gem5 handles
>>> >>>>>>>>> dynamic instructions, in particular how it
destroys them. I have two
>>> >>>>>>>>> traces to compare, one with the
physical memory, and the other with the
>>> >>>>>>>>> integrated
dramsim2 dram memory. I also have two plots showing instruction
>>>
>>>>>>>>> counts over time (sim ticks). All of these are linked at the
end of the
>>> >>>>>>>>> email.
>>> >>>>>>>>> First, I'm going to go
into what I've been able to interpret
>>> >>>>>>>>> regarding how
instructions are destroyed. In particular, comparing when
>>> >>>>>>>>>
DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu. I
>>>
>>>>>>>>> separate these because I've seen a difference, as I discuss
later. These
>>> >>>>>>>>> explanations are fairly non-existent on the
wiki. There is a section
>>> >>>>>>>>> header waiting to be
filled...
>>> >>>>>>>>> From what I have been able to gather from the
code, there is a list
>>> >>>>>>>>> of all the instructions in flight in
cpu/o3/cpu.cc called instList, with
>>> >>>>>>>>> the type DynInstPtr.
There are three conditions to instructions being
>>> >>>>>>>>> cleaned
from this list:
>>> >>>>>>>>> 1.) The ROB retires its head
instruction
>>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from
the commit,
>>> >>>>>>>>> resulting in removing any instruction not in
the ROB
>>> >>>>>>>>> 3.) Decode detects an incorrect branch prediction,
resulting in
>>> >>>>>>>>> removal of all instructions back to the bad
seq num.
>>> >>>>>>>>> Once all five stages have completed, the CPU
cleans up all the
>>> >>>>>>>>> removed in-flight instructions. This
line in particular
>>> >>>>>>>>> in cleanUpRemovedInsts() in
cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>> >>>>>>>>>
instList.erase(removeList.front());
>>> >>>>>>>>> When I turn on the
debug flag O3CPU, I see the message "Removing
>>> >>>>>>>>> instruction,
..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>
>>>>>>>>> after all 5 cpu stages have completed, and one of the
conditions above is
>>> >>>>>>>>> met. I also see what tick it occurs
on.
>>> >>>>>>>>> When I turn on the DynInst debug flag, I see when
instructions are
>>> >>>>>>>>> created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From
>>> >>>>>>>>> analyzing
the trace files, I've gathered that this takes into account that
>>>
>>>>>>>>> instructions have different execution lengths. So if one tick
a memory
>>> >>>>>>>>> instruction in the instList (DynInstPtr) is
removed, the DynInst for that
>>> >>>>>>>>> memory instruction will
occur much later (i.e. 1M ticks later). I have yet
>>> >>>>>>>>> to
determine how this is implemented.
>>> >>>>>>>>> Now for the
problem.
>>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory
is a significant
>>> >>>>>>>>> difference between the size of the
instList vector (of DynInstPtr objects),
>>> >>>>>>>>> and the size of
dynamic instruction count (of DynInst objects). The
>>> >>>>>>>>>
benchmark I'm running is libquantum from SPEC 2006. For the first
roughly
>>> >>>>>>>>> 130B ticks, the dynamic instruction count kept in
cpu/base_dyn_inst.impl.hh
>>> >>>>>>>>> shadows the instList size in
o3/cpu.cc (figure linked below) very closely.
>>> >>>>>>>>> Around tick
130B after libquantum started, it starts hitting what I'm
>>> >>>>>>>>>
assuming are loops (therefore branch prediction), resulting in some
>>>
>>>>>>>>> behavior that seems to imply improper instruction handling
(i.e. more
>>> >>>>>>>>> instructions in flight than allowed by
ROB).
>>> >>>>>>>>> I wasn't able to sync-up the physical and dramsim2
traces exactly by
>>> >>>>>>>>> trace, but they should represent roughly
the same area of execution. They
>>> >>>>>>>>> don't execute the same
due to the dramsim2 modeling the memory differently
>>> >>>>>>>>> (i.e.
latency and other delays).
>>> >>>>>>>>> I've shared both traces on my
public Dropbox here --
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>>> >>>>>>>>> Here are a couple plots of tick versus instruction
count, with
>>> >>>>>>>>> respect to cpu->instcount in
cpu/base_dyn_inst.impl.hh and instList.size()
>>> >>>>>>>>> in
cpu/o3/cpu.cc. --
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>>> >>>>>>>>>
>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>>> >>>>>>>>> Note that I added the printout of the instList size
to an existing
>>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in
cpu/o3/cpu.cc.
>>> >>>>>>>>> Here are the commands I ran to parse the
traces into data files to
>>> >>>>>>>>> analyze in MATLAB and create the
plots:
>>> >>>>>>>>> zgrep DynInst
>>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed
>>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>
>>>>>>>>> zgrep instList
>>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk
'{print
>>> >>>>>>>>> $1,$11}' > instlistsize.out
>>> >>>>>>>>> It seems
to me like the problem might lie in gem5, but has just been
>>>
>>>>>>>>> exposed by integrating this more detailed memory model,
dramsim2, into
>>> >>>>>>>>> gem5. Either that, or their are some timing
errors in how dramsim2 was
>>> >>>>>>>>> integrated. I doubt this,
however, since those first 190B ticks executed
>>> >>>>>>>>> used the
dramsim2 memory. I believe the problem is a combination of memory
>>>
>>>>>>>>> instructions + complex loops (branch prediction), resulting in
improper
>>> >>>>>>>>> destroying of instructions.
>>> >>>>>>>>> I've
included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>
>>>>>>>>> Their are 192 ROB entries, which is why the instList size
generally has a
>>> >>>>>>>>> max of about 192 instructions. The dynamic
instruction counts (seen in the
>>> >>>>>>>>> dramsim2 plot) seem to
also imply that instructions are incorrectly been
>>> >>>>>>>>> removed
from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>
>>>>>>>>> which allows more and more instructions to be added to the
system (possibly
>>> >>>>>>>>> from a bad branch).
>>> >>>>>>>>> I
appreciate any help in debugging this and further figuring out the
>>>
>>>>>>>>> root problem, just let me know if you need anything else from
me. I don't
>>> >>>>>>>>> have much more time at the moment to debug,
but I can take any advice for
>>> >>>>>>>>> quick changes and/or
additional traces, then send the results back to the
>>> >>>>>>>>> list
for discussion.
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> Andrew
>>> >>>>>>>>>
P.S. Paul - I did try decreasing the size of the dramsim2
>>> >>>>>>>>>
transaction (and even command) queue from 512 to 32. The same
instructions
>>> >>>>>>>>> problem occurred. It basically just decreased
the execution time.
>>> >>>>>>>>>
>>> >>>>>>>>> On Wed, Mar 14, 2012 at
2:10 PM, Ali Saidi <***@umich.edu [18]> wrote:
>>> >>>>>>>>>
>>>
>>>>>>>>>> The error is that there are more that 1500 instructions
currently
>>> >>>>>>>>>> in flight in the system. It could mean several
things:
>>> >>>>>>>>>>
>>> >>>>>>>>>> 1. The value is somewhat
arbitrarily defined and maybe there are
>>> >>>>>>>>>> more than 1500 in
your system at one time?
>>> >>>>>>>>>>
>>> >>>>>>>>>> 2. Instructions
aren't being destroyed correctly
>>> >>>>>>>>>>
>>> >>>>>>>>>> You could
try to to run a debug binary so you'll get a list of
>>> >>>>>>>>>>
instructions when it happens or increase the number which may
>>>
>>>>>>>>>> be appropriate for certain situations (but 1500 is quite a
few inflight
>>> >>>>>>>>>> instructions).
>>> >>>>>>>>>>
>>> >>>>>>>>>>
Ali
>>> >>>>>>>>>>
>>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski
wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi Xiangyu,
>>> >>>>>>>>>> I just
started looking into this some more. So at first I
>>> >>>>>>>>>>
thought it was due to updating to a more recent revision, but then I
went
>>> >>>>>>>>>> back to revision 8643, added your patch, built and
ran....and now get the
>>> >>>>>>>>>> error with it too (when running
ARM_FS/gem5.opt). I"m testing now to see
>>> >>>>>>>>>> if an update to
SWIG might have resulted in this error, maybe someone on
>>> >>>>>>>>>>
the mailing list would know if that's possible. The difference is
1.3.40
>>> >>>>>>>>>> vs. 2.0.3, both of which are supported according
to the dependencies wiki
>>> >>>>>>>>>> page.
>>> >>>>>>>>>> Just for
completeness, here's the error from revision 8643:
>>> >>>>>>>>>>
build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>> >>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>> >>>>>>>>>> I have not tried running with gem5.debug,
so I will be doing
>>> >>>>>>>>>> that today. Maybe this is an assertion
that is occurring due to an
>>> >>>>>>>>>> optimization. That would mean
it wouldn't be triggered in gem5.debug since
>>> >>>>>>>>>> it runs
without optimizations. Have you tested all debug, opt and fast
>>>
>>>>>>>>>> with your tests?
>>> >>>>>>>>>> Thanks,
>>> >>>>>>>>>>
Andrew
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM,
Rio Xiangyu Dong <
>>> >>>>>>>>>> ***@gmail.com [19]> wrote:
>>>
>>>>>>>>>>
>>> >>>>>>>>>>> Hi Andrew,
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I didn?t see this error in
my simulations. May I ask which gem5
>>> >>>>>>>>>>> version you are
using? I find some of the latest code updates do not comply
>>>
>>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5
repo8643, and
>>> >>>>>>>>>>> have run all the runnable benchmarks in
SPEC2006, SPEC2000, EEMBC2, and
>>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thank
you!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
Best,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Xiangyu
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *From:* Andrew Cebulski
[mailto:***@drexel.edu [20]]
>>> >>>>>>>>>>> *Sent:* Thursday, March
08, 2012 6:52 PM
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *To:* gem5 users
mailing list
>>> >>>>>>>>>>> *Cc:****@gmail.com [21];
***@umich.edu [22]
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *Subject:* Re:
[gem5-users] A Patch for DRAMsim2 Integration
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>> Xiangyu,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I've been having an
issue recently with the number of
>>> >>>>>>>>>>> instructions I've been
seeing committed to the CPU (I have a separate
>>> >>>>>>>>>>> thread on
this). It turns out the issue seems to be coming from this patch
>>>
>>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately,
I've been
>>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up
until now, I haven't been
>>> >>>>>>>>>>> seeing assertions. I thought
I'd run it with gem5.opt or debug back in
>>> >>>>>>>>>>> December, but
I must not have. My runs on the Arm O3 cpu fails with this
>>>
>>>>>>>>>>> assertion:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>> >>>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> -Andrew
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>
>>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com [23]>
>>>
>>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org [24]>
>>>
>>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>
>>>>>>>>>>> Message-ID: gmail.com [25]>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
Content-Type: text/plain; charset="us-ascii"
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>> Hi all,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
and FS
>>> >>>>>>>>>>> modes.
>>> >>>>>>>>>>> I'm willing to share it
here.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
For those who have such needs, please go to my website
>>> >>>>>>>>>>>
www.cse.psu.edu/~xydong [26] <http://www.cse.psu.edu/%7Exydong [27]>
to
>>> >>>>>>>>>>> download the patch and test it. To enable
>>>
>>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for
FS, you
>>> >>>>>>>>>>> can create
>>> >>>>>>>>>>> by yourself). The
basic idea to enable the DRAMsim2 module is to
>>> >>>>>>>>>>> use
the
>>> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory
class.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
Please let me know if there are bugs.
>>> >>>>>>>>>>>
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thank you!
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best,
>>>
>>>>>>>>>>>
>>> >>>>>>>>>>> Xiangyu Dong
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
-------------- next part --------------
>>> >>>>>>>>>>> An HTML
attachment was scrubbed...
>>> >>>>>>>>>>> URL: <
>>> >>>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[28]
>>> >>>>>>>>>>> >
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
_______________________________________________
>>> >>>>>>>>>>
gem5-users mailing list
>>> >>>>>>>>>> gem5-***@gem5.org [29]
>>>
>>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [30]
>>>
>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
_______________________________________________
>>> >>>>>>>>> gem5-users
mailing
listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
_______________________________________________
>>> >>>>>>>>> gem5-users
mailing list
>>> >>>>>>>>> gem5-***@gem5.org [32]
>>> >>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [33]
>>>
>>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
_______________________________________________
>>> >>>>>>> gem5-users
mailing list
>>> >>>>>>> gem5-***@gem5.org [34]
>>> >>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [35]
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
_______________________________________________
>>> >>>>>> gem5-users
mailing list
>>> >>>>>> gem5-***@gem5.org [36]
>>> >>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [37]
>>> >>>>>
>>>
>>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
_______________________________________________
>>> >>>>> gem5-users
mailing list
>>> >>>>> gem5-***@gem5.org [38]
>>> >>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [39]
>>> >>>>
>>>
>>>>
>>> >>>>
>>> >>>>
>>> >>>>
_______________________________________________
>>> >>>> gem5-users
mailing list
>>> >>>> gem5-***@gem5.org [40]
>>> >>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [41]
>>> >>>>
>>>
>>>
>>> >>>
>>> >>
>>> >
>>> >
>>> >
_______________________________________________
>>> > gem5-users mailing
list
>>> > gem5-***@gem5.org [42]
>>> >
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [43]
>>> >
>>>
>>>
_______________________________________________
>>> gem5-users mailing
list
>>> gem5-***@gem5.org [44]
>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [45]




Links:
------
[1] mailto:***@drexel.edu
[2]
mailto:***@drexel.edu
[3] mailto:***@umich.edu
[4]
mailto:***@umich.edu
[5]
http://dl.dropbox.com/u/2953302/gem5/err.0
[6]
mailto:***@umich.edu
[7]
http://dl.dropbox.com/u/2953302/gem5/tlb.out
[8]
mailto:***@umich.edu
[9]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[10]
mailto:***@drexel.edu
[11]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[12]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[13]
mailto:***@eecs.umich.edu
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:***@gmail.com
[20]
mailto:***@drexel.edu
[21] mailto:***@gmail.com
[22]
mailto:***@umich.edu
[23] mailto:***@gmail.com
[24]
mailto:gem5-***@gem5.org
[25] http://gmail.com
[26]
http://www.cse.psu.edu/~xydong
[27]
http://www.cse.psu.edu/%7Exydong
[28]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[29]
mailto:gem5-***@gem5.org
[30]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[32]
mailto:gem5-***@gem5.org
[33]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[34]
mailto:gem5-***@gem5.org
[35]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[36]
mailto:gem5-***@gem5.org
[37]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[38]
mailto:gem5-***@gem5.org
[39]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[40]
mailto:gem5-***@gem5.org
[41]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[42]
mailto:gem5-***@gem5.org
[43]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[44]
mailto:gem5-***@gem5.org
[45]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[46]
http://dl.dropbox.com/u/2953302/gem5/table_walker.out
[47]
http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
[48]
mailto:***@umich.edu
[49] mailto:***@drexel.edu
Ali Saidi
2012-05-11 04:17:21 UTC
Permalink
Hi Andrew,

Looking at the trace it seems like there are a lot of
invalid translations that are occurring. Everything to an address less
than 0x1000 is likely invalid. An invalid translation will return a
fault (setting the fault pointer in the dynamic instruction to something
other than NoFault and the instruction will either be squashed by a
mispredicted branch or redirect fetch to a kernel handler. I'm wondering
if that isn't happening for some reason. You need to trace back some of
these translations and see what the instruction serial number is for
them and then see what the instructions lifetime is like. Are they
getting squashed? Looking at your graph, when the instructions fall to
0, what is the cause? Does an interrupt occur right before? Something
else?

Thanks,

Ali

On 07.05.2012 20:53, Andrew Cebulski wrote:

>
Hi Ali and Gabe,
> Here's the trace file:
http://dl.dropbox.com/u/2953302/gem5/table_walker.out [46]
> The
pending queue size in the table walker follows the shape of the dynamic
instruction curves. The L1 and L2 queue size never go above 0. Comparing
DynInst count in cpu->instcount with pendingQueue size:
>
http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png [47]
>
>
-Andrew
>
> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu
[48]> wrote:
>
>> Hi Andrew,
>>
>> Could you add some code to the
table walker to see how big the following are getting:
>>
stateQueueL1.size()
>> stateQueueL2.size()
>> pendingQueue.size()
>>
>>
Perhaps we're some how getting into a loop where there are a lot of
translations to invalid addresses that get squashed and they pile up in
the table walker?
>>
>> Thanks,
>> Ali
>>
>> On May 4, 2012, at 7:53
AM, Gabriel Michael Black wrote:
>>
>> > I haven't had a chance to
study what's going on here, but could the problem be that we don't have
bandwidth limits/back pressure implemented for the TLB and delayed
translation? It could be that the CPU is pumping instructions into
translation which eventually drain out/are squashed, and if too many
accumulate they trip that assert.
>> >
>> > That may not actually make
any sense as far as what the code is actually doing, but it occurred to
me as a possibility and I thought I'd throw it out there.
>> >
>> >
Gabe
>> >
>> > Quoting Andrew Cebulski <***@drexel.edu [1]>:
>> >
>>
>> I double-checked by looking at the config.ini file. It turns out I
did
>> >> actually create the checkpoint with an Atomic CPU without
caches. Sorry
>> >> for the confusion.
>> >>
>> >> -Andrew
>> >>
>> >>
On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu [2]>
wrote:
>> >>
>> >>> I started hitting this assertion (that the number of
insts in flight was >
>> >>> 1500) before I started using a checkpoint.
I created the checkpoint
>> >>> afterwards to decrease the time needed
to run simulations to debug this
>> >>> problem. I'll create a new
checkpoint, then send the new trace output.
>> >>>
>> >>> -Andrew
>>
>>>
>> >>>
>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi
<***@umich.edu [3]> wrote:
>> >>>
>> >>>> **
>> >>>>
>> >>>> It's
likely the cause for all of your problems. Dirty data in the caches
>>
>>>> doesn't get restored either. You should always create checkpoints
with an
>> >>>> atomic cpu and without caches.
>> >>>>
>> >>>>
>>
>>>>
>> >>>> Ali
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 02.05.2012 21:23,
Andrew Cebulski wrote:
>> >>>>
>> >>>> Sorry, I created the checkpoint I
referred to with an O3 CPU with caches.
>> >>>> From what I recall
reading, caches don't get restored from checkpoints.
>> >>>> Since the
checkpoint wasn't during the benchmark run, I assumed that was
>> >>>>
okay.
>> >>>> -Andrew
>> >>>>
>> >>>> On Wed, May 2, 2012 at 9:07 PM,
Ali Saidi <***@umich.edu [4]> wrote:
>> >>>>
>> >>>>> You haven't
answered the question about if you created the checkpoints
>> >>>>> with
an atomic cpu without caches.
>> >>>>>
>> >>>>> Ali
>> >>>>>
>> >>>>>
>>
>>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 02.05.2012 19:58, Andrew Cebulski
wrote:
>> >>>>>
>> >>>>> I have not run with the checker CPU recently.
Here's the stderr output
>> >>>>> from a run I did awhile back:
>> >>>>>
http://dl.dropbox.com/u/2953302/gem5/err.0 [5]
>> >>>>> Note that the
instruction match error is before my benchmark actually
>> >>>>> starts
running. The start of my boot script checks to see if my files
>> >>>>>
image is mounted (which it is), then continues on to run the benchmark.
I
>> >>>>> booted the system, mounted my files image, then took a
checkpoint. I've
>> >>>>> been running all my tests from that
checkpoint. I found where my benchmark
>> >>>>> started based on the
ASID (from ExecAsid debug flag).
>> >>>>> I delayed the start of
gathering trace data until the second-to-last
>> >>>>> linear increase
in dynamic instructions in-flight. I'm running a new trace
>> >>>>>
now.
>> >>>>> -Andrew
>> >>>>>
>> >>>>>
>> >>>>> On Wed, May 2, 2012 at
5:28 PM, Ali Saidi <***@umich.edu [6]> wrote:
>> >>>>>
>> >>>>>>
Something is wrong well before this point. There is no reason that
>>
>>>>>> address 0x0 or 0x4 should be translated.
>> >>>>>>
>> >>>>>> Did
you happen to create a checkpoint when caches were in the system?
>>
>>>>>>
>> >>>>>> Have you tried to run with the checker cpu and see if
it detects any
>> >>>>>> errors?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
Ali
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On
02.05.2012 17:22, Andrew Cebulski wrote:
>> >>>>>>
>> >>>>>> They are
data TLB misses that occur as the in-flight instruction count
>> >>>>>>
rises (at 0x0 and 0x4). The last TLB miss before the in-flight
instruction
>> >>>>>> count finally linearly decreases is to 0x200.
Also, at the start of the
>> >>>>>> rising slope, I see a miss to 0x8
and 0x2508c.
>> >>>>>> Here's a trace file:
>> >>>>>>
http://dl.dropbox.com/u/2953302/gem5/tlb.out [7]
>> >>>>>> To reduce
size, I just have lines that have either TLB or walker in
>> >>>>>>
them.
>> >>>>>> I do see only a handful of instruction TLB misses.
>>
>>>>>> -Andrew
>> >>>>>>
>> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali
Saidi <***@umich.edu [8]> wrote:
>> >>>>>>
>> >>>>>>> Hi Andrew,
>>
>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks for digging into this. I
think there is an issue somewhere, but
>> >>>>>>> I'm still not sure
where.
>> >>>>>>>
>> >>>>>>> Ali
>> >>>>>>>
>> >>>>>>> On 01.05.2012
23:34, Andrew Cebulski wrote:
>> >>>>>>>
>> >>>>>>> Okay, I'm positive
now that the issue lies with delayed translations
>> >>>>>>> that are
squashed before finishing.
>> >>>>>>>
>> >>>>>>> On the data on
instruction side? You seem to allude to data in the
>> >>>>>>> paragraph
below, but then instructions in the latter text.
>> >>>>>>>
>> >>>>>>>
It seems to me like speculative load/stores are being executed,
>>
>>>>>>> rather than waiting for the instructions to commit. Once the
instructions
>> >>>>>>> begin getting (speculatively) executed in the
TLB, a reference is left
>> >>>>>>> there, which seems hard to root out
and dereference after the instruction
>> >>>>>>> ends up being squashed.
At least, I have not been able to find that out in
>> >>>>>>> the source
code as of yet. Can anyone clarify on this?
>> >>>>>>>
>> >>>>>>>
>>
>>>>>>>
>> >>>>>>> There should only be one translation outstanding from
each
>> >>>>>>> instruction and data side walker. Any nested
transactions should be queued
>> >>>>>>> in the walker. Until one
finishes, I'm not sure how multiple would ever be
>> >>>>>>>
outstanding.
>> >>>>>>>
>> >>>>>>> Recall the following image that shows
how the number of dynamic
>> >>>>>>> instruction (DynInst) objects
in-flight increases linearly for varying
>> >>>>>>> periods of time:
>>
>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[9]
>> >>>>>>> After enabling the TLB debug flag, I see that the linear
increase in
>> >>>>>>> instructions in flight is proportional to the
number of TLB misses. These
>> >>>>>>> TLB misses have a much larger
delay (resulting in translation delays) due
>> >>>>>>> to the fact the
DramSim2 models the memory system more accurately. It
>> >>>>>>> seems
that with the classic memory system, TLB misses often do not have
>>
>>>>>>> translation delays. For whatever reason, it would also seem that
every
>> >>>>>>> instruction that has a TLB miss also is eventually
squashed...
>> >>>>>>>
>> >>>>>>> From a data side perspective this is
reasonable. While a miss is
>> >>>>>>> outstanding at some point
instructions will stop committing and thus the
>> >>>>>>> instructions
in flight will begin to rise until the miss is satisfied.
>> >>>>>>>
>>
>>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>
>>>>>>> messages appears on the rising slopes (repeated up until the
peak):
>> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>
>>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>
>>>>>>>
>> >>>>>>> This is interesting/odd. I don't know a good reason
why (1) a miss
>> >>>>>>> would be outstanding to both address 0 and
address 4 at the same time. In
>> >>>>>>> almost all cases these pages
are marked as no-access to detect segfaults.
>> >>>>>>> Perhaps there is
an issue where the cpu is getting into a loop faulting on
>> >>>>>>> a
bad access and then faulting again on the fault handler. I could
imagine
>> >>>>>>> this would happen if there was some corruption in the
memory system (for
>> >>>>>>> example the timings in dramsim exposing a
bug in the cache models or
>> >>>>>>> something).
>> >>>>>>>
>>
>>>>>>>
>> >>>>>>> At the peak, the following message appears (from
fetch) almost every
>> >>>>>>> tick for (what I believe to be) every
single one of the table walkers that
>> >>>>>>> were squashed.
>>
>>>>>>> Fetch is waiting ITLB walk to finish!
>> >>>>>>>
>> >>>>>>>
There must be another walk in flight? The instruction side will only
>>
>>>>>>> have one fault outstanding at once. Successive branch
mispredicts will
>> >>>>>>> re-direct fetch but there is code that
catches the fact that a different
>> >>>>>>> walk completed then
expected and "does the right thing."
>> >>>>>>>
>> >>>>>>> The problem
is that these ITLB table walks are for instructions that
>> >>>>>>> were
squashed as much as 0.3 billion cycles earlier, and since been
removed
>> >>>>>>> from the CPU's instruction list.
>> >>>>>>>
>>
>>>>>>> I'm not following here.
>> >>>>>>>
>> >>>>>>> Any help will be
greatly appreciated in solving this problem. I've
>> >>>>>>> hit a
roadblock with getting Ruby working with ARM, most likely due to the
>>
>>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not).
There's the 256
>> >>>>>>> MB for physical memory, then the 64 MB for
the boot loader. I brought this
>> >>>>>>> up in my last email about
trying to get Ruby working. Therefore, I'm
>> >>>>>>> trying to get this
DramSim2 integration fixed so I can start modeling FS
>> >>>>>>> with
DRAM memory.
>> >>>>>>>
>> >>>>>>> Brad/Steve/Nilay anyone have a
suggestion on how to make this work?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
Note that these problems also occur in Soplex from the Spec CPU2006
>>
>>>>>>> benchmark suite (also hits 1500 in-flight instructions
assertion). Due to
>> >>>>>>> time constraints, I haven't tested on
other benchmarks.
>> >>>>>>> Thanks,
>> >>>>>>> Andrew
>> >>>>>>> On
Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <***@drexel.edu
[10]>wrote:
>> >>>>>>>
>> >>>>>>>> Hey Gabe,
>> >>>>>>>> Thanks for
this...very helpful. I just recently got back into
>> >>>>>>>> debugging
this problem. I made a small change in src/base/refcnt.hh to
>> >>>>>>>>
allow me to return the current count of references to a DynInst
object.
>> >>>>>>>> I then modified existing DPRINTFs to also print out
reference
>> >>>>>>>> counts, then added some of my own when I needed
extra visibility.
>> >>>>>>>> I've found one memory store instruction
that seems to be getting
>> >>>>>>>> lost. What's happening is that is
progresses as far as getting executed in
>> >>>>>>>> the IEW once, but a
delayed translation occurs, deferring the store. By
>> >>>>>>>> the time
it reenters the IEW, the IQ has marked the instruction as
>> >>>>>>>>
squashed. Everything progresses as usual from here on out, with one
>>
>>>>>>>> exception. When the instruction is removed from the CPUs
instruction list,
>> >>>>>>>> there is one reference count hanging.
>>
>>>>>>>> I've added in some additional debugging for my traces to
help
>> >>>>>>>> narrow down where this reference is coming from. As far
as I can tell,
>> >>>>>>>> it's because of a call to initiateAcc()
within the executeStore function in
>> >>>>>>>> the lsq unit. Please see
the following two traces. The first trace shows
>> >>>>>>>> what I just
discussed. The second trace is another memory store
>> >>>>>>>>
instruction that got squashed, however, it was squashed upon its
first
>> >>>>>>>> entry into the IEW, therefore it never started
execution.
>> >>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [11]
>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[12]
>> >>>>>>>> Let me know if you have any ideas based on these two
instruction
>> >>>>>>>> traces. I do not understand how the initiateAcc
function results in
>> >>>>>>>> another reference, but maybe someone
else does.... Since I don't see how
>> >>>>>>>> it makes a reference,
it's hard to find out how to make sure it gets
>> >>>>>>>>
dereferenced...
>> >>>>>>>> Unfortunately, I haven't been able to add a
DPRINTF in
>> >>>>>>>> src/base/refcnt.hh ...this would make things more
clear (i.e. exactly when
>> >>>>>>>> references/deferences occur). Let
me know if you have any advice on
>> >>>>>>>> this...if it's possible. I
can't seem to get the right include files, and
>> >>>>>>>> likely right
SConscript compile order...
>> >>>>>>>> Thanks,
>> >>>>>>>> Andrew
>>
>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe
Black <***@eecs.umich.edu [13]>wrote:
>> >>>>>>>>
>> >>>>>>>>>
Without digging into things too deeply, it looks like you may be
>>
>>>>>>>>> leaking references to dynamic instructions. The CPU may think
it's done
>> >>>>>>>>> with one, but until that final reference is
removed, the object will hang
>> >>>>>>>>> around forever. I think I've
had problems before where there reference
>> >>>>>>>>> count ended up
off by one somehow and instructions would start piling up.
>> >>>>>>>>>
It's also possible that a clog develops in O3's pipeline and some
internal
>> >>>>>>>>> structure stops letting instructions through and
starts accumulating them.
>> >>>>>>>>> Either of these problems will be
annoying to track down, but with enough
>> >>>>>>>>> digging I've been
able to fix these sorts of things.
>> >>>>>>>>>
>> >>>>>>>>> This may
have more to do with O3 not handling the benchmark you're
>> >>>>>>>>>
running well rather than a problem with your new DRAM model. There may
be
>> >>>>>>>>> some interaction between the two, though, where the new
memory makes the
>> >>>>>>>>> timing line up to cause O3 to behave
poorly. What you can do is instrument
>> >>>>>>>>> dynamic instruction
creation and destruction and reference counting (try
>> >>>>>>>>> print
"this" for both the reference counting wrapper and the dyn inst
>>
>>>>>>>>> itself) and turn it on as close as you can to where things go
bad tick
>> >>>>>>>>> wise. Then look for an instruction which gets
lost, and look for where it's
>> >>>>>>>>> reference count is
incremented and decremented. It should be relatively
>> >>>>>>>>> easy
to pair up where references are created and destroyed, and you should
>>
>>>>>>>>> be able to identify the reference which never goes away. Then
you need to
>> >>>>>>>>> figure out where that reference is being
created. After that, you should
>> >>>>>>>>> have enough information to
identify why the reference counting isn't being
>> >>>>>>>>> done
correctly. It's arduous, but that's the only way.
>> >>>>>>>>>
>>
>>>>>>>>> It's important to also make sure reference counts aren't
decremented
>> >>>>>>>>> to zero prematurely. I had a problem once where
that happened and the
>> >>>>>>>>> memory behind the object was updated
by something that didn't know it was
>> >>>>>>>>> dead. The memory had
since been reallocated to another object of the same
>> >>>>>>>>> type,
so that other object reflected what happened to the phantom one. If I
>>
>>>>>>>>> remember that manifested as something weird like an add
causing a page
>> >>>>>>>>> fault or something.
>> >>>>>>>>>
>>
>>>>>>>>> Gabe
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On 04/07/12 18:21,
Andrew Cebulski wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi all,
>> >>>>>>>>>
I've looked into this problem some more, and have put together a
>>
>>>>>>>>> couple traces. I've been becoming more familiar with how gem5
handles
>> >>>>>>>>> dynamic instructions, in particular how it destroys
them. I have two
>> >>>>>>>>> traces to compare, one with the physical
memory, and the other with the
>> >>>>>>>>> integrated dramsim2 dram
memory. I also have two plots showing instruction
>> >>>>>>>>> counts
over time (sim ticks). All of these are linked at the end of the
>>
>>>>>>>>> email.
>> >>>>>>>>> First, I'm going to go into what I've been
able to interpret
>> >>>>>>>>> regarding how instructions are destroyed.
In particular, comparing when
>> >>>>>>>>> DynInst's vs. DynInstPtr's
are deconstructed/removed from the cpu. I
>> >>>>>>>>> separate these
because I've seen a difference, as I discuss later. These
>> >>>>>>>>>
explanations are fairly non-existent on the wiki. There is a section
>>
>>>>>>>>> header waiting to be filled...
>> >>>>>>>>> From what I have
been able to gather from the code, there is a list
>> >>>>>>>>> of all
the instructions in flight in cpu/o3/cpu.cc called instList, with
>>
>>>>>>>>> the type DynInstPtr. There are three conditions to
instructions being
>> >>>>>>>>> cleaned from this list:
>> >>>>>>>>> 1.)
The ROB retires its head instruction
>> >>>>>>>>> 2.) Fetch receives a
rob squashing signal from the commit,
>> >>>>>>>>> resulting in removing
any instruction not in the ROB
>> >>>>>>>>> 3.) Decode detects an
incorrect branch prediction, resulting in
>> >>>>>>>>> removal of all
instructions back to the bad seq num.
>> >>>>>>>>> Once all five stages
have completed, the CPU cleans up all the
>> >>>>>>>>> removed in-flight
instructions. This line in particular
>> >>>>>>>>> in
cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>
>>>>>>>>> instList.erase(removeList.front());
>> >>>>>>>>> When I turn
on the debug flag O3CPU, I see the message "Removing
>> >>>>>>>>>
instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and
pcState
>> >>>>>>>>> after all 5 cpu stages have completed, and one of
the conditions above is
>> >>>>>>>>> met. I also see what tick it occurs
on.
>> >>>>>>>>> When I turn on the DynInst debug flag, I see when
instructions are
>> >>>>>>>>> created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From
>> >>>>>>>>> analyzing
the trace files, I've gathered that this takes into account that
>>
>>>>>>>>> instructions have different execution lengths. So if one tick
a memory
>> >>>>>>>>> instruction in the instList (DynInstPtr) is
removed, the DynInst for that
>> >>>>>>>>> memory instruction will occur
much later (i.e. 1M ticks later). I have yet
>> >>>>>>>>> to determine
how this is implemented.
>> >>>>>>>>> Now for the problem.
>> >>>>>>>>>
What I'm seeing when I run dramsim2 dram memory is a significant
>>
>>>>>>>>> difference between the size of the instList vector (of
DynInstPtr objects),
>> >>>>>>>>> and the size of dynamic instruction
count (of DynInst objects). The
>> >>>>>>>>> benchmark I'm running is
libquantum from SPEC 2006. For the first roughly
>> >>>>>>>>> 130B
ticks, the dynamic instruction count kept in
cpu/base_dyn_inst.impl.hh
>> >>>>>>>>> shadows the instList size in
o3/cpu.cc (figure linked below) very closely.
>> >>>>>>>>> Around tick
130B after libquantum started, it starts hitting what I'm
>> >>>>>>>>>
assuming are loops (therefore branch prediction), resulting in some
>>
>>>>>>>>> behavior that seems to imply improper instruction handling
(i.e. more
>> >>>>>>>>> instructions in flight than allowed by ROB).
>>
>>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces
exactly by
>> >>>>>>>>> trace, but they should represent roughly the
same area of execution. They
>> >>>>>>>>> don't execute the same due to
the dramsim2 modeling the memory differently
>> >>>>>>>>> (i.e. latency
and other delays).
>> >>>>>>>>> I've shared both traces on my public
Dropbox here --
>> >>>>>>>>>
>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>> >>>>>>>>>
>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>> >>>>>>>>> Here are a couple plots of tick versus instruction
count, with
>> >>>>>>>>> respect to cpu->instcount in
cpu/base_dyn_inst.impl.hh and instList.size()
>> >>>>>>>>> in
cpu/o3/cpu.cc. --
>> >>>>>>>>>
>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>> >>>>>>>>>
>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>> >>>>>>>>> Note that I added the printout of the instList size to
an existing
>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in
cpu/o3/cpu.cc.
>> >>>>>>>>> Here are the commands I ran to parse the
traces into data files to
>> >>>>>>>>> analyze in MATLAB and create the
plots:
>> >>>>>>>>> zgrep DynInst
>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed
>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>
>>>>>>>>> zgrep instList
>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk
'{print
>> >>>>>>>>> $1,$11}' > instlistsize.out
>> >>>>>>>>> It seems
to me like the problem might lie in gem5, but has just been
>> >>>>>>>>>
exposed by integrating this more detailed memory model, dramsim2,
into
>> >>>>>>>>> gem5. Either that, or their are some timing errors in
how dramsim2 was
>> >>>>>>>>> integrated. I doubt this, however, since
those first 190B ticks executed
>> >>>>>>>>> used the dramsim2 memory. I
believe the problem is a combination of memory
>> >>>>>>>>> instructions
+ complex loops (branch prediction), resulting in improper
>> >>>>>>>>>
destroying of instructions.
>> >>>>>>>>> I've included the ROB, Commit,
Fetch, DynInst and O3CPU debug flags.
>> >>>>>>>>> Their are 192 ROB
entries, which is why the instList size generally has a
>> >>>>>>>>> max
of about 192 instructions. The dynamic instruction counts (seen in
the
>> >>>>>>>>> dramsim2 plot) seem to also imply that instructions are
incorrectly been
>> >>>>>>>>> removed from the ROB, and then from the
cpu's instruction list in cpu.cc,
>> >>>>>>>>> which allows more and
more instructions to be added to the system (possibly
>> >>>>>>>>> from
a bad branch).
>> >>>>>>>>> I appreciate any help in debugging this and
further figuring out the
>> >>>>>>>>> root problem, just let me know if
you need anything else from me. I don't
>> >>>>>>>>> have much more time
at the moment to debug, but I can take any advice for
>> >>>>>>>>> quick
changes and/or additional traces, then send the results back to the
>>
>>>>>>>>> list for discussion.
>> >>>>>>>>> Thanks,
>> >>>>>>>>>
Andrew
>> >>>>>>>>> P.S. Paul - I did try decreasing the size of the
dramsim2
>> >>>>>>>>> transaction (and even command) queue from 512 to
32. The same instructions
>> >>>>>>>>> problem occurred. It basically
just decreased the execution time.
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Mar
14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu [18]> wrote:
>>
>>>>>>>>>
>> >>>>>>>>>> The error is that there are more that 1500
instructions currently
>> >>>>>>>>>> in flight in the system. It could
mean several things:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. The value is
somewhat arbitrarily defined and maybe there are
>> >>>>>>>>>> more than
1500 in your system at one time?
>> >>>>>>>>>>
>> >>>>>>>>>> 2.
Instructions aren't being destroyed correctly
>> >>>>>>>>>>
>>
>>>>>>>>>> You could try to to run a debug binary so you'll get a list
of
>> >>>>>>>>>> instructions when it happens or increase the number
which may
>> >>>>>>>>>> be appropriate for certain situations (but 1500
is quite a few inflight
>> >>>>>>>>>> instructions).
>> >>>>>>>>>>
>>
>>>>>>>>>> Ali
>> >>>>>>>>>>
>> >>>>>>>>>> On 13.03.2012 10:56, Andrew
Cebulski wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Xiangyu,
>> >>>>>>>>>> I
just started looking into this some more. So at first I
>> >>>>>>>>>>
thought it was due to updating to a more recent revision, but then I
went
>> >>>>>>>>>> back to revision 8643, added your patch, built and
ran....and now get the
>> >>>>>>>>>> error with it too (when running
ARM_FS/gem5.opt). I"m testing now to see
>> >>>>>>>>>> if an update to
SWIG might have resulted in this error, maybe someone on
>> >>>>>>>>>>
the mailing list would know if that's possible. The difference is
1.3.40
>> >>>>>>>>>> vs. 2.0.3, both of which are supported according to
the dependencies wiki
>> >>>>>>>>>> page.
>> >>>>>>>>>> Just for
completeness, here's the error from revision 8643:
>> >>>>>>>>>>
build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>> >>>>>>>>>> I have not tried running with gem5.debug,
so I will be doing
>> >>>>>>>>>> that today. Maybe this is an assertion
that is occurring due to an
>> >>>>>>>>>> optimization. That would mean
it wouldn't be triggered in gem5.debug since
>> >>>>>>>>>> it runs
without optimizations. Have you tested all debug, opt and fast
>>
>>>>>>>>>> with your tests?
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>>
Andrew
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio
Xiangyu Dong <
>> >>>>>>>>>> ***@gmail.com [19]> wrote:
>>
>>>>>>>>>>
>> >>>>>>>>>>> Hi Andrew,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>>
>>>>>>>>>>>
>> >>>>>>>>>>> I didn?t see this error in my simulations.
May I ask which gem5
>> >>>>>>>>>>> version you are using? I find some
of the latest code updates do not comply
>> >>>>>>>>>>> with my changes.
I am still using the DRAMsim2 patch on Gem5 repo8643, and
>> >>>>>>>>>>>
have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2,
and
>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>>
>>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>>
>>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
Xiangyu
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
*From:* Andrew Cebulski [mailto:***@drexel.edu [20]]
>> >>>>>>>>>>>
*Sent:* Thursday, March 08, 2012 6:52 PM
>> >>>>>>>>>>>
>> >>>>>>>>>>>
*To:* gem5 users mailing list
>> >>>>>>>>>>> *Cc:****@gmail.com
[21]; ***@umich.edu [22]
>> >>>>>>>>>>>
>> >>>>>>>>>>> *Subject:* Re:
[gem5-users] A Patch for DRAMsim2 Integration
>> >>>>>>>>>>>
>>
>>>>>>>>>>> Xiangyu,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I've been having an
issue recently with the number of
>> >>>>>>>>>>> instructions I've been
seeing committed to the CPU (I have a separate
>> >>>>>>>>>>> thread on
this). It turns out the issue seems to be coming from this patch
>>
>>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately,
I've been
>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up
until now, I haven't been
>> >>>>>>>>>>> seeing assertions. I thought
I'd run it with gem5.opt or debug back in
>> >>>>>>>>>>> December, but I
must not have. My runs on the Arm O3 cpu fails with this
>> >>>>>>>>>>>
assertion:
>> >>>>>>>>>>>
>> >>>>>>>>>>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>> >>>>>>>>>>>
>> >>>>>>>>>>> -Andrew
>> >>>>>>>>>>>
>>
>>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> >>>>>>>>>>> From:
"Dong, Xiangyu" <***@gmail.com [23]>
>> >>>>>>>>>>> To: "gem5
users mailing list" <gem5-***@gem5.org [24]>
>> >>>>>>>>>>> Subject:
[gem5-users] A Patch for DRAMsim2 Integration
>> >>>>>>>>>>> Message-ID:
gmail.com [25]>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Content-Type: text/plain;
charset="us-ascii"
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>>
>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have a
Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>> >>>>>>>>>>>
modes.
>> >>>>>>>>>>> I'm willing to share it here.
>> >>>>>>>>>>>
>>
>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> For those who have such needs,
please go to my website
>> >>>>>>>>>>> www.cse.psu.edu/~xydong [26]
<http://www.cse.psu.edu/%7Exydong [27]> to
>> >>>>>>>>>>> download the
patch and test it. To enable
>> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py
script instead of se.py (for FS, you
>> >>>>>>>>>>> can create
>>
>>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module
is to
>> >>>>>>>>>>> use the
>> >>>>>>>>>>> derived DRAMMemory class
instead of PhysicalMemory class.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>>
>>>>>>>>>>>
>> >>>>>>>>>>> Please let me know if there are bugs.
>>
>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>>
>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>>
>>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu Dong
>> >>>>>>>>>>>
>> >>>>>>>>>>>
-------------- next part --------------
>> >>>>>>>>>>> An HTML
attachment was scrubbed...
>> >>>>>>>>>>> URL: <
>> >>>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[28]
>> >>>>>>>>>>> >
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
_______________________________________________
>> >>>>>>>>>> gem5-users
mailing list
>> >>>>>>>>>> gem5-***@gem5.org [29]
>> >>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [30]
>>
>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
_______________________________________________
>> >>>>>>>>> gem5-users
mailing
listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
_______________________________________________
>> >>>>>>>>> gem5-users
mailing list
>> >>>>>>>>> gem5-***@gem5.org [32]
>> >>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [33]
>>
>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
_______________________________________________
>> >>>>>>> gem5-users
mailing list
>> >>>>>>> gem5-***@gem5.org [34]
>> >>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [35]
>> >>>>>>
>>
>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
_______________________________________________
>> >>>>>> gem5-users
mailing list
>> >>>>>> gem5-***@gem5.org [36]
>> >>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [37]
>> >>>>>
>>
>>>>>
>> >>>>>
>> >>>>>
>> >>>>>
_______________________________________________
>> >>>>> gem5-users
mailing list
>> >>>>> gem5-***@gem5.org [38]
>> >>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [39]
>> >>>>
>>
>>>>
>> >>>>
>> >>>>
>> >>>>
_______________________________________________
>> >>>> gem5-users
mailing list
>> >>>> gem5-***@gem5.org [40]
>> >>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [41]
>> >>>>
>>
>>>
>> >>>
>> >>
>> >
>> >
>> >
_______________________________________________
>> > gem5-users mailing
list
>> > gem5-***@gem5.org [42]
>> >
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [43]
>> >
>>
>>
_______________________________________________
>> gem5-users mailing
list
>> gem5-***@gem5.org [44]
>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [45]




Links:
------
[1] mailto:***@drexel.edu
[2]
mailto:***@drexel.edu
[3] mailto:***@umich.edu
[4]
mailto:***@umich.edu
[5]
http://dl.dropbox.com/u/2953302/gem5/err.0
[6]
mailto:***@umich.edu
[7]
http://dl.dropbox.com/u/2953302/gem5/tlb.out
[8]
mailto:***@umich.edu
[9]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[10]
mailto:***@drexel.edu
[11]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[12]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[13]
mailto:***@eecs.umich.edu
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:***@gmail.com
[20]
mailto:***@drexel.edu
[21] mailto:***@gmail.com
[22]
mailto:***@umich.edu
[23] mailto:***@gmail.com
[24]
mailto:gem5-***@gem5.org
[25] http://gmail.com
[26]
http://www.cse.psu.edu/~xydong
[27]
http://www.cse.psu.edu/%7Exydong
[28]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[29]
mailto:gem5-***@gem5.org
[30]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[32]
mailto:gem5-***@gem5.org
[33]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[34]
mailto:gem5-***@gem5.org
[35]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[36]
mailto:gem5-***@gem5.org
[37]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[38]
mailto:gem5-***@gem5.org
[39]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[40]
mailto:gem5-***@gem5.org
[41]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[42]
mailto:gem5-***@gem5.org
[43]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[44]
mailto:gem5-***@gem5.org
[45]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[46]
http://dl.dropbox.com/u/2953302/gem5/table_walker.out
[47]
http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
[48]
mailto:***@umich.edu
Andrew Cebulski
2012-05-14 15:15:32 UTC
Permalink
Ali,

Looking at the trace file for the TLB walker that I sent earlier, I see a
considerable number of these faults:

L2 descriptor invalid, causing fault

This is within the doL2Descriptor function in tablewalker.cc.

Here's a look at the frequency of these faults, with bins centered around
the base of each rise/fall of the pendingQueue size (see small arrows on
x-axis):

http://dl.dropbox.com/u/2953302/gem5/L2faults.png

I'm still looking into how this fault is handled, along with your other
questions. I probably won't have much of a chance to get into it more
until late today or tomorrow though. Let me know if you have any new ideas
based on these results.

Thanks,
Andrew

On Fri, May 11, 2012 at 12:17 AM, Ali Saidi <***@umich.edu> wrote:

> **
>
> Hi Andrew,
>
> Looking at the trace it seems like there are a lot of invalid translations
> that are occurring. Everything to an address less than 0x1000 is likely
> invalid. An invalid translation will return a fault (setting the fault
> pointer in the dynamic instruction to something other than NoFault and the
> instruction will either be squashed by a mispredicted branch or redirect
> fetch to a kernel handler. I'm wondering if that isn't happening for some
> reason. You need to trace back some of these translations and see what the
> instruction serial number is for them and then see what the instructions
> lifetime is like. Are they getting squashed? Looking at your graph, when
> the instructions fall to 0, what is the cause? Does an interrupt occur
> right before? Something else?
>
>
>
> Thanks,
>
> Ali
>
>
>
>
>
> On 07.05.2012 20:53, Andrew Cebulski wrote:
>
> Hi Ali and Gabe,
> Here's the trace file:
> http://dl.dropbox.com/u/2953302/gem5/table_walker.out
> The pending queue size in the table walker follows the shape of the
> dynamic instruction curves. The L1 and L2 queue size never go above 0.
> Comparing DynInst count in cpu->instcount with pendingQueue size:
> http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
>
> -Andrew
>
> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu> wrote:
>
>> Hi Andrew,
>>
>> Could you add some code to the table walker to see how big the following
>> are getting:
>> stateQueueL1.size()
>> stateQueueL2.size()
>> pendingQueue.size()
>>
>> Perhaps we're some how getting into a loop where there are a lot of
>> translations to invalid addresses that get squashed and they pile up in the
>> table walker?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:
>>
>> > I haven't had a chance to study what's going on here, but could the
>> problem be that we don't have bandwidth limits/back pressure implemented
>> for the TLB and delayed translation? It could be that the CPU is pumping
>> instructions into translation which eventually drain out/are squashed, and
>> if too many accumulate they trip that assert.
>> >
>> > That may not actually make any sense as far as what the code is
>> actually doing, but it occurred to me as a possibility and I thought I'd
>> throw it out there.
>> >
>> > Gabe
>> >
>> > Quoting Andrew Cebulski <***@drexel.edu>:
>> >
>> >> I double-checked by looking at the config.ini file. It turns out I did
>> >> actually create the checkpoint with an Atomic CPU without caches.
>> Sorry
>> >> for the confusion.
>> >>
>> >> -Andrew
>> >>
>> >> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu>
>> wrote:
>> >>
>> >>> I started hitting this assertion (that the number of insts in flight
>> was >
>> >>> 1500) before I started using a checkpoint. I created the checkpoint
>> >>> afterwards to decrease the time needed to run simulations to debug
>> this
>> >>> problem. I'll create a new checkpoint, then send the new trace
>> output.
>> >>>
>> >>> -Andrew
>> >>>
>> >>>
>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>> >>>
>> >>>> **
>> >>>>
>> >>>> It's likely the cause for all of your problems. Dirty data in the
>> caches
>> >>>> doesn't get restored either. You should always create checkpoints
>> with an
>> >>>> atomic cpu and without caches.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Ali
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>> >>>>
>> >>>> Sorry, I created the checkpoint I referred to with an O3 CPU with
>> caches.
>> >>>> From what I recall reading, caches don't get restored from
>> checkpoints.
>> >>>> Since the checkpoint wasn't during the benchmark run, I assumed that
>> was
>> >>>> okay.
>> >>>> -Andrew
>> >>>>
>> >>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>> >>>>
>> >>>>> You haven't answered the question about if you created the
>> checkpoints
>> >>>>> with an atomic cpu without caches.
>> >>>>>
>> >>>>> Ali
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>> >>>>>
>> >>>>> I have not run with the checker CPU recently. Here's the stderr
>> output
>> >>>>> from a run I did awhile back:
>> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>> >>>>> Note that the instruction match error is before my benchmark
>> actually
>> >>>>> starts running. The start of my boot script checks to see if my
>> files
>> >>>>> image is mounted (which it is), then continues on to run the
>> benchmark. I
>> >>>>> booted the system, mounted my files image, then took a checkpoint.
>> I've
>> >>>>> been running all my tests from that checkpoint. I found where my
>> benchmark
>> >>>>> started based on the ASID (from ExecAsid debug flag).
>> >>>>> I delayed the start of gathering trace data until the second-to-last
>> >>>>> linear increase in dynamic instructions in-flight. I'm running a
>> new trace
>> >>>>> now.
>> >>>>> -Andrew
>> >>>>>
>> >>>>>
>> >>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>> >>>>>
>> >>>>>> Something is wrong well before this point. There is no reason that
>> >>>>>> address 0x0 or 0x4 should be translated.
>> >>>>>>
>> >>>>>> Did you happen to create a checkpoint when caches were in the
>> system?
>> >>>>>>
>> >>>>>> Have you tried to run with the checker cpu and see if it detects
>> any
>> >>>>>> errors?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Ali
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>> >>>>>>
>> >>>>>> They are data TLB misses that occur as the in-flight instruction
>> count
>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight
>> instruction
>> >>>>>> count finally linearly decreases is to 0x200. Also, at the start
>> of the
>> >>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>> >>>>>> Here's a trace file:
>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>> >>>>>> To reduce size, I just have lines that have either TLB or walker in
>> >>>>>> them.
>> >>>>>> I do see only a handful of instruction TLB misses.
>> >>>>>> -Andrew
>> >>>>>>
>> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu>
>> wrote:
>> >>>>>>
>> >>>>>>> Hi Andrew,
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks for digging into this. I think there is an issue
>> somewhere, but
>> >>>>>>> I'm still not sure where.
>> >>>>>>>
>> >>>>>>> Ali
>> >>>>>>>
>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>> >>>>>>>
>> >>>>>>> Okay, I'm positive now that the issue lies with delayed
>> translations
>> >>>>>>> that are squashed before finishing.
>> >>>>>>>
>> >>>>>>> On the data on instruction side? You seem to allude to data in the
>> >>>>>>> paragraph below, but then instructions in the latter text.
>> >>>>>>>
>> >>>>>>> It seems to me like speculative load/stores are being executed,
>> >>>>>>> rather than waiting for the instructions to commit. Once the
>> instructions
>> >>>>>>> begin getting (speculatively) executed in the TLB, a reference is
>> left
>> >>>>>>> there, which seems hard to root out and dereference after the
>> instruction
>> >>>>>>> ends up being squashed. At least, I have not been able to find
>> that out in
>> >>>>>>> the source code as of yet. Can anyone clarify on this?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> There should only be one translation outstanding from each
>> >>>>>>> instruction and data side walker. Any nested transactions should
>> be queued
>> >>>>>>> in the walker. Until one finishes, I'm not sure how multiple
>> would ever be
>> >>>>>>> outstanding.
>> >>>>>>>
>> >>>>>>> Recall the following image that shows how the number of dynamic
>> >>>>>>> instruction (DynInst) objects in-flight increases linearly for
>> varying
>> >>>>>>> periods of time:
>> >>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>> After enabling the TLB debug flag, I see that the linear increase
>> in
>> >>>>>>> instructions in flight is proportional to the number of TLB
>> misses. These
>> >>>>>>> TLB misses have a much larger delay (resulting in translation
>> delays) due
>> >>>>>>> to the fact the DramSim2 models the memory system more
>> accurately. It
>> >>>>>>> seems that with the classic memory system, TLB misses often do
>> not have
>> >>>>>>> translation delays. For whatever reason, it would also seem that
>> every
>> >>>>>>> instruction that has a TLB miss also is eventually squashed...
>> >>>>>>>
>> >>>>>>> From a data side perspective this is reasonable. While a miss is
>> >>>>>>> outstanding at some point instructions will stop committing and
>> thus the
>> >>>>>>> instructions in flight will begin to rise until the miss is
>> satisfied.
>> >>>>>>>
>> >>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>> >>>>>>> messages appears on the rising slopes (repeated up until the
>> peak):
>> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>> >>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>> >>>>>>>
>> >>>>>>> This is interesting/odd. I don't know a good reason why (1) a miss
>> >>>>>>> would be outstanding to both address 0 and address 4 at the same
>> time. In
>> >>>>>>> almost all cases these pages are marked as no-access to detect
>> segfaults.
>> >>>>>>> Perhaps there is an issue where the cpu is getting into a loop
>> faulting on
>> >>>>>>> a bad access and then faulting again on the fault handler. I
>> could imagine
>> >>>>>>> this would happen if there was some corruption in the memory
>> system (for
>> >>>>>>> example the timings in dramsim exposing a bug in the cache models
>> or
>> >>>>>>> something).
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> At the peak, the following message appears (from fetch) almost
>> every
>> >>>>>>> tick for (what I believe to be) every single one of the table
>> walkers that
>> >>>>>>> were squashed.
>> >>>>>>> Fetch is waiting ITLB walk to finish!
>> >>>>>>>
>> >>>>>>> There must be another walk in flight? The instruction side will
>> only
>> >>>>>>> have one fault outstanding at once. Successive branch mispredicts
>> will
>> >>>>>>> re-direct fetch but there is code that catches the fact that a
>> different
>> >>>>>>> walk completed then expected and "does the right thing."
>> >>>>>>>
>> >>>>>>> The problem is that these ITLB table walks are for instructions
>> that
>> >>>>>>> were squashed as much as 0.3 billion cycles earlier, and since
>> been removed
>> >>>>>>> from the CPU's instruction list.
>> >>>>>>>
>> >>>>>>> I'm not following here.
>> >>>>>>>
>> >>>>>>> Any help will be greatly appreciated in solving this problem.
>> I've
>> >>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
>> due to the
>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not).
>> There's the 256
>> >>>>>>> MB for physical memory, then the 64 MB for the boot loader. I
>> brought this
>> >>>>>>> up in my last email about trying to get Ruby working. Therefore,
>> I'm
>> >>>>>>> trying to get this DramSim2 integration fixed so I can start
>> modeling FS
>> >>>>>>> with DRAM memory.
>> >>>>>>>
>> >>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this
>> work?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Note that these problems also occur in Soplex from the Spec
>> CPU2006
>> >>>>>>> benchmark suite (also hits 1500 in-flight instructions
>> assertion). Due to
>> >>>>>>> time constraints, I haven't tested on other benchmarks.
>> >>>>>>> Thanks,
>> >>>>>>> Andrew
>> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <
>> ***@drexel.edu>wrote:
>> >>>>>>>
>> >>>>>>>> Hey Gabe,
>> >>>>>>>> Thanks for this...very helpful. I just recently got back into
>> >>>>>>>> debugging this problem. I made a small change in
>> src/base/refcnt.hh to
>> >>>>>>>> allow me to return the current count of references to a DynInst
>> object.
>> >>>>>>>> I then modified existing DPRINTFs to also print out reference
>> >>>>>>>> counts, then added some of my own when I needed extra visibility.
>> >>>>>>>> I've found one memory store instruction that seems to be
>> getting
>> >>>>>>>> lost. What's happening is that is progresses as far as getting
>> executed in
>> >>>>>>>> the IEW once, but a delayed translation occurs, deferring the
>> store. By
>> >>>>>>>> the time it reenters the IEW, the IQ has marked the instruction
>> as
>> >>>>>>>> squashed. Everything progresses as usual from here on out, with
>> one
>> >>>>>>>> exception. When the instruction is removed from the CPUs
>> instruction list,
>> >>>>>>>> there is one reference count hanging.
>> >>>>>>>> I've added in some additional debugging for my traces to help
>> >>>>>>>> narrow down where this reference is coming from. As far as I
>> can tell,
>> >>>>>>>> it's because of a call to initiateAcc() within the executeStore
>> function in
>> >>>>>>>> the lsq unit. Please see the following two traces. The first
>> trace shows
>> >>>>>>>> what I just discussed. The second trace is another memory store
>> >>>>>>>> instruction that got squashed, however, it was squashed upon its
>> first
>> >>>>>>>> entry into the IEW, therefore it never started execution.
>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>> >>>>>>>> Let me know if you have any ideas based on these two
>> instruction
>> >>>>>>>> traces. I do not understand how the initiateAcc function
>> results in
>> >>>>>>>> another reference, but maybe someone else does.... Since I
>> don't see how
>> >>>>>>>> it makes a reference, it's hard to find out how to make sure it
>> gets
>> >>>>>>>> dereferenced...
>> >>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>> >>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e.
>> exactly when
>> >>>>>>>> references/deferences occur). Let me know if you have any
>> advice on
>> >>>>>>>> this...if it's possible. I can't seem to get the right include
>> files, and
>> >>>>>>>> likely right SConscript compile order...
>> >>>>>>>> Thanks,
>> >>>>>>>> Andrew
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <
>> ***@eecs.umich.edu>wrote:
>> >>>>>>>>
>> >>>>>>>>> Without digging into things too deeply, it looks like you may be
>> >>>>>>>>> leaking references to dynamic instructions. The CPU may think
>> it's done
>> >>>>>>>>> with one, but until that final reference is removed, the object
>> will hang
>> >>>>>>>>> around forever. I think I've had problems before where there
>> reference
>> >>>>>>>>> count ended up off by one somehow and instructions would start
>> piling up.
>> >>>>>>>>> It's also possible that a clog develops in O3's pipeline and
>> some internal
>> >>>>>>>>> structure stops letting instructions through and starts
>> accumulating them.
>> >>>>>>>>> Either of these problems will be annoying to track down, but
>> with enough
>> >>>>>>>>> digging I've been able to fix these sorts of things.
>> >>>>>>>>>
>> >>>>>>>>> This may have more to do with O3 not handling the benchmark
>> you're
>> >>>>>>>>> running well rather than a problem with your new DRAM model.
>> There may be
>> >>>>>>>>> some interaction between the two, though, where the new memory
>> makes the
>> >>>>>>>>> timing line up to cause O3 to behave poorly. What you can do is
>> instrument
>> >>>>>>>>> dynamic instruction creation and destruction and reference
>> counting (try
>> >>>>>>>>> print "this" for both the reference counting wrapper and the
>> dyn inst
>> >>>>>>>>> itself) and turn it on as close as you can to where things go
>> bad tick
>> >>>>>>>>> wise. Then look for an instruction which gets lost, and look
>> for where it's
>> >>>>>>>>> reference count is incremented and decremented. It should be
>> relatively
>> >>>>>>>>> easy to pair up where references are created and destroyed, and
>> you should
>> >>>>>>>>> be able to identify the reference which never goes away. Then
>> you need to
>> >>>>>>>>> figure out where that reference is being created. After that,
>> you should
>> >>>>>>>>> have enough information to identify why the reference counting
>> isn't being
>> >>>>>>>>> done correctly. It's arduous, but that's the only way.
>> >>>>>>>>>
>> >>>>>>>>> It's important to also make sure reference counts aren't
>> decremented
>> >>>>>>>>> to zero prematurely. I had a problem once where that happened
>> and the
>> >>>>>>>>> memory behind the object was updated by something that didn't
>> know it was
>> >>>>>>>>> dead. The memory had since been reallocated to another object
>> of the same
>> >>>>>>>>> type, so that other object reflected what happened to the
>> phantom one. If I
>> >>>>>>>>> remember that manifested as something weird like an add causing
>> a page
>> >>>>>>>>> fault or something.
>> >>>>>>>>>
>> >>>>>>>>> Gabe
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi all,
>> >>>>>>>>> I've looked into this problem some more, and have put together a
>> >>>>>>>>> couple traces. I've been becoming more familiar with how gem5
>> handles
>> >>>>>>>>> dynamic instructions, in particular how it destroys them. I
>> have two
>> >>>>>>>>> traces to compare, one with the physical memory, and the other
>> with the
>> >>>>>>>>> integrated dramsim2 dram memory. I also have two plots showing
>> instruction
>> >>>>>>>>> counts over time (sim ticks). All of these are linked at the
>> end of the
>> >>>>>>>>> email.
>> >>>>>>>>> First, I'm going to go into what I've been able to interpret
>> >>>>>>>>> regarding how instructions are destroyed. In particular,
>> comparing when
>> >>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the
>> cpu. I
>> >>>>>>>>> separate these because I've seen a difference, as I discuss
>> later. These
>> >>>>>>>>> explanations are fairly non-existent on the wiki. There is a
>> section
>> >>>>>>>>> header waiting to be filled...
>> >>>>>>>>> From what I have been able to gather from the code, there is a
>> list
>> >>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called
>> instList, with
>> >>>>>>>>> the type DynInstPtr. There are three conditions to
>> instructions being
>> >>>>>>>>> cleaned from this list:
>> >>>>>>>>> 1.) The ROB retires its head instruction
>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>> >>>>>>>>> resulting in removing any instruction not in the ROB
>> >>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting in
>> >>>>>>>>> removal of all instructions back to the bad seq num.
>> >>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>> >>>>>>>>> removed in-flight instructions. This line in particular
>> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a
>> DynInstPtr:
>> >>>>>>>>> instList.erase(removeList.front());
>> >>>>>>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>> >>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum
>> and pcState
>> >>>>>>>>> after all 5 cpu stages have completed, and one of the
>> conditions above is
>> >>>>>>>>> met. I also see what tick it occurs on.
>> >>>>>>>>> When I turn on the DynInst debug flag, I see when instructions
>> are
>> >>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what
>> tick. From
>> >>>>>>>>> analyzing the trace files, I've gathered that this takes into
>> account that
>> >>>>>>>>> instructions have different execution lengths. So if one tick
>> a memory
>> >>>>>>>>> instruction in the instList (DynInstPtr) is removed, the
>> DynInst for that
>> >>>>>>>>> memory instruction will occur much later (i.e. 1M ticks later).
>> I have yet
>> >>>>>>>>> to determine how this is implemented.
>> >>>>>>>>> Now for the problem.
>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>> >>>>>>>>> difference between the size of the instList vector (of
>> DynInstPtr objects),
>> >>>>>>>>> and the size of dynamic instruction count (of DynInst objects).
>> The
>> >>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the
>> first roughly
>> >>>>>>>>> 130B ticks, the dynamic instruction count kept in
>> cpu/base_dyn_inst.impl.hh
>> >>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below)
>> very closely.
>> >>>>>>>>> Around tick 130B after libquantum started, it starts hitting
>> what I'm
>> >>>>>>>>> assuming are loops (therefore branch prediction), resulting in
>> some
>> >>>>>>>>> behavior that seems to imply improper instruction handling
>> (i.e. more
>> >>>>>>>>> instructions in flight than allowed by ROB).
>> >>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces
>> exactly by
>> >>>>>>>>> trace, but they should represent roughly the same area of
>> execution. They
>> >>>>>>>>> don't execute the same due to the dramsim2 modeling the memory
>> differently
>> >>>>>>>>> (i.e. latency and other delays).
>> >>>>>>>>> I've shared both traces on my public Dropbox here --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>> >>>>>>>>> Here are a couple plots of tick versus instruction count, with
>> >>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
>> instList.size()
>> >>>>>>>>> in cpu/o3/cpu.cc. --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>>>> Note that I added the printout of the instList size to an
>> existing
>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>> >>>>>>>>> Here are the commands I ran to parse the traces into data files
>> to
>> >>>>>>>>> analyze in MATLAB and create the plots:
>> >>>>>>>>> zgrep DynInst
>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> grep destroyed
>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>> >>>>>>>>> zgrep instList
>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> awk '{print
>> >>>>>>>>> $1,$11}' > instlistsize.out
>> >>>>>>>>> It seems to me like the problem might lie in gem5, but has just
>> been
>> >>>>>>>>> exposed by integrating this more detailed memory model,
>> dramsim2, into
>> >>>>>>>>> gem5. Either that, or their are some timing errors in how
>> dramsim2 was
>> >>>>>>>>> integrated. I doubt this, however, since those first 190B
>> ticks executed
>> >>>>>>>>> used the dramsim2 memory. I believe the problem is a
>> combination of memory
>> >>>>>>>>> instructions + complex loops (branch prediction), resulting in
>> improper
>> >>>>>>>>> destroying of instructions.
>> >>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
>> flags.
>> >>>>>>>>> Their are 192 ROB entries, which is why the instList size
>> generally has a
>> >>>>>>>>> max of about 192 instructions. The dynamic instruction counts
>> (seen in the
>> >>>>>>>>> dramsim2 plot) seem to also imply that instructions are
>> incorrectly been
>> >>>>>>>>> removed from the ROB, and then from the cpu's instruction list
>> in cpu.cc,
>> >>>>>>>>> which allows more and more instructions to be added to the
>> system (possibly
>> >>>>>>>>> from a bad branch).
>> >>>>>>>>> I appreciate any help in debugging this and further figuring
>> out the
>> >>>>>>>>> root problem, just let me know if you need anything else from
>> me. I don't
>> >>>>>>>>> have much more time at the moment to debug, but I can take any
>> advice for
>> >>>>>>>>> quick changes and/or additional traces, then send the results
>> back to the
>> >>>>>>>>> list for discussion.
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Andrew
>> >>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>> >>>>>>>>> transaction (and even command) queue from 512 to 32. The same
>> instructions
>> >>>>>>>>> problem occurred. It basically just decreased the execution
>> time.
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu>
>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> The error is that there are more that 1500 instructions
>> currently
>> >>>>>>>>>> in flight in the system. It could mean several things:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there
>> are
>> >>>>>>>>>> more than 1500 in your system at one time?
>> >>>>>>>>>>
>> >>>>>>>>>> 2. Instructions aren't being destroyed correctly
>> >>>>>>>>>>
>> >>>>>>>>>> You could try to to run a debug binary so you'll get a list of
>> >>>>>>>>>> instructions when it happens or increase the number which may
>> >>>>>>>>>> be appropriate for certain situations (but 1500 is quite a few
>> inflight
>> >>>>>>>>>> instructions).
>> >>>>>>>>>>
>> >>>>>>>>>> Ali
>> >>>>>>>>>>
>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Xiangyu,
>> >>>>>>>>>> I just started looking into this some more. So at first I
>> >>>>>>>>>> thought it was due to updating to a more recent revision, but
>> then I went
>> >>>>>>>>>> back to revision 8643, added your patch, built and ran....and
>> now get the
>> >>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m testing
>> now to see
>> >>>>>>>>>> if an update to SWIG might have resulted in this error, maybe
>> someone on
>> >>>>>>>>>> the mailing list would know if that's possible. The
>> difference is 1.3.40
>> >>>>>>>>>> vs. 2.0.3, both of which are supported according to the
>> dependencies wiki
>> >>>>>>>>>> page.
>> >>>>>>>>>> Just for completeness, here's the error from revision 8643:
>> >>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount
>> >>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>> >>>>>>>>>> that today. Maybe this is an assertion that is occurring due
>> to an
>> >>>>>>>>>> optimization. That would mean it wouldn't be triggered in
>> gem5.debug since
>> >>>>>>>>>> it runs without optimizations. Have you tested all debug, opt
>> and fast
>> >>>>>>>>>> with your tests?
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Andrew
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>> >>>>>>>>>> ***@gmail.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Andrew,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I didn?t see this error in my simulations. May I ask which
>> gem5
>> >>>>>>>>>>> version you are using? I find some of the latest code updates
>> do not comply
>> >>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5
>> repo8643, and
>> >>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000,
>> EEMBC2, and
>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>> >>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>> >>>>>>>>>>>
>> >>>>>>>>>>> *To:* gem5 users mailing list
>> >>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>> >>>>>>>>>>>
>> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I've been having an issue recently with the number of
>> >>>>>>>>>>> instructions I've been seeing committed to the CPU (I have a
>> separate
>> >>>>>>>>>>> thread on this). It turns out the issue seems to be coming
>> from this patch
>> >>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately,
>> I've been
>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I
>> haven't been
>> >>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or
>> debug back in
>> >>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu
>> fails with this
>> >>>>>>>>>>> assertion:
>> >>>>>>>>>>>
>> >>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>> `cpu->instcount
>> >>>>>>>>>>>
>> >>>>>>>>>>> -Andrew
>> >>>>>>>>>>>
>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>> >>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>> >>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>> >>>>>>>>>>> Message-ID: gmail.com>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
>> and FS
>> >>>>>>>>>>> modes.
>> >>>>>>>>>>> I'm willing to share it here.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> For those who have such needs, please go to my website
>> >>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>> >>>>>>>>>>> download the patch and test it. To enable
>> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS,
>> you
>> >>>>>>>>>>> can create
>> >>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module
>> is to
>> >>>>>>>>>>> use the
>> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Please let me know if there are bugs.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu Dong
>> >>>>>>>>>>>
>> >>>>>>>>>>> -------------- next part --------------
>> >>>>>>>>>>> An HTML attachment was scrubbed...
>> >>>>>>>>>>> URL: <
>> >>>>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >>>>>>>>>>> >
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>> _______________________________________________
>> >>>>>>>>>> gem5-users mailing list
>> >>>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://
>> m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing list
>> >>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> gem5-users mailing list
>> >>>>>>> gem5-***@gem5.org
>> >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> gem5-users mailing list
>> >>>>>> gem5-***@gem5.org
>> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> gem5-users mailing list
>> >>>>> gem5-***@gem5.org
>> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> gem5-users mailing list
>> >>>> gem5-***@gem5.org
>> >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> > _______________________________________________
>> > gem5-users mailing list
>> > gem5-***@gem5.org
>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Andrew Cebulski
2012-05-15 07:44:07 UTC
Permalink
Here is the latest in my debugging:

http://dl.dropbox.com/u/2953302/gem5/pendingQueuePushPop.png

The frequency of occurrence of the doL2DescriptorWrapper function (where I
was seeing invalid faults) actually controls the size of the pendingQueue.
What I'm showing are where the pendingQueue size is increased (with a
push_back) and where it is decreased (pop_front). I put my DPRINTF for the
decrease at the end of the doL2DescriptorWrapper function in
table_walker.cc. This is actually right after a function call to nextWalk,
which schedules a process event (doProcessEvent aka processWalkWrapper())
for the next tick, which is where the pop of the pendingQueue occurs.

My first bin is large, just to show how the push/pop rate roughly averages
out at the start of the plot (there is still imbalance...just smaller
grained). The bins where push_backs aren't seen is because there are only
< 20 in those bins. Note how the difference between the push/pop is
roughly the peak of each rise/fall. I'm still trying to debug why the
imbalance in the pendingQueue/L2 function calls is occuring...namely at the
changes from rise/fall in the size, but I seem to be narrowing down on it.

Basically, it looks like there isn't a limit in place for the size of the
TLB, therefore no stalls are being sent to stop more TLB transactions
from initiating. The invalid accesses are likely a result of this too.
Looking more closely in my traces, it looks like the L2 descriptor invalid
errors start occurring once the pendingQueue increases above roughly 8
entries.

Here are the sizes of each bin (N1 is the push_back, N2 the L2 function):

N1 =

867
11
388
11
775
3
1535
17
2127
0

N2 =

788
205
95
300
189
588
376
1184
751
1

-Andrew



On Mon, May 14, 2012 at 11:15 AM, Andrew Cebulski <***@drexel.edu> wrote:

> Ali,
>
> Looking at the trace file for the TLB walker that I sent earlier, I see a
> considerable number of these faults:
>
> L2 descriptor invalid, causing fault
>
> This is within the doL2Descriptor function in tablewalker.cc.
>
> Here's a look at the frequency of these faults, with bins centered around
> the base of each rise/fall of the pendingQueue size (see small arrows on
> x-axis):
>
> http://dl.dropbox.com/u/2953302/gem5/L2faults.png
>
> I'm still looking into how this fault is handled, along with your other
> questions. I probably won't have much of a chance to get into it more
> until late today or tomorrow though. Let me know if you have any new ideas
> based on these results.
>
> Thanks,
> Andrew
>
> On Fri, May 11, 2012 at 12:17 AM, Ali Saidi <***@umich.edu> wrote:
>
>> **
>>
>> Hi Andrew,
>>
>> Looking at the trace it seems like there are a lot of invalid
>> translations that are occurring. Everything to an address less than 0x1000
>> is likely invalid. An invalid translation will return a fault (setting the
>> fault pointer in the dynamic instruction to something other than NoFault
>> and the instruction will either be squashed by a mispredicted branch or
>> redirect fetch to a kernel handler. I'm wondering if that isn't happening
>> for some reason. You need to trace back some of these translations and see
>> what the instruction serial number is for them and then see what the
>> instructions lifetime is like. Are they getting squashed? Looking at your
>> graph, when the instructions fall to 0, what is the cause? Does an
>> interrupt occur right before? Something else?
>>
>>
>>
>> Thanks,
>>
>> Ali
>>
>>
>>
>>
>>
>> On 07.05.2012 20:53, Andrew Cebulski wrote:
>>
>> Hi Ali and Gabe,
>> Here's the trace file:
>> http://dl.dropbox.com/u/2953302/gem5/table_walker.out
>> The pending queue size in the table walker follows the shape of the
>> dynamic instruction curves. The L1 and L2 queue size never go above 0.
>> Comparing DynInst count in cpu->instcount with pendingQueue size:
>> http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
>>
>> -Andrew
>>
>> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu> wrote:
>>
>>> Hi Andrew,
>>>
>>> Could you add some code to the table walker to see how big the following
>>> are getting:
>>> stateQueueL1.size()
>>> stateQueueL2.size()
>>> pendingQueue.size()
>>>
>>> Perhaps we're some how getting into a loop where there are a lot of
>>> translations to invalid addresses that get squashed and they pile up in the
>>> table walker?
>>>
>>> Thanks,
>>> Ali
>>>
>>>
>>>
>>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:
>>>
>>> > I haven't had a chance to study what's going on here, but could the
>>> problem be that we don't have bandwidth limits/back pressure implemented
>>> for the TLB and delayed translation? It could be that the CPU is pumping
>>> instructions into translation which eventually drain out/are squashed, and
>>> if too many accumulate they trip that assert.
>>> >
>>> > That may not actually make any sense as far as what the code is
>>> actually doing, but it occurred to me as a possibility and I thought I'd
>>> throw it out there.
>>> >
>>> > Gabe
>>> >
>>> > Quoting Andrew Cebulski <***@drexel.edu>:
>>> >
>>> >> I double-checked by looking at the config.ini file. It turns out I
>>> did
>>> >> actually create the checkpoint with an Atomic CPU without caches.
>>> Sorry
>>> >> for the confusion.
>>> >>
>>> >> -Andrew
>>> >>
>>> >> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu>
>>> wrote:
>>> >>
>>> >>> I started hitting this assertion (that the number of insts in flight
>>> was >
>>> >>> 1500) before I started using a checkpoint. I created the checkpoint
>>> >>> afterwards to decrease the time needed to run simulations to debug
>>> this
>>> >>> problem. I'll create a new checkpoint, then send the new trace
>>> output.
>>> >>>
>>> >>> -Andrew
>>> >>>
>>> >>>
>>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>>> >>>
>>> >>>> **
>>> >>>>
>>> >>>> It's likely the cause for all of your problems. Dirty data in the
>>> caches
>>> >>>> doesn't get restored either. You should always create checkpoints
>>> with an
>>> >>>> atomic cpu and without caches.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Ali
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>>> >>>>
>>> >>>> Sorry, I created the checkpoint I referred to with an O3 CPU with
>>> caches.
>>> >>>> From what I recall reading, caches don't get restored from
>>> checkpoints.
>>> >>>> Since the checkpoint wasn't during the benchmark run, I assumed
>>> that was
>>> >>>> okay.
>>> >>>> -Andrew
>>> >>>>
>>> >>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>>> >>>>
>>> >>>>> You haven't answered the question about if you created the
>>> checkpoints
>>> >>>>> with an atomic cpu without caches.
>>> >>>>>
>>> >>>>> Ali
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>> >>>>>
>>> >>>>> I have not run with the checker CPU recently. Here's the stderr
>>> output
>>> >>>>> from a run I did awhile back:
>>> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>>> >>>>> Note that the instruction match error is before my benchmark
>>> actually
>>> >>>>> starts running. The start of my boot script checks to see if my
>>> files
>>> >>>>> image is mounted (which it is), then continues on to run the
>>> benchmark. I
>>> >>>>> booted the system, mounted my files image, then took a checkpoint.
>>> I've
>>> >>>>> been running all my tests from that checkpoint. I found where my
>>> benchmark
>>> >>>>> started based on the ASID (from ExecAsid debug flag).
>>> >>>>> I delayed the start of gathering trace data until the
>>> second-to-last
>>> >>>>> linear increase in dynamic instructions in-flight. I'm running a
>>> new trace
>>> >>>>> now.
>>> >>>>> -Andrew
>>> >>>>>
>>> >>>>>
>>> >>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu> wrote:
>>> >>>>>
>>> >>>>>> Something is wrong well before this point. There is no reason that
>>> >>>>>> address 0x0 or 0x4 should be translated.
>>> >>>>>>
>>> >>>>>> Did you happen to create a checkpoint when caches were in the
>>> system?
>>> >>>>>>
>>> >>>>>> Have you tried to run with the checker cpu and see if it detects
>>> any
>>> >>>>>> errors?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Ali
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>> >>>>>>
>>> >>>>>> They are data TLB misses that occur as the in-flight instruction
>>> count
>>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight
>>> instruction
>>> >>>>>> count finally linearly decreases is to 0x200. Also, at the start
>>> of the
>>> >>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>>> >>>>>> Here's a trace file:
>>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>> >>>>>> To reduce size, I just have lines that have either TLB or walker
>>> in
>>> >>>>>> them.
>>> >>>>>> I do see only a handful of instruction TLB misses.
>>> >>>>>> -Andrew
>>> >>>>>>
>>> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu>
>>> wrote:
>>> >>>>>>
>>> >>>>>>> Hi Andrew,
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Thanks for digging into this. I think there is an issue
>>> somewhere, but
>>> >>>>>>> I'm still not sure where.
>>> >>>>>>>
>>> >>>>>>> Ali
>>> >>>>>>>
>>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>> >>>>>>>
>>> >>>>>>> Okay, I'm positive now that the issue lies with delayed
>>> translations
>>> >>>>>>> that are squashed before finishing.
>>> >>>>>>>
>>> >>>>>>> On the data on instruction side? You seem to allude to data in
>>> the
>>> >>>>>>> paragraph below, but then instructions in the latter text.
>>> >>>>>>>
>>> >>>>>>> It seems to me like speculative load/stores are being executed,
>>> >>>>>>> rather than waiting for the instructions to commit. Once the
>>> instructions
>>> >>>>>>> begin getting (speculatively) executed in the TLB, a reference
>>> is left
>>> >>>>>>> there, which seems hard to root out and dereference after the
>>> instruction
>>> >>>>>>> ends up being squashed. At least, I have not been able to find
>>> that out in
>>> >>>>>>> the source code as of yet. Can anyone clarify on this?
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> There should only be one translation outstanding from each
>>> >>>>>>> instruction and data side walker. Any nested transactions should
>>> be queued
>>> >>>>>>> in the walker. Until one finishes, I'm not sure how multiple
>>> would ever be
>>> >>>>>>> outstanding.
>>> >>>>>>>
>>> >>>>>>> Recall the following image that shows how the number of dynamic
>>> >>>>>>> instruction (DynInst) objects in-flight increases linearly for
>>> varying
>>> >>>>>>> periods of time:
>>> >>>>>>>
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>> >>>>>>> After enabling the TLB debug flag, I see that the linear
>>> increase in
>>> >>>>>>> instructions in flight is proportional to the number of TLB
>>> misses. These
>>> >>>>>>> TLB misses have a much larger delay (resulting in translation
>>> delays) due
>>> >>>>>>> to the fact the DramSim2 models the memory system more
>>> accurately. It
>>> >>>>>>> seems that with the classic memory system, TLB misses often do
>>> not have
>>> >>>>>>> translation delays. For whatever reason, it would also seem
>>> that every
>>> >>>>>>> instruction that has a TLB miss also is eventually squashed...
>>> >>>>>>>
>>> >>>>>>> From a data side perspective this is reasonable. While a miss is
>>> >>>>>>> outstanding at some point instructions will stop committing and
>>> thus the
>>> >>>>>>> instructions in flight will begin to rise until the miss is
>>> satisfied.
>>> >>>>>>>
>>> >>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>> >>>>>>> messages appears on the rising slopes (repeated up until the
>>> peak):
>>> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>> >>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>> >>>>>>>
>>> >>>>>>> This is interesting/odd. I don't know a good reason why (1) a
>>> miss
>>> >>>>>>> would be outstanding to both address 0 and address 4 at the same
>>> time. In
>>> >>>>>>> almost all cases these pages are marked as no-access to detect
>>> segfaults.
>>> >>>>>>> Perhaps there is an issue where the cpu is getting into a loop
>>> faulting on
>>> >>>>>>> a bad access and then faulting again on the fault handler. I
>>> could imagine
>>> >>>>>>> this would happen if there was some corruption in the memory
>>> system (for
>>> >>>>>>> example the timings in dramsim exposing a bug in the cache
>>> models or
>>> >>>>>>> something).
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> At the peak, the following message appears (from fetch) almost
>>> every
>>> >>>>>>> tick for (what I believe to be) every single one of the table
>>> walkers that
>>> >>>>>>> were squashed.
>>> >>>>>>> Fetch is waiting ITLB walk to finish!
>>> >>>>>>>
>>> >>>>>>> There must be another walk in flight? The instruction side will
>>> only
>>> >>>>>>> have one fault outstanding at once. Successive branch
>>> mispredicts will
>>> >>>>>>> re-direct fetch but there is code that catches the fact that a
>>> different
>>> >>>>>>> walk completed then expected and "does the right thing."
>>> >>>>>>>
>>> >>>>>>> The problem is that these ITLB table walks are for instructions
>>> that
>>> >>>>>>> were squashed as much as 0.3 billion cycles earlier, and since
>>> been removed
>>> >>>>>>> from the CPU's instruction list.
>>> >>>>>>>
>>> >>>>>>> I'm not following here.
>>> >>>>>>>
>>> >>>>>>> Any help will be greatly appreciated in solving this problem.
>>> I've
>>> >>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
>>> due to the
>>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not).
>>> There's the 256
>>> >>>>>>> MB for physical memory, then the 64 MB for the boot loader. I
>>> brought this
>>> >>>>>>> up in my last email about trying to get Ruby working.
>>> Therefore, I'm
>>> >>>>>>> trying to get this DramSim2 integration fixed so I can start
>>> modeling FS
>>> >>>>>>> with DRAM memory.
>>> >>>>>>>
>>> >>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this
>>> work?
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Note that these problems also occur in Soplex from the Spec
>>> CPU2006
>>> >>>>>>> benchmark suite (also hits 1500 in-flight instructions
>>> assertion). Due to
>>> >>>>>>> time constraints, I haven't tested on other benchmarks.
>>> >>>>>>> Thanks,
>>> >>>>>>> Andrew
>>> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <
>>> ***@drexel.edu>wrote:
>>> >>>>>>>
>>> >>>>>>>> Hey Gabe,
>>> >>>>>>>> Thanks for this...very helpful. I just recently got back
>>> into
>>> >>>>>>>> debugging this problem. I made a small change in
>>> src/base/refcnt.hh to
>>> >>>>>>>> allow me to return the current count of references to a DynInst
>>> object.
>>> >>>>>>>> I then modified existing DPRINTFs to also print out reference
>>> >>>>>>>> counts, then added some of my own when I needed extra
>>> visibility.
>>> >>>>>>>> I've found one memory store instruction that seems to be
>>> getting
>>> >>>>>>>> lost. What's happening is that is progresses as far as getting
>>> executed in
>>> >>>>>>>> the IEW once, but a delayed translation occurs, deferring the
>>> store. By
>>> >>>>>>>> the time it reenters the IEW, the IQ has marked the instruction
>>> as
>>> >>>>>>>> squashed. Everything progresses as usual from here on out,
>>> with one
>>> >>>>>>>> exception. When the instruction is removed from the CPUs
>>> instruction list,
>>> >>>>>>>> there is one reference count hanging.
>>> >>>>>>>> I've added in some additional debugging for my traces to help
>>> >>>>>>>> narrow down where this reference is coming from. As far as I
>>> can tell,
>>> >>>>>>>> it's because of a call to initiateAcc() within the executeStore
>>> function in
>>> >>>>>>>> the lsq unit. Please see the following two traces. The first
>>> trace shows
>>> >>>>>>>> what I just discussed. The second trace is another memory store
>>> >>>>>>>> instruction that got squashed, however, it was squashed upon
>>> its first
>>> >>>>>>>> entry into the IEW, therefore it never started execution.
>>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>> >>>>>>>> Let me know if you have any ideas based on these two
>>> instruction
>>> >>>>>>>> traces. I do not understand how the initiateAcc function
>>> results in
>>> >>>>>>>> another reference, but maybe someone else does.... Since I
>>> don't see how
>>> >>>>>>>> it makes a reference, it's hard to find out how to make sure it
>>> gets
>>> >>>>>>>> dereferenced...
>>> >>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>> >>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e.
>>> exactly when
>>> >>>>>>>> references/deferences occur). Let me know if you have any
>>> advice on
>>> >>>>>>>> this...if it's possible. I can't seem to get the right include
>>> files, and
>>> >>>>>>>> likely right SConscript compile order...
>>> >>>>>>>> Thanks,
>>> >>>>>>>> Andrew
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <
>>> ***@eecs.umich.edu>wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Without digging into things too deeply, it looks like you may
>>> be
>>> >>>>>>>>> leaking references to dynamic instructions. The CPU may think
>>> it's done
>>> >>>>>>>>> with one, but until that final reference is removed, the
>>> object will hang
>>> >>>>>>>>> around forever. I think I've had problems before where there
>>> reference
>>> >>>>>>>>> count ended up off by one somehow and instructions would start
>>> piling up.
>>> >>>>>>>>> It's also possible that a clog develops in O3's pipeline and
>>> some internal
>>> >>>>>>>>> structure stops letting instructions through and starts
>>> accumulating them.
>>> >>>>>>>>> Either of these problems will be annoying to track down, but
>>> with enough
>>> >>>>>>>>> digging I've been able to fix these sorts of things.
>>> >>>>>>>>>
>>> >>>>>>>>> This may have more to do with O3 not handling the benchmark
>>> you're
>>> >>>>>>>>> running well rather than a problem with your new DRAM model.
>>> There may be
>>> >>>>>>>>> some interaction between the two, though, where the new memory
>>> makes the
>>> >>>>>>>>> timing line up to cause O3 to behave poorly. What you can do
>>> is instrument
>>> >>>>>>>>> dynamic instruction creation and destruction and reference
>>> counting (try
>>> >>>>>>>>> print "this" for both the reference counting wrapper and the
>>> dyn inst
>>> >>>>>>>>> itself) and turn it on as close as you can to where things go
>>> bad tick
>>> >>>>>>>>> wise. Then look for an instruction which gets lost, and look
>>> for where it's
>>> >>>>>>>>> reference count is incremented and decremented. It should be
>>> relatively
>>> >>>>>>>>> easy to pair up where references are created and destroyed,
>>> and you should
>>> >>>>>>>>> be able to identify the reference which never goes away. Then
>>> you need to
>>> >>>>>>>>> figure out where that reference is being created. After that,
>>> you should
>>> >>>>>>>>> have enough information to identify why the reference counting
>>> isn't being
>>> >>>>>>>>> done correctly. It's arduous, but that's the only way.
>>> >>>>>>>>>
>>> >>>>>>>>> It's important to also make sure reference counts aren't
>>> decremented
>>> >>>>>>>>> to zero prematurely. I had a problem once where that happened
>>> and the
>>> >>>>>>>>> memory behind the object was updated by something that didn't
>>> know it was
>>> >>>>>>>>> dead. The memory had since been reallocated to another object
>>> of the same
>>> >>>>>>>>> type, so that other object reflected what happened to the
>>> phantom one. If I
>>> >>>>>>>>> remember that manifested as something weird like an add
>>> causing a page
>>> >>>>>>>>> fault or something.
>>> >>>>>>>>>
>>> >>>>>>>>> Gabe
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Hi all,
>>> >>>>>>>>> I've looked into this problem some more, and have put together
>>> a
>>> >>>>>>>>> couple traces. I've been becoming more familiar with how gem5
>>> handles
>>> >>>>>>>>> dynamic instructions, in particular how it destroys them. I
>>> have two
>>> >>>>>>>>> traces to compare, one with the physical memory, and the other
>>> with the
>>> >>>>>>>>> integrated dramsim2 dram memory. I also have two plots
>>> showing instruction
>>> >>>>>>>>> counts over time (sim ticks). All of these are linked at the
>>> end of the
>>> >>>>>>>>> email.
>>> >>>>>>>>> First, I'm going to go into what I've been able to interpret
>>> >>>>>>>>> regarding how instructions are destroyed. In particular,
>>> comparing when
>>> >>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the
>>> cpu. I
>>> >>>>>>>>> separate these because I've seen a difference, as I discuss
>>> later. These
>>> >>>>>>>>> explanations are fairly non-existent on the wiki. There is a
>>> section
>>> >>>>>>>>> header waiting to be filled...
>>> >>>>>>>>> From what I have been able to gather from the code, there is a
>>> list
>>> >>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called
>>> instList, with
>>> >>>>>>>>> the type DynInstPtr. There are three conditions to
>>> instructions being
>>> >>>>>>>>> cleaned from this list:
>>> >>>>>>>>> 1.) The ROB retires its head instruction
>>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>>> >>>>>>>>> resulting in removing any instruction not in the ROB
>>> >>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting
>>> in
>>> >>>>>>>>> removal of all instructions back to the bad seq num.
>>> >>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>>> >>>>>>>>> removed in-flight instructions. This line in particular
>>> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a
>>> DynInstPtr:
>>> >>>>>>>>> instList.erase(removeList.front());
>>> >>>>>>>>> When I turn on the debug flag O3CPU, I see the message
>>> "Removing
>>> >>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum
>>> and pcState
>>> >>>>>>>>> after all 5 cpu stages have completed, and one of the
>>> conditions above is
>>> >>>>>>>>> met. I also see what tick it occurs on.
>>> >>>>>>>>> When I turn on the DynInst debug flag, I see when instructions
>>> are
>>> >>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what
>>> tick. From
>>> >>>>>>>>> analyzing the trace files, I've gathered that this takes into
>>> account that
>>> >>>>>>>>> instructions have different execution lengths. So if one tick
>>> a memory
>>> >>>>>>>>> instruction in the instList (DynInstPtr) is removed, the
>>> DynInst for that
>>> >>>>>>>>> memory instruction will occur much later (i.e. 1M ticks
>>> later). I have yet
>>> >>>>>>>>> to determine how this is implemented.
>>> >>>>>>>>> Now for the problem.
>>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a
>>> significant
>>> >>>>>>>>> difference between the size of the instList vector (of
>>> DynInstPtr objects),
>>> >>>>>>>>> and the size of dynamic instruction count (of DynInst
>>> objects). The
>>> >>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the
>>> first roughly
>>> >>>>>>>>> 130B ticks, the dynamic instruction count kept in
>>> cpu/base_dyn_inst.impl.hh
>>> >>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below)
>>> very closely.
>>> >>>>>>>>> Around tick 130B after libquantum started, it starts hitting
>>> what I'm
>>> >>>>>>>>> assuming are loops (therefore branch prediction), resulting in
>>> some
>>> >>>>>>>>> behavior that seems to imply improper instruction handling
>>> (i.e. more
>>> >>>>>>>>> instructions in flight than allowed by ROB).
>>> >>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces
>>> exactly by
>>> >>>>>>>>> trace, but they should represent roughly the same area of
>>> execution. They
>>> >>>>>>>>> don't execute the same due to the dramsim2 modeling the memory
>>> differently
>>> >>>>>>>>> (i.e. latency and other delays).
>>> >>>>>>>>> I've shared both traces on my public Dropbox here --
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>> >>>>>>>>> Here are a couple plots of tick versus instruction count, with
>>> >>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
>>> instList.size()
>>> >>>>>>>>> in cpu/o3/cpu.cc. --
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>> >>>>>>>>> Note that I added the printout of the instList size to an
>>> existing
>>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>> >>>>>>>>> Here are the commands I ran to parse the traces into data
>>> files to
>>> >>>>>>>>> analyze in MATLAB and create the plots:
>>> >>>>>>>>> zgrep DynInst
>>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>>> grep destroyed
>>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>> >>>>>>>>> zgrep instList
>>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>>> awk '{print
>>> >>>>>>>>> $1,$11}' > instlistsize.out
>>> >>>>>>>>> It seems to me like the problem might lie in gem5, but has
>>> just been
>>> >>>>>>>>> exposed by integrating this more detailed memory model,
>>> dramsim2, into
>>> >>>>>>>>> gem5. Either that, or their are some timing errors in how
>>> dramsim2 was
>>> >>>>>>>>> integrated. I doubt this, however, since those first 190B
>>> ticks executed
>>> >>>>>>>>> used the dramsim2 memory. I believe the problem is a
>>> combination of memory
>>> >>>>>>>>> instructions + complex loops (branch prediction), resulting in
>>> improper
>>> >>>>>>>>> destroying of instructions.
>>> >>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
>>> flags.
>>> >>>>>>>>> Their are 192 ROB entries, which is why the instList size
>>> generally has a
>>> >>>>>>>>> max of about 192 instructions. The dynamic instruction counts
>>> (seen in the
>>> >>>>>>>>> dramsim2 plot) seem to also imply that instructions are
>>> incorrectly been
>>> >>>>>>>>> removed from the ROB, and then from the cpu's instruction list
>>> in cpu.cc,
>>> >>>>>>>>> which allows more and more instructions to be added to the
>>> system (possibly
>>> >>>>>>>>> from a bad branch).
>>> >>>>>>>>> I appreciate any help in debugging this and further figuring
>>> out the
>>> >>>>>>>>> root problem, just let me know if you need anything else from
>>> me. I don't
>>> >>>>>>>>> have much more time at the moment to debug, but I can take any
>>> advice for
>>> >>>>>>>>> quick changes and/or additional traces, then send the results
>>> back to the
>>> >>>>>>>>> list for discussion.
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> Andrew
>>> >>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>>> >>>>>>>>> transaction (and even command) queue from 512 to 32. The same
>>> instructions
>>> >>>>>>>>> problem occurred. It basically just decreased the execution
>>> time.
>>> >>>>>>>>>
>>> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu>
>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> The error is that there are more that 1500 instructions
>>> currently
>>> >>>>>>>>>> in flight in the system. It could mean several things:
>>> >>>>>>>>>>
>>> >>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there
>>> are
>>> >>>>>>>>>> more than 1500 in your system at one time?
>>> >>>>>>>>>>
>>> >>>>>>>>>> 2. Instructions aren't being destroyed correctly
>>> >>>>>>>>>>
>>> >>>>>>>>>> You could try to to run a debug binary so you'll get a list of
>>> >>>>>>>>>> instructions when it happens or increase the number which may
>>> >>>>>>>>>> be appropriate for certain situations (but 1500 is quite a
>>> few inflight
>>> >>>>>>>>>> instructions).
>>> >>>>>>>>>>
>>> >>>>>>>>>> Ali
>>> >>>>>>>>>>
>>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi Xiangyu,
>>> >>>>>>>>>> I just started looking into this some more. So at first I
>>> >>>>>>>>>> thought it was due to updating to a more recent revision, but
>>> then I went
>>> >>>>>>>>>> back to revision 8643, added your patch, built and ran....and
>>> now get the
>>> >>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m
>>> testing now to see
>>> >>>>>>>>>> if an update to SWIG might have resulted in this error, maybe
>>> someone on
>>> >>>>>>>>>> the mailing list would know if that's possible. The
>>> difference is 1.3.40
>>> >>>>>>>>>> vs. 2.0.3, both of which are supported according to the
>>> dependencies wiki
>>> >>>>>>>>>> page.
>>> >>>>>>>>>> Just for completeness, here's the error from revision 8643:
>>> >>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>> >>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>>> `cpu->instcount
>>> >>>>>>>>>> I have not tried running with gem5.debug, so I will be doing
>>> >>>>>>>>>> that today. Maybe this is an assertion that is occurring due
>>> to an
>>> >>>>>>>>>> optimization. That would mean it wouldn't be triggered in
>>> gem5.debug since
>>> >>>>>>>>>> it runs without optimizations. Have you tested all debug,
>>> opt and fast
>>> >>>>>>>>>> with your tests?
>>> >>>>>>>>>> Thanks,
>>> >>>>>>>>>> Andrew
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>> >>>>>>>>>> ***@gmail.com> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>>> Hi Andrew,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I didn?t see this error in my simulations. May I ask which
>>> gem5
>>> >>>>>>>>>>> version you are using? I find some of the latest code
>>> updates do not comply
>>> >>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5
>>> repo8643, and
>>> >>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000,
>>> EEMBC2, and
>>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thank you!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Xiangyu
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>> >>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *To:* gem5 users mailing list
>>> >>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Xiangyu,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I've been having an issue recently with the number of
>>> >>>>>>>>>>> instructions I've been seeing committed to the CPU (I have a
>>> separate
>>> >>>>>>>>>>> thread on this). It turns out the issue seems to be coming
>>> from this patch
>>> >>>>>>>>>>> you created to integrate DramSim2 with Gem5. Unfortunately,
>>> I've been
>>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I
>>> haven't been
>>> >>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or
>>> debug back in
>>> >>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu
>>> fails with this
>>> >>>>>>>>>>> assertion:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>> >>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>>> `cpu->instcount
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> -Andrew
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>> >>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>> >>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>> >>>>>>>>>>> Message-ID: gmail.com>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Hi all,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
>>> and FS
>>> >>>>>>>>>>> modes.
>>> >>>>>>>>>>> I'm willing to share it here.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> For those who have such needs, please go to my website
>>> >>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong>
>>> to
>>> >>>>>>>>>>> download the patch and test it. To enable
>>> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for
>>> FS, you
>>> >>>>>>>>>>> can create
>>> >>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module
>>> is to
>>> >>>>>>>>>>> use the
>>> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Please let me know if there are bugs.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thank you!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Xiangyu Dong
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> -------------- next part --------------
>>> >>>>>>>>>>> An HTML attachment was scrubbed...
>>> >>>>>>>>>>> URL: <
>>> >>>>>>>>>>>
>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>> >>>>>>>>>>> >
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>> _______________________________________________
>>> >>>>>>>>>> gem5-users mailing list
>>> >>>>>>>>>> gem5-***@gem5.org
>>> >>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> _______________________________________________
>>> >>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://
>>> m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> _______________________________________________
>>> >>>>>>>>> gem5-users mailing list
>>> >>>>>>>>> gem5-***@gem5.org
>>> >>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> _______________________________________________
>>> >>>>>>> gem5-users mailing list
>>> >>>>>>> gem5-***@gem5.org
>>> >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> _______________________________________________
>>> >>>>>> gem5-users mailing list
>>> >>>>>> gem5-***@gem5.org
>>> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> gem5-users mailing list
>>> >>>>> gem5-***@gem5.org
>>> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> gem5-users mailing list
>>> >>>> gem5-***@gem5.org
>>> >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > gem5-users mailing list
>>> > gem5-***@gem5.org
>>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> >
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
Gabe Black
2012-05-15 12:48:55 UTC
Permalink
There's a limit on the size of the TLB itself, but there may not be a
limit on the number of translations it's doing at one time. I suspect
that's an important part of the problem.

Gabe

On 05/15/12 00:44, Andrew Cebulski wrote:
> Here is the latest in my debugging:
>
> http://dl.dropbox.com/u/2953302/gem5/pendingQueuePushPop.png
>
> The frequency of occurrence of the doL2DescriptorWrapper function
> (where I was seeing invalid faults) actually controls the size of the
> pendingQueue. What I'm showing are where the pendingQueue size is
> increased (with a push_back) and where it is decreased (pop_front). I
> put my DPRINTF for the decrease at the end of the
> doL2DescriptorWrapper function in table_walker.cc. This is actually
> right after a function call to nextWalk, which schedules a process
> event (doProcessEvent aka processWalkWrapper()) for the next tick,
> which is where the pop of the pendingQueue occurs.
>
> My first bin is large, just to show how the push/pop rate roughly
> averages out at the start of the plot (there is still imbalance...just
> smaller grained). The bins where push_backs aren't seen is because
> there are only < 20 in those bins. Note how the difference between
> the push/pop is roughly the peak of each rise/fall. I'm still trying
> to debug why the imbalance in the pendingQueue/L2 function calls is
> occuring...namely at the changes from rise/fall in the size, but I
> seem to be narrowing down on it.
>
> Basically, it looks like there isn't a limit in place for the size of
> the TLB, therefore no stalls are being sent to stop more TLB
> transactions from initiating. The invalid accesses are likely a
> result of this too. Looking more closely in my traces, it looks like
> the L2 descriptor invalid errors start occurring once the pendingQueue
> increases above roughly 8 entries.
>
> Here are the sizes of each bin (N1 is the push_back, N2 the L2
> function):
>
> N1 =
>
> 867
> 11
> 388
> 11
> 775
> 3
> 1535
> 17
> 2127
> 0
>
> N2 =
>
> 788
> 205
> 95
> 300
> 189
> 588
> 376
> 1184
> 751
> 1
>
> -Andrew
>
>
>
> On Mon, May 14, 2012 at 11:15 AM, Andrew Cebulski <***@drexel.edu
> <mailto:***@drexel.edu>> wrote:
>
> Ali,
>
> Looking at the trace file for the TLB walker that I sent earlier,
> I see a considerable number of these faults:
>
> L2 descriptor invalid, causing fault
>
> This is within the doL2Descriptor function in tablewalker.cc.
>
> Here's a look at the frequency of these faults, with bins centered
> around the base of each rise/fall of the pendingQueue size (see
> small arrows on x-axis):
>
> http://dl.dropbox.com/u/2953302/gem5/L2faults.png
>
> I'm still looking into how this fault is handled, along with your
> other questions. I probably won't have much of a chance to get
> into it more until late today or tomorrow though. Let me know if
> you have any new ideas based on these results.
>
> Thanks,
> Andrew
>
> On Fri, May 11, 2012 at 12:17 AM, Ali Saidi <***@umich.edu
> <mailto:***@umich.edu>> wrote:
>
> Hi Andrew,
>
> Looking at the trace it seems like there are a lot of invalid
> translations that are occurring. Everything to an address less
> than 0x1000 is likely invalid. An invalid translation will
> return a fault (setting the fault pointer in the dynamic
> instruction to something other than NoFault and the
> instruction will either be squashed by a mispredicted branch
> or redirect fetch to a kernel handler. I'm wondering if that
> isn't happening for some reason. You need to trace back some
> of these translations and see what the instruction serial
> number is for them and then see what the instructions lifetime
> is like. Are they getting squashed? Looking at your graph,
> when the instructions fall to 0, what is the cause? Does an
> interrupt occur right before? Something else?
>
>
>
> Thanks,
>
> Ali
>
>
>
>
>
> On 07.05.2012 20:53, Andrew Cebulski wrote:
>
>> Hi Ali and Gabe,
>>
>> Here's the trace file:
>> http://dl.dropbox.com/u/2953302/gem5/table_walker.out
>> The pending queue size in the table walker follows the
>> shape of the dynamic instruction curves. The L1 and L2 queue
>> size never go above 0. Comparing DynInst count in
>> cpu->instcount with pendingQueue size:
>> http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
>>
>> -Andrew
>>
>> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu
>> <mailto:***@umich.edu>> wrote:
>>
>> Hi Andrew,
>>
>> Could you add some code to the table walker to see how
>> big the following are getting:
>> stateQueueL1.size()
>> stateQueueL2.size()
>> pendingQueue.size()
>>
>> Perhaps we're some how getting into a loop where there
>> are a lot of translations to invalid addresses that get
>> squashed and they pile up in the table walker?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:
>>
>> > I haven't had a chance to study what's going on here,
>> but could the problem be that we don't have bandwidth
>> limits/back pressure implemented for the TLB and delayed
>> translation? It could be that the CPU is pumping
>> instructions into translation which eventually drain
>> out/are squashed, and if too many accumulate they trip
>> that assert.
>> >
>> > That may not actually make any sense as far as what the
>> code is actually doing, but it occurred to me as a
>> possibility and I thought I'd throw it out there.
>> >
>> > Gabe
>> >
>> > Quoting Andrew Cebulski <***@drexel.edu
>> <mailto:***@drexel.edu>>:
>> >
>> >> I double-checked by looking at the config.ini file.
>> It turns out I did
>> >> actually create the checkpoint with an Atomic CPU
>> without caches. Sorry
>> >> for the confusion.
>> >>
>> >> -Andrew
>> >>
>> >> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski
>> <***@drexel.edu <mailto:***@drexel.edu>> wrote:
>> >>
>> >>> I started hitting this assertion (that the number of
>> insts in flight was >
>> >>> 1500) before I started using a checkpoint. I created
>> the checkpoint
>> >>> afterwards to decrease the time needed to run
>> simulations to debug this
>> >>> problem. I'll create a new checkpoint, then send the
>> new trace output.
>> >>>
>> >>> -Andrew
>> >>>
>> >>>
>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>
>> >>>> **
>> >>>>
>> >>>> It's likely the cause for all of your problems.
>> Dirty data in the caches
>> >>>> doesn't get restored either. You should always
>> create checkpoints with an
>> >>>> atomic cpu and without caches.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Ali
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>> >>>>
>> >>>> Sorry, I created the checkpoint I referred to with
>> an O3 CPU with caches.
>> >>>> From what I recall reading, caches don't get
>> restored from checkpoints.
>> >>>> Since the checkpoint wasn't during the benchmark
>> run, I assumed that was
>> >>>> okay.
>> >>>> -Andrew
>> >>>>
>> >>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>
>> >>>>> You haven't answered the question about if you
>> created the checkpoints
>> >>>>> with an atomic cpu without caches.
>> >>>>>
>> >>>>> Ali
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>> >>>>>
>> >>>>> I have not run with the checker CPU recently.
>> Here's the stderr output
>> >>>>> from a run I did awhile back:
>> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>> >>>>> Note that the instruction match error is before my
>> benchmark actually
>> >>>>> starts running. The start of my boot script checks
>> to see if my files
>> >>>>> image is mounted (which it is), then continues on
>> to run the benchmark. I
>> >>>>> booted the system, mounted my files image, then
>> took a checkpoint. I've
>> >>>>> been running all my tests from that checkpoint. I
>> found where my benchmark
>> >>>>> started based on the ASID (from ExecAsid debug flag).
>> >>>>> I delayed the start of gathering trace data until
>> the second-to-last
>> >>>>> linear increase in dynamic instructions in-flight.
>> I'm running a new trace
>> >>>>> now.
>> >>>>> -Andrew
>> >>>>>
>> >>>>>
>> >>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>>
>> >>>>>> Something is wrong well before this point. There
>> is no reason that
>> >>>>>> address 0x0 or 0x4 should be translated.
>> >>>>>>
>> >>>>>> Did you happen to create a checkpoint when caches
>> were in the system?
>> >>>>>>
>> >>>>>> Have you tried to run with the checker cpu and see
>> if it detects any
>> >>>>>> errors?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Ali
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>> >>>>>>
>> >>>>>> They are data TLB misses that occur as the
>> in-flight instruction count
>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss before
>> the in-flight instruction
>> >>>>>> count finally linearly decreases is to 0x200.
>> Also, at the start of the
>> >>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>> >>>>>> Here's a trace file:
>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>> >>>>>> To reduce size, I just have lines that have either
>> TLB or walker in
>> >>>>>> them.
>> >>>>>> I do see only a handful of instruction TLB misses.
>> >>>>>> -Andrew
>> >>>>>>
>> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>>>
>> >>>>>>> Hi Andrew,
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks for digging into this. I think there is an
>> issue somewhere, but
>> >>>>>>> I'm still not sure where.
>> >>>>>>>
>> >>>>>>> Ali
>> >>>>>>>
>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>> >>>>>>>
>> >>>>>>> Okay, I'm positive now that the issue lies with
>> delayed translations
>> >>>>>>> that are squashed before finishing.
>> >>>>>>>
>> >>>>>>> On the data on instruction side? You seem to
>> allude to data in the
>> >>>>>>> paragraph below, but then instructions in the
>> latter text.
>> >>>>>>>
>> >>>>>>> It seems to me like speculative load/stores are
>> being executed,
>> >>>>>>> rather than waiting for the instructions to
>> commit. Once the instructions
>> >>>>>>> begin getting (speculatively) executed in the
>> TLB, a reference is left
>> >>>>>>> there, which seems hard to root out and
>> dereference after the instruction
>> >>>>>>> ends up being squashed. At least, I have not
>> been able to find that out in
>> >>>>>>> the source code as of yet. Can anyone clarify on
>> this?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> There should only be one translation outstanding
>> from each
>> >>>>>>> instruction and data side walker. Any nested
>> transactions should be queued
>> >>>>>>> in the walker. Until one finishes, I'm not sure
>> how multiple would ever be
>> >>>>>>> outstanding.
>> >>>>>>>
>> >>>>>>> Recall the following image that shows how the
>> number of dynamic
>> >>>>>>> instruction (DynInst) objects in-flight increases
>> linearly for varying
>> >>>>>>> periods of time:
>> >>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>> After enabling the TLB debug flag, I see that the
>> linear increase in
>> >>>>>>> instructions in flight is proportional to the
>> number of TLB misses. These
>> >>>>>>> TLB misses have a much larger delay (resulting in
>> translation delays) due
>> >>>>>>> to the fact the DramSim2 models the memory system
>> more accurately. It
>> >>>>>>> seems that with the classic memory system, TLB
>> misses often do not have
>> >>>>>>> translation delays. For whatever reason, it
>> would also seem that every
>> >>>>>>> instruction that has a TLB miss also is
>> eventually squashed...
>> >>>>>>>
>> >>>>>>> From a data side perspective this is reasonable.
>> While a miss is
>> >>>>>>> outstanding at some point instructions will stop
>> committing and thus the
>> >>>>>>> instructions in flight will begin to rise until
>> the miss is satisfied.
>> >>>>>>>
>> >>>>>>> Here's a summary of outputs from my trace. These
>> two DPRINTF
>> >>>>>>> messages appears on the rising slopes (repeated
>> up until the peak):
>> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>> >>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>> >>>>>>>
>> >>>>>>> This is interesting/odd. I don't know a good
>> reason why (1) a miss
>> >>>>>>> would be outstanding to both address 0 and
>> address 4 at the same time. In
>> >>>>>>> almost all cases these pages are marked as
>> no-access to detect segfaults.
>> >>>>>>> Perhaps there is an issue where the cpu is
>> getting into a loop faulting on
>> >>>>>>> a bad access and then faulting again on the fault
>> handler. I could imagine
>> >>>>>>> this would happen if there was some corruption in
>> the memory system (for
>> >>>>>>> example the timings in dramsim exposing a bug in
>> the cache models or
>> >>>>>>> something).
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> At the peak, the following message appears (from
>> fetch) almost every
>> >>>>>>> tick for (what I believe to be) every single one
>> of the table walkers that
>> >>>>>>> were squashed.
>> >>>>>>> Fetch is waiting ITLB walk to finish!
>> >>>>>>>
>> >>>>>>> There must be another walk in flight? The
>> instruction side will only
>> >>>>>>> have one fault outstanding at once. Successive
>> branch mispredicts will
>> >>>>>>> re-direct fetch but there is code that catches
>> the fact that a different
>> >>>>>>> walk completed then expected and "does the right
>> thing."
>> >>>>>>>
>> >>>>>>> The problem is that these ITLB table walks are
>> for instructions that
>> >>>>>>> were squashed as much as 0.3 billion cycles
>> earlier, and since been removed
>> >>>>>>> from the CPU's instruction list.
>> >>>>>>>
>> >>>>>>> I'm not following here.
>> >>>>>>>
>> >>>>>>> Any help will be greatly appreciated in solving
>> this problem. I've
>> >>>>>>> hit a roadblock with getting Ruby working with
>> ARM, most likely due to the
>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha
>> do not). There's the 256
>> >>>>>>> MB for physical memory, then the 64 MB for the
>> boot loader. I brought this
>> >>>>>>> up in my last email about trying to get Ruby
>> working. Therefore, I'm
>> >>>>>>> trying to get this DramSim2 integration fixed so
>> I can start modeling FS
>> >>>>>>> with DRAM memory.
>> >>>>>>>
>> >>>>>>> Brad/Steve/Nilay anyone have a suggestion on how
>> to make this work?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Note that these problems also occur in Soplex
>> from the Spec CPU2006
>> >>>>>>> benchmark suite (also hits 1500 in-flight
>> instructions assertion). Due to
>> >>>>>>> time constraints, I haven't tested on other
>> benchmarks.
>> >>>>>>> Thanks,
>> >>>>>>> Andrew
>> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
>> <***@drexel.edu <mailto:***@drexel.edu>>wrote:
>> >>>>>>>
>> >>>>>>>> Hey Gabe,
>> >>>>>>>> Thanks for this...very helpful. I just
>> recently got back into
>> >>>>>>>> debugging this problem. I made a small change
>> in src/base/refcnt.hh to
>> >>>>>>>> allow me to return the current count of
>> references to a DynInst object.
>> >>>>>>>> I then modified existing DPRINTFs to also
>> print out reference
>> >>>>>>>> counts, then added some of my own when I needed
>> extra visibility.
>> >>>>>>>> I've found one memory store instruction that
>> seems to be getting
>> >>>>>>>> lost. What's happening is that is progresses as
>> far as getting executed in
>> >>>>>>>> the IEW once, but a delayed translation occurs,
>> deferring the store. By
>> >>>>>>>> the time it reenters the IEW, the IQ has marked
>> the instruction as
>> >>>>>>>> squashed. Everything progresses as usual from
>> here on out, with one
>> >>>>>>>> exception. When the instruction is removed from
>> the CPUs instruction list,
>> >>>>>>>> there is one reference count hanging.
>> >>>>>>>> I've added in some additional debugging for
>> my traces to help
>> >>>>>>>> narrow down where this reference is coming from.
>> As far as I can tell,
>> >>>>>>>> it's because of a call to initiateAcc() within
>> the executeStore function in
>> >>>>>>>> the lsq unit. Please see the following two
>> traces. The first trace shows
>> >>>>>>>> what I just discussed. The second trace is
>> another memory store
>> >>>>>>>> instruction that got squashed, however, it was
>> squashed upon its first
>> >>>>>>>> entry into the IEW, therefore it never started
>> execution.
>> >>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>> >>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>> >>>>>>>> Let me know if you have any ideas based on
>> these two instruction
>> >>>>>>>> traces. I do not understand how the initiateAcc
>> function results in
>> >>>>>>>> another reference, but maybe someone else
>> does.... Since I don't see how
>> >>>>>>>> it makes a reference, it's hard to find out how
>> to make sure it gets
>> >>>>>>>> dereferenced...
>> >>>>>>>> Unfortunately, I haven't been able to add a
>> DPRINTF in
>> >>>>>>>> src/base/refcnt.hh ...this would make things
>> more clear (i.e. exactly when
>> >>>>>>>> references/deferences occur). Let me know if
>> you have any advice on
>> >>>>>>>> this...if it's possible. I can't seem to get
>> the right include files, and
>> >>>>>>>> likely right SConscript compile order...
>> >>>>>>>> Thanks,
>> >>>>>>>> Andrew
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black
>> <***@eecs.umich.edu <mailto:***@eecs.umich.edu>>wrote:
>> >>>>>>>>
>> >>>>>>>>> Without digging into things too deeply, it
>> looks like you may be
>> >>>>>>>>> leaking references to dynamic instructions. The
>> CPU may think it's done
>> >>>>>>>>> with one, but until that final reference is
>> removed, the object will hang
>> >>>>>>>>> around forever. I think I've had problems
>> before where there reference
>> >>>>>>>>> count ended up off by one somehow and
>> instructions would start piling up.
>> >>>>>>>>> It's also possible that a clog develops in O3's
>> pipeline and some internal
>> >>>>>>>>> structure stops letting instructions through
>> and starts accumulating them.
>> >>>>>>>>> Either of these problems will be annoying to
>> track down, but with enough
>> >>>>>>>>> digging I've been able to fix these sorts of
>> things.
>> >>>>>>>>>
>> >>>>>>>>> This may have more to do with O3 not handling
>> the benchmark you're
>> >>>>>>>>> running well rather than a problem with your
>> new DRAM model. There may be
>> >>>>>>>>> some interaction between the two, though, where
>> the new memory makes the
>> >>>>>>>>> timing line up to cause O3 to behave poorly.
>> What you can do is instrument
>> >>>>>>>>> dynamic instruction creation and destruction
>> and reference counting (try
>> >>>>>>>>> print "this" for both the reference counting
>> wrapper and the dyn inst
>> >>>>>>>>> itself) and turn it on as close as you can to
>> where things go bad tick
>> >>>>>>>>> wise. Then look for an instruction which gets
>> lost, and look for where it's
>> >>>>>>>>> reference count is incremented and decremented.
>> It should be relatively
>> >>>>>>>>> easy to pair up where references are created
>> and destroyed, and you should
>> >>>>>>>>> be able to identify the reference which never
>> goes away. Then you need to
>> >>>>>>>>> figure out where that reference is being
>> created. After that, you should
>> >>>>>>>>> have enough information to identify why the
>> reference counting isn't being
>> >>>>>>>>> done correctly. It's arduous, but that's the
>> only way.
>> >>>>>>>>>
>> >>>>>>>>> It's important to also make sure reference
>> counts aren't decremented
>> >>>>>>>>> to zero prematurely. I had a problem once where
>> that happened and the
>> >>>>>>>>> memory behind the object was updated by
>> something that didn't know it was
>> >>>>>>>>> dead. The memory had since been reallocated to
>> another object of the same
>> >>>>>>>>> type, so that other object reflected what
>> happened to the phantom one. If I
>> >>>>>>>>> remember that manifested as something weird
>> like an add causing a page
>> >>>>>>>>> fault or something.
>> >>>>>>>>>
>> >>>>>>>>> Gabe
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi all,
>> >>>>>>>>> I've looked into this problem some more, and
>> have put together a
>> >>>>>>>>> couple traces. I've been becoming more
>> familiar with how gem5 handles
>> >>>>>>>>> dynamic instructions, in particular how it
>> destroys them. I have two
>> >>>>>>>>> traces to compare, one with the physical
>> memory, and the other with the
>> >>>>>>>>> integrated dramsim2 dram memory. I also have
>> two plots showing instruction
>> >>>>>>>>> counts over time (sim ticks). All of these are
>> linked at the end of the
>> >>>>>>>>> email.
>> >>>>>>>>> First, I'm going to go into what I've been able
>> to interpret
>> >>>>>>>>> regarding how instructions are destroyed. In
>> particular, comparing when
>> >>>>>>>>> DynInst's vs. DynInstPtr's are
>> deconstructed/removed from the cpu. I
>> >>>>>>>>> separate these because I've seen a difference,
>> as I discuss later. These
>> >>>>>>>>> explanations are fairly non-existent on the
>> wiki. There is a section
>> >>>>>>>>> header waiting to be filled...
>> >>>>>>>>> From what I have been able to gather from the
>> code, there is a list
>> >>>>>>>>> of all the instructions in flight in
>> cpu/o3/cpu.cc called instList, with
>> >>>>>>>>> the type DynInstPtr. There are three
>> conditions to instructions being
>> >>>>>>>>> cleaned from this list:
>> >>>>>>>>> 1.) The ROB retires its head instruction
>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from
>> the commit,
>> >>>>>>>>> resulting in removing any instruction not in
>> the ROB
>> >>>>>>>>> 3.) Decode detects an incorrect branch
>> prediction, resulting in
>> >>>>>>>>> removal of all instructions back to the bad seq
>> num.
>> >>>>>>>>> Once all five stages have completed, the CPU
>> cleans up all the
>> >>>>>>>>> removed in-flight instructions. This line in
>> particular
>> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc
>> deconstructs a DynInstPtr:
>> >>>>>>>>> instList.erase(removeList.front());
>> >>>>>>>>> When I turn on the debug flag O3CPU, I see the
>> message "Removing
>> >>>>>>>>> instruction, ..." (from o3/cpu.cc) with the
>> threadNum, seqNum and pcState
>> >>>>>>>>> after all 5 cpu stages have completed, and one
>> of the conditions above is
>> >>>>>>>>> met. I also see what tick it occurs on.
>> >>>>>>>>> When I turn on the DynInst debug flag, I see
>> when instructions are
>> >>>>>>>>> created and destroyed
>> (cpu/base_dyn_inst_impl.hh) and what tick. From
>> >>>>>>>>> analyzing the trace files, I've gathered that
>> this takes into account that
>> >>>>>>>>> instructions have different execution lengths.
>> So if one tick a memory
>> >>>>>>>>> instruction in the instList (DynInstPtr) is
>> removed, the DynInst for that
>> >>>>>>>>> memory instruction will occur much later (i.e.
>> 1M ticks later). I have yet
>> >>>>>>>>> to determine how this is implemented.
>> >>>>>>>>> Now for the problem.
>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory
>> is a significant
>> >>>>>>>>> difference between the size of the instList
>> vector (of DynInstPtr objects),
>> >>>>>>>>> and the size of dynamic instruction count (of
>> DynInst objects). The
>> >>>>>>>>> benchmark I'm running is libquantum from SPEC
>> 2006. For the first roughly
>> >>>>>>>>> 130B ticks, the dynamic instruction count kept
>> in cpu/base_dyn_inst.impl.hh
>> >>>>>>>>> shadows the instList size in o3/cpu.cc (figure
>> linked below) very closely.
>> >>>>>>>>> Around tick 130B after libquantum started, it
>> starts hitting what I'm
>> >>>>>>>>> assuming are loops (therefore branch
>> prediction), resulting in some
>> >>>>>>>>> behavior that seems to imply improper
>> instruction handling (i.e. more
>> >>>>>>>>> instructions in flight than allowed by ROB).
>> >>>>>>>>> I wasn't able to sync-up the physical and
>> dramsim2 traces exactly by
>> >>>>>>>>> trace, but they should represent roughly the
>> same area of execution. They
>> >>>>>>>>> don't execute the same due to the dramsim2
>> modeling the memory differently
>> >>>>>>>>> (i.e. latency and other delays).
>> >>>>>>>>> I've shared both traces on my public Dropbox
>> here --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>> >>>>>>>>> Here are a couple plots of tick versus
>> instruction count, with
>> >>>>>>>>> respect to cpu->instcount in
>> cpu/base_dyn_inst.impl.hh and instList.size()
>> >>>>>>>>> in cpu/o3/cpu.cc. --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>>>> Note that I added the printout of the instList
>> size to an existing
>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in
>> cpu/o3/cpu.cc.
>> >>>>>>>>> Here are the commands I ran to parse the traces
>> into data files to
>> >>>>>>>>> analyze in MATLAB and create the plots:
>> >>>>>>>>> zgrep DynInst
>> >>>>>>>>>
>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> grep destroyed
>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>> >>>>>>>>> zgrep instList
>> >>>>>>>>>
>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> awk '{print
>> >>>>>>>>> $1,$11}' > instlistsize.out
>> >>>>>>>>> It seems to me like the problem might lie in
>> gem5, but has just been
>> >>>>>>>>> exposed by integrating this more detailed
>> memory model, dramsim2, into
>> >>>>>>>>> gem5. Either that, or their are some timing
>> errors in how dramsim2 was
>> >>>>>>>>> integrated. I doubt this, however, since those
>> first 190B ticks executed
>> >>>>>>>>> used the dramsim2 memory. I believe the
>> problem is a combination of memory
>> >>>>>>>>> instructions + complex loops (branch
>> prediction), resulting in improper
>> >>>>>>>>> destroying of instructions.
>> >>>>>>>>> I've included the ROB, Commit, Fetch, DynInst
>> and O3CPU debug flags.
>> >>>>>>>>> Their are 192 ROB entries, which is why the
>> instList size generally has a
>> >>>>>>>>> max of about 192 instructions. The dynamic
>> instruction counts (seen in the
>> >>>>>>>>> dramsim2 plot) seem to also imply that
>> instructions are incorrectly been
>> >>>>>>>>> removed from the ROB, and then from the cpu's
>> instruction list in cpu.cc,
>> >>>>>>>>> which allows more and more instructions to be
>> added to the system (possibly
>> >>>>>>>>> from a bad branch).
>> >>>>>>>>> I appreciate any help in debugging this and
>> further figuring out the
>> >>>>>>>>> root problem, just let me know if you need
>> anything else from me. I don't
>> >>>>>>>>> have much more time at the moment to debug, but
>> I can take any advice for
>> >>>>>>>>> quick changes and/or additional traces, then
>> send the results back to the
>> >>>>>>>>> list for discussion.
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Andrew
>> >>>>>>>>> P.S. Paul - I did try decreasing the size of
>> the dramsim2
>> >>>>>>>>> transaction (and even command) queue from 512
>> to 32. The same instructions
>> >>>>>>>>> problem occurred. It basically just decreased
>> the execution time.
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> The error is that there are more that 1500
>> instructions currently
>> >>>>>>>>>> in flight in the system. It could mean several
>> things:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. The value is somewhat arbitrarily defined
>> and maybe there are
>> >>>>>>>>>> more than 1500 in your system at one time?
>> >>>>>>>>>>
>> >>>>>>>>>> 2. Instructions aren't being destroyed correctly
>> >>>>>>>>>>
>> >>>>>>>>>> You could try to to run a debug binary so
>> you'll get a list of
>> >>>>>>>>>> instructions when it happens or increase the
>> number which may
>> >>>>>>>>>> be appropriate for certain situations (but
>> 1500 is quite a few inflight
>> >>>>>>>>>> instructions).
>> >>>>>>>>>>
>> >>>>>>>>>> Ali
>> >>>>>>>>>>
>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Xiangyu,
>> >>>>>>>>>> I just started looking into this some more.
>> So at first I
>> >>>>>>>>>> thought it was due to updating to a more
>> recent revision, but then I went
>> >>>>>>>>>> back to revision 8643, added your patch, built
>> and ran....and now get the
>> >>>>>>>>>> error with it too (when running
>> ARM_FS/gem5.opt). I"m testing now to see
>> >>>>>>>>>> if an update to SWIG might have resulted in
>> this error, maybe someone on
>> >>>>>>>>>> the mailing list would know if that's
>> possible. The difference is 1.3.40
>> >>>>>>>>>> vs. 2.0.3, both of which are supported
>> according to the dependencies wiki
>> >>>>>>>>>> page.
>> >>>>>>>>>> Just for completeness, here's the error from
>> revision 8643:
>> >>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>> BaseDynInst::initVars() [with Impl =
>> O3CPUImpl]: Assertion `cpu->instcount
>> >>>>>>>>>> I have not tried running with gem5.debug, so
>> I will be doing
>> >>>>>>>>>> that today. Maybe this is an assertion that
>> is occurring due to an
>> >>>>>>>>>> optimization. That would mean it wouldn't be
>> triggered in gem5.debug since
>> >>>>>>>>>> it runs without optimizations. Have you
>> tested all debug, opt and fast
>> >>>>>>>>>> with your tests?
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Andrew
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu
>> Dong <
>> >>>>>>>>>> ***@gmail.com
>> <mailto:***@gmail.com>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Andrew,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I didn?t see this error in my simulations.
>> May I ask which gem5
>> >>>>>>>>>>> version you are using? I find some of the
>> latest code updates do not comply
>> >>>>>>>>>>> with my changes. I am still using the
>> DRAMsim2 patch on Gem5 repo8643, and
>> >>>>>>>>>>> have run all the runnable benchmarks in
>> SPEC2006, SPEC2000, EEMBC2, and
>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> *From:* Andrew Cebulski
>> [mailto:***@drexel.edu <mailto:***@drexel.edu>]
>> >>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>> >>>>>>>>>>>
>> >>>>>>>>>>> *To:* gem5 users mailing list
>> >>>>>>>>>>> *Cc:****@gmail.com
>> <mailto:***@gmail.com>; ***@umich.edu
>> <mailto:***@umich.edu>
>> >>>>>>>>>>>
>> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for
>> DRAMsim2 Integration
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I've been having an issue recently with the
>> number of
>> >>>>>>>>>>> instructions I've been seeing committed to
>> the CPU (I have a separate
>> >>>>>>>>>>> thread on this). It turns out the issue
>> seems to be coming from this patch
>> >>>>>>>>>>> you created to integrate DramSim2 with Gem5.
>> Unfortunately, I've been
>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up
>> until now, I haven't been
>> >>>>>>>>>>> seeing assertions. I thought I'd run it with
>> gem5.opt or debug back in
>> >>>>>>>>>>> December, but I must not have. My runs on
>> the Arm O3 cpu fails with this
>> >>>>>>>>>>> assertion:
>> >>>>>>>>>>>
>> >>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>>> BaseDynInst::initVars() [with Impl =
>> O3CPUImpl]: Assertion `cpu->instcount
>> >>>>>>>>>>>
>> >>>>>>>>>>> -Andrew
>> >>>>>>>>>>>
>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com
>> <mailto:***@gmail.com>>
>> >>>>>>>>>>> To: "gem5 users mailing list"
>> <gem5-***@gem5.org <mailto:gem5-***@gem5.org>>
>> >>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2
>> Integration
>> >>>>>>>>>>> Message-ID: gmail.com <http://gmail.com>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it
>> under both SE and FS
>> >>>>>>>>>>> modes.
>> >>>>>>>>>>> I'm willing to share it here.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> For those who have such needs, please go to
>> my website
>> >>>>>>>>>>> www.cse.psu.edu/~xydong
>> <http://www.cse.psu.edu/%7Exydong>
>> <http://www.cse.psu.edu/%7Exydong> to
>> >>>>>>>>>>> download the patch and test it. To enable
>> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead
>> of se.py (for FS, you
>> >>>>>>>>>>> can create
>> >>>>>>>>>>> by yourself). The basic idea to enable the
>> DRAMsim2 module is to
>> >>>>>>>>>>> use the
>> >>>>>>>>>>> derived DRAMMemory class instead of
>> PhysicalMemory class.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Please let me know if there are bugs.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu Dong
>> >>>>>>>>>>>
>> >>>>>>>>>>> -------------- next part --------------
>> >>>>>>>>>>> An HTML attachment was scrubbed...
>> >>>>>>>>>>> URL: <
>> >>>>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >>>>>>>>>>> >
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>> _______________________________________________
>> >>>>>>>>>> gem5-users mailing list
>> >>>>>>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing
>> listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> <http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing list
>> >>>>>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> gem5-users mailing list
>> >>>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> gem5-users mailing list
>> >>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> gem5-users mailing list
>> >>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> gem5-users mailing list
>> >>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> > _______________________________________________
>> > gem5-users mailing list
>> > gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Ali Saidi
2012-05-20 14:36:51 UTC
Permalink
Yes, there isn't any back pressure here and there probably/maybeshould be. The reason we've got away with it before is that normally the pressure comes from the cpu itself. The number of outstanding translations is limited by the size of the LSQ. That is why I find this to be strange. there is some code that is executing and must be mis-predicting constantly to fill up the tlb with translations and the get squashed, only to do the exact same thing again. It seems like the second time through the branch should be correct. The fastest way forward is probably to add the ability to squash all pending tlb translations (except maybe the one that is currently going on) as you through those instructions out of the lsq on a mispredict. Make sense?

Ali





On May 15, 2012, at 8:48 AM, Gabe Black wrote:

> There's a limit on the size of the TLB itself, but there may not be a limit on the number of translations it's doing at one time. I suspect that's an important part of the problem.
>
> Gabe
>
>
Steve Reinhardt
2012-05-20 20:20:52 UTC
Permalink
It occurred to me that this behavior could be related to the thread we had
before about overlapping page table walks for the same page... Gabe and I
had already agreed that if you have a TLB miss to an address that lies
within the same page (at the minimum page size) as an ongoing walk, you
really should not fire off another walk, but that's what we currently do.
Adding that suppression of redundant walks wouldn't directly fix this
problem, but it probably would mitigate the effects.

As far as the problem itself: even if the TLB itself has no bandwidth
limit, there should definitely be a limit on the number of concurrent page
walks, probably something in the low single digits. With a hardware table
walker, there is hardware state associated with each ongoing walk, and the
number of copies of this state to allow concurrent walks is very limited.

Steve

On Sun, May 20, 2012 at 7:36 AM, Ali Saidi <***@umich.edu> wrote:

> Yes, there isn't any back pressure here and there probably/maybeshould be.
> The reason we've got away with it before is that normally the pressure
> comes from the cpu itself. The number of outstanding translations is
> limited by the size of the LSQ. That is why I find this to be strange.
> there is some code that is executing and must be mis-predicting constantly
> to fill up the tlb with translations and the get squashed, only to do the
> exact same thing again. It seems like the second time through the branch
> should be correct. The fastest way forward is probably to add the ability
> to squash all pending tlb translations (except maybe the one that is
> currently going on) as you through those instructions out of the lsq on a
> mispredict. Make sense?
>
> Ali
>
>
>
>
>
> On May 15, 2012, at 8:48 AM, Gabe Black wrote:
>
> > There's a limit on the size of the TLB itself, but there may not be a
> limit on the number of translations it's doing at one time. I suspect
> that's an important part of the problem.
> >
> > Gabe
> >
> >
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Gabe Black
2012-05-21 05:25:36 UTC
Permalink
This sounds about right. The CPU interfaces with the TLB which uses the
page table walker behind the scenes. How can the TLB tell the CPU to
hold off because it's busy and to try again later? It would need to do
that on behalf of the table walker.

Gabe

On 05/20/12 13:20, Steve Reinhardt wrote:
> It occurred to me that this behavior could be related to the thread we
> had before about overlapping page table walks for the same page...
> Gabe and I had already agreed that if you have a TLB miss to an
> address that lies within the same page (at the minimum page size) as
> an ongoing walk, you really should not fire off another walk, but
> that's what we currently do. Adding that suppression of redundant
> walks wouldn't directly fix this problem, but it probably would
> mitigate the effects.
>
> As far as the problem itself: even if the TLB itself has no bandwidth
> limit, there should definitely be a limit on the number of concurrent
> page walks, probably something in the low single digits. With a
> hardware table walker, there is hardware state associated with each
> ongoing walk, and the number of copies of this state to allow
> concurrent walks is very limited.
>
> Steve
>
> On Sun, May 20, 2012 at 7:36 AM, Ali Saidi <***@umich.edu
> <mailto:***@umich.edu>> wrote:
>
> Yes, there isn't any back pressure here and there
> probably/maybeshould be. The reason we've got away with it before
> is that normally the pressure comes from the cpu itself. The
> number of outstanding translations is limited by the size of the
> LSQ. That is why I find this to be strange. there is some code
> that is executing and must be mis-predicting constantly to fill up
> the tlb with translations and the get squashed, only to do the
> exact same thing again. It seems like the second time through the
> branch should be correct. The fastest way forward is probably to
> add the ability to squash all pending tlb translations (except
> maybe the one that is currently going on) as you through those
> instructions out of the lsq on a mispredict. Make sense?
>
> Ali
>
>
>
>
>
> On May 15, 2012, at 8:48 AM, Gabe Black wrote:
>
> > There's a limit on the size of the TLB itself, but there may not
> be a limit on the number of translations it's doing at one time. I
> suspect that's an important part of the problem.
> >
> > Gabe
> >
> >
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Ali Saidi
2012-09-07 20:03:29 UTC
Permalink
Hi Andrew,

I think that http://reviews.gem5.org/r/1402/ [55] will
fix the issue you were seeing. It's not a complete solution (we still
could use a mechanism to backpressure the CPU), but this solves part of
the issue.

Thanks,

Ali

On 15.05.2012 02:44, Andrew Cebulski wrote:


> Here is the latest in my debugging:
>
http://dl.dropbox.com/u/2953302/gem5/pendingQueuePushPop.png [53]
> The
frequency of occurrence of the doL2DescriptorWrapper function (where I
was seeing invalid faults) actually controls the size of the
pendingQueue. What I'm showing are where the pendingQueue size is
increased (with a push_back) and where it is decreased (pop_front). I
put my DPRINTF for the decrease at the end of the doL2DescriptorWrapper
function in table_walker.cc. This is actually right after a function
call to nextWalk, which schedules a process event (doProcessEvent aka
processWalkWrapper()) for the next tick, which is where the pop of the
pendingQueue occurs.
> My first bin is large, just to show how the
push/pop rate roughly averages out at the start of the plot (there is
still imbalance...just smaller grained). The bins where push_backs
aren't seen is because there are only < 20 in those bins. Note how the
difference between the push/pop is roughly the peak of each rise/fall.
I'm still trying to debug why the imbalance in the pendingQueue/L2
function calls is occuring...namely at the changes from rise/fall in the
size, but I seem to be narrowing down on it.
>
> Basically, it looks
like there isn't a limit in place for the size of the TLB, therefore no
stalls are being sent to stop more TLB transactions from initiating. The
invalid accesses are likely a result of this too. Looking more closely
in my traces, it looks like the L2 descriptor invalid errors start
occurring once the pendingQueue increases above roughly 8 entries.
>
Here are the sizes of each bin (N1 is the push_back, N2 the L2
function):
>
> N1 =
> 867
> 11
> 388
> 11
> 775
> 3
> 1535
>
17
> 2127
> 0
> N2 =
> 788
> 205
> 95
> 300
> 189
> 588
> 376

> 1184
> 751
> 1
> -Andrew
>
> On Mon, May 14, 2012 at 11:15 AM,
Andrew Cebulski <***@drexel.edu [54]> wrote:
>
>> Ali,
>> Looking at
the trace file for the TLB walker that I sent earlier, I see a
considerable number of these faults:
>> L2 descriptor invalid, causing
fault
>> This is within the doL2Descriptor function in tablewalker.cc.

>> Here's a look at the frequency of these faults, with bins centered
around the base of each rise/fall of the pendingQueue size (see small
arrows on x-axis):
>> http://dl.dropbox.com/u/2953302/gem5/L2faults.png
[51]
>> I'm still looking into how this fault is handled, along with
your other questions. I probably won't have much of a chance to get into
it more until late today or tomorrow though. Let me know if you have any
new ideas based on these results.
>> Thanks,
>> Andrew
>>
>> On Fri,
May 11, 2012 at 12:17 AM, Ali Saidi <***@umich.edu [52]> wrote:
>>

>>> Hi Andrew,
>>>
>>> Looking at the trace it seems like there are a
lot of invalid translations that are occurring. Everything to an address
less than 0x1000 is likely invalid. An invalid translation will return a
fault (setting the fault pointer in the dynamic instruction to something
other than NoFault and the instruction will either be squashed by a
mispredicted branch or redirect fetch to a kernel handler. I'm wondering
if that isn't happening for some reason. You need to trace back some of
these translations and see what the instruction serial number is for
them and then see what the instructions lifetime is like. Are they
getting squashed? Looking at your graph, when the instructions fall to
0, what is the cause? Does an interrupt occur right before? Something
else?
>>>
>>> Thanks,
>>>
>>> Ali
>>>
>>> On 07.05.2012 20:53,
Andrew Cebulski wrote:
>>>
>>>> Hi Ali and Gabe,
>>>> Here's the
trace file: http://dl.dropbox.com/u/2953302/gem5/table_walker.out [46]

>>>> The pending queue size in the table walker follows the shape of
the dynamic instruction curves. The L1 and L2 queue size never go above
0. Comparing DynInst count in cpu->instcount with pendingQueue size:

>>>> http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png [47]

>>>>
>>>> -Andrew
>>>>
>>>> On Sun, May 6, 2012 at 12:01 PM, Ali
Saidi <***@umich.edu [48]> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>
Could you add some code to the table walker to see how big the following
are getting:
>>>>> stateQueueL1.size()
>>>>> stateQueueL2.size()
>>>>>
pendingQueue.size()
>>>>>
>>>>> Perhaps we're some how getting into a
loop where there are a lot of translations to invalid addresses that get
squashed and they pile up in the table walker?
>>>>>
>>>>>
Thanks,
>>>>> Ali
>>>>>
>>>>> On May 4, 2012, at 7:53 AM, Gabriel
Michael Black wrote:
>>>>>
>>>>> > I haven't had a chance to study
what's going on here, but could the problem be that we don't have
bandwidth limits/back pressure implemented for the TLB and delayed
translation? It could be that the CPU is pumping instructions into
translation which eventually drain out/are squashed, and if too many
accumulate they trip that assert.
>>>>> >
>>>>> > That may not actually
make any sense as far as what the code is actually doing, but it
occurred to me as a possibility and I thought I'd throw it out
there.
>>>>> >
>>>>> > Gabe
>>>>> >
>>>>> > Quoting Andrew Cebulski
<***@drexel.edu [1]>:
>>>>> >
>>>>> >> I double-checked by looking at
the config.ini file. It turns out I did
>>>>> >> actually create the
checkpoint with an Atomic CPU without caches. Sorry
>>>>> >> for the
confusion.
>>>>> >>
>>>>> >> -Andrew
>>>>> >>
>>>>> >> On Wed, May 2,
2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu [2]> wrote:
>>>>>
>>
>>>>> >>> I started hitting this assertion (that the number of insts
in flight was >
>>>>> >>> 1500) before I started using a checkpoint. I
created the checkpoint
>>>>> >>> afterwards to decrease the time needed
to run simulations to debug this
>>>>> >>> problem. I'll create a new
checkpoint, then send the new trace output.
>>>>> >>>
>>>>> >>>
-Andrew
>>>>> >>>
>>>>> >>>
>>>>> >>> On Wed, May 2, 2012 at 9:53 PM,
Ali Saidi <***@umich.edu [3]> wrote:
>>>>> >>>
>>>>> >>>> **
>>>>>
>>>>
>>>>> >>>> It's likely the cause for all of your problems. Dirty
data in the caches
>>>>> >>>> doesn't get restored either. You should
always create checkpoints with an
>>>>> >>>> atomic cpu and without
caches.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> Ali
>>>>> >>>>
>>>>>
>>>>
>>>>> >>>>
>>>>> >>>> On 02.05.2012 21:23, Andrew Cebulski
wrote:
>>>>> >>>>
>>>>> >>>> Sorry, I created the checkpoint I referred
to with an O3 CPU with caches.
>>>>> >>>> From what I recall reading,
caches don't get restored from checkpoints.
>>>>> >>>> Since the
checkpoint wasn't during the benchmark run, I assumed that was
>>>>>
>>>> okay.
>>>>> >>>> -Andrew
>>>>> >>>>
>>>>> >>>> On Wed, May 2, 2012
at 9:07 PM, Ali Saidi <***@umich.edu [4]> wrote:
>>>>> >>>>
>>>>>
>>>>> You haven't answered the question about if you created the
checkpoints
>>>>> >>>>> with an atomic cpu without caches.
>>>>>
>>>>>
>>>>> >>>>> Ali
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>>
>>>>>
>>>>> >>>>>
>>>>> >>>>> On 02.05.2012 19:58, Andrew Cebulski
wrote:
>>>>> >>>>>
>>>>> >>>>> I have not run with the checker CPU
recently. Here's the stderr output
>>>>> >>>>> from a run I did awhile
back:
>>>>> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0 [5]
>>>>>
>>>>> Note that the instruction match error is before my benchmark
actually
>>>>> >>>>> starts running. The start of my boot script checks
to see if my files
>>>>> >>>>> image is mounted (which it is), then
continues on to run the benchmark. I
>>>>> >>>>> booted the system,
mounted my files image, then took a checkpoint. I've
>>>>> >>>>> been
running all my tests from that checkpoint. I found where my
benchmark
>>>>> >>>>> started based on the ASID (from ExecAsid debug
flag).
>>>>> >>>>> I delayed the start of gathering trace data until the
second-to-last
>>>>> >>>>> linear increase in dynamic instructions
in-flight. I'm running a new trace
>>>>> >>>>> now.
>>>>> >>>>>
-Andrew
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On Wed, May 2, 2012 at 5:28
PM, Ali Saidi <***@umich.edu [6]> wrote:
>>>>> >>>>>
>>>>> >>>>>>
Something is wrong well before this point. There is no reason that
>>>>>
>>>>>> address 0x0 or 0x4 should be translated.
>>>>> >>>>>>
>>>>>
>>>>>> Did you happen to create a checkpoint when caches were in the
system?
>>>>> >>>>>>
>>>>> >>>>>> Have you tried to run with the checker
cpu and see if it detects any
>>>>> >>>>>> errors?
>>>>> >>>>>>
>>>>>
>>>>>>
>>>>> >>>>>>
>>>>> >>>>>> Ali
>>>>> >>>>>>
>>>>> >>>>>>
>>>>>
>>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> On 02.05.2012 17:22,
Andrew Cebulski wrote:
>>>>> >>>>>>
>>>>> >>>>>> They are data TLB
misses that occur as the in-flight instruction count
>>>>> >>>>>> rises
(at 0x0 and 0x4). The last TLB miss before the in-flight
instruction
>>>>> >>>>>> count finally linearly decreases is to 0x200.
Also, at the start of the
>>>>> >>>>>> rising slope, I see a miss to 0x8
and 0x2508c.
>>>>> >>>>>> Here's a trace file:
>>>>> >>>>>>
http://dl.dropbox.com/u/2953302/gem5/tlb.out [7]
>>>>> >>>>>> To reduce
size, I just have lines that have either TLB or walker in
>>>>> >>>>>>
them.
>>>>> >>>>>> I do see only a handful of instruction TLB
misses.
>>>>> >>>>>> -Andrew
>>>>> >>>>>>
>>>>> >>>>>> On Wed, May 2,
2012 at 11:10 AM, Ali Saidi <***@umich.edu [8]> wrote:
>>>>>
>>>>>>
>>>>> >>>>>>> Hi Andrew,
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>>
>>>>>>>
>>>>> >>>>>>> Thanks for digging into this. I think there is an
issue somewhere, but
>>>>> >>>>>>> I'm still not sure where.
>>>>>
>>>>>>>
>>>>> >>>>>>> Ali
>>>>> >>>>>>>
>>>>> >>>>>>> On 01.05.2012
23:34, Andrew Cebulski wrote:
>>>>> >>>>>>>
>>>>> >>>>>>> Okay, I'm
positive now that the issue lies with delayed translations
>>>>> >>>>>>>
that are squashed before finishing.
>>>>> >>>>>>>
>>>>> >>>>>>> On the
data on instruction side? You seem to allude to data in the
>>>>>
>>>>>>> paragraph below, but then instructions in the latter text.
>>>>>
>>>>>>>
>>>>> >>>>>>> It seems to me like speculative load/stores are
being executed,
>>>>> >>>>>>> rather than waiting for the instructions
to commit. Once the instructions
>>>>> >>>>>>> begin getting
(speculatively) executed in the TLB, a reference is left
>>>>> >>>>>>>
there, which seems hard to root out and dereference after the
instruction
>>>>> >>>>>>> ends up being squashed. At least, I have not
been able to find that out in
>>>>> >>>>>>> the source code as of yet.
Can anyone clarify on this?
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>>
>>>>>>>
>>>>> >>>>>>> There should only be one translation outstanding
from each
>>>>> >>>>>>> instruction and data side walker. Any nested
transactions should be queued
>>>>> >>>>>>> in the walker. Until one
finishes, I'm not sure how multiple would ever be
>>>>> >>>>>>>
outstanding.
>>>>> >>>>>>>
>>>>> >>>>>>> Recall the following image that
shows how the number of dynamic
>>>>> >>>>>>> instruction (DynInst)
objects in-flight increases linearly for varying
>>>>> >>>>>>> periods
of time:
>>>>> >>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[9]
>>>>> >>>>>>> After enabling the TLB debug flag, I see that the
linear increase in
>>>>> >>>>>>> instructions in flight is proportional
to the number of TLB misses. These
>>>>> >>>>>>> TLB misses have a much
larger delay (resulting in translation delays) due
>>>>> >>>>>>> to the
fact the DramSim2 models the memory system more accurately. It
>>>>>
>>>>>>> seems that with the classic memory system, TLB misses often do
not have
>>>>> >>>>>>> translation delays. For whatever reason, it would
also seem that every
>>>>> >>>>>>> instruction that has a TLB miss also
is eventually squashed...
>>>>> >>>>>>>
>>>>> >>>>>>> From a data side
perspective this is reasonable. While a miss is
>>>>> >>>>>>>
outstanding at some point instructions will stop committing and thus
the
>>>>> >>>>>>> instructions in flight will begin to rise until the
miss is satisfied.
>>>>> >>>>>>>
>>>>> >>>>>>> Here's a summary of
outputs from my trace. These two DPRINTF
>>>>> >>>>>>> messages appears
on the rising slopes (repeated up until the peak):
>>>>> >>>>>>> TLB
Miss: Starting hardware table walker for 0(656)
>>>>> >>>>>>> TLB Miss:
Starting hardware table walker for 0x4(656)
>>>>> >>>>>>>
>>>>> >>>>>>>
This is interesting/odd. I don't know a good reason why (1) a miss
>>>>>
>>>>>>> would be outstanding to both address 0 and address 4 at the same
time. In
>>>>> >>>>>>> almost all cases these pages are marked as
no-access to detect segfaults.
>>>>> >>>>>>> Perhaps there is an issue
where the cpu is getting into a loop faulting on
>>>>> >>>>>>> a bad
access and then faulting again on the fault handler. I could
imagine
>>>>> >>>>>>> this would happen if there was some corruption in
the memory system (for
>>>>> >>>>>>> example the timings in dramsim
exposing a bug in the cache models or
>>>>> >>>>>>> something).
>>>>>
>>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> At the peak, the following message
appears (from fetch) almost every
>>>>> >>>>>>> tick for (what I believe
to be) every single one of the table walkers that
>>>>> >>>>>>> were
squashed.
>>>>> >>>>>>> Fetch is waiting ITLB walk to finish!
>>>>>
>>>>>>>
>>>>> >>>>>>> There must be another walk in flight? The
instruction side will only
>>>>> >>>>>>> have one fault outstanding at
once. Successive branch mispredicts will
>>>>> >>>>>>> re-direct fetch
but there is code that catches the fact that a different
>>>>> >>>>>>>
walk completed then expected and "does the right thing."
>>>>>
>>>>>>>
>>>>> >>>>>>> The problem is that these ITLB table walks are for
instructions that
>>>>> >>>>>>> were squashed as much as 0.3 billion
cycles earlier, and since been removed
>>>>> >>>>>>> from the CPU's
instruction list.
>>>>> >>>>>>>
>>>>> >>>>>>> I'm not following
here.
>>>>> >>>>>>>
>>>>> >>>>>>> Any help will be greatly appreciated
in solving this problem. I've
>>>>> >>>>>>> hit a roadblock with getting
Ruby working with ARM, most likely due to the
>>>>> >>>>>>> fact that
ARM has disjoint memory (x86 and Alpha do not). There's the 256
>>>>>
>>>>>>> MB for physical memory, then the 64 MB for the boot loader. I
brought this
>>>>> >>>>>>> up in my last email about trying to get Ruby
working. Therefore, I'm
>>>>> >>>>>>> trying to get this DramSim2
integration fixed so I can start modeling FS
>>>>> >>>>>>> with DRAM
memory.
>>>>> >>>>>>>
>>>>> >>>>>>> Brad/Steve/Nilay anyone have a
suggestion on how to make this work?
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>>
>>>>>>> Note that these problems also occur in Soplex from the Spec
CPU2006
>>>>> >>>>>>> benchmark suite (also hits 1500 in-flight
instructions assertion). Due to
>>>>> >>>>>>> time constraints, I
haven't tested on other benchmarks.
>>>>> >>>>>>> Thanks,
>>>>> >>>>>>>
Andrew
>>>>> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
<***@drexel.edu [10]>wrote:
>>>>> >>>>>>>
>>>>> >>>>>>>> Hey
Gabe,
>>>>> >>>>>>>> Thanks for this...very helpful. I just recently got
back into
>>>>> >>>>>>>> debugging this problem. I made a small change
in src/base/refcnt.hh to
>>>>> >>>>>>>> allow me to return the current
count of references to a DynInst object.
>>>>> >>>>>>>> I then modified
existing DPRINTFs to also print out reference
>>>>> >>>>>>>> counts,
then added some of my own when I needed extra visibility.
>>>>> >>>>>>>>
I've found one memory store instruction that seems to be getting
>>>>>
>>>>>>>> lost. What's happening is that is progresses as far as getting
executed in
>>>>> >>>>>>>> the IEW once, but a delayed translation
occurs, deferring the store. By
>>>>> >>>>>>>> the time it reenters the
IEW, the IQ has marked the instruction as
>>>>> >>>>>>>> squashed.
Everything progresses as usual from here on out, with one
>>>>> >>>>>>>>
exception. When the instruction is removed from the CPUs instruction
list,
>>>>> >>>>>>>> there is one reference count hanging.
>>>>>
>>>>>>>> I've added in some additional debugging for my traces to
help
>>>>> >>>>>>>> narrow down where this reference is coming from. As
far as I can tell,
>>>>> >>>>>>>> it's because of a call to
initiateAcc() within the executeStore function in
>>>>> >>>>>>>> the lsq
unit. Please see the following two traces. The first trace shows
>>>>>
>>>>>>>> what I just discussed. The second trace is another memory
store
>>>>> >>>>>>>> instruction that got squashed, however, it was
squashed upon its first
>>>>> >>>>>>>> entry into the IEW, therefore it
never started execution.
>>>>> >>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [11]
>>>>>
>>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[12]
>>>>> >>>>>>>> Let me know if you have any ideas based on these two
instruction
>>>>> >>>>>>>> traces. I do not understand how the
initiateAcc function results in
>>>>> >>>>>>>> another reference, but
maybe someone else does.... Since I don't see how
>>>>> >>>>>>>> it
makes a reference, it's hard to find out how to make sure it gets
>>>>>
>>>>>>>> dereferenced...
>>>>> >>>>>>>> Unfortunately, I haven't been
able to add a DPRINTF in
>>>>> >>>>>>>> src/base/refcnt.hh ...this would
make things more clear (i.e. exactly when
>>>>> >>>>>>>>
references/deferences occur). Let me know if you have any advice
on
>>>>> >>>>>>>> this...if it's possible. I can't seem to get the right
include files, and
>>>>> >>>>>>>> likely right SConscript compile
order...
>>>>> >>>>>>>> Thanks,
>>>>> >>>>>>>> Andrew
>>>>>
>>>>>>>>
>>>>> >>>>>>>>
>>>>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM,
Gabe Black <***@eecs.umich.edu [13]>wrote:
>>>>> >>>>>>>>
>>>>>
>>>>>>>>> Without digging into things too deeply, it looks like you may
be
>>>>> >>>>>>>>> leaking references to dynamic instructions. The CPU
may think it's done
>>>>> >>>>>>>>> with one, but until that final
reference is removed, the object will hang
>>>>> >>>>>>>>> around
forever. I think I've had problems before where there reference
>>>>>
>>>>>>>>> count ended up off by one somehow and instructions would start
piling up.
>>>>> >>>>>>>>> It's also possible that a clog develops in
O3's pipeline and some internal
>>>>> >>>>>>>>> structure stops letting
instructions through and starts accumulating them.
>>>>> >>>>>>>>>
Either of these problems will be annoying to track down, but with
enough
>>>>> >>>>>>>>> digging I've been able to fix these sorts of
things.
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> This may have more to do with O3
not handling the benchmark you're
>>>>> >>>>>>>>> running well rather
than a problem with your new DRAM model. There may be
>>>>> >>>>>>>>>
some interaction between the two, though, where the new memory makes
the
>>>>> >>>>>>>>> timing line up to cause O3 to behave poorly. What
you can do is instrument
>>>>> >>>>>>>>> dynamic instruction creation
and destruction and reference counting (try
>>>>> >>>>>>>>> print "this"
for both the reference counting wrapper and the dyn inst
>>>>> >>>>>>>>>
itself) and turn it on as close as you can to where things go bad
tick
>>>>> >>>>>>>>> wise. Then look for an instruction which gets lost,
and look for where it's
>>>>> >>>>>>>>> reference count is incremented
and decremented. It should be relatively
>>>>> >>>>>>>>> easy to pair up
where references are created and destroyed, and you should
>>>>>
>>>>>>>>> be able to identify the reference which never goes away. Then
you need to
>>>>> >>>>>>>>> figure out where that reference is being
created. After that, you should
>>>>> >>>>>>>>> have enough information
to identify why the reference counting isn't being
>>>>> >>>>>>>>> done
correctly. It's arduous, but that's the only way.
>>>>> >>>>>>>>>
>>>>>
>>>>>>>>> It's important to also make sure reference counts aren't
decremented
>>>>> >>>>>>>>> to zero prematurely. I had a problem once
where that happened and the
>>>>> >>>>>>>>> memory behind the object was
updated by something that didn't know it was
>>>>> >>>>>>>>> dead. The
memory had since been reallocated to another object of the same
>>>>>
>>>>>>>>> type, so that other object reflected what happened to the
phantom one. If I
>>>>> >>>>>>>>> remember that manifested as something
weird like an add causing a page
>>>>> >>>>>>>>> fault or
something.
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Gabe
>>>>> >>>>>>>>>
>>>>>
>>>>>>>>>
>>>>> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski
wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Hi all,
>>>>> >>>>>>>>> I've
looked into this problem some more, and have put together a
>>>>>
>>>>>>>>> couple traces. I've been becoming more familiar with how gem5
handles
>>>>> >>>>>>>>> dynamic instructions, in particular how it
destroys them. I have two
>>>>> >>>>>>>>> traces to compare, one with
the physical memory, and the other with the
>>>>> >>>>>>>>> integrated
dramsim2 dram memory. I also have two plots showing instruction
>>>>>
>>>>>>>>> counts over time (sim ticks). All of these are linked at the
end of the
>>>>> >>>>>>>>> email.
>>>>> >>>>>>>>> First, I'm going to go
into what I've been able to interpret
>>>>> >>>>>>>>> regarding how
instructions are destroyed. In particular, comparing when
>>>>>
>>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the
cpu. I
>>>>> >>>>>>>>> separate these because I've seen a difference, as
I discuss later. These
>>>>> >>>>>>>>> explanations are fairly
non-existent on the wiki. There is a section
>>>>> >>>>>>>>> header
waiting to be filled...
>>>>> >>>>>>>>> From what I have been able to
gather from the code, there is a list
>>>>> >>>>>>>>> of all the
instructions in flight in cpu/o3/cpu.cc called instList, with
>>>>>
>>>>>>>>> the type DynInstPtr. There are three conditions to
instructions being
>>>>> >>>>>>>>> cleaned from this list:
>>>>>
>>>>>>>>> 1.) The ROB retires its head instruction
>>>>> >>>>>>>>> 2.)
Fetch receives a rob squashing signal from the commit,
>>>>> >>>>>>>>>
resulting in removing any instruction not in the ROB
>>>>> >>>>>>>>> 3.)
Decode detects an incorrect branch prediction, resulting in
>>>>>
>>>>>>>>> removal of all instructions back to the bad seq num.
>>>>>
>>>>>>>>> Once all five stages have completed, the CPU cleans up all
the
>>>>> >>>>>>>>> removed in-flight instructions. This line in
particular
>>>>> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc
deconstructs a DynInstPtr:
>>>>> >>>>>>>>>
instList.erase(removeList.front());
>>>>> >>>>>>>>> When I turn on the
debug flag O3CPU, I see the message "Removing
>>>>> >>>>>>>>>
instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and
pcState
>>>>> >>>>>>>>> after all 5 cpu stages have completed, and one
of the conditions above is
>>>>> >>>>>>>>> met. I also see what tick it
occurs on.
>>>>> >>>>>>>>> When I turn on the DynInst debug flag, I see
when instructions are
>>>>> >>>>>>>>> created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From
>>>>> >>>>>>>>>
analyzing the trace files, I've gathered that this takes into account
that
>>>>> >>>>>>>>> instructions have different execution lengths. So
if one tick a memory
>>>>> >>>>>>>>> instruction in the instList
(DynInstPtr) is removed, the DynInst for that
>>>>> >>>>>>>>> memory
instruction will occur much later (i.e. 1M ticks later). I have
yet
>>>>> >>>>>>>>> to determine how this is implemented.
>>>>>
>>>>>>>>> Now for the problem.
>>>>> >>>>>>>>> What I'm seeing when I
run dramsim2 dram memory is a significant
>>>>> >>>>>>>>> difference
between the size of the instList vector (of DynInstPtr objects),
>>>>>
>>>>>>>>> and the size of dynamic instruction count (of DynInst
objects). The
>>>>> >>>>>>>>> benchmark I'm running is libquantum from
SPEC 2006. For the first roughly
>>>>> >>>>>>>>> 130B ticks, the dynamic
instruction count kept in cpu/base_dyn_inst.impl.hh
>>>>> >>>>>>>>>
shadows the instList size in o3/cpu.cc (figure linked below) very
closely.
>>>>> >>>>>>>>> Around tick 130B after libquantum started, it
starts hitting what I'm
>>>>> >>>>>>>>> assuming are loops (therefore
branch prediction), resulting in some
>>>>> >>>>>>>>> behavior that
seems to imply improper instruction handling (i.e. more
>>>>> >>>>>>>>>
instructions in flight than allowed by ROB).
>>>>> >>>>>>>>> I wasn't
able to sync-up the physical and dramsim2 traces exactly by
>>>>>
>>>>>>>>> trace, but they should represent roughly the same area of
execution. They
>>>>> >>>>>>>>> don't execute the same due to the
dramsim2 modeling the memory differently
>>>>> >>>>>>>>> (i.e. latency
and other delays).
>>>>> >>>>>>>>> I've shared both traces on my public
Dropbox here --
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]
>>>>> >>>>>>>>> Here are a couple plots of tick versus instruction
count, with
>>>>> >>>>>>>>> respect to cpu->instcount in
cpu/base_dyn_inst.impl.hh and instList.size()
>>>>> >>>>>>>>> in
cpu/o3/cpu.cc. --
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]
>>>>> >>>>>>>>> Note that I added the printout of the instList size
to an existing
>>>>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in
cpu/o3/cpu.cc.
>>>>> >>>>>>>>> Here are the commands I ran to parse the
traces into data files to
>>>>> >>>>>>>>> analyze in MATLAB and create
the plots:
>>>>> >>>>>>>>> zgrep DynInst
>>>>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed
>>>>> >>>>>>>>> | awk '{print $1,$11}' >
cpuinstcount.out
>>>>> >>>>>>>>> zgrep instList
>>>>> >>>>>>>>>
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk
'{print
>>>>> >>>>>>>>> $1,$11}' > instlistsize.out
>>>>> >>>>>>>>> It
seems to me like the problem might lie in gem5, but has just been
>>>>>
>>>>>>>>> exposed by integrating this more detailed memory model,
dramsim2, into
>>>>> >>>>>>>>> gem5. Either that, or their are some
timing errors in how dramsim2 was
>>>>> >>>>>>>>> integrated. I doubt
this, however, since those first 190B ticks executed
>>>>> >>>>>>>>>
used the dramsim2 memory. I believe the problem is a combination of
memory
>>>>> >>>>>>>>> instructions + complex loops (branch prediction),
resulting in improper
>>>>> >>>>>>>>> destroying of instructions.
>>>>>
>>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
flags.
>>>>> >>>>>>>>> Their are 192 ROB entries, which is why the
instList size generally has a
>>>>> >>>>>>>>> max of about 192
instructions. The dynamic instruction counts (seen in the
>>>>>
>>>>>>>>> dramsim2 plot) seem to also imply that instructions are
incorrectly been
>>>>> >>>>>>>>> removed from the ROB, and then from the
cpu's instruction list in cpu.cc,
>>>>> >>>>>>>>> which allows more and
more instructions to be added to the system (possibly
>>>>> >>>>>>>>>
from a bad branch).
>>>>> >>>>>>>>> I appreciate any help in debugging
this and further figuring out the
>>>>> >>>>>>>>> root problem, just let
me know if you need anything else from me. I don't
>>>>> >>>>>>>>> have
much more time at the moment to debug, but I can take any advice
for
>>>>> >>>>>>>>> quick changes and/or additional traces, then send
the results back to the
>>>>> >>>>>>>>> list for discussion.
>>>>>
>>>>>>>>> Thanks,
>>>>> >>>>>>>>> Andrew
>>>>> >>>>>>>>> P.S. Paul - I
did try decreasing the size of the dramsim2
>>>>> >>>>>>>>> transaction
(and even command) queue from 512 to 32. The same instructions
>>>>>
>>>>>>>>> problem occurred. It basically just decreased the execution
time.
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM,
Ali Saidi <***@umich.edu [18]> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>>
The error is that there are more that 1500 instructions currently
>>>>>
>>>>>>>>>> in flight in the system. It could mean several things:
>>>>>
>>>>>>>>>>
>>>>> >>>>>>>>>> 1. The value is somewhat arbitrarily defined
and maybe there are
>>>>> >>>>>>>>>> more than 1500 in your system at
one time?
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> 2. Instructions aren't being
destroyed correctly
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> You could try to
to run a debug binary so you'll get a list of
>>>>> >>>>>>>>>>
instructions when it happens or increase the number which may
>>>>>
>>>>>>>>>> be appropriate for certain situations (but 1500 is quite a
few inflight
>>>>> >>>>>>>>>> instructions).
>>>>> >>>>>>>>>>
>>>>>
>>>>>>>>>> Ali
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> On 13.03.2012 10:56,
Andrew Cebulski wrote:
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> Hi
Xiangyu,
>>>>> >>>>>>>>>> I just started looking into this some more. So
at first I
>>>>> >>>>>>>>>> thought it was due to updating to a more
recent revision, but then I went
>>>>> >>>>>>>>>> back to revision 8643,
added your patch, built and ran....and now get the
>>>>> >>>>>>>>>>
error with it too (when running ARM_FS/gem5.opt). I"m testing now to
see
>>>>> >>>>>>>>>> if an update to SWIG might have resulted in this
error, maybe someone on
>>>>> >>>>>>>>>> the mailing list would know if
that's possible. The difference is 1.3.40
>>>>> >>>>>>>>>> vs. 2.0.3,
both of which are supported according to the dependencies wiki
>>>>>
>>>>>>>>>> page.
>>>>> >>>>>>>>>> Just for completeness, here's the
error from revision 8643:
>>>>> >>>>>>>>>>
build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>> >>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>>>> >>>>>>>>>> I have not tried running with
gem5.debug, so I will be doing
>>>>> >>>>>>>>>> that today. Maybe this
is an assertion that is occurring due to an
>>>>> >>>>>>>>>>
optimization. That would mean it wouldn't be triggered in gem5.debug
since
>>>>> >>>>>>>>>> it runs without optimizations. Have you tested
all debug, opt and fast
>>>>> >>>>>>>>>> with your tests?
>>>>>
>>>>>>>>>> Thanks,
>>>>> >>>>>>>>>> Andrew
>>>>> >>>>>>>>>>
>>>>>
>>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>>
>>>>>>>>>> ***@gmail.com [19]> wrote:
>>>>> >>>>>>>>>>
>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>> I didn?t see this error in my simulations.
May I ask which gem5
>>>>> >>>>>>>>>>> version you are using? I find
some of the latest code updates do not comply
>>>>> >>>>>>>>>>> with my
changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>>
>>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000,
EEMBC2, and
>>>>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>>>>> >>>>>>>>>>>
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Thank you!
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
Best,
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Xiangyu
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
*From:* Andrew Cebulski [mailto:***@drexel.edu [20]]
>>>>> >>>>>>>>>>>
*Sent:* Thursday, March 08, 2012 6:52 PM
>>>>> >>>>>>>>>>>
>>>>>
>>>>>>>>>>> *To:* gem5 users mailing list
>>>>> >>>>>>>>>>>
*Cc:****@gmail.com [21]; ***@umich.edu [22]
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for
DRAMsim2 Integration
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Xiangyu,
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>> I've been having an issue recently with
the number of
>>>>> >>>>>>>>>>> instructions I've been seeing committed
to the CPU (I have a separate
>>>>> >>>>>>>>>>> thread on this). It
turns out the issue seems to be coming from this patch
>>>>> >>>>>>>>>>>
you created to integrate DramSim2 with Gem5. Unfortunately, I've
been
>>>>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until
now, I haven't been
>>>>> >>>>>>>>>>> seeing assertions. I thought I'd
run it with gem5.opt or debug back in
>>>>> >>>>>>>>>>> December, but I
must not have. My runs on the Arm O3 cpu fails with this
>>>>>
>>>>>>>>>>> assertion:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>>> >>>>>>>>>>>
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> -Andrew
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58
-0800
>>>>> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com
[23]>
>>>>> >>>>>>>>>>> To: "gem5 users mailing list"
<gem5-***@gem5.org [24]>
>>>>> >>>>>>>>>>> Subject: [gem5-users] A
Patch for DRAMsim2 Integration
>>>>> >>>>>>>>>>> Message-ID: gmail.com
[25]>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Content-Type: text/plain;
charset="us-ascii"
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Hi all,
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> I have
a Gem5+DRAMsim2 patch. I've tested it under both SE and FS
>>>>>
>>>>>>>>>>> modes.
>>>>> >>>>>>>>>>> I'm willing to share it here.
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> For
those who have such needs, please go to my website
>>>>> >>>>>>>>>>>
www.cse.psu.edu/~xydong [26] <http://www.cse.psu.edu/%7Exydong [27]>
to
>>>>> >>>>>>>>>>> download the patch and test it. To enable
>>>>>
>>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for
FS, you
>>>>> >>>>>>>>>>> can create
>>>>> >>>>>>>>>>> by yourself). The
basic idea to enable the DRAMsim2 module is to
>>>>> >>>>>>>>>>> use
the
>>>>> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory
class.
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>>
>>>>>>>>>>> Please let me know if there are bugs.
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Thank
you!
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>>
>>>>>>>>>>> Best,
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Xiangyu Dong
>>>>>
>>>>>>>>>>>
>>>>> >>>>>>>>>>> -------------- next part
--------------
>>>>> >>>>>>>>>>> An HTML attachment was
scrubbed...
>>>>> >>>>>>>>>>> URL: <
>>>>> >>>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[28]
>>>>> >>>>>>>>>>> >
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>>
>>>>>>>>>> _______________________________________________
>>>>>
>>>>>>>>>> gem5-users mailing list
>>>>> >>>>>>>>>> gem5-***@gem5.org
[29]
>>>>> >>>>>>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [30]
>>>>>
>>>>>>>>>>
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
>>>>>
>>>>>>>>> _______________________________________________
>>>>>
>>>>>>>>> gem5-users mailing
listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
_______________________________________________
>>>>> >>>>>>>>>
gem5-users mailing list
>>>>> >>>>>>>>> gem5-***@gem5.org [32]
>>>>>
>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[33]
>>>>> >>>>>>>>>
>>>>> >>>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>>
>>>>>>> _______________________________________________
>>>>> >>>>>>>
gem5-users mailing list
>>>>> >>>>>>> gem5-***@gem5.org [34]
>>>>>
>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [35]
>>>>>
>>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
_______________________________________________
>>>>> >>>>>> gem5-users
mailing list
>>>>> >>>>>> gem5-***@gem5.org [36]
>>>>> >>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [37]
>>>>>
>>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
_______________________________________________
>>>>> >>>>> gem5-users
mailing list
>>>>> >>>>> gem5-***@gem5.org [38]
>>>>> >>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [39]
>>>>>
>>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
_______________________________________________
>>>>> >>>> gem5-users
mailing list
>>>>> >>>> gem5-***@gem5.org [40]
>>>>> >>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [41]
>>>>>
>>>>
>>>>> >>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
_______________________________________________
>>>>> > gem5-users
mailing list
>>>>> > gem5-***@gem5.org [42]
>>>>> >
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [43]
>>>>> >
>>>>>

>>>>> _______________________________________________
>>>>> gem5-users
mailing list
>>>>> gem5-***@gem5.org [44]
>>>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [45]
>>>
>>>
_______________________________________________
>>> gem5-users mailing
list
>>> gem5-***@gem5.org [49]
>>>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [50]




Links:
------
[1] mailto:***@drexel.edu
[2]
mailto:***@drexel.edu
[3] mailto:***@umich.edu
[4]
mailto:***@umich.edu
[5]
http://dl.dropbox.com/u/2953302/gem5/err.0
[6]
mailto:***@umich.edu
[7]
http://dl.dropbox.com/u/2953302/gem5/tlb.out
[8]
mailto:***@umich.edu
[9]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[10]
mailto:***@drexel.edu
[11]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[12]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[13]
mailto:***@eecs.umich.edu
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:***@umich.edu
[19] mailto:***@gmail.com
[20]
mailto:***@drexel.edu
[21] mailto:***@gmail.com
[22]
mailto:***@umich.edu
[23] mailto:***@gmail.com
[24]
mailto:gem5-***@gem5.org
[25] http://gmail.com
[26]
http://www.cse.psu.edu/~xydong
[27]
http://www.cse.psu.edu/%7Exydong
[28]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[29]
mailto:gem5-***@gem5.org
[30]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[31]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[32]
mailto:gem5-***@gem5.org
[33]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[34]
mailto:gem5-***@gem5.org
[35]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[36]
mailto:gem5-***@gem5.org
[37]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[38]
mailto:gem5-***@gem5.org
[39]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[40]
mailto:gem5-***@gem5.org
[41]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[42]
mailto:gem5-***@gem5.org
[43]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[44]
mailto:gem5-***@gem5.org
[45]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[46]
http://dl.dropbox.com/u/2953302/gem5/table_walker.out
[47]
http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
[48]
mailto:***@umich.edu
[49] mailto:gem5-***@gem5.org
[50]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[51]
http://dl.dropbox.com/u/2953302/gem5/L2faults.png
[52]
mailto:***@umich.edu
[53]
http://dl.dropbox.com/u/2953302/gem5/pendingQueuePushPop.png
[54]
mailto:***@drexel.edu
[55] http://reviews.gem5.org/r/1402/
Andrew Cebulski
2012-10-11 12:09:22 UTC
Permalink
Hi Ali,

Just to confirm, this did fix the issue I was seeing. The committed
instruction count is now what I expect, and there are no assertions firing.

Thanks,
Andrew

On Fri, Sep 7, 2012 at 4:03 PM, Ali Saidi <***@umich.edu> wrote:

> **
>
> Hi Andrew,
>
>
>
> I think that http://reviews.gem5.org/r/1402/ will fix the issue you were
> seeing. It's not a complete solution (we still could use a mechanism to
> backpressure the CPU), but this solves part of the issue.
>
>
>
> Thanks,
>
> Ali
>
>
>
> On 15.05.2012 02:44, Andrew Cebulski wrote:
>
> Here is the latest in my debugging:
> http://dl.dropbox.com/u/2953302/gem5/pendingQueuePushPop.png
> The frequency of occurrence of the doL2DescriptorWrapper function (where I
> was seeing invalid faults) actually controls the size of the pendingQueue.
> What I'm showing are where the pendingQueue size is increased (with a
> push_back) and where it is decreased (pop_front). I put my DPRINTF for the
> decrease at the end of the doL2DescriptorWrapper function in
> table_walker.cc. This is actually right after a function call to nextWalk,
> which schedules a process event (doProcessEvent aka processWalkWrapper())
> for the next tick, which is where the pop of the pendingQueue occurs.
> My first bin is large, just to show how the push/pop rate roughly averages
> out at the start of the plot (there is still imbalance...just smaller
> grained). The bins where push_backs aren't seen is because there are only
> < 20 in those bins. Note how the difference between the push/pop is
> roughly the peak of each rise/fall. I'm still trying to debug why the
> imbalance in the pendingQueue/L2 function calls is occuring...namely at the
> changes from rise/fall in the size, but I seem to be narrowing down on it.
>
> Basically, it looks like there isn't a limit in place for the size of the
> TLB, therefore no stalls are being sent to stop more TLB transactions
> from initiating. The invalid accesses are likely a result of this too.
> Looking more closely in my traces, it looks like the L2 descriptor invalid
> errors start occurring once the pendingQueue increases above roughly 8
> entries.
> Here are the sizes of each bin (N1 is the push_back, N2 the L2 function):
> N1 =
> 867
> 11
> 388
> 11
> 775
> 3
> 1535
> 17
> 2127
> 0
> N2 =
> 788
> 205
> 95
> 300
> 189
> 588
> 376
> 1184
> 751
> 1
> -Andrew
>
>
> On Mon, May 14, 2012 at 11:15 AM, Andrew Cebulski <***@drexel.edu>wrote:
>
>> Ali,
>> Looking at the trace file for the TLB walker that I sent earlier, I see a
>> considerable number of these faults:
>> L2 descriptor invalid, causing fault
>> This is within the doL2Descriptor function in tablewalker.cc.
>> Here's a look at the frequency of these faults, with bins centered around
>> the base of each rise/fall of the pendingQueue size (see small arrows on
>> x-axis):
>> http://dl.dropbox.com/u/2953302/gem5/L2faults.png
>> I'm still looking into how this fault is handled, along with your other
>> questions. I probably won't have much of a chance to get into it more
>> until late today or tomorrow though. Let me know if you have any new ideas
>> based on these results.
>> Thanks,
>> Andrew
>>
>> On Fri, May 11, 2012 at 12:17 AM, Ali Saidi <***@umich.edu> wrote:
>>
>>> Hi Andrew,
>>>
>>> Looking at the trace it seems like there are a lot of invalid
>>> translations that are occurring. Everything to an address less than 0x1000
>>> is likely invalid. An invalid translation will return a fault (setting the
>>> fault pointer in the dynamic instruction to something other than NoFault
>>> and the instruction will either be squashed by a mispredicted branch or
>>> redirect fetch to a kernel handler. I'm wondering if that isn't happening
>>> for some reason. You need to trace back some of these translations and see
>>> what the instruction serial number is for them and then see what the
>>> instructions lifetime is like. Are they getting squashed? Looking at your
>>> graph, when the instructions fall to 0, what is the cause? Does an
>>> interrupt occur right before? Something else?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ali
>>>
>>>
>>>
>>>
>>>
>>> On 07.05.2012 20:53, Andrew Cebulski wrote:
>>>
>>> Hi Ali and Gabe,
>>> Here's the trace file:
>>> http://dl.dropbox.com/u/2953302/gem5/table_walker.out
>>> The pending queue size in the table walker follows the shape of the
>>> dynamic instruction curves. The L1 and L2 queue size never go above 0.
>>> Comparing DynInst count in cpu->instcount with pendingQueue size:
>>> http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
>>>
>>> -Andrew
>>>
>>> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> Could you add some code to the table walker to see how big the
>>>> following are getting:
>>>> stateQueueL1.size()
>>>> stateQueueL2.size()
>>>> pendingQueue.size()
>>>>
>>>> Perhaps we're some how getting into a loop where there are a lot of
>>>> translations to invalid addresses that get squashed and they pile up in the
>>>> table walker?
>>>>
>>>> Thanks,
>>>> Ali
>>>>
>>>>
>>>>
>>>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:
>>>>
>>>> > I haven't had a chance to study what's going on here, but could the
>>>> problem be that we don't have bandwidth limits/back pressure implemented
>>>> for the TLB and delayed translation? It could be that the CPU is pumping
>>>> instructions into translation which eventually drain out/are squashed, and
>>>> if too many accumulate they trip that assert.
>>>> >
>>>> > That may not actually make any sense as far as what the code is
>>>> actually doing, but it occurred to me as a possibility and I thought I'd
>>>> throw it out there.
>>>> >
>>>> > Gabe
>>>> >
>>>> > Quoting Andrew Cebulski <***@drexel.edu>:
>>>> >
>>>> >> I double-checked by looking at the config.ini file. It turns out I
>>>> did
>>>> >> actually create the checkpoint with an Atomic CPU without caches.
>>>> Sorry
>>>> >> for the confusion.
>>>> >>
>>>> >> -Andrew
>>>> >>
>>>> >> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <***@drexel.edu>
>>>> wrote:
>>>> >>
>>>> >>> I started hitting this assertion (that the number of insts in
>>>> flight was >
>>>> >>> 1500) before I started using a checkpoint. I created the checkpoint
>>>> >>> afterwards to decrease the time needed to run simulations to debug
>>>> this
>>>> >>> problem. I'll create a new checkpoint, then send the new trace
>>>> output.
>>>> >>>
>>>> >>> -Andrew
>>>> >>>
>>>> >>>
>>>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <***@umich.edu> wrote:
>>>> >>>
>>>> >>>> **
>>>> >>>>
>>>> >>>> It's likely the cause for all of your problems. Dirty data in the
>>>> caches
>>>> >>>> doesn't get restored either. You should always create checkpoints
>>>> with an
>>>> >>>> atomic cpu and without caches.
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> Ali
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>>>> >>>>
>>>> >>>> Sorry, I created the checkpoint I referred to with an O3 CPU with
>>>> caches.
>>>> >>>> From what I recall reading, caches don't get restored from
>>>> checkpoints.
>>>> >>>> Since the checkpoint wasn't during the benchmark run, I assumed
>>>> that was
>>>> >>>> okay.
>>>> >>>> -Andrew
>>>> >>>>
>>>> >>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <***@umich.edu> wrote:
>>>> >>>>
>>>> >>>>> You haven't answered the question about if you created the
>>>> checkpoints
>>>> >>>>> with an atomic cpu without caches.
>>>> >>>>>
>>>> >>>>> Ali
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>>>> >>>>>
>>>> >>>>> I have not run with the checker CPU recently. Here's the stderr
>>>> output
>>>> >>>>> from a run I did awhile back:
>>>> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>>>> >>>>> Note that the instruction match error is before my benchmark
>>>> actually
>>>> >>>>> starts running. The start of my boot script checks to see if my
>>>> files
>>>> >>>>> image is mounted (which it is), then continues on to run the
>>>> benchmark. I
>>>> >>>>> booted the system, mounted my files image, then took a
>>>> checkpoint. I've
>>>> >>>>> been running all my tests from that checkpoint. I found where my
>>>> benchmark
>>>> >>>>> started based on the ASID (from ExecAsid debug flag).
>>>> >>>>> I delayed the start of gathering trace data until the
>>>> second-to-last
>>>> >>>>> linear increase in dynamic instructions in-flight. I'm running a
>>>> new trace
>>>> >>>>> now.
>>>> >>>>> -Andrew
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <***@umich.edu>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Something is wrong well before this point. There is no reason
>>>> that
>>>> >>>>>> address 0x0 or 0x4 should be translated.
>>>> >>>>>>
>>>> >>>>>> Did you happen to create a checkpoint when caches were in the
>>>> system?
>>>> >>>>>>
>>>> >>>>>> Have you tried to run with the checker cpu and see if it detects
>>>> any
>>>> >>>>>> errors?
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Ali
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>>>> >>>>>>
>>>> >>>>>> They are data TLB misses that occur as the in-flight instruction
>>>> count
>>>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss before the in-flight
>>>> instruction
>>>> >>>>>> count finally linearly decreases is to 0x200. Also, at the
>>>> start of the
>>>> >>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>>>> >>>>>> Here's a trace file:
>>>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>>>> >>>>>> To reduce size, I just have lines that have either TLB or walker
>>>> in
>>>> >>>>>> them.
>>>> >>>>>> I do see only a handful of instruction TLB misses.
>>>> >>>>>> -Andrew
>>>> >>>>>>
>>>> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <***@umich.edu>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Hi Andrew,
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Thanks for digging into this. I think there is an issue
>>>> somewhere, but
>>>> >>>>>>> I'm still not sure where.
>>>> >>>>>>>
>>>> >>>>>>> Ali
>>>> >>>>>>>
>>>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>>> >>>>>>>
>>>> >>>>>>> Okay, I'm positive now that the issue lies with delayed
>>>> translations
>>>> >>>>>>> that are squashed before finishing.
>>>> >>>>>>>
>>>> >>>>>>> On the data on instruction side? You seem to allude to data in
>>>> the
>>>> >>>>>>> paragraph below, but then instructions in the latter text.
>>>> >>>>>>>
>>>> >>>>>>> It seems to me like speculative load/stores are being executed,
>>>> >>>>>>> rather than waiting for the instructions to commit. Once the
>>>> instructions
>>>> >>>>>>> begin getting (speculatively) executed in the TLB, a reference
>>>> is left
>>>> >>>>>>> there, which seems hard to root out and dereference after the
>>>> instruction
>>>> >>>>>>> ends up being squashed. At least, I have not been able to find
>>>> that out in
>>>> >>>>>>> the source code as of yet. Can anyone clarify on this?
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> There should only be one translation outstanding from each
>>>> >>>>>>> instruction and data side walker. Any nested transactions
>>>> should be queued
>>>> >>>>>>> in the walker. Until one finishes, I'm not sure how multiple
>>>> would ever be
>>>> >>>>>>> outstanding.
>>>> >>>>>>>
>>>> >>>>>>> Recall the following image that shows how the number of dynamic
>>>> >>>>>>> instruction (DynInst) objects in-flight increases linearly for
>>>> varying
>>>> >>>>>>> periods of time:
>>>> >>>>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>> >>>>>>> After enabling the TLB debug flag, I see that the linear
>>>> increase in
>>>> >>>>>>> instructions in flight is proportional to the number of TLB
>>>> misses. These
>>>> >>>>>>> TLB misses have a much larger delay (resulting in translation
>>>> delays) due
>>>> >>>>>>> to the fact the DramSim2 models the memory system more
>>>> accurately. It
>>>> >>>>>>> seems that with the classic memory system, TLB misses often do
>>>> not have
>>>> >>>>>>> translation delays. For whatever reason, it would also seem
>>>> that every
>>>> >>>>>>> instruction that has a TLB miss also is eventually squashed...
>>>> >>>>>>>
>>>> >>>>>>> From a data side perspective this is reasonable. While a miss is
>>>> >>>>>>> outstanding at some point instructions will stop committing and
>>>> thus the
>>>> >>>>>>> instructions in flight will begin to rise until the miss is
>>>> satisfied.
>>>> >>>>>>>
>>>> >>>>>>> Here's a summary of outputs from my trace. These two DPRINTF
>>>> >>>>>>> messages appears on the rising slopes (repeated up until the
>>>> peak):
>>>> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>>>> >>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>>>> >>>>>>>
>>>> >>>>>>> This is interesting/odd. I don't know a good reason why (1) a
>>>> miss
>>>> >>>>>>> would be outstanding to both address 0 and address 4 at the
>>>> same time. In
>>>> >>>>>>> almost all cases these pages are marked as no-access to detect
>>>> segfaults.
>>>> >>>>>>> Perhaps there is an issue where the cpu is getting into a loop
>>>> faulting on
>>>> >>>>>>> a bad access and then faulting again on the fault handler. I
>>>> could imagine
>>>> >>>>>>> this would happen if there was some corruption in the memory
>>>> system (for
>>>> >>>>>>> example the timings in dramsim exposing a bug in the cache
>>>> models or
>>>> >>>>>>> something).
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> At the peak, the following message appears (from fetch) almost
>>>> every
>>>> >>>>>>> tick for (what I believe to be) every single one of the table
>>>> walkers that
>>>> >>>>>>> were squashed.
>>>> >>>>>>> Fetch is waiting ITLB walk to finish!
>>>> >>>>>>>
>>>> >>>>>>> There must be another walk in flight? The instruction side will
>>>> only
>>>> >>>>>>> have one fault outstanding at once. Successive branch
>>>> mispredicts will
>>>> >>>>>>> re-direct fetch but there is code that catches the fact that a
>>>> different
>>>> >>>>>>> walk completed then expected and "does the right thing."
>>>> >>>>>>>
>>>> >>>>>>> The problem is that these ITLB table walks are for instructions
>>>> that
>>>> >>>>>>> were squashed as much as 0.3 billion cycles earlier, and since
>>>> been removed
>>>> >>>>>>> from the CPU's instruction list.
>>>> >>>>>>>
>>>> >>>>>>> I'm not following here.
>>>> >>>>>>>
>>>> >>>>>>> Any help will be greatly appreciated in solving this problem.
>>>> I've
>>>> >>>>>>> hit a roadblock with getting Ruby working with ARM, most likely
>>>> due to the
>>>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha do not).
>>>> There's the 256
>>>> >>>>>>> MB for physical memory, then the 64 MB for the boot loader. I
>>>> brought this
>>>> >>>>>>> up in my last email about trying to get Ruby working.
>>>> Therefore, I'm
>>>> >>>>>>> trying to get this DramSim2 integration fixed so I can start
>>>> modeling FS
>>>> >>>>>>> with DRAM memory.
>>>> >>>>>>>
>>>> >>>>>>> Brad/Steve/Nilay anyone have a suggestion on how to make this
>>>> work?
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Note that these problems also occur in Soplex from the Spec
>>>> CPU2006
>>>> >>>>>>> benchmark suite (also hits 1500 in-flight instructions
>>>> assertion). Due to
>>>> >>>>>>> time constraints, I haven't tested on other benchmarks.
>>>> >>>>>>> Thanks,
>>>> >>>>>>> Andrew
>>>> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <
>>>> ***@drexel.edu>wrote:
>>>> >>>>>>>
>>>> >>>>>>>> Hey Gabe,
>>>> >>>>>>>> Thanks for this...very helpful. I just recently got back
>>>> into
>>>> >>>>>>>> debugging this problem. I made a small change in
>>>> src/base/refcnt.hh to
>>>> >>>>>>>> allow me to return the current count of references to a
>>>> DynInst object.
>>>> >>>>>>>> I then modified existing DPRINTFs to also print out
>>>> reference
>>>> >>>>>>>> counts, then added some of my own when I needed extra
>>>> visibility.
>>>> >>>>>>>> I've found one memory store instruction that seems to be
>>>> getting
>>>> >>>>>>>> lost. What's happening is that is progresses as far as
>>>> getting executed in
>>>> >>>>>>>> the IEW once, but a delayed translation occurs, deferring the
>>>> store. By
>>>> >>>>>>>> the time it reenters the IEW, the IQ has marked the
>>>> instruction as
>>>> >>>>>>>> squashed. Everything progresses as usual from here on out,
>>>> with one
>>>> >>>>>>>> exception. When the instruction is removed from the CPUs
>>>> instruction list,
>>>> >>>>>>>> there is one reference count hanging.
>>>> >>>>>>>> I've added in some additional debugging for my traces to
>>>> help
>>>> >>>>>>>> narrow down where this reference is coming from. As far as I
>>>> can tell,
>>>> >>>>>>>> it's because of a call to initiateAcc() within the
>>>> executeStore function in
>>>> >>>>>>>> the lsq unit. Please see the following two traces. The first
>>>> trace shows
>>>> >>>>>>>> what I just discussed. The second trace is another memory
>>>> store
>>>> >>>>>>>> instruction that got squashed, however, it was squashed upon
>>>> its first
>>>> >>>>>>>> entry into the IEW, therefore it never started execution.
>>>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>>> >>>>>>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>> >>>>>>>> Let me know if you have any ideas based on these two
>>>> instruction
>>>> >>>>>>>> traces. I do not understand how the initiateAcc function
>>>> results in
>>>> >>>>>>>> another reference, but maybe someone else does.... Since I
>>>> don't see how
>>>> >>>>>>>> it makes a reference, it's hard to find out how to make sure
>>>> it gets
>>>> >>>>>>>> dereferenced...
>>>> >>>>>>>> Unfortunately, I haven't been able to add a DPRINTF in
>>>> >>>>>>>> src/base/refcnt.hh ...this would make things more clear (i.e.
>>>> exactly when
>>>> >>>>>>>> references/deferences occur). Let me know if you have any
>>>> advice on
>>>> >>>>>>>> this...if it's possible. I can't seem to get the right
>>>> include files, and
>>>> >>>>>>>> likely right SConscript compile order...
>>>> >>>>>>>> Thanks,
>>>> >>>>>>>> Andrew
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <
>>>> ***@eecs.umich.edu>wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> Without digging into things too deeply, it looks like you may
>>>> be
>>>> >>>>>>>>> leaking references to dynamic instructions. The CPU may think
>>>> it's done
>>>> >>>>>>>>> with one, but until that final reference is removed, the
>>>> object will hang
>>>> >>>>>>>>> around forever. I think I've had problems before where there
>>>> reference
>>>> >>>>>>>>> count ended up off by one somehow and instructions would
>>>> start piling up.
>>>> >>>>>>>>> It's also possible that a clog develops in O3's pipeline and
>>>> some internal
>>>> >>>>>>>>> structure stops letting instructions through and starts
>>>> accumulating them.
>>>> >>>>>>>>> Either of these problems will be annoying to track down, but
>>>> with enough
>>>> >>>>>>>>> digging I've been able to fix these sorts of things.
>>>> >>>>>>>>>
>>>> >>>>>>>>> This may have more to do with O3 not handling the benchmark
>>>> you're
>>>> >>>>>>>>> running well rather than a problem with your new DRAM model.
>>>> There may be
>>>> >>>>>>>>> some interaction between the two, though, where the new
>>>> memory makes the
>>>> >>>>>>>>> timing line up to cause O3 to behave poorly. What you can do
>>>> is instrument
>>>> >>>>>>>>> dynamic instruction creation and destruction and reference
>>>> counting (try
>>>> >>>>>>>>> print "this" for both the reference counting wrapper and the
>>>> dyn inst
>>>> >>>>>>>>> itself) and turn it on as close as you can to where things go
>>>> bad tick
>>>> >>>>>>>>> wise. Then look for an instruction which gets lost, and look
>>>> for where it's
>>>> >>>>>>>>> reference count is incremented and decremented. It should be
>>>> relatively
>>>> >>>>>>>>> easy to pair up where references are created and destroyed,
>>>> and you should
>>>> >>>>>>>>> be able to identify the reference which never goes away. Then
>>>> you need to
>>>> >>>>>>>>> figure out where that reference is being created. After that,
>>>> you should
>>>> >>>>>>>>> have enough information to identify why the reference
>>>> counting isn't being
>>>> >>>>>>>>> done correctly. It's arduous, but that's the only way.
>>>> >>>>>>>>>
>>>> >>>>>>>>> It's important to also make sure reference counts aren't
>>>> decremented
>>>> >>>>>>>>> to zero prematurely. I had a problem once where that happened
>>>> and the
>>>> >>>>>>>>> memory behind the object was updated by something that didn't
>>>> know it was
>>>> >>>>>>>>> dead. The memory had since been reallocated to another object
>>>> of the same
>>>> >>>>>>>>> type, so that other object reflected what happened to the
>>>> phantom one. If I
>>>> >>>>>>>>> remember that manifested as something weird like an add
>>>> causing a page
>>>> >>>>>>>>> fault or something.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Gabe
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> Hi all,
>>>> >>>>>>>>> I've looked into this problem some more, and have put
>>>> together a
>>>> >>>>>>>>> couple traces. I've been becoming more familiar with how
>>>> gem5 handles
>>>> >>>>>>>>> dynamic instructions, in particular how it destroys them. I
>>>> have two
>>>> >>>>>>>>> traces to compare, one with the physical memory, and the
>>>> other with the
>>>> >>>>>>>>> integrated dramsim2 dram memory. I also have two plots
>>>> showing instruction
>>>> >>>>>>>>> counts over time (sim ticks). All of these are linked at the
>>>> end of the
>>>> >>>>>>>>> email.
>>>> >>>>>>>>> First, I'm going to go into what I've been able to interpret
>>>> >>>>>>>>> regarding how instructions are destroyed. In particular,
>>>> comparing when
>>>> >>>>>>>>> DynInst's vs. DynInstPtr's are deconstructed/removed from the
>>>> cpu. I
>>>> >>>>>>>>> separate these because I've seen a difference, as I discuss
>>>> later. These
>>>> >>>>>>>>> explanations are fairly non-existent on the wiki. There is a
>>>> section
>>>> >>>>>>>>> header waiting to be filled...
>>>> >>>>>>>>> From what I have been able to gather from the code, there is
>>>> a list
>>>> >>>>>>>>> of all the instructions in flight in cpu/o3/cpu.cc called
>>>> instList, with
>>>> >>>>>>>>> the type DynInstPtr. There are three conditions to
>>>> instructions being
>>>> >>>>>>>>> cleaned from this list:
>>>> >>>>>>>>> 1.) The ROB retires its head instruction
>>>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from the commit,
>>>> >>>>>>>>> resulting in removing any instruction not in the ROB
>>>> >>>>>>>>> 3.) Decode detects an incorrect branch prediction, resulting
>>>> in
>>>> >>>>>>>>> removal of all instructions back to the bad seq num.
>>>> >>>>>>>>> Once all five stages have completed, the CPU cleans up all the
>>>> >>>>>>>>> removed in-flight instructions. This line in particular
>>>> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a
>>>> DynInstPtr:
>>>> >>>>>>>>> instList.erase(removeList.front());
>>>> >>>>>>>>> When I turn on the debug flag O3CPU, I see the message
>>>> "Removing
>>>> >>>>>>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum
>>>> and pcState
>>>> >>>>>>>>> after all 5 cpu stages have completed, and one of the
>>>> conditions above is
>>>> >>>>>>>>> met. I also see what tick it occurs on.
>>>> >>>>>>>>> When I turn on the DynInst debug flag, I see when
>>>> instructions are
>>>> >>>>>>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what
>>>> tick. From
>>>> >>>>>>>>> analyzing the trace files, I've gathered that this takes into
>>>> account that
>>>> >>>>>>>>> instructions have different execution lengths. So if one
>>>> tick a memory
>>>> >>>>>>>>> instruction in the instList (DynInstPtr) is removed, the
>>>> DynInst for that
>>>> >>>>>>>>> memory instruction will occur much later (i.e. 1M ticks
>>>> later). I have yet
>>>> >>>>>>>>> to determine how this is implemented.
>>>> >>>>>>>>> Now for the problem.
>>>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory is a
>>>> significant
>>>> >>>>>>>>> difference between the size of the instList vector (of
>>>> DynInstPtr objects),
>>>> >>>>>>>>> and the size of dynamic instruction count (of DynInst
>>>> objects). The
>>>> >>>>>>>>> benchmark I'm running is libquantum from SPEC 2006. For the
>>>> first roughly
>>>> >>>>>>>>> 130B ticks, the dynamic instruction count kept in
>>>> cpu/base_dyn_inst.impl.hh
>>>> >>>>>>>>> shadows the instList size in o3/cpu.cc (figure linked below)
>>>> very closely.
>>>> >>>>>>>>> Around tick 130B after libquantum started, it starts hitting
>>>> what I'm
>>>> >>>>>>>>> assuming are loops (therefore branch prediction), resulting
>>>> in some
>>>> >>>>>>>>> behavior that seems to imply improper instruction handling
>>>> (i.e. more
>>>> >>>>>>>>> instructions in flight than allowed by ROB).
>>>> >>>>>>>>> I wasn't able to sync-up the physical and dramsim2 traces
>>>> exactly by
>>>> >>>>>>>>> trace, but they should represent roughly the same area of
>>>> execution. They
>>>> >>>>>>>>> don't execute the same due to the dramsim2 modeling the
>>>> memory differently
>>>> >>>>>>>>> (i.e. latency and other delays).
>>>> >>>>>>>>> I've shared both traces on my public Dropbox here --
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>> >>>>>>>>> Here are a couple plots of tick versus instruction count, with
>>>> >>>>>>>>> respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
>>>> instList.size()
>>>> >>>>>>>>> in cpu/o3/cpu.cc. --
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>> >>>>>>>>> Note that I added the printout of the instList size to an
>>>> existing
>>>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>> >>>>>>>>> Here are the commands I ran to parse the traces into data
>>>> files to
>>>> >>>>>>>>> analyze in MATLAB and create the plots:
>>>> >>>>>>>>> zgrep DynInst
>>>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>>>> grep destroyed
>>>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>> >>>>>>>>> zgrep instList
>>>> >>>>>>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>>>> awk '{print
>>>> >>>>>>>>> $1,$11}' > instlistsize.out
>>>> >>>>>>>>> It seems to me like the problem might lie in gem5, but has
>>>> just been
>>>> >>>>>>>>> exposed by integrating this more detailed memory model,
>>>> dramsim2, into
>>>> >>>>>>>>> gem5. Either that, or their are some timing errors in how
>>>> dramsim2 was
>>>> >>>>>>>>> integrated. I doubt this, however, since those first 190B
>>>> ticks executed
>>>> >>>>>>>>> used the dramsim2 memory. I believe the problem is a
>>>> combination of memory
>>>> >>>>>>>>> instructions + complex loops (branch prediction), resulting
>>>> in improper
>>>> >>>>>>>>> destroying of instructions.
>>>> >>>>>>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug
>>>> flags.
>>>> >>>>>>>>> Their are 192 ROB entries, which is why the instList size
>>>> generally has a
>>>> >>>>>>>>> max of about 192 instructions. The dynamic instruction
>>>> counts (seen in the
>>>> >>>>>>>>> dramsim2 plot) seem to also imply that instructions are
>>>> incorrectly been
>>>> >>>>>>>>> removed from the ROB, and then from the cpu's instruction
>>>> list in cpu.cc,
>>>> >>>>>>>>> which allows more and more instructions to be added to the
>>>> system (possibly
>>>> >>>>>>>>> from a bad branch).
>>>> >>>>>>>>> I appreciate any help in debugging this and further figuring
>>>> out the
>>>> >>>>>>>>> root problem, just let me know if you need anything else from
>>>> me. I don't
>>>> >>>>>>>>> have much more time at the moment to debug, but I can take
>>>> any advice for
>>>> >>>>>>>>> quick changes and/or additional traces, then send the results
>>>> back to the
>>>> >>>>>>>>> list for discussion.
>>>> >>>>>>>>> Thanks,
>>>> >>>>>>>>> Andrew
>>>> >>>>>>>>> P.S. Paul - I did try decreasing the size of the dramsim2
>>>> >>>>>>>>> transaction (and even command) queue from 512 to 32. The
>>>> same instructions
>>>> >>>>>>>>> problem occurred. It basically just decreased the execution
>>>> time.
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <***@umich.edu>
>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> The error is that there are more that 1500 instructions
>>>> currently
>>>> >>>>>>>>>> in flight in the system. It could mean several things:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> 1. The value is somewhat arbitrarily defined and maybe there
>>>> are
>>>> >>>>>>>>>> more than 1500 in your system at one time?
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> 2. Instructions aren't being destroyed correctly
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> You could try to to run a debug binary so you'll get a list
>>>> of
>>>> >>>>>>>>>> instructions when it happens or increase the number which may
>>>> >>>>>>>>>> be appropriate for certain situations (but 1500 is quite a
>>>> few inflight
>>>> >>>>>>>>>> instructions).
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Ali
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Hi Xiangyu,
>>>> >>>>>>>>>> I just started looking into this some more. So at first I
>>>> >>>>>>>>>> thought it was due to updating to a more recent revision,
>>>> but then I went
>>>> >>>>>>>>>> back to revision 8643, added your patch, built and
>>>> ran....and now get the
>>>> >>>>>>>>>> error with it too (when running ARM_FS/gem5.opt). I"m
>>>> testing now to see
>>>> >>>>>>>>>> if an update to SWIG might have resulted in this error,
>>>> maybe someone on
>>>> >>>>>>>>>> the mailing list would know if that's possible. The
>>>> difference is 1.3.40
>>>> >>>>>>>>>> vs. 2.0.3, both of which are supported according to the
>>>> dependencies wiki
>>>> >>>>>>>>>> page.
>>>> >>>>>>>>>> Just for completeness, here's the error from revision 8643:
>>>> >>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>> >>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>>>> `cpu->instcount
>>>> >>>>>>>>>> I have not tried running with gem5.debug, so I will be
>>>> doing
>>>> >>>>>>>>>> that today. Maybe this is an assertion that is occurring
>>>> due to an
>>>> >>>>>>>>>> optimization. That would mean it wouldn't be triggered in
>>>> gem5.debug since
>>>> >>>>>>>>>> it runs without optimizations. Have you tested all debug,
>>>> opt and fast
>>>> >>>>>>>>>> with your tests?
>>>> >>>>>>>>>> Thanks,
>>>> >>>>>>>>>> Andrew
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>> >>>>>>>>>> ***@gmail.com> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>> Hi Andrew,
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> I didn?t see this error in my simulations. May I ask which
>>>> gem5
>>>> >>>>>>>>>>> version you are using? I find some of the latest code
>>>> updates do not comply
>>>> >>>>>>>>>>> with my changes. I am still using the DRAMsim2 patch on
>>>> Gem5 repo8643, and
>>>> >>>>>>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000,
>>>> EEMBC2, and
>>>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Thank you!
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Best,
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Xiangyu
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> *From:* Andrew Cebulski [mailto:***@drexel.edu]
>>>> >>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> *To:* gem5 users mailing list
>>>> >>>>>>>>>>> *Cc:****@gmail.com; ***@umich.edu
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Xiangyu,
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> I've been having an issue recently with the number of
>>>> >>>>>>>>>>> instructions I've been seeing committed to the CPU (I have
>>>> a separate
>>>> >>>>>>>>>>> thread on this). It turns out the issue seems to be coming
>>>> from this patch
>>>> >>>>>>>>>>> you created to integrate DramSim2 with Gem5.
>>>> Unfortunately, I've been
>>>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up until now, I
>>>> haven't been
>>>> >>>>>>>>>>> seeing assertions. I thought I'd run it with gem5.opt or
>>>> debug back in
>>>> >>>>>>>>>>> December, but I must not have. My runs on the Arm O3 cpu
>>>> fails with this
>>>> >>>>>>>>>>> assertion:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>>>> >>>>>>>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
>>>> `cpu->instcount
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> -Andrew
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com>
>>>> >>>>>>>>>>> To: "gem5 users mailing list" <gem5-***@gem5.org>
>>>> >>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>> >>>>>>>>>>> Message-ID: gmail.com>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Hi all,
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE
>>>> and FS
>>>> >>>>>>>>>>> modes.
>>>> >>>>>>>>>>> I'm willing to share it here.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> For those who have such needs, please go to my website
>>>> >>>>>>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong<http://www.cse.psu.edu/~xydong>>
>>>> to
>>>> >>>>>>>>>>> download the patch and test it. To enable
>>>> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for
>>>> FS, you
>>>> >>>>>>>>>>> can create
>>>> >>>>>>>>>>> by yourself). The basic idea to enable the DRAMsim2 module
>>>> is to
>>>> >>>>>>>>>>> use the
>>>> >>>>>>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Please let me know if there are bugs.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Thank you!
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Best,
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Xiangyu Dong
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> -------------- next part --------------
>>>> >>>>>>>>>>> An HTML attachment was scrubbed...
>>>> >>>>>>>>>>> URL: <
>>>> >>>>>>>>>>>
>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>> >>>>>>>>>>> >
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>> _______________________________________________
>>>> >>>>>>>>>> gem5-users mailing list
>>>> >>>>>>>>>> gem5-***@gem5.org
>>>> >>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> _______________________________________________
>>>> >>>>>>>>> gem5-users mailing listgem5-***@gem5.orghttp://
>>>> m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> _______________________________________________
>>>> >>>>>>>>> gem5-users mailing list
>>>> >>>>>>>>> gem5-***@gem5.org
>>>> >>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >>>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> _______________________________________________
>>>> >>>>>>> gem5-users mailing list
>>>> >>>>>>> gem5-***@gem5.org
>>>> >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> _______________________________________________
>>>> >>>>>> gem5-users mailing list
>>>> >>>>>> gem5-***@gem5.org
>>>> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> _______________________________________________
>>>> >>>>> gem5-users mailing list
>>>> >>>>> gem5-***@gem5.org
>>>> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> _______________________________________________
>>>> >>>> gem5-users mailing list
>>>> >>>> gem5-***@gem5.org
>>>> >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > gem5-users mailing list
>>>> > gem5-***@gem5.org
>>>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>> >
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
Gabriel Michael Black
2012-05-02 21:34:50 UTC
Permalink
Yes, thanks for your perseverance. I've been meaning to reply but I
haven't found the time to look at your email carefully. I'll try to do
that soon.

Gabe

Quoting Ali Saidi <***@umich.edu>:

>
>
> Hi Andrew,
>
> Thanks for digging into this. I think there is an issue
> somewhere, but I'm still not sure where.
>
> Ali
>
> On 01.05.2012 23:34,
> Andrew Cebulski wrote:
>
>> Okay, I'm positive now that the issue lies
> with delayed translations that are squashed before finishing.
>
> On the
> data on instruction side? You seem to allude to data in the paragraph
> below, but then instructions in the latter text.
>
>> It seems to me like
> speculative load/stores are being executed, rather than waiting for the
> instructions to commit. Once the instructions begin getting
> (speculatively) executed in the TLB, a reference is left there, which
> seems hard to root out and dereference after the instruction ends up
> being squashed. At least, I have not been able to find that out in the
> source code as of yet. Can anyone clarify on this?
>
> There should only be
> one translation outstanding from each instruction and data side walker.
> Any nested transactions should be queued in the walker. Until one
> finishes, I'm not sure how multiple would ever be outstanding.
> R
>
>> ses
> linearly for varying periods of time:
>>
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> [1]
>> After enabling the TLB debug flag, I see that the linear increase
> in instructions in flight is proportional to the number of TLB misses.
> These TLB misses have a much larger delay (resulting in translation
> delays) due to the fact the DramSim2 models the memory system more
> accurately. It seems that with the classic memory system, TLB misses
> often do not have translation delays. For whatever reason, it would also
> seem that every instruction that has a TLB miss also is eventually
> squashed...
>>
>> From a data side perspective this is reasonable. While
> a miss is outstanding at
> structions will stop committing and thus the
> instructions in flight will begin to rise until the miss is satisfied.
>
>
> Here's a summary of outputs from my trace. These two DPRINTF messages
> appears on the rising slopes (repeated up until the peak):
> TLB Miss
>
>>
> This is interesting/odd. I don't know a good reason why (1) a miss would
> be outstanding to both address 0 and address 4 at the same time. In
> almost all cases these pages are marked as no-access to detect
> segfaults. Perhaps there is an issue where the
> g into a loop faulting on
> a bad access and then faulting again on the fault handler. I could
> imagine this would happen if there was some corruption in the memory
> system (for example the timings in dramsim exposing a bug in the cache
> models or something).
>
> At the peak, the following message appears
> (from fetch) almost every tick for (what I believe to be) every single
> one of the table walkers that were squashed.
> Fetch is waiting ITLB walk
> to finish!
>
> There must be another walk in flight? The instruction side
> will only have one fault outstanding at once. Successive branch
> mispredicts will re-direct
>
>> ht thing."
>>
>> The problem is that
> these ITLB table walks are for instructions that were squashed as
> much
> on cycles earlier, and since been removed from the CPU's
> instruction list.
>
> I'm not following here.
>
> Any help will be greatly
> appreciated in solving this problem. I've hit a roadblock with getting
> Ruby working with ARM, most likely due to the fact that ARM has disjoint
> m
>
>> r. I brought this up in my last email about trying to get Ruby
> working. Therefore, I'm trying to get this DramSim2 integration fixed so
> I can start modeling FS with DRAM memory.
>
> Brad/Steve/Nilay anyone have
> a suggestion on how to make this work?
>
> Note that these problems also
> occur in Soplex from the Spec CP
>
>> en't tested on other benchmarks.
>>
> Thanks,
>> Andrew
>>
>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
> <***@drexel.edu [2]> wrote:
>> Hey Gabe,
>> Thanks for this...very
> helpful. I just recently got back into debugging this problem. I made a
> small
> c/base/refcnt.hh to allow me to return the current count of
> references to a DynInst object.
> I then modified existing DPRINTFs to
> also print out reference counts, then added some of my own when I needed
> extra
>
>> What's happening is that is progresses as far as getting
> executed in the IEW once, but a delayed translation occurs, deferring
> the store. By the time it reenters the IEW, the IQ has marked the
> instruction as squashed. Everything progresses as usual from here on
> out, with one exception. When the instruction is removed from the CPUs
> instruction list, there is one reference count hanging.
>> I've added in
> some additional debugging for my traces to help narrow down where this
> reference is coming from. As far as I can tell, it's because of a call
> to initiateAcc() within the executeStore function in the lsq unit.
> Please see the following two traces. The first trace shows what I just
> discussed. The second trace is another memory store instruction that got
> squashed, however, it was squashed upon its first entry into the IEW,
> therefore it never started execution.
>>
> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [21]
>>
> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out [22]
>> Let
> me know if you have any ideas based on these two instruction traces. I
> do not understand how the initiateAcc function results in another
> reference, but maybe someone else does.... Since I don't see how it
> makes a reference, it's hard to find out how to make sure it gets
> dereferenced...
>> Unfortunately, I haven't been able to add a DPRINTF
> in src/base/refcnt.hh ...this would make things more clear (i.e. exactly
> when references/deferences occur). Let me know if you have any advice on
> this...if it's possible. I can't seem to get the right include files,
> and likely right SConscript compile order...
>> Thanks,
>> Andrew
>>
>>
> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <***@eecs.umich.edu [23]>
> wrote:
>>
>>> Without digging into things too deeply, it looks like you
> may be leaking references to dynamic instructions. The CPU may think
> it's done with one, but until that final reference is removed, the
> object will hang around forever. I think I've had problems before where
> there reference count ended up off by one somehow and instructions would
> start piling up. It's also possible that a clog develops in O3's
> pipeline and some internal structure stops letting instructions through
> and starts accumulating them. Either of these problems will be annoying
> to track down, but with enough digging I've been able to fix these sorts
> of things.
>>>
>>> This may have more to do with O3 not handling the
> benchmark you're running well rather than a problem with your new DRAM
> model. There may be some interaction between the two, though, where the
> new memory makes the timing line up to cause O3 to behave poorly. What
> you can do is instrument dynamic instruction creation and destruction
> and reference counting (try print "this" for both the reference counting
> wrapper and the dyn inst itself) and turn it on as close as you can to
> where things go bad tick wise. Then look for an instruction which gets
> lost, and look for where it's reference count is incremented and
> decremented. It should be relatively easy to pair up where references
> are created and destroyed, and you should be able to identify the
> reference which never goes away. Then you need to figure out where that
> reference is being created. After that, you should have enough
> information to identify why the reference counting isn't being done
> correctly. It's arduous, but that's the only way.
>>>
>>> It's important
> to also make sure reference counts aren't decremented to zero
> prematurely. I had a problem once where that happened and the memory
> behind the object was updated by something that didn't know it was dead.
> The memory had since been reallocated to another object of the same
> type, so that other object reflected what happened to the phantom one.
> If I remember that manifested as something weird like an add causing a
> page fault or something.
>>>
>>> Gabe
>>>
>>> On 04/07/12 18:21, Andrew
> Cebulski wrote:
>>>
>>>> Hi all,
>>>> I've looked into this problem some
> more, and have put together a couple traces. I've been becoming more
> familiar with how gem5 handles dynamic instructions, in particular how
> it destroys them. I have two traces to compare, one with the physical
> memory, and the other with the integrated dramsim2 dram memory. I also
> have two plots showing instruction counts over time (sim ticks). All of
> these are linked at the end of the email.
>>>> First, I'm going to go
> into what I've been able to interpret regarding how instructions are
> destroyed. In particular, comparing when DynInst's vs. DynInstPtr's are
> deconstructed/removed from the cpu. I separate these because I've seen a
> difference, as I discuss later. These explanations are fairly
> non-existent on the wiki. There is a section header waiting to be
> filled...
>>>> From what I have been able to gather from the code, there
> is a list of all the instructions in flight in cpu/o3/cpu.cc called
> instList, with the type DynInstPtr. There are three conditions to
> instructions being cleaned from this list:
>>>> 1.) The ROB retires its
> head instruction
>>>> 2.) Fetch receives a rob squashing signal from the
> commit, resulting in removing any instruction not in the ROB
>>>> 3.)
> Decode detects an incorrect branch prediction, resulting in removal of
> all instructions back to the bad seq num.
>>>> Once all five stages have
> completed, the CPU cleans up all the removed in-flight instructions.
> This line in particular in cleanUpRemovedInsts() in cpu/o3/cpu.cc
> deconstructs a DynInstPtr:
>>>> instList.erase(removeList.front());
>>>>
> When I turn on the debug flag O3CPU, I see the message "Removing
> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and
> pcState after all 5 cpu stages have completed, and one of the conditions
> above is met. I also see what tick it occurs on.
>>>> When I turn on the
> DynInst debug flag, I see when instructions are created and destroyed
> (cpu/base_dyn_inst_impl.hh) and what tick. From analyzing the trace
> files, I've gathered that this takes into account that instructions have
> different execution lengths. So if one tick a memory instruction in the
> instList (DynInstPtr) is removed, the DynInst for that memory
> instruction will occur much later (i.e. 1M ticks later). I have yet to
> determine how this is implemented.
>>>> Now for the problem.
>>>> What
> I'm seeing when I run dramsim2 dram memory is a significant difference
> between the size of the instList vector (of DynInstPtr objects), and the
> size of dynamic instruction count (of DynInst objects). The benchmark
> I'm running is libquantum from SPEC 2006. For the first roughly 130B
> ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
> shadows the instList size in o3/cpu.cc (figure linked below) very
> closely. Around tick 130B after libquantum started, it starts hitting
> what I'm assuming are loops (therefore branch prediction), resulting in
> some behavior that seems to imply improper instruction handling (i.e.
> more instructions in flight than allowed by ROB).
>>>> I wasn't able to
> sync-up the physical and dramsim2 traces exactly by trace, but they
> should represent roughly the same area of execution. They don't execute
> the same due to the dramsim2 modeling the memory differently (i.e.
> latency and other delays).
>>>> I've shared both traces on my public
> Dropbox here --
>>>>
> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
> [14]
>>>>
> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
> [15]
>>>> Here are a couple plots of tick versus instruction count,
> with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
> instList.size() in cpu/o3/cpu.cc. --
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
> [16]
>>>>
>>>>
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> [17]
>>>> Note that I added the printout of the instList size to an
> existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>>
> Here are the commands I ran to parse the traces into data files to
> analyze in MATLAB and create the plots:
>>>> zgrep DynInst
> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
> destroyed | awk '{print $1,$11}' > cpuinstcount.out
>>>> zgrep instList
> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
> $1,$11}' > instlistsize.out
>>>> It seems to me like the problem might
> lie in gem5, but has just been exposed by integrating this more detailed
> memory model, dramsim2, into gem5. Either that, or their are some timing
> errors in how dramsim2 was integrated. I doubt this, however, since
> those first 190B ticks executed used the dramsim2 memory. I believe the
> problem is a combination of memory instructions + complex loops (branch
> prediction), resulting in improper destroying of instructions.
>>>> I've
> included the ROB, Commit, Fetch, DynInst and O3CPU debug flags. Their
> are 192 ROB entries, which is why the instList size generally has a max
> of about 192 instructions. The dynamic instruction counts (seen in the
> dramsim2 plot) seem to also imply that instructions are incorrectly been
> removed from the ROB, and then from the cpu's instruction list in
> cpu.cc, which allows more and more instructions to be added to the
> system (possibly from a bad branch).
>>>> I appreciate any help in
> debugging this and further figuring out the root problem, just let me
> know if you need anything else from me. I don't have much more time at
> the moment to debug, but I can take any advice for quick changes and/or
> additional traces, then send the results back to the list for
> discussion.
>>>> Thanks,
>>>> Andrew
>>>> P.S. Paul - I did try
> decreasing the size of the dramsim2 transaction (and even command) queue
> from 512 to 32. The same instructions problem occurred. It basically
> just decreased the execution time.
>>>>
>>>> On Wed, Mar 14, 2012 at
> 2:10 PM, Ali Saidi <***@umich.edu [18]> wrote:
>>>>
>>>>> The error is
> that there are more that 1500 instructions currently in flight in the
> system. It could mean several things:
>>>>>
>>>>> 1. The value is
> somewhat arbitrarily defined and maybe there are more than 1500 in your
> system at one time?
>>>>>
>>>>> 2. Instructions aren't being destroyed
> correctly
>>>>>
>>>>> You could try to to run a debug binary so you'll
> get a list of instructions when it happens or increase the number which
> may be appropriate for certain situations (but 1500 is quite a few
> inflight instructions).
>>>>>
>>>>> Ali
>>>>>
>>>>> On 13.03.2012 10:56,
> Andrew Cebulski wrote:
>>>>>
>>>>>> Hi Xiangyu,
>>>>>> I just started
> looking into this some more. So at first I thought it was due to
> updating to a more recent revision, but then I went back to revision
> 8643, added your patch, built and ran....and now get the error with it
> too (when running ARM_FS/gem5.opt). I"m testing now to see if an update
> to SWIG might have resulted in this error, maybe someone on the mailing
> list would know if that's possible. The difference is 1.3.40 vs. 2.0.3,
> both of which are supported according to the dependencies wiki page.
>
>>>>>> Just for completeness, here's the error from revision 8643:
>
>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
> `cpu->instcount
>>>>>>
>>>>>> I have not tried running with gem5.debug,
> so I will be doing that today. Maybe this is an assertion that is
> occurring due to an optimization. That would mean it wouldn't be
> triggered in gem5.debug since it runs without optimizations. Have you
> tested all debug, opt and fast with your tests?
>>>>>> Thanks,
>>>>>>
> Andrew
>>>>>>
>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong
> <***@gmail.com [11]> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>
>>>>>>> I didn't see this error in my simulations. May I ask which gem5
> version you are using? I find some of the latest code updates do not
> comply with my changes. I am still using the DRAMsim2 patch on Gem5
> repo8643, and have run all the runnable benchmarks in SPEC2006,
> SPEC2000, EEMBC2, and PARSEC2 on ARM_SE.
>>>>>>>
>>>>>>> Thank you!
>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Xiangyu
>>>>>>>
>>>>>>> FROM:
> Andrew Cebulski [mailto:***@drexel.edu [8]]
>>>>>>> SENT: Thursday,
> March 08, 2012 6:52 PM
>>>>>>>
>>>>>>> TO: gem5 users mailing list
> CC:***@gmail.com [9]; ***@umich.edu [10]
>>>>>>>
>>>>>>>
> SUBJECT: Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>>
>
>>>>>>> Xiangyu,
>>>>>>>
>>>>>>> I've been having an issue recently with
> the number of instructions I've been seeing committed to the CPU (I have
> a separate thread on this). It turns out the issue seems to be coming
> from this patch you created to integrate DramSim2 with Gem5.
> Unfortunately, I've been running with gem5.fast, not gem5.opt. So up
> until now, I haven't been seeing assertions. I thought I'd run it with
> gem5.opt or debug back in December, but I must not have. My runs on the
> Arm O3 cpu fails with this assertion:
>>>>>>>
>>>>>>>
> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>>
>>>>>>>
> -Andrew
>>>>>>>
>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>>>>
> From: "Dong, Xiangyu" <***@gmail.com [3]>
>>>>>>>> To: "gem5 users
> mailing list" <gem5-***@gem5.org [4]>
>>>>>>>> Subject: [gem5-users] A
> Patch for DRAMsim2 Integration Message-ID: gmail.com [5]>
>>>>>>>>
>
>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>>>
>>>>>>>>
> Hi all,
>>>>>>>>
>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it
> under both SE and FS modes.
>>>>>>>> I'm willing to share it
> here.
>>>>>>>>
>>>>>>>> For those who have such needs, please go to my
> website
>>>>>>>> www.cse.psu.edu/~xydong [6] to download the patch and
> test it. To enable
>>>>>>>> DRAMSim2, use se_dramsim2.py script instead
> of se.py (for FS, you can create
>>>>>>>> by yourself). The basic idea to
> enable the DRAMsim2 module is to use the
>>>>>>>> derived DRAMMemory
> class instead of PhysicalMemory class.
>>>>>>>>
>>>>>>>> Please let me
> know if there are bugs.
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>>
> Best,
>>>>>>>>
>>>>>>>> Xiangyu Dong
>>>>>>>>
>>>>>>>> -------------- next
> part --------------
>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>
> URL:
> <http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
> [7]>
>>>>>>
>>>>>> _______________________________________________
>>>>>>
> gem5-users mailing list
>>>>>> gem5-***@gem5.org [12]
>>>>>>
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [13]
>>>>
>>>>
> _______________________________________________
>>>> gem5-users mailing
> list
>>>>
> gem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>
>>> _______________________________________________
>>> gem5-users
> mailing list
>>> gem5-***@gem5.org [19]
>>>
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> [20]
>
>
> Links:
> ------
> [1]
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> [2]
> mailto:***@drexel.edu
> [3] mailto:***@gmail.com
> [4]
> mailto:gem5-***@gem5.org
> [5] http://gmail.com
> [6]
> http://www.cse.psu.edu/%7Exydong
> [7]
> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
> [8]
> mailto:***@drexel.edu
> [9] mailto:***@gmail.com
> [10]
> mailto:***@umich.edu
> [11] mailto:***@gmail.com
> [12]
> mailto:gem5-***@gem5.org
> [13]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> [14]
> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
> [15]
> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
> [16]
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
> [17]
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> [18]
> mailto:***@umich.edu
> [19] mailto:gem5-***@gem5.org
> [20]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> [21]
> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
> [22]
> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
> [23]
> mailto:***@eecs.umich.edu
>
Loading...