There's a limit on the size of the TLB itself, but there may not be a
limit on the number of translations it's doing at one time. I suspect
that's an important part of the problem.
Gabe
On 05/15/12 00:44, Andrew Cebulski wrote:
> Here is the latest in my debugging:
>
> http://dl.dropbox.com/u/2953302/gem5/pendingQueuePushPop.png
>
> The frequency of occurrence of the doL2DescriptorWrapper function
> (where I was seeing invalid faults) actually controls the size of the
> pendingQueue. What I'm showing are where the pendingQueue size is
> increased (with a push_back) and where it is decreased (pop_front). I
> put my DPRINTF for the decrease at the end of the
> doL2DescriptorWrapper function in table_walker.cc. This is actually
> right after a function call to nextWalk, which schedules a process
> event (doProcessEvent aka processWalkWrapper()) for the next tick,
> which is where the pop of the pendingQueue occurs.
>
> My first bin is large, just to show how the push/pop rate roughly
> averages out at the start of the plot (there is still imbalance...just
> smaller grained). The bins where push_backs aren't seen is because
> there are only < 20 in those bins. Note how the difference between
> the push/pop is roughly the peak of each rise/fall. I'm still trying
> to debug why the imbalance in the pendingQueue/L2 function calls is
> occuring...namely at the changes from rise/fall in the size, but I
> seem to be narrowing down on it.
>
> Basically, it looks like there isn't a limit in place for the size of
> the TLB, therefore no stalls are being sent to stop more TLB
> transactions from initiating. The invalid accesses are likely a
> result of this too. Looking more closely in my traces, it looks like
> the L2 descriptor invalid errors start occurring once the pendingQueue
> increases above roughly 8 entries.
>
> Here are the sizes of each bin (N1 is the push_back, N2 the L2
> function):
>
> N1 =
>
> 867
> 11
> 388
> 11
> 775
> 3
> 1535
> 17
> 2127
> 0
>
> N2 =
>
> 788
> 205
> 95
> 300
> 189
> 588
> 376
> 1184
> 751
> 1
>
> -Andrew
>
>
>
> On Mon, May 14, 2012 at 11:15 AM, Andrew Cebulski <***@drexel.edu
> <mailto:***@drexel.edu>> wrote:
>
> Ali,
>
> Looking at the trace file for the TLB walker that I sent earlier,
> I see a considerable number of these faults:
>
> L2 descriptor invalid, causing fault
>
> This is within the doL2Descriptor function in tablewalker.cc.
>
> Here's a look at the frequency of these faults, with bins centered
> around the base of each rise/fall of the pendingQueue size (see
> small arrows on x-axis):
>
> http://dl.dropbox.com/u/2953302/gem5/L2faults.png
>
> I'm still looking into how this fault is handled, along with your
> other questions. I probably won't have much of a chance to get
> into it more until late today or tomorrow though. Let me know if
> you have any new ideas based on these results.
>
> Thanks,
> Andrew
>
> On Fri, May 11, 2012 at 12:17 AM, Ali Saidi <***@umich.edu
> <mailto:***@umich.edu>> wrote:
>
> Hi Andrew,
>
> Looking at the trace it seems like there are a lot of invalid
> translations that are occurring. Everything to an address less
> than 0x1000 is likely invalid. An invalid translation will
> return a fault (setting the fault pointer in the dynamic
> instruction to something other than NoFault and the
> instruction will either be squashed by a mispredicted branch
> or redirect fetch to a kernel handler. I'm wondering if that
> isn't happening for some reason. You need to trace back some
> of these translations and see what the instruction serial
> number is for them and then see what the instructions lifetime
> is like. Are they getting squashed? Looking at your graph,
> when the instructions fall to 0, what is the cause? Does an
> interrupt occur right before? Something else?
>
>
>
> Thanks,
>
> Ali
>
>
>
>
>
> On 07.05.2012 20:53, Andrew Cebulski wrote:
>
>> Hi Ali and Gabe,
>>
>> Here's the trace file:
>> http://dl.dropbox.com/u/2953302/gem5/table_walker.out
>> The pending queue size in the table walker follows the
>> shape of the dynamic instruction curves. The L1 and L2 queue
>> size never go above 0. Comparing DynInst count in
>> cpu->instcount with pendingQueue size:
>> http://dl.dropbox.com/u/2953302/gem5/pendingQueueSize.png
>>
>> -Andrew
>>
>> On Sun, May 6, 2012 at 12:01 PM, Ali Saidi <***@umich.edu
>> <mailto:***@umich.edu>> wrote:
>>
>> Hi Andrew,
>>
>> Could you add some code to the table walker to see how
>> big the following are getting:
>> stateQueueL1.size()
>> stateQueueL2.size()
>> pendingQueue.size()
>>
>> Perhaps we're some how getting into a loop where there
>> are a lot of translations to invalid addresses that get
>> squashed and they pile up in the table walker?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> On May 4, 2012, at 7:53 AM, Gabriel Michael Black wrote:
>>
>> > I haven't had a chance to study what's going on here,
>> but could the problem be that we don't have bandwidth
>> limits/back pressure implemented for the TLB and delayed
>> translation? It could be that the CPU is pumping
>> instructions into translation which eventually drain
>> out/are squashed, and if too many accumulate they trip
>> that assert.
>> >
>> > That may not actually make any sense as far as what the
>> code is actually doing, but it occurred to me as a
>> possibility and I thought I'd throw it out there.
>> >
>> > Gabe
>> >
>> > Quoting Andrew Cebulski <***@drexel.edu
>> <mailto:***@drexel.edu>>:
>> >
>> >> I double-checked by looking at the config.ini file.
>> It turns out I did
>> >> actually create the checkpoint with an Atomic CPU
>> without caches. Sorry
>> >> for the confusion.
>> >>
>> >> -Andrew
>> >>
>> >> On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski
>> <***@drexel.edu <mailto:***@drexel.edu>> wrote:
>> >>
>> >>> I started hitting this assertion (that the number of
>> insts in flight was >
>> >>> 1500) before I started using a checkpoint. I created
>> the checkpoint
>> >>> afterwards to decrease the time needed to run
>> simulations to debug this
>> >>> problem. I'll create a new checkpoint, then send the
>> new trace output.
>> >>>
>> >>> -Andrew
>> >>>
>> >>>
>> >>> On Wed, May 2, 2012 at 9:53 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>
>> >>>> **
>> >>>>
>> >>>> It's likely the cause for all of your problems.
>> Dirty data in the caches
>> >>>> doesn't get restored either. You should always
>> create checkpoints with an
>> >>>> atomic cpu and without caches.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Ali
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 02.05.2012 21:23, Andrew Cebulski wrote:
>> >>>>
>> >>>> Sorry, I created the checkpoint I referred to with
>> an O3 CPU with caches.
>> >>>> From what I recall reading, caches don't get
>> restored from checkpoints.
>> >>>> Since the checkpoint wasn't during the benchmark
>> run, I assumed that was
>> >>>> okay.
>> >>>> -Andrew
>> >>>>
>> >>>> On Wed, May 2, 2012 at 9:07 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>
>> >>>>> You haven't answered the question about if you
>> created the checkpoints
>> >>>>> with an atomic cpu without caches.
>> >>>>>
>> >>>>> Ali
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 02.05.2012 19:58, Andrew Cebulski wrote:
>> >>>>>
>> >>>>> I have not run with the checker CPU recently.
>> Here's the stderr output
>> >>>>> from a run I did awhile back:
>> >>>>> http://dl.dropbox.com/u/2953302/gem5/err.0
>> >>>>> Note that the instruction match error is before my
>> benchmark actually
>> >>>>> starts running. The start of my boot script checks
>> to see if my files
>> >>>>> image is mounted (which it is), then continues on
>> to run the benchmark. I
>> >>>>> booted the system, mounted my files image, then
>> took a checkpoint. I've
>> >>>>> been running all my tests from that checkpoint. I
>> found where my benchmark
>> >>>>> started based on the ASID (from ExecAsid debug flag).
>> >>>>> I delayed the start of gathering trace data until
>> the second-to-last
>> >>>>> linear increase in dynamic instructions in-flight.
>> I'm running a new trace
>> >>>>> now.
>> >>>>> -Andrew
>> >>>>>
>> >>>>>
>> >>>>> On Wed, May 2, 2012 at 5:28 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>>
>> >>>>>> Something is wrong well before this point. There
>> is no reason that
>> >>>>>> address 0x0 or 0x4 should be translated.
>> >>>>>>
>> >>>>>> Did you happen to create a checkpoint when caches
>> were in the system?
>> >>>>>>
>> >>>>>> Have you tried to run with the checker cpu and see
>> if it detects any
>> >>>>>> errors?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Ali
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 02.05.2012 17:22, Andrew Cebulski wrote:
>> >>>>>>
>> >>>>>> They are data TLB misses that occur as the
>> in-flight instruction count
>> >>>>>> rises (at 0x0 and 0x4). The last TLB miss before
>> the in-flight instruction
>> >>>>>> count finally linearly decreases is to 0x200.
>> Also, at the start of the
>> >>>>>> rising slope, I see a miss to 0x8 and 0x2508c.
>> >>>>>> Here's a trace file:
>> >>>>>> http://dl.dropbox.com/u/2953302/gem5/tlb.out
>> >>>>>> To reduce size, I just have lines that have either
>> TLB or walker in
>> >>>>>> them.
>> >>>>>> I do see only a handful of instruction TLB misses.
>> >>>>>> -Andrew
>> >>>>>>
>> >>>>>> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>>>
>> >>>>>>> Hi Andrew,
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks for digging into this. I think there is an
>> issue somewhere, but
>> >>>>>>> I'm still not sure where.
>> >>>>>>>
>> >>>>>>> Ali
>> >>>>>>>
>> >>>>>>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>> >>>>>>>
>> >>>>>>> Okay, I'm positive now that the issue lies with
>> delayed translations
>> >>>>>>> that are squashed before finishing.
>> >>>>>>>
>> >>>>>>> On the data on instruction side? You seem to
>> allude to data in the
>> >>>>>>> paragraph below, but then instructions in the
>> latter text.
>> >>>>>>>
>> >>>>>>> It seems to me like speculative load/stores are
>> being executed,
>> >>>>>>> rather than waiting for the instructions to
>> commit. Once the instructions
>> >>>>>>> begin getting (speculatively) executed in the
>> TLB, a reference is left
>> >>>>>>> there, which seems hard to root out and
>> dereference after the instruction
>> >>>>>>> ends up being squashed. At least, I have not
>> been able to find that out in
>> >>>>>>> the source code as of yet. Can anyone clarify on
>> this?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> There should only be one translation outstanding
>> from each
>> >>>>>>> instruction and data side walker. Any nested
>> transactions should be queued
>> >>>>>>> in the walker. Until one finishes, I'm not sure
>> how multiple would ever be
>> >>>>>>> outstanding.
>> >>>>>>>
>> >>>>>>> Recall the following image that shows how the
>> number of dynamic
>> >>>>>>> instruction (DynInst) objects in-flight increases
>> linearly for varying
>> >>>>>>> periods of time:
>> >>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>> After enabling the TLB debug flag, I see that the
>> linear increase in
>> >>>>>>> instructions in flight is proportional to the
>> number of TLB misses. These
>> >>>>>>> TLB misses have a much larger delay (resulting in
>> translation delays) due
>> >>>>>>> to the fact the DramSim2 models the memory system
>> more accurately. It
>> >>>>>>> seems that with the classic memory system, TLB
>> misses often do not have
>> >>>>>>> translation delays. For whatever reason, it
>> would also seem that every
>> >>>>>>> instruction that has a TLB miss also is
>> eventually squashed...
>> >>>>>>>
>> >>>>>>> From a data side perspective this is reasonable.
>> While a miss is
>> >>>>>>> outstanding at some point instructions will stop
>> committing and thus the
>> >>>>>>> instructions in flight will begin to rise until
>> the miss is satisfied.
>> >>>>>>>
>> >>>>>>> Here's a summary of outputs from my trace. These
>> two DPRINTF
>> >>>>>>> messages appears on the rising slopes (repeated
>> up until the peak):
>> >>>>>>> TLB Miss: Starting hardware table walker for 0(656)
>> >>>>>>> TLB Miss: Starting hardware table walker for 0x4(656)
>> >>>>>>>
>> >>>>>>> This is interesting/odd. I don't know a good
>> reason why (1) a miss
>> >>>>>>> would be outstanding to both address 0 and
>> address 4 at the same time. In
>> >>>>>>> almost all cases these pages are marked as
>> no-access to detect segfaults.
>> >>>>>>> Perhaps there is an issue where the cpu is
>> getting into a loop faulting on
>> >>>>>>> a bad access and then faulting again on the fault
>> handler. I could imagine
>> >>>>>>> this would happen if there was some corruption in
>> the memory system (for
>> >>>>>>> example the timings in dramsim exposing a bug in
>> the cache models or
>> >>>>>>> something).
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> At the peak, the following message appears (from
>> fetch) almost every
>> >>>>>>> tick for (what I believe to be) every single one
>> of the table walkers that
>> >>>>>>> were squashed.
>> >>>>>>> Fetch is waiting ITLB walk to finish!
>> >>>>>>>
>> >>>>>>> There must be another walk in flight? The
>> instruction side will only
>> >>>>>>> have one fault outstanding at once. Successive
>> branch mispredicts will
>> >>>>>>> re-direct fetch but there is code that catches
>> the fact that a different
>> >>>>>>> walk completed then expected and "does the right
>> thing."
>> >>>>>>>
>> >>>>>>> The problem is that these ITLB table walks are
>> for instructions that
>> >>>>>>> were squashed as much as 0.3 billion cycles
>> earlier, and since been removed
>> >>>>>>> from the CPU's instruction list.
>> >>>>>>>
>> >>>>>>> I'm not following here.
>> >>>>>>>
>> >>>>>>> Any help will be greatly appreciated in solving
>> this problem. I've
>> >>>>>>> hit a roadblock with getting Ruby working with
>> ARM, most likely due to the
>> >>>>>>> fact that ARM has disjoint memory (x86 and Alpha
>> do not). There's the 256
>> >>>>>>> MB for physical memory, then the 64 MB for the
>> boot loader. I brought this
>> >>>>>>> up in my last email about trying to get Ruby
>> working. Therefore, I'm
>> >>>>>>> trying to get this DramSim2 integration fixed so
>> I can start modeling FS
>> >>>>>>> with DRAM memory.
>> >>>>>>>
>> >>>>>>> Brad/Steve/Nilay anyone have a suggestion on how
>> to make this work?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Note that these problems also occur in Soplex
>> from the Spec CPU2006
>> >>>>>>> benchmark suite (also hits 1500 in-flight
>> instructions assertion). Due to
>> >>>>>>> time constraints, I haven't tested on other
>> benchmarks.
>> >>>>>>> Thanks,
>> >>>>>>> Andrew
>> >>>>>>> On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski
>> <***@drexel.edu <mailto:***@drexel.edu>>wrote:
>> >>>>>>>
>> >>>>>>>> Hey Gabe,
>> >>>>>>>> Thanks for this...very helpful. I just
>> recently got back into
>> >>>>>>>> debugging this problem. I made a small change
>> in src/base/refcnt.hh to
>> >>>>>>>> allow me to return the current count of
>> references to a DynInst object.
>> >>>>>>>> I then modified existing DPRINTFs to also
>> print out reference
>> >>>>>>>> counts, then added some of my own when I needed
>> extra visibility.
>> >>>>>>>> I've found one memory store instruction that
>> seems to be getting
>> >>>>>>>> lost. What's happening is that is progresses as
>> far as getting executed in
>> >>>>>>>> the IEW once, but a delayed translation occurs,
>> deferring the store. By
>> >>>>>>>> the time it reenters the IEW, the IQ has marked
>> the instruction as
>> >>>>>>>> squashed. Everything progresses as usual from
>> here on out, with one
>> >>>>>>>> exception. When the instruction is removed from
>> the CPUs instruction list,
>> >>>>>>>> there is one reference count hanging.
>> >>>>>>>> I've added in some additional debugging for
>> my traces to help
>> >>>>>>>> narrow down where this reference is coming from.
>> As far as I can tell,
>> >>>>>>>> it's because of a call to initiateAcc() within
>> the executeStore function in
>> >>>>>>>> the lsq unit. Please see the following two
>> traces. The first trace shows
>> >>>>>>>> what I just discussed. The second trace is
>> another memory store
>> >>>>>>>> instruction that got squashed, however, it was
>> squashed upon its first
>> >>>>>>>> entry into the IEW, therefore it never started
>> execution.
>> >>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>> >>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>> >>>>>>>> Let me know if you have any ideas based on
>> these two instruction
>> >>>>>>>> traces. I do not understand how the initiateAcc
>> function results in
>> >>>>>>>> another reference, but maybe someone else
>> does.... Since I don't see how
>> >>>>>>>> it makes a reference, it's hard to find out how
>> to make sure it gets
>> >>>>>>>> dereferenced...
>> >>>>>>>> Unfortunately, I haven't been able to add a
>> DPRINTF in
>> >>>>>>>> src/base/refcnt.hh ...this would make things
>> more clear (i.e. exactly when
>> >>>>>>>> references/deferences occur). Let me know if
>> you have any advice on
>> >>>>>>>> this...if it's possible. I can't seem to get
>> the right include files, and
>> >>>>>>>> likely right SConscript compile order...
>> >>>>>>>> Thanks,
>> >>>>>>>> Andrew
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black
>> <***@eecs.umich.edu <mailto:***@eecs.umich.edu>>wrote:
>> >>>>>>>>
>> >>>>>>>>> Without digging into things too deeply, it
>> looks like you may be
>> >>>>>>>>> leaking references to dynamic instructions. The
>> CPU may think it's done
>> >>>>>>>>> with one, but until that final reference is
>> removed, the object will hang
>> >>>>>>>>> around forever. I think I've had problems
>> before where there reference
>> >>>>>>>>> count ended up off by one somehow and
>> instructions would start piling up.
>> >>>>>>>>> It's also possible that a clog develops in O3's
>> pipeline and some internal
>> >>>>>>>>> structure stops letting instructions through
>> and starts accumulating them.
>> >>>>>>>>> Either of these problems will be annoying to
>> track down, but with enough
>> >>>>>>>>> digging I've been able to fix these sorts of
>> things.
>> >>>>>>>>>
>> >>>>>>>>> This may have more to do with O3 not handling
>> the benchmark you're
>> >>>>>>>>> running well rather than a problem with your
>> new DRAM model. There may be
>> >>>>>>>>> some interaction between the two, though, where
>> the new memory makes the
>> >>>>>>>>> timing line up to cause O3 to behave poorly.
>> What you can do is instrument
>> >>>>>>>>> dynamic instruction creation and destruction
>> and reference counting (try
>> >>>>>>>>> print "this" for both the reference counting
>> wrapper and the dyn inst
>> >>>>>>>>> itself) and turn it on as close as you can to
>> where things go bad tick
>> >>>>>>>>> wise. Then look for an instruction which gets
>> lost, and look for where it's
>> >>>>>>>>> reference count is incremented and decremented.
>> It should be relatively
>> >>>>>>>>> easy to pair up where references are created
>> and destroyed, and you should
>> >>>>>>>>> be able to identify the reference which never
>> goes away. Then you need to
>> >>>>>>>>> figure out where that reference is being
>> created. After that, you should
>> >>>>>>>>> have enough information to identify why the
>> reference counting isn't being
>> >>>>>>>>> done correctly. It's arduous, but that's the
>> only way.
>> >>>>>>>>>
>> >>>>>>>>> It's important to also make sure reference
>> counts aren't decremented
>> >>>>>>>>> to zero prematurely. I had a problem once where
>> that happened and the
>> >>>>>>>>> memory behind the object was updated by
>> something that didn't know it was
>> >>>>>>>>> dead. The memory had since been reallocated to
>> another object of the same
>> >>>>>>>>> type, so that other object reflected what
>> happened to the phantom one. If I
>> >>>>>>>>> remember that manifested as something weird
>> like an add causing a page
>> >>>>>>>>> fault or something.
>> >>>>>>>>>
>> >>>>>>>>> Gabe
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi all,
>> >>>>>>>>> I've looked into this problem some more, and
>> have put together a
>> >>>>>>>>> couple traces. I've been becoming more
>> familiar with how gem5 handles
>> >>>>>>>>> dynamic instructions, in particular how it
>> destroys them. I have two
>> >>>>>>>>> traces to compare, one with the physical
>> memory, and the other with the
>> >>>>>>>>> integrated dramsim2 dram memory. I also have
>> two plots showing instruction
>> >>>>>>>>> counts over time (sim ticks). All of these are
>> linked at the end of the
>> >>>>>>>>> email.
>> >>>>>>>>> First, I'm going to go into what I've been able
>> to interpret
>> >>>>>>>>> regarding how instructions are destroyed. In
>> particular, comparing when
>> >>>>>>>>> DynInst's vs. DynInstPtr's are
>> deconstructed/removed from the cpu. I
>> >>>>>>>>> separate these because I've seen a difference,
>> as I discuss later. These
>> >>>>>>>>> explanations are fairly non-existent on the
>> wiki. There is a section
>> >>>>>>>>> header waiting to be filled...
>> >>>>>>>>> From what I have been able to gather from the
>> code, there is a list
>> >>>>>>>>> of all the instructions in flight in
>> cpu/o3/cpu.cc called instList, with
>> >>>>>>>>> the type DynInstPtr. There are three
>> conditions to instructions being
>> >>>>>>>>> cleaned from this list:
>> >>>>>>>>> 1.) The ROB retires its head instruction
>> >>>>>>>>> 2.) Fetch receives a rob squashing signal from
>> the commit,
>> >>>>>>>>> resulting in removing any instruction not in
>> the ROB
>> >>>>>>>>> 3.) Decode detects an incorrect branch
>> prediction, resulting in
>> >>>>>>>>> removal of all instructions back to the bad seq
>> num.
>> >>>>>>>>> Once all five stages have completed, the CPU
>> cleans up all the
>> >>>>>>>>> removed in-flight instructions. This line in
>> particular
>> >>>>>>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc
>> deconstructs a DynInstPtr:
>> >>>>>>>>> instList.erase(removeList.front());
>> >>>>>>>>> When I turn on the debug flag O3CPU, I see the
>> message "Removing
>> >>>>>>>>> instruction, ..." (from o3/cpu.cc) with the
>> threadNum, seqNum and pcState
>> >>>>>>>>> after all 5 cpu stages have completed, and one
>> of the conditions above is
>> >>>>>>>>> met. I also see what tick it occurs on.
>> >>>>>>>>> When I turn on the DynInst debug flag, I see
>> when instructions are
>> >>>>>>>>> created and destroyed
>> (cpu/base_dyn_inst_impl.hh) and what tick. From
>> >>>>>>>>> analyzing the trace files, I've gathered that
>> this takes into account that
>> >>>>>>>>> instructions have different execution lengths.
>> So if one tick a memory
>> >>>>>>>>> instruction in the instList (DynInstPtr) is
>> removed, the DynInst for that
>> >>>>>>>>> memory instruction will occur much later (i.e.
>> 1M ticks later). I have yet
>> >>>>>>>>> to determine how this is implemented.
>> >>>>>>>>> Now for the problem.
>> >>>>>>>>> What I'm seeing when I run dramsim2 dram memory
>> is a significant
>> >>>>>>>>> difference between the size of the instList
>> vector (of DynInstPtr objects),
>> >>>>>>>>> and the size of dynamic instruction count (of
>> DynInst objects). The
>> >>>>>>>>> benchmark I'm running is libquantum from SPEC
>> 2006. For the first roughly
>> >>>>>>>>> 130B ticks, the dynamic instruction count kept
>> in cpu/base_dyn_inst.impl.hh
>> >>>>>>>>> shadows the instList size in o3/cpu.cc (figure
>> linked below) very closely.
>> >>>>>>>>> Around tick 130B after libquantum started, it
>> starts hitting what I'm
>> >>>>>>>>> assuming are loops (therefore branch
>> prediction), resulting in some
>> >>>>>>>>> behavior that seems to imply improper
>> instruction handling (i.e. more
>> >>>>>>>>> instructions in flight than allowed by ROB).
>> >>>>>>>>> I wasn't able to sync-up the physical and
>> dramsim2 traces exactly by
>> >>>>>>>>> trace, but they should represent roughly the
>> same area of execution. They
>> >>>>>>>>> don't execute the same due to the dramsim2
>> modeling the memory differently
>> >>>>>>>>> (i.e. latency and other delays).
>> >>>>>>>>> I've shared both traces on my public Dropbox
>> here --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>> >>>>>>>>> Here are a couple plots of tick versus
>> instruction count, with
>> >>>>>>>>> respect to cpu->instcount in
>> cpu/base_dyn_inst.impl.hh and instList.size()
>> >>>>>>>>> in cpu/o3/cpu.cc. --
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>> >>>>>>>>>
>> >>>>>>>>>
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> >>>>>>>>> Note that I added the printout of the instList
>> size to an existing
>> >>>>>>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in
>> cpu/o3/cpu.cc.
>> >>>>>>>>> Here are the commands I ran to parse the traces
>> into data files to
>> >>>>>>>>> analyze in MATLAB and create the plots:
>> >>>>>>>>> zgrep DynInst
>> >>>>>>>>>
>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> grep destroyed
>> >>>>>>>>> | awk '{print $1,$11}' > cpuinstcount.out
>> >>>>>>>>> zgrep instList
>> >>>>>>>>>
>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |
>> awk '{print
>> >>>>>>>>> $1,$11}' > instlistsize.out
>> >>>>>>>>> It seems to me like the problem might lie in
>> gem5, but has just been
>> >>>>>>>>> exposed by integrating this more detailed
>> memory model, dramsim2, into
>> >>>>>>>>> gem5. Either that, or their are some timing
>> errors in how dramsim2 was
>> >>>>>>>>> integrated. I doubt this, however, since those
>> first 190B ticks executed
>> >>>>>>>>> used the dramsim2 memory. I believe the
>> problem is a combination of memory
>> >>>>>>>>> instructions + complex loops (branch
>> prediction), resulting in improper
>> >>>>>>>>> destroying of instructions.
>> >>>>>>>>> I've included the ROB, Commit, Fetch, DynInst
>> and O3CPU debug flags.
>> >>>>>>>>> Their are 192 ROB entries, which is why the
>> instList size generally has a
>> >>>>>>>>> max of about 192 instructions. The dynamic
>> instruction counts (seen in the
>> >>>>>>>>> dramsim2 plot) seem to also imply that
>> instructions are incorrectly been
>> >>>>>>>>> removed from the ROB, and then from the cpu's
>> instruction list in cpu.cc,
>> >>>>>>>>> which allows more and more instructions to be
>> added to the system (possibly
>> >>>>>>>>> from a bad branch).
>> >>>>>>>>> I appreciate any help in debugging this and
>> further figuring out the
>> >>>>>>>>> root problem, just let me know if you need
>> anything else from me. I don't
>> >>>>>>>>> have much more time at the moment to debug, but
>> I can take any advice for
>> >>>>>>>>> quick changes and/or additional traces, then
>> send the results back to the
>> >>>>>>>>> list for discussion.
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Andrew
>> >>>>>>>>> P.S. Paul - I did try decreasing the size of
>> the dramsim2
>> >>>>>>>>> transaction (and even command) queue from 512
>> to 32. The same instructions
>> >>>>>>>>> problem occurred. It basically just decreased
>> the execution time.
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi
>> <***@umich.edu <mailto:***@umich.edu>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> The error is that there are more that 1500
>> instructions currently
>> >>>>>>>>>> in flight in the system. It could mean several
>> things:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. The value is somewhat arbitrarily defined
>> and maybe there are
>> >>>>>>>>>> more than 1500 in your system at one time?
>> >>>>>>>>>>
>> >>>>>>>>>> 2. Instructions aren't being destroyed correctly
>> >>>>>>>>>>
>> >>>>>>>>>> You could try to to run a debug binary so
>> you'll get a list of
>> >>>>>>>>>> instructions when it happens or increase the
>> number which may
>> >>>>>>>>>> be appropriate for certain situations (but
>> 1500 is quite a few inflight
>> >>>>>>>>>> instructions).
>> >>>>>>>>>>
>> >>>>>>>>>> Ali
>> >>>>>>>>>>
>> >>>>>>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Xiangyu,
>> >>>>>>>>>> I just started looking into this some more.
>> So at first I
>> >>>>>>>>>> thought it was due to updating to a more
>> recent revision, but then I went
>> >>>>>>>>>> back to revision 8643, added your patch, built
>> and ran....and now get the
>> >>>>>>>>>> error with it too (when running
>> ARM_FS/gem5.opt). I"m testing now to see
>> >>>>>>>>>> if an update to SWIG might have resulted in
>> this error, maybe someone on
>> >>>>>>>>>> the mailing list would know if that's
>> possible. The difference is 1.3.40
>> >>>>>>>>>> vs. 2.0.3, both of which are supported
>> according to the dependencies wiki
>> >>>>>>>>>> page.
>> >>>>>>>>>> Just for completeness, here's the error from
>> revision 8643:
>> >>>>>>>>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>> BaseDynInst::initVars() [with Impl =
>> O3CPUImpl]: Assertion `cpu->instcount
>> >>>>>>>>>> I have not tried running with gem5.debug, so
>> I will be doing
>> >>>>>>>>>> that today. Maybe this is an assertion that
>> is occurring due to an
>> >>>>>>>>>> optimization. That would mean it wouldn't be
>> triggered in gem5.debug since
>> >>>>>>>>>> it runs without optimizations. Have you
>> tested all debug, opt and fast
>> >>>>>>>>>> with your tests?
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Andrew
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu
>> Dong <
>> >>>>>>>>>> ***@gmail.com
>> <mailto:***@gmail.com>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Andrew,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I didn?t see this error in my simulations.
>> May I ask which gem5
>> >>>>>>>>>>> version you are using? I find some of the
>> latest code updates do not comply
>> >>>>>>>>>>> with my changes. I am still using the
>> DRAMsim2 patch on Gem5 repo8643, and
>> >>>>>>>>>>> have run all the runnable benchmarks in
>> SPEC2006, SPEC2000, EEMBC2, and
>> >>>>>>>>>>> PARSEC2 on ARM_SE.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> *From:* Andrew Cebulski
>> [mailto:***@drexel.edu <mailto:***@drexel.edu>]
>> >>>>>>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>> >>>>>>>>>>>
>> >>>>>>>>>>> *To:* gem5 users mailing list
>> >>>>>>>>>>> *Cc:****@gmail.com
>> <mailto:***@gmail.com>; ***@umich.edu
>> <mailto:***@umich.edu>
>> >>>>>>>>>>>
>> >>>>>>>>>>> *Subject:* Re: [gem5-users] A Patch for
>> DRAMsim2 Integration
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I've been having an issue recently with the
>> number of
>> >>>>>>>>>>> instructions I've been seeing committed to
>> the CPU (I have a separate
>> >>>>>>>>>>> thread on this). It turns out the issue
>> seems to be coming from this patch
>> >>>>>>>>>>> you created to integrate DramSim2 with Gem5.
>> Unfortunately, I've been
>> >>>>>>>>>>> running with gem5.fast, not gem5.opt. So up
>> until now, I haven't been
>> >>>>>>>>>>> seeing assertions. I thought I'd run it with
>> gem5.opt or debug back in
>> >>>>>>>>>>> December, but I must not have. My runs on
>> the Arm O3 cpu fails with this
>> >>>>>>>>>>> assertion:
>> >>>>>>>>>>>
>> >>>>>>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void
>> >>>>>>>>>>> BaseDynInst::initVars() [with Impl =
>> O3CPUImpl]: Assertion `cpu->instcount
>> >>>>>>>>>>>
>> >>>>>>>>>>> -Andrew
>> >>>>>>>>>>>
>> >>>>>>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>> >>>>>>>>>>> From: "Dong, Xiangyu" <***@gmail.com
>> <mailto:***@gmail.com>>
>> >>>>>>>>>>> To: "gem5 users mailing list"
>> <gem5-***@gem5.org <mailto:gem5-***@gem5.org>>
>> >>>>>>>>>>> Subject: [gem5-users] A Patch for DRAMsim2
>> Integration
>> >>>>>>>>>>> Message-ID: gmail.com <http://gmail.com>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have a Gem5+DRAMsim2 patch. I've tested it
>> under both SE and FS
>> >>>>>>>>>>> modes.
>> >>>>>>>>>>> I'm willing to share it here.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> For those who have such needs, please go to
>> my website
>> >>>>>>>>>>> www.cse.psu.edu/~xydong
>> <http://www.cse.psu.edu/%7Exydong>
>> <http://www.cse.psu.edu/%7Exydong> to
>> >>>>>>>>>>> download the patch and test it. To enable
>> >>>>>>>>>>> DRAMSim2, use se_dramsim2.py script instead
>> of se.py (for FS, you
>> >>>>>>>>>>> can create
>> >>>>>>>>>>> by yourself). The basic idea to enable the
>> DRAMsim2 module is to
>> >>>>>>>>>>> use the
>> >>>>>>>>>>> derived DRAMMemory class instead of
>> PhysicalMemory class.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Please let me know if there are bugs.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Xiangyu Dong
>> >>>>>>>>>>>
>> >>>>>>>>>>> -------------- next part --------------
>> >>>>>>>>>>> An HTML attachment was scrubbed...
>> >>>>>>>>>>> URL: <
>> >>>>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>> >>>>>>>>>>> >
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>> _______________________________________________
>> >>>>>>>>>> gem5-users mailing list
>> >>>>>>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing
>> listgem5-***@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> <http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-users mailing list
>> >>>>>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>>>>>
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> gem5-users mailing list
>> >>>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> gem5-users mailing list
>> >>>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> gem5-users mailing list
>> >>>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> gem5-users mailing list
>> >>>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> > _______________________________________________
>> > gem5-users mailing list
>> > gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org <mailto:gem5-***@gem5.org>
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users