Discussion:
System Hangs
(too old to reply)
Ivan Stalev via gem5-users
2014-05-12 15:52:00 UTC
Permalink
Hello,

I am running X86 in full system mode. When running just 1 CPU, both atomic
and detailed mode work fine. However, with more than 1 CPU, atomic works
fine, but in detailed mode the system appears to hang shortly after
boot-up. GEM5 doesn't crash, but the system stops having any output.
Looking at the stats, it appears that instructions are still being
committed, but the actual applications/benchmarks are not making progress.
The issue persists with the latest version of GEM5. I also tried two
different kernel versions and several different disk images.

I might be experiencing what seems to be the same issue that was found
about a year ago but appears to not have been fixed:
https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html

Can anyone reproduce this or know of a solution?

Thank you,

Ivan
Joel Hestness via gem5-users
2014-05-16 02:51:31 UTC
Permalink
Hi Ivan,
Can you please give more detail on what you're running? Specifically,
can you give your command line, and which kernel, disk image you're using?
Are you using checkpointing?

Joel


On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
gem5-***@gem5.org> wrote:

> Hello,
>
> I am running X86 in full system mode. When running just 1 CPU, both atomic
> and detailed mode work fine. However, with more than 1 CPU, atomic works
> fine, but in detailed mode the system appears to hang shortly after
> boot-up. GEM5 doesn't crash, but the system stops having any output.
> Looking at the stats, it appears that instructions are still being
> committed, but the actual applications/benchmarks are not making progress.
> The issue persists with the latest version of GEM5. I also tried two
> different kernel versions and several different disk images.
>
> I might be experiencing what seems to be the same issue that was found
> about a year ago but appears to not have been fixed:
> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>
> Can anyone reproduce this or know of a solution?
>
> Thank you,
>
> Ivan
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>



--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Ivan Stalev via gem5-users
2014-05-16 04:16:41 UTC
Permalink
Hi Joel,

I have tried several different kernels and disk images, including the
default ones provided on the GEM5 website in the x86-system.tar.bz2
download. I run with these commands:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
--kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
-n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
--script=rcs_scripts/run.rcS --max-checkpoints=1

My run.rcS script simply contains:

#!/bin/sh
/sbin/m5 resetstats
/sbin/m5 checkpoint
echo 'booted'
/extras/run
/sbin/m5 exit

where "/extras/run" is simply a C program with an infinite loop that prints
a counter.

I then restore:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
--kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
-r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
--l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
--l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64

I specified the disk image file in Benchmarks.py. Restoring from the same
checkpoint and running in atomic mode works fine. I also tried booting the
system in detailed and letting it run for a while, but once it boots, there
is no more output. So it seems that checkpointing is not the issue. The
"run" program is just a dummy, and the same issue also persists when
running SPEC benchmarks or any other program.

My dummy program is simply:

int count=0;
printf("**************************** HEYY \n");
while(1)
printf("\n %d \n", count++);

Letting it run for a while, the only output is exactly this:

booted
*******

It doesn't even finish printing the first printf.

Another thing to add: In another scenario, I modified the kernel to call an
m5 pseudo instruction on every context switch, and then GEM5 prints that a
context switch occurred. Once again, in atomic mode this worked as
expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
output stopped along with the system output in the terminal.

Thank you for your help.

Ivan


On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com>wrote:

> Hi Ivan,
> Can you please give more detail on what you're running? Specifically,
> can you give your command line, and which kernel, disk image you're using?
> Are you using checkpointing?
>
> Joel
>
>
> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
> gem5-***@gem5.org> wrote:
>
>> Hello,
>>
>> I am running X86 in full system mode. When running just 1 CPU, both
>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>> works fine, but in detailed mode the system appears to hang shortly after
>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>> Looking at the stats, it appears that instructions are still being
>> committed, but the actual applications/benchmarks are not making progress.
>> The issue persists with the latest version of GEM5. I also tried two
>> different kernel versions and several different disk images.
>>
>> I might be experiencing what seems to be the same issue that was found
>> about a year ago but appears to not have been fixed:
>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>
>> Can anyone reproduce this or know of a solution?
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
Joel Hestness via gem5-users
2014-05-16 15:18:05 UTC
Permalink
Hi Ivan,
I believe that the email thread you previously referenced was related to
a bug that we identified and fixed with changeset
9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
That bug was causing interrupts to be dropped in x86 when running with the
O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd
recommend updating to a more recent version and trying to replicate this
issue again.

If you are using a more recent version of gem5, first, please let us know
which changeset and whether you've made any changes. Then, I'd recommend
compiling gem5.opt and using the DPRINTF tracing functionality to see if
you can zero in on what is happening. Specifically, first try passing the
flag --debug-flag=Exec to look at what the CPU cores are executing (you may
also want to pass --trace-start=<<tick>> with a simulator tick time close
to when the hang happens). This trace will include Linux kernel symbols
for at least some of the lines if executing in the kernel (e.g. handling an
interrupt). If you've compiled your benchmark without debugging symbols,
it may just show the memory addresses of instructions being executed within
the application. I will guess that you'll see kernel symbols for at least
some of the executed instructions for interrupts.

If it appears that the CPUs are continuing to execute, try running with
--debug-flag=LocalApic. This will print the interrupts that each core is
receiving, and if it stops printing at any point, it means something has
gone wrong and we'd have to do some deeper digging.

Keep us posted on what you find,
Joel



On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:

> Hi Joel,
>
> I have tried several different kernels and disk images, including the
> default ones provided on the GEM5 website in the x86-system.tar.bz2
> download. I run with these commands:
>
> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
> --script=rcs_scripts/run.rcS --max-checkpoints=1
>
> My run.rcS script simply contains:
>
> #!/bin/sh
> /sbin/m5 resetstats
> /sbin/m5 checkpoint
> echo 'booted'
> /extras/run
> /sbin/m5 exit
>
> where "/extras/run" is simply a C program with an infinite loop that
> prints a counter.
>
> I then restore:
>
> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>
> I specified the disk image file in Benchmarks.py. Restoring from the same
> checkpoint and running in atomic mode works fine. I also tried booting the
> system in detailed and letting it run for a while, but once it boots, there
> is no more output. So it seems that checkpointing is not the issue. The
> "run" program is just a dummy, and the same issue also persists when
> running SPEC benchmarks or any other program.
>
> My dummy program is simply:
>
> int count=0;
> printf("**************************** HEYY \n");
> while(1)
> printf("\n %d \n", count++);
>
> Letting it run for a while, the only output is exactly this:
>
> booted
> *******
>
> It doesn't even finish printing the first printf.
>
> Another thing to add: In another scenario, I modified the kernel to call
> an m5 pseudo instruction on every context switch, and then GEM5 prints that
> a context switch occurred. Once again, in atomic mode this worked as
> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
> output stopped along with the system output in the terminal.
>
> Thank you for your help.
>
> Ivan
>
>
> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com>wrote:
>
>> Hi Ivan,
>> Can you please give more detail on what you're running? Specifically,
>> can you give your command line, and which kernel, disk image you're using?
>> Are you using checkpointing?
>>
>> Joel
>>
>>
>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>> gem5-***@gem5.org> wrote:
>>
>>> Hello,
>>>
>>> I am running X86 in full system mode. When running just 1 CPU, both
>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>> works fine, but in detailed mode the system appears to hang shortly after
>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>> Looking at the stats, it appears that instructions are still being
>>> committed, but the actual applications/benchmarks are not making progress.
>>> The issue persists with the latest version of GEM5. I also tried two
>>> different kernel versions and several different disk images.
>>>
>>> I might be experiencing what seems to be the same issue that was found
>>> about a year ago but appears to not have been fixed:
>>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>>
>>> Can anyone reproduce this or know of a solution?
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>>
>> --
>> Joel Hestness
>> PhD Student, Computer Architecture
>> Dept. of Computer Science, University of Wisconsin - Madison
>> http://pages.cs.wisc.edu/~hestness/
>>
>
>


--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Ivan Stalev via gem5-users
2014-05-17 05:50:49 UTC
Permalink
Hi Joel,

I am using revision 10124. I removed all of my own modifications just to be
safe.

Running with gem5.opt and restoring from a boot-up checkpoint
with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
infinite loop, executing this continuously:

5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
A=0xffffffff80822400
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I
: rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I
: limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 : JLE_I
: wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP
: IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
A=0xffffffff80822400
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I
: rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I
: limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I
: wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
: IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 :
CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 :
CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
A=0xffffffff80822400
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 :
CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 : JLE_I
: rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 : JLE_I
: limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I
: wrip , t1, t2 : IntAlu :
5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
: IntAlu :

....and so on repetitively without stopping.

Using --debug-flag=LocalApic, the output does indeed stop shortly after
restoring from the checkpoint. The last output is:
..
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
object.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
offset 0xb0 as 0.

...and no more output from this point on.

I appreciate your help tremendously.

Ivan



On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com>wrote:

> Hi Ivan,
> I believe that the email thread you previously referenced was related to
> a bug that we identified and fixed with changeset 9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
> That bug was causing interrupts to be dropped in x86 when running with the
> O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd
> recommend updating to a more recent version and trying to replicate this
> issue again.
>
> If you are using a more recent version of gem5, first, please let us
> know which changeset and whether you've made any changes. Then, I'd
> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
> see if you can zero in on what is happening. Specifically, first try
> passing the flag --debug-flag=Exec to look at what the CPU cores are
> executing (you may also want to pass --trace-start=<<tick>> with a
> simulator tick time close to when the hang happens). This trace will
> include Linux kernel symbols for at least some of the lines if executing in
> the kernel (e.g. handling an interrupt). If you've compiled your benchmark
> without debugging symbols, it may just show the memory addresses of
> instructions being executed within the application. I will guess that
> you'll see kernel symbols for at least some of the executed instructions
> for interrupts.
>
> If it appears that the CPUs are continuing to execute, try running with
> --debug-flag=LocalApic. This will print the interrupts that each core is
> receiving, and if it stops printing at any point, it means something has
> gone wrong and we'd have to do some deeper digging.
>
> Keep us posted on what you find,
> Joel
>
>
>
> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:
>
>> Hi Joel,
>>
>> I have tried several different kernels and disk images, including the
>> default ones provided on the GEM5 website in the x86-system.tar.bz2
>> download. I run with these commands:
>>
>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>
>> My run.rcS script simply contains:
>>
>> #!/bin/sh
>> /sbin/m5 resetstats
>> /sbin/m5 checkpoint
>> echo 'booted'
>> /extras/run
>> /sbin/m5 exit
>>
>> where "/extras/run" is simply a C program with an infinite loop that
>> prints a counter.
>>
>> I then restore:
>>
>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>
>> I specified the disk image file in Benchmarks.py. Restoring from the same
>> checkpoint and running in atomic mode works fine. I also tried booting the
>> system in detailed and letting it run for a while, but once it boots, there
>> is no more output. So it seems that checkpointing is not the issue. The
>> "run" program is just a dummy, and the same issue also persists when
>> running SPEC benchmarks or any other program.
>>
>> My dummy program is simply:
>>
>> int count=0;
>> printf("**************************** HEYY \n");
>> while(1)
>> printf("\n %d \n", count++);
>>
>> Letting it run for a while, the only output is exactly this:
>>
>> booted
>> *******
>>
>> It doesn't even finish printing the first printf.
>>
>> Another thing to add: In another scenario, I modified the kernel to call
>> an m5 pseudo instruction on every context switch, and then GEM5 prints that
>> a context switch occurred. Once again, in atomic mode this worked as
>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>> output stopped along with the system output in the terminal.
>>
>> Thank you for your help.
>>
>> Ivan
>>
>>
>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com>wrote:
>>
>>> Hi Ivan,
>>> Can you please give more detail on what you're running? Specifically,
>>> can you give your command line, and which kernel, disk image you're using?
>>> Are you using checkpointing?
>>>
>>> Joel
>>>
>>>
>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>> gem5-***@gem5.org> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am running X86 in full system mode. When running just 1 CPU, both
>>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>>> works fine, but in detailed mode the system appears to hang shortly after
>>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>> Looking at the stats, it appears that instructions are still being
>>>> committed, but the actual applications/benchmarks are not making progress.
>>>> The issue persists with the latest version of GEM5. I also tried two
>>>> different kernel versions and several different disk images.
>>>>
>>>> I might be experiencing what seems to be the same issue that was found
>>>> about a year ago but appears to not have been fixed:
>>>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>>>
>>>> Can anyone reproduce this or know of a solution?
>>>>
>>>> Thank you,
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>>
>>> --
>>> Joel Hestness
>>> PhD Student, Computer Architecture
>>> Dept. of Computer Science, University of Wisconsin - Madison
>>> http://pages.cs.wisc.edu/~hestness/
>>>
>>
>>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
Ivan Stalev via gem5-users
2014-05-25 00:27:41 UTC
Permalink
Hi,

Has anyone been able to reproduce this issue?

Thanks,

Ivan


On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu> wrote:

> Hi Joel,
>
> I am using revision 10124. I removed all of my own modifications just to
> be safe.
>
> Running with gem5.opt and restoring from a boot-up checkpoint
> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
> infinite loop, executing this continuously:
>
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 :
> JLE_I : wrip , t1, t2 : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP
> : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
> JLE_I : wrip , t1, t2 : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
> : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 :
> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 :
> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 :
> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 :
> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 :
> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
> JLE_I : wrip , t1, t2 : IntAlu :
> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
> : IntAlu :
>
> ....and so on repetitively without stopping.
>
> Using --debug-flag=LocalApic, the output does indeed stop shortly after
> restoring from the checkpoint. The last output is:
> ..
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
> object.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
> offset 0xb0 as 0.
>
> ...and no more output from this point on.
>
> I appreciate your help tremendously.
>
> Ivan
>
>
>
> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com>wrote:
>
>> Hi Ivan,
>> I believe that the email thread you previously referenced was related
>> to a bug that we identified and fixed with changeset 9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
>> That bug was causing interrupts to be dropped in x86 when running with the
>> O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd
>> recommend updating to a more recent version and trying to replicate this
>> issue again.
>>
>> If you are using a more recent version of gem5, first, please let us
>> know which changeset and whether you've made any changes. Then, I'd
>> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
>> see if you can zero in on what is happening. Specifically, first try
>> passing the flag --debug-flag=Exec to look at what the CPU cores are
>> executing (you may also want to pass --trace-start=<<tick>> with a
>> simulator tick time close to when the hang happens). This trace will
>> include Linux kernel symbols for at least some of the lines if executing in
>> the kernel (e.g. handling an interrupt). If you've compiled your benchmark
>> without debugging symbols, it may just show the memory addresses of
>> instructions being executed within the application. I will guess that
>> you'll see kernel symbols for at least some of the executed instructions
>> for interrupts.
>>
>> If it appears that the CPUs are continuing to execute, try running with
>> --debug-flag=LocalApic. This will print the interrupts that each core is
>> receiving, and if it stops printing at any point, it means something has
>> gone wrong and we'd have to do some deeper digging.
>>
>> Keep us posted on what you find,
>> Joel
>>
>>
>>
>> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:
>>
>>> Hi Joel,
>>>
>>> I have tried several different kernels and disk images, including the
>>> default ones provided on the GEM5 website in the x86-system.tar.bz2
>>> download. I run with these commands:
>>>
>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>>
>>> My run.rcS script simply contains:
>>>
>>> #!/bin/sh
>>> /sbin/m5 resetstats
>>> /sbin/m5 checkpoint
>>> echo 'booted'
>>> /extras/run
>>> /sbin/m5 exit
>>>
>>> where "/extras/run" is simply a C program with an infinite loop that
>>> prints a counter.
>>>
>>> I then restore:
>>>
>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>>
>>> I specified the disk image file in Benchmarks.py. Restoring from the
>>> same checkpoint and running in atomic mode works fine. I also tried booting
>>> the system in detailed and letting it run for a while, but once it boots,
>>> there is no more output. So it seems that checkpointing is not the issue.
>>> The "run" program is just a dummy, and the same issue also persists when
>>> running SPEC benchmarks or any other program.
>>>
>>> My dummy program is simply:
>>>
>>> int count=0;
>>> printf("**************************** HEYY \n");
>>> while(1)
>>> printf("\n %d \n", count++);
>>>
>>> Letting it run for a while, the only output is exactly this:
>>>
>>> booted
>>> *******
>>>
>>> It doesn't even finish printing the first printf.
>>>
>>> Another thing to add: In another scenario, I modified the kernel to call
>>> an m5 pseudo instruction on every context switch, and then GEM5 prints that
>>> a context switch occurred. Once again, in atomic mode this worked as
>>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>>> output stopped along with the system output in the terminal.
>>>
>>> Thank you for your help.
>>>
>>> Ivan
>>>
>>>
>>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com>wrote:
>>>
>>>> Hi Ivan,
>>>> Can you please give more detail on what you're running?
>>>> Specifically, can you give your command line, and which kernel, disk image
>>>> you're using? Are you using checkpointing?
>>>>
>>>> Joel
>>>>
>>>>
>>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>>> gem5-***@gem5.org> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am running X86 in full system mode. When running just 1 CPU, both
>>>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>>>> works fine, but in detailed mode the system appears to hang shortly after
>>>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>>> Looking at the stats, it appears that instructions are still being
>>>>> committed, but the actual applications/benchmarks are not making progress.
>>>>> The issue persists with the latest version of GEM5. I also tried two
>>>>> different kernel versions and several different disk images.
>>>>>
>>>>> I might be experiencing what seems to be the same issue that was found
>>>>> about a year ago but appears to not have been fixed:
>>>>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>>>>
>>>>> Can anyone reproduce this or know of a solution?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joel Hestness
>>>> PhD Student, Computer Architecture
>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>> http://pages.cs.wisc.edu/~hestness/
>>>>
>>>
>>>
>>
>>
>> --
>> Joel Hestness
>> PhD Student, Computer Architecture
>> Dept. of Computer Science, University of Wisconsin - Madison
>> http://pages.cs.wisc.edu/~hestness/
>>
>
>
Joel Hestness via gem5-users
2014-06-03 23:53:23 UTC
Permalink
Hi Ivan,
Sorry for the delay on this.

I haven't had an opportunity to try to reproduce your problem, though the
traces you've supplied here can help a bit. Specifically, the stalled
LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect
periodic interrupts to continue. However, the last interrupt on CPU 1
appears to get cleared, which looks fine. The CPU spin lock is normal for
threads that don't have any work to complete, but it's confusing why they
wouldn't be doing something.

The next thing to dig into would be to figure out what the CPUs were
doing last before they entered the spin loop. For this we may need to
trace a bit earlier in time using the Exec flags, and since it is likely
that messages/responses may be getting lost in the memory hierarchy or
devices, we'll need to use the ProtocolTrace flag to see what is being
communicated. You could try playing around with these as a start.

I may also have time to try to reproduce this over the next week, so I'm
hoping you could give me some more information: can you send me your gem5
command line, config.ini, and system.pc.com_1.terminal output from your
simulation, and details on the kernel and disk image that you're trying to
use?


Thanks!
Joel




On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu> wrote:

> Hi,
>
> Has anyone been able to reproduce this issue?
>
> Thanks,
>
> Ivan
>
>
> On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu> wrote:
>
>> Hi Joel,
>>
>> I am using revision 10124. I removed all of my own modifications just to
>> be safe.
>>
>> Running with gem5.opt and restoring from a boot-up checkpoint
>> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
>> infinite loop, executing this continuously:
>>
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>> A=0xffffffff80822400
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 :
>> JLE_I : wrip , t1, t2 : IntAlu :
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP
>> : IntAlu :
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>> A=0xffffffff80822400
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>> JLE_I : wrip , t1, t2 : IntAlu :
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
>> : IntAlu :
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 :
>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 :
>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>> A=0xffffffff80822400
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 :
>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 :
>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 :
>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>> JLE_I : wrip , t1, t2 : IntAlu :
>> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
>> : IntAlu :
>>
>> ....and so on repetitively without stopping.
>>
>> Using --debug-flag=LocalApic, the output does indeed stop shortly after
>> restoring from the checkpoint. The last output is:
>> ..
>> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
>> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
>> 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
>> object.
>> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
>> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
>> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
>> offset 0xb0 as 0.
>>
>> ...and no more output from this point on.
>>
>> I appreciate your help tremendously.
>>
>> Ivan
>>
>>
>>
>> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>> I believe that the email thread you previously referenced was related
>>> to a bug that we identified and fixed with changeset 9624
>>> <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>. That
>>> bug was causing interrupts to be dropped in x86 when running with the O3
>>> CPU. Are you using a version of gem5 after that changeset? If not, I'd
>>> recommend updating to a more recent version and trying to replicate this
>>> issue again.
>>>
>>> If you are using a more recent version of gem5, first, please let us
>>> know which changeset and whether you've made any changes. Then, I'd
>>> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
>>> see if you can zero in on what is happening. Specifically, first try
>>> passing the flag --debug-flag=Exec to look at what the CPU cores are
>>> executing (you may also want to pass --trace-start=<<tick>> with a
>>> simulator tick time close to when the hang happens). This trace will
>>> include Linux kernel symbols for at least some of the lines if executing in
>>> the kernel (e.g. handling an interrupt). If you've compiled your benchmark
>>> without debugging symbols, it may just show the memory addresses of
>>> instructions being executed within the application. I will guess that
>>> you'll see kernel symbols for at least some of the executed instructions
>>> for interrupts.
>>>
>>> If it appears that the CPUs are continuing to execute, try running
>>> with --debug-flag=LocalApic. This will print the interrupts that each core
>>> is receiving, and if it stops printing at any point, it means something has
>>> gone wrong and we'd have to do some deeper digging.
>>>
>>> Keep us posted on what you find,
>>> Joel
>>>
>>>
>>>
>>> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:
>>>
>>>> Hi Joel,
>>>>
>>>> I have tried several different kernels and disk images, including the
>>>> default ones provided on the GEM5 website in the x86-system.tar.bz2
>>>> download. I run with these commands:
>>>>
>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>>>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>>>
>>>> My run.rcS script simply contains:
>>>>
>>>> #!/bin/sh
>>>> /sbin/m5 resetstats
>>>> /sbin/m5 checkpoint
>>>> echo 'booted'
>>>> /extras/run
>>>> /sbin/m5 exit
>>>>
>>>> where "/extras/run" is simply a C program with an infinite loop that
>>>> prints a counter.
>>>>
>>>> I then restore:
>>>>
>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>>>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>>>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>>>
>>>> I specified the disk image file in Benchmarks.py. Restoring from the
>>>> same checkpoint and running in atomic mode works fine. I also tried booting
>>>> the system in detailed and letting it run for a while, but once it boots,
>>>> there is no more output. So it seems that checkpointing is not the issue.
>>>> The "run" program is just a dummy, and the same issue also persists when
>>>> running SPEC benchmarks or any other program.
>>>>
>>>> My dummy program is simply:
>>>>
>>>> int count=0;
>>>> printf("**************************** HEYY \n");
>>>> while(1)
>>>> printf("\n %d \n", count++);
>>>>
>>>> Letting it run for a while, the only output is exactly this:
>>>>
>>>> booted
>>>> *******
>>>>
>>>> It doesn't even finish printing the first printf.
>>>>
>>>> Another thing to add: In another scenario, I modified the kernel to
>>>> call an m5 pseudo instruction on every context switch, and then GEM5 prints
>>>> that a context switch occurred. Once again, in atomic mode this worked as
>>>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>>>> output stopped along with the system output in the terminal.
>>>>
>>>> Thank you for your help.
>>>>
>>>> Ivan
>>>>
>>>>
>>>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ivan,
>>>>> Can you please give more detail on what you're running?
>>>>> Specifically, can you give your command line, and which kernel, disk image
>>>>> you're using? Are you using checkpointing?
>>>>>
>>>>> Joel
>>>>>
>>>>>
>>>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>>>> gem5-***@gem5.org> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am running X86 in full system mode. When running just 1 CPU, both
>>>>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>>>>> works fine, but in detailed mode the system appears to hang shortly after
>>>>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>>>> Looking at the stats, it appears that instructions are still being
>>>>>> committed, but the actual applications/benchmarks are not making progress.
>>>>>> The issue persists with the latest version of GEM5. I also tried two
>>>>>> different kernel versions and several different disk images.
>>>>>>
>>>>>> I might be experiencing what seems to be the same issue that was
>>>>>> found about a year ago but appears to not have been fixed:
>>>>>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>>>>>
>>>>>> Can anyone reproduce this or know of a solution?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joel Hestness
>>>>> PhD Student, Computer Architecture
>>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>>> http://pages.cs.wisc.edu/~hestness/
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Joel Hestness
>>> PhD Student, Computer Architecture
>>> Dept. of Computer Science, University of Wisconsin - Madison
>>> http://pages.cs.wisc.edu/~hestness/
>>>
>>
>>
>


--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Ivan Stalev via gem5-users
2014-06-05 23:14:34 UTC
Permalink
Hi Joel,

Thanks for getting back to me.

I ran it again with the ProtocolTrace flag and the only output there
is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012

With the Exec flag, I do see spinlock output on and off in the beginning
during regular execution, so that is normal as you said. But once the
"problem" occurs shortly after, the Exec flag output is just continuous
spinlock forever as I posted previously.

The exact gem5 command lines I use are posted in my previous post. The
kernel and disk image are the simply the default ones from the GEM5
downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2

I have attached a zip file containing the following files:

BOOT-config.ini - The config.ini from the first run, i.e. booting in atomic
mode, creating a checkpoint, and exiting.
BOOT-system.pc.com_1.terminal - The terminal output from the first run
CPT-config.ini - The config.ini when restoring from the checkpoint in
detailed mode
CPT-system.pc.com_1.terminal - The system output after restoring from the
checkpoint
run.c - The dummy program started by the run script
run.rcS - The run script
flag-exec-partial.out - The output from the Exec flag, right before the
"problem" occurs, The infinite spinlock starts at tick 5268700121500

Again, this problem occurs even without checkpointing. I have also tried a
few different kernels and disk images. I did the same test with both alpha
and arm64 and it works, so it appears to just be an issue with x86.

Thank you,

Ivan



On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <***@gmail.com> wrote:

> Hi Ivan,
> Sorry for the delay on this.
>
> I haven't had an opportunity to try to reproduce your problem, though
> the traces you've supplied here can help a bit. Specifically, the stalled
> LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect
> periodic interrupts to continue. However, the last interrupt on CPU 1
> appears to get cleared, which looks fine. The CPU spin lock is normal for
> threads that don't have any work to complete, but it's confusing why they
> wouldn't be doing something.
>
> The next thing to dig into would be to figure out what the CPUs were
> doing last before they entered the spin loop. For this we may need to
> trace a bit earlier in time using the Exec flags, and since it is likely
> that messages/responses may be getting lost in the memory hierarchy or
> devices, we'll need to use the ProtocolTrace flag to see what is being
> communicated. You could try playing around with these as a start.
>
> I may also have time to try to reproduce this over the next week, so I'm
> hoping you could give me some more information: can you send me your gem5
> command line, config.ini, and system.pc.com_1.terminal output from your
> simulation, and details on the kernel and disk image that you're trying to
> use?
>
>
> Thanks!
> Joel
>
>
>
>
> On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu> wrote:
>
>> Hi,
>>
>> Has anyone been able to reproduce this issue?
>>
>> Thanks,
>>
>> Ivan
>>
>>
>> On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu> wrote:
>>
>>> Hi Joel,
>>>
>>> I am using revision 10124. I removed all of my own modifications just to
>>> be safe.
>>>
>>> Running with gem5.opt and restoring from a boot-up checkpoint
>>> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
>>> infinite loop, executing this continuously:
>>>
>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>> A=0xffffffff80822400
>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 :
>>> JLE_I : wrip , t1, t2 : IntAlu :
>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 :
>>> NOP : IntAlu :
>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>> A=0xffffffff80822400
>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>>> JLE_I : wrip , t1, t2 : IntAlu :
>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 :
>>> NOP : IntAlu :
>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 :
>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 :
>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>> A=0xffffffff80822400
>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 :
>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 :
>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 :
>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>>> JLE_I : wrip , t1, t2 : IntAlu :
>>> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 :
>>> NOP : IntAlu :
>>>
>>> ....and so on repetitively without stopping.
>>>
>>> Using --debug-flag=LocalApic, the output does indeed stop shortly after
>>> restoring from the checkpoint. The last output is:
>>> ..
>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>> interrupt.
>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>> interrupt.
>>> 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
>>> object.
>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>> interrupt.
>>> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
>>> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
>>> offset 0xb0 as 0.
>>>
>>> ...and no more output from this point on.
>>>
>>> I appreciate your help tremendously.
>>>
>>> Ivan
>>>
>>>
>>>
>>> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com>
>>> wrote:
>>>
>>>> Hi Ivan,
>>>> I believe that the email thread you previously referenced was related
>>>> to a bug that we identified and fixed with changeset 9624
>>>> <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
>>>> That bug was causing interrupts to be dropped in x86 when running with the
>>>> O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd
>>>> recommend updating to a more recent version and trying to replicate this
>>>> issue again.
>>>>
>>>> If you are using a more recent version of gem5, first, please let us
>>>> know which changeset and whether you've made any changes. Then, I'd
>>>> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
>>>> see if you can zero in on what is happening. Specifically, first try
>>>> passing the flag --debug-flag=Exec to look at what the CPU cores are
>>>> executing (you may also want to pass --trace-start=<<tick>> with a
>>>> simulator tick time close to when the hang happens). This trace will
>>>> include Linux kernel symbols for at least some of the lines if executing in
>>>> the kernel (e.g. handling an interrupt). If you've compiled your benchmark
>>>> without debugging symbols, it may just show the memory addresses of
>>>> instructions being executed within the application. I will guess that
>>>> you'll see kernel symbols for at least some of the executed instructions
>>>> for interrupts.
>>>>
>>>> If it appears that the CPUs are continuing to execute, try running
>>>> with --debug-flag=LocalApic. This will print the interrupts that each core
>>>> is receiving, and if it stops printing at any point, it means something has
>>>> gone wrong and we'd have to do some deeper digging.
>>>>
>>>> Keep us posted on what you find,
>>>> Joel
>>>>
>>>>
>>>>
>>>> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:
>>>>
>>>>> Hi Joel,
>>>>>
>>>>> I have tried several different kernels and disk images, including the
>>>>> default ones provided on the GEM5 website in the x86-system.tar.bz2
>>>>> download. I run with these commands:
>>>>>
>>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>>>>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>>>>
>>>>> My run.rcS script simply contains:
>>>>>
>>>>> #!/bin/sh
>>>>> /sbin/m5 resetstats
>>>>> /sbin/m5 checkpoint
>>>>> echo 'booted'
>>>>> /extras/run
>>>>> /sbin/m5 exit
>>>>>
>>>>> where "/extras/run" is simply a C program with an infinite loop that
>>>>> prints a counter.
>>>>>
>>>>> I then restore:
>>>>>
>>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>>>>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>>>>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>>>>
>>>>> I specified the disk image file in Benchmarks.py. Restoring from the
>>>>> same checkpoint and running in atomic mode works fine. I also tried booting
>>>>> the system in detailed and letting it run for a while, but once it boots,
>>>>> there is no more output. So it seems that checkpointing is not the issue.
>>>>> The "run" program is just a dummy, and the same issue also persists when
>>>>> running SPEC benchmarks or any other program.
>>>>>
>>>>> My dummy program is simply:
>>>>>
>>>>> int count=0;
>>>>> printf("**************************** HEYY \n");
>>>>> while(1)
>>>>> printf("\n %d \n", count++);
>>>>>
>>>>> Letting it run for a while, the only output is exactly this:
>>>>>
>>>>> booted
>>>>> *******
>>>>>
>>>>> It doesn't even finish printing the first printf.
>>>>>
>>>>> Another thing to add: In another scenario, I modified the kernel to
>>>>> call an m5 pseudo instruction on every context switch, and then GEM5 prints
>>>>> that a context switch occurred. Once again, in atomic mode this worked as
>>>>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>>>>> output stopped along with the system output in the terminal.
>>>>>
>>>>> Thank you for your help.
>>>>>
>>>>> Ivan
>>>>>
>>>>>
>>>>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ivan,
>>>>>> Can you please give more detail on what you're running?
>>>>>> Specifically, can you give your command line, and which kernel, disk image
>>>>>> you're using? Are you using checkpointing?
>>>>>>
>>>>>> Joel
>>>>>>
>>>>>>
>>>>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>>>>> gem5-***@gem5.org> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am running X86 in full system mode. When running just 1 CPU, both
>>>>>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>>>>>> works fine, but in detailed mode the system appears to hang shortly after
>>>>>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>>>>> Looking at the stats, it appears that instructions are still being
>>>>>>> committed, but the actual applications/benchmarks are not making progress.
>>>>>>> The issue persists with the latest version of GEM5. I also tried two
>>>>>>> different kernel versions and several different disk images.
>>>>>>>
>>>>>>> I might be experiencing what seems to be the same issue that was
>>>>>>> found about a year ago but appears to not have been fixed:
>>>>>>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>>>>>>
>>>>>>> Can anyone reproduce this or know of a solution?
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-users mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Joel Hestness
>>>>>> PhD Student, Computer Architecture
>>>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>>>> http://pages.cs.wisc.edu/~hestness/
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Joel Hestness
>>>> PhD Student, Computer Architecture
>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>> http://pages.cs.wisc.edu/~hestness/
>>>>
>>>
>>>
>>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
Castillo Villar, Emilio via gem5-users
2014-06-06 15:49:06 UTC
Permalink
Hello,

I have seen similar issues when running X86 timing and detailed cpus with the Classic memory system. Mostly due to X86 atomic memory accesses not being implemented. The stdout freezes but instructions are still being committed.

If you want to run with timing or detailed cpus in X86 & FS & multi-core I am afraid you will need to use Ruby.

Emilio
________________________________
De: gem5-users [gem5-users-***@gem5.org] en nombre de Ivan Stalev via gem5-users [gem5-***@gem5.org]
Enviado: viernes, 06 de junio de 2014 1:14
Para: Joel Hestness
CC: gem5 users mailing list
Asunto: Re: [gem5-users] System Hangs

Hi Joel,

Thanks for getting back to me.

I ran it again with the ProtocolTrace flag and the only output there is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012

With the Exec flag, I do see spinlock output on and off in the beginning during regular execution, so that is normal as you said. But once the "problem" occurs shortly after, the Exec flag output is just continuous spinlock forever as I posted previously.

The exact gem5 command lines I use are posted in my previous post. The kernel and disk image are the simply the default ones from the GEM5 downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2

I have attached a zip file containing the following files:

BOOT-config.ini - The config.ini from the first run, i.e. booting in atomic mode, creating a checkpoint, and exiting.
BOOT-system.pc.com_1.terminal - The terminal output from the first run
CPT-config.ini - The config.ini when restoring from the checkpoint in detailed mode
CPT-system.pc.com_1.terminal - The system output after restoring from the checkpoint
run.c - The dummy program started by the run script
run.rcS - The run script
flag-exec-partial.out - The output from the Exec flag, right before the "problem" occurs, The infinite spinlock starts at tick 5268700121500

Again, this problem occurs even without checkpointing. I have also tried a few different kernels and disk images. I did the same test with both alpha and arm64 and it works, so it appears to just be an issue with x86.

Thank you,

Ivan



On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
Sorry for the delay on this.

I haven't had an opportunity to try to reproduce your problem, though the traces you've supplied here can help a bit. Specifically, the stalled LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect periodic interrupts to continue. However, the last interrupt on CPU 1 appears to get cleared, which looks fine. The CPU spin lock is normal for threads that don't have any work to complete, but it's confusing why they wouldn't be doing something.

The next thing to dig into would be to figure out what the CPUs were doing last before they entered the spin loop. For this we may need to trace a bit earlier in time using the Exec flags, and since it is likely that messages/responses may be getting lost in the memory hierarchy or devices, we'll need to use the ProtocolTrace flag to see what is being communicated. You could try playing around with these as a start.

I may also have time to try to reproduce this over the next week, so I'm hoping you could give me some more information: can you send me your gem5 command line, config.ini, and system.pc.com_1.terminal output from your simulation, and details on the kernel and disk image that you're trying to use?


Thanks!
Joel




On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi,

Has anyone been able to reproduce this issue?

Thanks,

Ivan


On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi Joel,

I am using revision 10124. I removed all of my own modifications just to be safe.

Running with gem5.opt and restoring from a boot-up checkpoint with--debug-flag=Exec, it appears that the CPU is stuck in some sort of infinite loop, executing this continuously:

5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :

....and so on repetitively without stopping.

Using --debug-flag=LocalApic, the output does indeed stop shortly after restoring from the checkpoint. The last output is:
..
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Generated regular interrupt fault object.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at offset 0xb0 as 0.

...and no more output from this point on.

I appreciate your help tremendously.

Ivan



On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
I believe that the email thread you previously referenced was related to a bug that we identified and fixed with changeset 9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>. That bug was causing interrupts to be dropped in x86 when running with the O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd recommend updating to a more recent version and trying to replicate this issue again.

If you are using a more recent version of gem5, first, please let us know which changeset and whether you've made any changes. Then, I'd recommend compiling gem5.opt and using the DPRINTF tracing functionality to see if you can zero in on what is happening. Specifically, first try passing the flag --debug-flag=Exec to look at what the CPU cores are executing (you may also want to pass --trace-start=<<tick>> with a simulator tick time close to when the hang happens). This trace will include Linux kernel symbols for at least some of the lines if executing in the kernel (e.g. handling an interrupt). If you've compiled your benchmark without debugging symbols, it may just show the memory addresses of instructions being executed within the application. I will guess that you'll see kernel symbols for at least some of the executed instructions for interrupts.

If it appears that the CPUs are continuing to execute, try running with --debug-flag=LocalApic. This will print the interrupts that each core is receiving, and if it stops printing at any point, it means something has gone wrong and we'd have to do some deeper digging.

Keep us posted on what you find,
Joel



On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi Joel,

I have tried several different kernels and disk images, including the default ones provided on the GEM5 website in the x86-system.tar.bz2 download. I run with these commands:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz --script=rcs_scripts/run.rcS --max-checkpoints=1

My run.rcS script simply contains:

#!/bin/sh
/sbin/m5 resetstats
/sbin/m5 checkpoint
echo 'booted'
/extras/run
/sbin/m5 exit

where "/extras/run" is simply a C program with an infinite loop that prints a counter.

I then restore:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64

I specified the disk image file in Benchmarks.py. Restoring from the same checkpoint and running in atomic mode works fine. I also tried booting the system in detailed and letting it run for a while, but once it boots, there is no more output. So it seems that checkpointing is not the issue. The "run" program is just a dummy, and the same issue also persists when running SPEC benchmarks or any other program.

My dummy program is simply:

int count=0;
printf("**************************** HEYY \n");
while(1)
printf("\n %d \n", count++);

Letting it run for a while, the only output is exactly this:

booted
*******

It doesn't even finish printing the first printf.

Another thing to add: In another scenario, I modified the kernel to call an m5 pseudo instruction on every context switch, and then GEM5 prints that a context switch occurred. Once again, in atomic mode this worked as expected. However, in detailed, even the GEM5 (printf inside GEM5 itself) output stopped along with the system output in the terminal.

Thank you for your help.

Ivan


On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
Can you please give more detail on what you're running? Specifically, can you give your command line, and which kernel, disk image you're using? Are you using checkpointing?

Joel


On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <gem5-***@gem5.org<mailto:gem5-***@gem5.org>> wrote:
Hello,

I am running X86 in full system mode. When running just 1 CPU, both atomic and detailed mode work fine. However, with more than 1 CPU, atomic works fine, but in detailed mode the system appears to hang shortly after boot-up. GEM5 doesn't crash, but the system stops having any output. Looking at the stats, it appears that instructions are still being committed, but the actual applications/benchmarks are not making progress. The issue persists with the latest version of GEM5. I also tried two different kernel versions and several different disk images.

I might be experiencing what seems to be the same issue that was found about a year ago but appears to not have been fixed: https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html

Can anyone reproduce this or know of a solution?

Thank you,

Ivan



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/




--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/





--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Joel Hestness via gem5-users
2014-06-08 21:27:17 UTC
Permalink
Hi guys,
I've been able to reproduce Ivan's issue in the latest gem5 (rev. 10235).
It seems this may be the same bug as a report that I filed about a year ago
<http://flyspray.gem5.org/index.php?do=details&task_id=7>. Previously, I
had overlooked that Ivan's tests were using the classic memory system (and
frankly, I had forgotten I had submitted that bug report). I'll second
Emilio and recommend using Ruby for now.

For anyone's future reference: Contrary to Emilio's statement, I'm not
clear that simulations necessarily make progress after terminal output
interrupts are lost with the classic memory model. It is possible that
unimplemented x86 atomics in the classic memory hierarchy could be the
problem, and if so, many other problems besides hung terminal output could
arise.

Joel



On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio <
***@unican.es> wrote:

> Hello,
>
> I have seen similar issues when running X86 timing and detailed cpus with
> the Classic memory system. Mostly due to X86 atomic memory accesses not
> being implemented. The stdout freezes but instructions are still being
> committed.
>
> If you want to run with timing or detailed cpus in X86 & FS & multi-core I
> am afraid you will need to use Ruby.
>
> Emilio
> ------------------------------
> *De:* gem5-users [gem5-users-***@gem5.org] en nombre de Ivan Stalev
> via gem5-users [gem5-***@gem5.org]
> *Enviado:* viernes, 06 de junio de 2014 1:14
> *Para:* Joel Hestness
> *CC:* gem5 users mailing list
> *Asunto:* Re: [gem5-users] System Hangs
>
> Hi Joel,
>
> Thanks for getting back to me.
>
> I ran it again with the ProtocolTrace flag and the only output there
> is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012
>
> With the Exec flag, I do see spinlock output on and off in the beginning
> during regular execution, so that is normal as you said. But once the
> "problem" occurs shortly after, the Exec flag output is just continuous
> spinlock forever as I posted previously.
>
> The exact gem5 command lines I use are posted in my previous post. The
> kernel and disk image are the simply the default ones from the GEM5
> downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2
>
> I have attached a zip file containing the following files:
>
> BOOT-config.ini - The config.ini from the first run, i.e. booting in
> atomic mode, creating a checkpoint, and exiting.
> BOOT-system.pc.com_1.terminal - The terminal output from the first run
> CPT-config.ini - The config.ini when restoring from the checkpoint in
> detailed mode
> CPT-system.pc.com_1.terminal - The system output after restoring from the
> checkpoint
> run.c - The dummy program started by the run script
> run.rcS - The run script
> flag-exec-partial.out - The output from the Exec flag, right before the
> "problem" occurs, The infinite spinlock starts at tick 5268700121500
>
> Again, this problem occurs even without checkpointing. I have also tried
> a few different kernels and disk images. I did the same test with both
> alpha and arm64 and it works, so it appears to just be an issue with x86.
>
> Thank you,
>
> Ivan
>
>
>
> On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <***@gmail.com>
> wrote:
>
>> Hi Ivan,
>> Sorry for the delay on this.
>>
>> I haven't had an opportunity to try to reproduce your problem, though
>> the traces you've supplied here can help a bit. Specifically, the stalled
>> LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect
>> periodic interrupts to continue. However, the last interrupt on CPU 1
>> appears to get cleared, which looks fine. The CPU spin lock is normal for
>> threads that don't have any work to complete, but it's confusing why they
>> wouldn't be doing something.
>>
>> The next thing to dig into would be to figure out what the CPUs were
>> doing last before they entered the spin loop. For this we may need to
>> trace a bit earlier in time using the Exec flags, and since it is likely
>> that messages/responses may be getting lost in the memory hierarchy or
>> devices, we'll need to use the ProtocolTrace flag to see what is being
>> communicated. You could try playing around with these as a start.
>>
>> I may also have time to try to reproduce this over the next week, so
>> I'm hoping you could give me some more information: can you send me your
>> gem5 command line, config.ini, and system.pc.com_1.terminal output from
>> your simulation, and details on the kernel and disk image that you're
>> trying to use?
>>
>>
>> Thanks!
>> Joel
>>
>>
>>
>>
>> On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu> wrote:
>>
>>> Hi,
>>>
>>> Has anyone been able to reproduce this issue?
>>>
>>> Thanks,
>>>
>>> Ivan
>>>
>>>
>>> On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu> wrote:
>>>
>>>> Hi Joel,
>>>>
>>>> I am using revision 10124. I removed all of my own modifications just
>>>> to be safe.
>>>>
>>>> Running with gem5.opt and restoring from a boot-up checkpoint
>>>> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
>>>> infinite loop, executing this continuously:
>>>>
>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>>> A=0xffffffff80822400
>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 :
>>>> JLE_I : wrip , t1, t2 : IntAlu :
>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 :
>>>> NOP : IntAlu :
>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>>> A=0xffffffff80822400
>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>>> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>>>> JLE_I : wrip , t1, t2 : IntAlu :
>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 :
>>>> NOP : IntAlu :
>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 :
>>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 :
>>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>>> A=0xffffffff80822400
>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 :
>>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 :
>>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 :
>>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>>>> JLE_I : wrip , t1, t2 : IntAlu :
>>>> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 :
>>>> NOP : IntAlu :
>>>>
>>>> ....and so on repetitively without stopping.
>>>>
>>>> Using --debug-flag=LocalApic, the output does indeed stop shortly
>>>> after restoring from the checkpoint. The last output is:
>>>> ..
>>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>>> interrupt.
>>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>>> interrupt.
>>>> 5269570990500: system.cpu1.interrupts: Generated regular interrupt
>>>> fault object.
>>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>>> interrupt.
>>>> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
>>>> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
>>>> offset 0xb0 as 0.
>>>>
>>>> ...and no more output from this point on.
>>>>
>>>> I appreciate your help tremendously.
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ivan,
>>>>> I believe that the email thread you previously referenced was
>>>>> related to a bug that we identified and fixed with changeset 9624
>>>>> <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
>>>>> That bug was causing interrupts to be dropped in x86 when running with the
>>>>> O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd
>>>>> recommend updating to a more recent version and trying to replicate this
>>>>> issue again.
>>>>>
>>>>> If you are using a more recent version of gem5, first, please let
>>>>> us know which changeset and whether you've made any changes. Then, I'd
>>>>> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
>>>>> see if you can zero in on what is happening. Specifically, first try
>>>>> passing the flag --debug-flag=Exec to look at what the CPU cores are
>>>>> executing (you may also want to pass --trace-start=<<tick>> with a
>>>>> simulator tick time close to when the hang happens). This trace will
>>>>> include Linux kernel symbols for at least some of the lines if executing in
>>>>> the kernel (e.g. handling an interrupt). If you've compiled your benchmark
>>>>> without debugging symbols, it may just show the memory addresses of
>>>>> instructions being executed within the application. I will guess that
>>>>> you'll see kernel symbols for at least some of the executed instructions
>>>>> for interrupts.
>>>>>
>>>>> If it appears that the CPUs are continuing to execute, try running
>>>>> with --debug-flag=LocalApic. This will print the interrupts that each core
>>>>> is receiving, and if it stops printing at any point, it means something has
>>>>> gone wrong and we'd have to do some deeper digging.
>>>>>
>>>>> Keep us posted on what you find,
>>>>> Joel
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:
>>>>>
>>>>>> Hi Joel,
>>>>>>
>>>>>> I have tried several different kernels and disk images, including
>>>>>> the default ones provided on the GEM5 website in the x86-system.tar.bz2
>>>>>> download. I run with these commands:
>>>>>>
>>>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>>>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>>>>>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>>>>>
>>>>>> My run.rcS script simply contains:
>>>>>>
>>>>>> #!/bin/sh
>>>>>> /sbin/m5 resetstats
>>>>>> /sbin/m5 checkpoint
>>>>>> echo 'booted'
>>>>>> /extras/run
>>>>>> /sbin/m5 exit
>>>>>>
>>>>>> where "/extras/run" is simply a C program with an infinite loop
>>>>>> that prints a counter.
>>>>>>
>>>>>> I then restore:
>>>>>>
>>>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>>>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>>>>>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>>>>>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>>>>>
>>>>>> I specified the disk image file in Benchmarks.py. Restoring from
>>>>>> the same checkpoint and running in atomic mode works fine. I also tried
>>>>>> booting the system in detailed and letting it run for a while, but once it
>>>>>> boots, there is no more output. So it seems that checkpointing is not the
>>>>>> issue. The "run" program is just a dummy, and the same issue also persists
>>>>>> when running SPEC benchmarks or any other program.
>>>>>>
>>>>>> My dummy program is simply:
>>>>>>
>>>>>> int count=0;
>>>>>> printf("**************************** HEYY \n");
>>>>>> while(1)
>>>>>> printf("\n %d \n", count++);
>>>>>>
>>>>>> Letting it run for a while, the only output is exactly this:
>>>>>>
>>>>>> booted
>>>>>> *******
>>>>>>
>>>>>> It doesn't even finish printing the first printf.
>>>>>>
>>>>>> Another thing to add: In another scenario, I modified the kernel to
>>>>>> call an m5 pseudo instruction on every context switch, and then GEM5 prints
>>>>>> that a context switch occurred. Once again, in atomic mode this worked as
>>>>>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>>>>>> output stopped along with the system output in the terminal.
>>>>>>
>>>>>> Thank you for your help.
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>>
>>>>>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Ivan,
>>>>>>> Can you please give more detail on what you're running?
>>>>>>> Specifically, can you give your command line, and which kernel, disk image
>>>>>>> you're using? Are you using checkpointing?
>>>>>>>
>>>>>>> Joel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>>>>>> gem5-***@gem5.org> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am running X86 in full system mode. When running just 1 CPU,
>>>>>>>> both atomic and detailed mode work fine. However, with more than 1 CPU,
>>>>>>>> atomic works fine, but in detailed mode the system appears to hang shortly
>>>>>>>> after boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>>>>>> Looking at the stats, it appears that instructions are still being
>>>>>>>> committed, but the actual applications/benchmarks are not making progress.
>>>>>>>> The issue persists with the latest version of GEM5. I also tried two
>>>>>>>> different kernel versions and several different disk images.
>>>>>>>>
>>>>>>>> I might be experiencing what seems to be the same issue that was
>>>>>>>> found about a year ago but appears to not have been fixed:
>>>>>>>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>>>>>>>
>>>>>>>> Can anyone reproduce this or know of a solution?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> Ivan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-users mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joel Hestness
>>>>>>> PhD Student, Computer Architecture
>>>>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>>>>> http://pages.cs.wisc.edu/~hestness/
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joel Hestness
>>>>> PhD Student, Computer Architecture
>>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>>> http://pages.cs.wisc.edu/~hestness/
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Joel Hestness
>> PhD Student, Computer Architecture
>> Dept. of Computer Science, University of Wisconsin - Madison
>> http://pages.cs.wisc.edu/~hestness/
>>
>
>


--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Castillo Villar, Emilio via gem5-users
2014-06-08 22:44:20 UTC
Permalink
Hello,

Mr. Hestness is right, simulations do not make progress once the output has hung.
The cpus keep executing some code and committing instructions, but they are doing spinlocks in almost all cases.
So although cpus are actually executing stuff, the simulation is completely freeze. Thats what I meat by "instructions are still being committed", sorry for not being clear enough.

Just running in FS with the classic memory system and do a "cat /proc/cpuinfo" will crash the system.

Kind regards.


________________________________
De: Joel Hestness [***@gmail.com]
Enviado: domingo, 08 de junio de 2014 23:27
Para: Castillo Villar, Emilio
CC: Ivan Stalev; gem5 users mailing list
Asunto: Re: [gem5-users] System Hangs

Hi guys,
I've been able to reproduce Ivan's issue in the latest gem5 (rev. 10235). It seems this may be the same bug as a report that I filed about a year ago<http://flyspray.gem5.org/index.php?do=details&task_id=7>. Previously, I had overlooked that Ivan's tests were using the classic memory system (and frankly, I had forgotten I had submitted that bug report). I'll second Emilio and recommend using Ruby for now.

For anyone's future reference: Contrary to Emilio's statement, I'm not clear that simulations necessarily make progress after terminal output interrupts are lost with the classic memory model. It is possible that unimplemented x86 atomics in the classic memory hierarchy could be the problem, and if so, many other problems besides hung terminal output could arise.

Joel



On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio <***@unican.es<mailto:***@unican.es>> wrote:
Hello,

I have seen similar issues when running X86 timing and detailed cpus with the Classic memory system. Mostly due to X86 atomic memory accesses not being implemented. The stdout freezes but instructions are still being committed.

If you want to run with timing or detailed cpus in X86 & FS & multi-core I am afraid you will need to use Ruby.

Emilio
________________________________
De: gem5-users [gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] en nombre de Ivan Stalev via gem5-users [gem5-***@gem5.org<mailto:gem5-***@gem5.org>]
Enviado: viernes, 06 de junio de 2014 1:14
Para: Joel Hestness
CC: gem5 users mailing list
Asunto: Re: [gem5-users] System Hangs

Hi Joel,

Thanks for getting back to me.

I ran it again with the ProtocolTrace flag and the only output there is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012

With the Exec flag, I do see spinlock output on and off in the beginning during regular execution, so that is normal as you said. But once the "problem" occurs shortly after, the Exec flag output is just continuous spinlock forever as I posted previously.

The exact gem5 command lines I use are posted in my previous post. The kernel and disk image are the simply the default ones from the GEM5 downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2

I have attached a zip file containing the following files:

BOOT-config.ini - The config.ini from the first run, i.e. booting in atomic mode, creating a checkpoint, and exiting.
BOOT-system.pc.com_1.terminal - The terminal output from the first run
CPT-config.ini - The config.ini when restoring from the checkpoint in detailed mode
CPT-system.pc.com_1.terminal - The system output after restoring from the checkpoint
run.c - The dummy program started by the run script
run.rcS - The run script
flag-exec-partial.out - The output from the Exec flag, right before the "problem" occurs, The infinite spinlock starts at tick 5268700121500

Again, this problem occurs even without checkpointing. I have also tried a few different kernels and disk images. I did the same test with both alpha and arm64 and it works, so it appears to just be an issue with x86.

Thank you,

Ivan



On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
Sorry for the delay on this.

I haven't had an opportunity to try to reproduce your problem, though the traces you've supplied here can help a bit. Specifically, the stalled LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect periodic interrupts to continue. However, the last interrupt on CPU 1 appears to get cleared, which looks fine. The CPU spin lock is normal for threads that don't have any work to complete, but it's confusing why they wouldn't be doing something.

The next thing to dig into would be to figure out what the CPUs were doing last before they entered the spin loop. For this we may need to trace a bit earlier in time using the Exec flags, and since it is likely that messages/responses may be getting lost in the memory hierarchy or devices, we'll need to use the ProtocolTrace flag to see what is being communicated. You could try playing around with these as a start.

I may also have time to try to reproduce this over the next week, so I'm hoping you could give me some more information: can you send me your gem5 command line, config.ini, and system.pc.com_1.terminal output from your simulation, and details on the kernel and disk image that you're trying to use?


Thanks!
Joel




On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi,

Has anyone been able to reproduce this issue?

Thanks,

Ivan


On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi Joel,

I am using revision 10124. I removed all of my own modifications just to be safe.

Running with gem5.opt and restoring from a boot-up checkpoint with--debug-flag=Exec, it appears that the CPU is stuck in some sort of infinite loop, executing this continuously:

5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :

....and so on repetitively without stopping.

Using --debug-flag=LocalApic, the output does indeed stop shortly after restoring from the checkpoint. The last output is:
..
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Generated regular interrupt fault object.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at offset 0xb0 as 0.

...and no more output from this point on.

I appreciate your help tremendously.

Ivan



On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
I believe that the email thread you previously referenced was related to a bug that we identified and fixed with changeset 9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>. That bug was causing interrupts to be dropped in x86 when running with the O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd recommend updating to a more recent version and trying to replicate this issue again.

If you are using a more recent version of gem5, first, please let us know which changeset and whether you've made any changes. Then, I'd recommend compiling gem5.opt and using the DPRINTF tracing functionality to see if you can zero in on what is happening. Specifically, first try passing the flag --debug-flag=Exec to look at what the CPU cores are executing (you may also want to pass --trace-start=<<tick>> with a simulator tick time close to when the hang happens). This trace will include Linux kernel symbols for at least some of the lines if executing in the kernel (e.g. handling an interrupt). If you've compiled your benchmark without debugging symbols, it may just show the memory addresses of instructions being executed within the application. I will guess that you'll see kernel symbols for at least some of the executed instructions for interrupts.

If it appears that the CPUs are continuing to execute, try running with --debug-flag=LocalApic. This will print the interrupts that each core is receiving, and if it stops printing at any point, it means something has gone wrong and we'd have to do some deeper digging.

Keep us posted on what you find,
Joel



On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi Joel,

I have tried several different kernels and disk images, including the default ones provided on the GEM5 website in the x86-system.tar.bz2 download. I run with these commands:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz --script=rcs_scripts/run.rcS --max-checkpoints=1

My run.rcS script simply contains:

#!/bin/sh
/sbin/m5 resetstats
/sbin/m5 checkpoint
echo 'booted'
/extras/run
/sbin/m5 exit

where "/extras/run" is simply a C program with an infinite loop that prints a counter.

I then restore:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64

I specified the disk image file in Benchmarks.py. Restoring from the same checkpoint and running in atomic mode works fine. I also tried booting the system in detailed and letting it run for a while, but once it boots, there is no more output. So it seems that checkpointing is not the issue. The "run" program is just a dummy, and the same issue also persists when running SPEC benchmarks or any other program.

My dummy program is simply:

int count=0;
printf("**************************** HEYY \n");
while(1)
printf("\n %d \n", count++);

Letting it run for a while, the only output is exactly this:

booted
*******

It doesn't even finish printing the first printf.

Another thing to add: In another scenario, I modified the kernel to call an m5 pseudo instruction on every context switch, and then GEM5 prints that a context switch occurred. Once again, in atomic mode this worked as expected. However, in detailed, even the GEM5 (printf inside GEM5 itself) output stopped along with the system output in the terminal.

Thank you for your help.

Ivan


On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
Can you please give more detail on what you're running? Specifically, can you give your command line, and which kernel, disk image you're using? Are you using checkpointing?

Joel


On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <gem5-***@gem5.org<mailto:gem5-***@gem5.org>> wrote:
Hello,

I am running X86 in full system mode. When running just 1 CPU, both atomic and detailed mode work fine. However, with more than 1 CPU, atomic works fine, but in detailed mode the system appears to hang shortly after boot-up. GEM5 doesn't crash, but the system stops having any output. Looking at the stats, it appears that instructions are still being committed, but the actual applications/benchmarks are not making progress. The issue persists with the latest version of GEM5. I also tried two different kernel versions and several different disk images.

I might be experiencing what seems to be the same issue that was found about a year ago but appears to not have been fixed: https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html

Can anyone reproduce this or know of a solution?

Thank you,

Ivan



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/




--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/





--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/




--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Ivan Stalev via gem5-users
2014-06-09 05:29:05 UTC
Permalink
Hi,

Thank you for verifying and reproducing the bug. I am now attempting to run
with Ruby as you suggested; however, I am getting a seg fault during
boot-up. I compile like this:

scons build/X86/gem5.fast -j 12 PROTOCOL=MOESI_hammer

and then run/boot-up like this:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
--kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
*--ruby* -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz
--script=rcs_scripts/run.rcS --caches --l2cache --num-l2caches=1
--l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB
--l2_assoc=8 --cacheline_size=64 --max-checkpoints=1

The last line in system.pc.com_1.terminal is "Kernel command line:
earlyprintk=ttyS0 console=ttyS0 lpj=7999923 root=/dev/hda1". Comparing with
the system.pc.com_1.terminal of a successful boot-up, the next line should
be "Initializing CPU#0".

Reading through the forums, there seem to have been some previous issues
with Ruby X86 FS, specific protocols, as well as checkpointing (i.e.
checkpointing using atomic and restoring with detailed). Can you guys
suggest what a working configuration would be (i.e. which protocol,
checkpointing)? Essentially, I need to run the same setup as the one I
tried with the classical model, but with minimum overhead in terms of
simulation time due to using Ruby.

Thanks,

Ivan


On Sun, Jun 8, 2014 at 6:44 PM, Castillo Villar, Emilio <
***@unican.es> wrote:

> Hello,
>
> Mr. Hestness is right, simulations do not make progress once the output
> has hung.
> The cpus keep executing some code and committing instructions, but they
> are doing spinlocks in almost all cases.
> So although cpus are actually executing stuff, the simulation is
> completely freeze. Thats what I meat by "instructions are still being
> committed", sorry for not being clear enough.
>
> Just running in FS with the classic memory system and do a "cat
> /proc/cpuinfo" will crash the system.
>
> Kind regards.
>
>
> ------------------------------
> *De:* Joel Hestness [***@gmail.com]
> *Enviado:* domingo, 08 de junio de 2014 23:27
> *Para:* Castillo Villar, Emilio
> *CC:* Ivan Stalev; gem5 users mailing list
>
> *Asunto:* Re: [gem5-users] System Hangs
>
> Hi guys,
> I've been able to reproduce Ivan's issue in the latest gem5 (rev.
> 10235). It seems this may be the same bug as a report that I filed about
> a year ago <http://flyspray.gem5.org/index.php?do=details&task_id=7>.
> Previously, I had overlooked that Ivan's tests were using the classic
> memory system (and frankly, I had forgotten I had submitted that bug
> report). I'll second Emilio and recommend using Ruby for now.
>
> For anyone's future reference: Contrary to Emilio's statement, I'm not
> clear that simulations necessarily make progress after terminal output
> interrupts are lost with the classic memory model. It is possible that
> unimplemented x86 atomics in the classic memory hierarchy could be the
> problem, and if so, many other problems besides hung terminal output could
> arise.
>
> Joel
>
>
>
> On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio <
> ***@unican.es> wrote:
>
>> Hello,
>>
>> I have seen similar issues when running X86 timing and detailed cpus with
>> the Classic memory system. Mostly due to X86 atomic memory accesses not
>> being implemented. The stdout freezes but instructions are still being
>> committed.
>>
>> If you want to run with timing or detailed cpus in X86 & FS & multi-core
>> I am afraid you will need to use Ruby.
>>
>> Emilio
>> ------------------------------
>> *De:* gem5-users [gem5-users-***@gem5.org] en nombre de Ivan Stalev
>> via gem5-users [gem5-***@gem5.org]
>> *Enviado:* viernes, 06 de junio de 2014 1:14
>> *Para:* Joel Hestness
>> *CC:* gem5 users mailing list
>> *Asunto:* Re: [gem5-users] System Hangs
>>
>> Hi Joel,
>>
>> Thanks for getting back to me.
>>
>> I ran it again with the ProtocolTrace flag and the only output there
>> is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012
>>
>> With the Exec flag, I do see spinlock output on and off in the
>> beginning during regular execution, so that is normal as you said. But once
>> the "problem" occurs shortly after, the Exec flag output is just continuous
>> spinlock forever as I posted previously.
>>
>> The exact gem5 command lines I use are posted in my previous post. The
>> kernel and disk image are the simply the default ones from the GEM5
>> downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2
>>
>> I have attached a zip file containing the following files:
>>
>> BOOT-config.ini - The config.ini from the first run, i.e. booting in
>> atomic mode, creating a checkpoint, and exiting.
>> BOOT-system.pc.com_1.terminal - The terminal output from the first run
>> CPT-config.ini - The config.ini when restoring from the checkpoint in
>> detailed mode
>> CPT-system.pc.com_1.terminal - The system output after restoring from the
>> checkpoint
>> run.c - The dummy program started by the run script
>> run.rcS - The run script
>> flag-exec-partial.out - The output from the Exec flag, right before the
>> "problem" occurs, The infinite spinlock starts at tick 5268700121500
>>
>> Again, this problem occurs even without checkpointing. I have also
>> tried a few different kernels and disk images. I did the same test with
>> both alpha and arm64 and it works, so it appears to just be an issue with
>> x86.
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>> On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <***@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>> Sorry for the delay on this.
>>>
>>> I haven't had an opportunity to try to reproduce your problem,
>>> though the traces you've supplied here can help a bit. Specifically, the
>>> stalled LocalApics (plural, because 2 CPU cores) is fishy, because we'd
>>> expect periodic interrupts to continue. However, the last interrupt on CPU
>>> 1 appears to get cleared, which looks fine. The CPU spin lock is normal
>>> for threads that don't have any work to complete, but it's confusing why
>>> they wouldn't be doing something.
>>>
>>> The next thing to dig into would be to figure out what the CPUs were
>>> doing last before they entered the spin loop. For this we may need to
>>> trace a bit earlier in time using the Exec flags, and since it is likely
>>> that messages/responses may be getting lost in the memory hierarchy or
>>> devices, we'll need to use the ProtocolTrace flag to see what is being
>>> communicated. You could try playing around with these as a start.
>>>
>>> I may also have time to try to reproduce this over the next week, so
>>> I'm hoping you could give me some more information: can you send me your
>>> gem5 command line, config.ini, and system.pc.com_1.terminal output from
>>> your simulation, and details on the kernel and disk image that you're
>>> trying to use?
>>>
>>>
>>> Thanks!
>>> Joel
>>>
>>>
>>>
>>>
>>> On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> Has anyone been able to reproduce this issue?
>>>>
>>>> Thanks,
>>>>
>>>> Ivan
>>>>
>>>>
>>>> On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu> wrote:
>>>>
>>>>> Hi Joel,
>>>>>
>>>>> I am using revision 10124. I removed all of my own modifications
>>>>> just to be safe.
>>>>>
>>>>> Running with gem5.opt and restoring from a boot-up checkpoint
>>>>> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
>>>>> infinite loop, executing this continuously:
>>>>>
>>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>>>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>>>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>>>> A=0xffffffff80822400
>>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>>>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>>>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>>>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>>>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 :
>>>>> JLE_I : wrip , t1, t2 : IntAlu :
>>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 :
>>>>> NOP : IntAlu :
>>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
>>>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
>>>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>>>> A=0xffffffff80822400
>>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
>>>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
>>>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>>>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
>>>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>>>> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>>>>> JLE_I : wrip , t1, t2 : IntAlu :
>>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 :
>>>>> NOP : IntAlu :
>>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 :
>>>>> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
>>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 :
>>>>> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
>>>>> A=0xffffffff80822400
>>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 :
>>>>> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
>>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 :
>>>>> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
>>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 :
>>>>> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
>>>>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
>>>>> JLE_I : wrip , t1, t2 : IntAlu :
>>>>> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 :
>>>>> NOP : IntAlu :
>>>>>
>>>>> ....and so on repetitively without stopping.
>>>>>
>>>>> Using --debug-flag=LocalApic, the output does indeed stop shortly
>>>>> after restoring from the checkpoint. The last output is:
>>>>> ..
>>>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>>>> interrupt.
>>>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>>>> interrupt.
>>>>> 5269570990500: system.cpu1.interrupts: Generated regular interrupt
>>>>> fault object.
>>>>> 5269570990500: system.cpu1.interrupts: Reported pending regular
>>>>> interrupt.
>>>>> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
>>>>> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5
>>>>> at offset 0xb0 as 0.
>>>>>
>>>>> ...and no more output from this point on.
>>>>>
>>>>> I appreciate your help tremendously.
>>>>>
>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ivan,
>>>>>> I believe that the email thread you previously referenced was
>>>>>> related to a bug that we identified and fixed with changeset 9624
>>>>>> <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
>>>>>> That bug was causing interrupts to be dropped in x86 when running with the
>>>>>> O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd
>>>>>> recommend updating to a more recent version and trying to replicate this
>>>>>> issue again.
>>>>>>
>>>>>> If you are using a more recent version of gem5, first, please let
>>>>>> us know which changeset and whether you've made any changes. Then, I'd
>>>>>> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
>>>>>> see if you can zero in on what is happening. Specifically, first try
>>>>>> passing the flag --debug-flag=Exec to look at what the CPU cores are
>>>>>> executing (you may also want to pass --trace-start=<<tick>> with a
>>>>>> simulator tick time close to when the hang happens). This trace will
>>>>>> include Linux kernel symbols for at least some of the lines if executing in
>>>>>> the kernel (e.g. handling an interrupt). If you've compiled your benchmark
>>>>>> without debugging symbols, it may just show the memory addresses of
>>>>>> instructions being executed within the application. I will guess that
>>>>>> you'll see kernel symbols for at least some of the executed instructions
>>>>>> for interrupts.
>>>>>>
>>>>>> If it appears that the CPUs are continuing to execute, try
>>>>>> running with --debug-flag=LocalApic. This will print the interrupts that
>>>>>> each core is receiving, and if it stops printing at any point, it means
>>>>>> something has gone wrong and we'd have to do some deeper digging.
>>>>>>
>>>>>> Keep us posted on what you find,
>>>>>> Joel
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:
>>>>>>
>>>>>>> Hi Joel,
>>>>>>>
>>>>>>> I have tried several different kernels and disk images, including
>>>>>>> the default ones provided on the GEM5 website in the x86-system.tar.bz2
>>>>>>> download. I run with these commands:
>>>>>>>
>>>>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>>>>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>>>>>>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>>>>>>
>>>>>>> My run.rcS script simply contains:
>>>>>>>
>>>>>>> #!/bin/sh
>>>>>>> /sbin/m5 resetstats
>>>>>>> /sbin/m5 checkpoint
>>>>>>> echo 'booted'
>>>>>>> /extras/run
>>>>>>> /sbin/m5 exit
>>>>>>>
>>>>>>> where "/extras/run" is simply a C program with an infinite loop
>>>>>>> that prints a counter.
>>>>>>>
>>>>>>> I then restore:
>>>>>>>
>>>>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>>>>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>>>>>>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>>>>>>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>>>>>>
>>>>>>> I specified the disk image file in Benchmarks.py. Restoring from
>>>>>>> the same checkpoint and running in atomic mode works fine. I also tried
>>>>>>> booting the system in detailed and letting it run for a while, but once it
>>>>>>> boots, there is no more output. So it seems that checkpointing is not the
>>>>>>> issue. The "run" program is just a dummy, and the same issue also persists
>>>>>>> when running SPEC benchmarks or any other program.
>>>>>>>
>>>>>>> My dummy program is simply:
>>>>>>>
>>>>>>> int count=0;
>>>>>>> printf("**************************** HEYY \n");
>>>>>>> while(1)
>>>>>>> printf("\n %d \n", count++);
>>>>>>>
>>>>>>> Letting it run for a while, the only output is exactly this:
>>>>>>>
>>>>>>> booted
>>>>>>> *******
>>>>>>>
>>>>>>> It doesn't even finish printing the first printf.
>>>>>>>
>>>>>>> Another thing to add: In another scenario, I modified the kernel
>>>>>>> to call an m5 pseudo instruction on every context switch, and then GEM5
>>>>>>> prints that a context switch occurred. Once again, in atomic mode this
>>>>>>> worked as expected. However, in detailed, even the GEM5 (printf inside GEM5
>>>>>>> itself) output stopped along with the system output in the terminal.
>>>>>>>
>>>>>>> Thank you for your help.
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Ivan,
>>>>>>>> Can you please give more detail on what you're running?
>>>>>>>> Specifically, can you give your command line, and which kernel, disk image
>>>>>>>> you're using? Are you using checkpointing?
>>>>>>>>
>>>>>>>> Joel
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>>>>>>> gem5-***@gem5.org> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I am running X86 in full system mode. When running just 1 CPU,
>>>>>>>>> both atomic and detailed mode work fine. However, with more than 1 CPU,
>>>>>>>>> atomic works fine, but in detailed mode the system appears to hang shortly
>>>>>>>>> after boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>>>>>>> Looking at the stats, it appears that instructions are still being
>>>>>>>>> committed, but the actual applications/benchmarks are not making progress.
>>>>>>>>> The issue persists with the latest version of GEM5. I also tried two
>>>>>>>>> different kernel versions and several different disk images.
>>>>>>>>>
>>>>>>>>> I might be experiencing what seems to be the same issue that was
>>>>>>>>> found about a year ago but appears to not have been fixed:
>>>>>>>>> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>>>>>>>>>
>>>>>>>>> Can anyone reproduce this or know of a solution?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>>
>>>>>>>>> Ivan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-users mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Joel Hestness
>>>>>>>> PhD Student, Computer Architecture
>>>>>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>>>>>> http://pages.cs.wisc.edu/~hestness/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Joel Hestness
>>>>>> PhD Student, Computer Architecture
>>>>>> Dept. of Computer Science, University of Wisconsin - Madison
>>>>>> http://pages.cs.wisc.edu/~hestness/
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Joel Hestness
>>> PhD Student, Computer Architecture
>>> Dept. of Computer Science, University of Wisconsin - Madison
>>> http://pages.cs.wisc.edu/~hestness/
>>>
>>
>>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
Andreas Hansson via gem5-users
2014-06-09 07:49:22 UTC
Permalink
Hi all,

It would be valuable if someone actually dig in and fixed the issue to that the non-Ruby memory system works for X86. I would imagine that the speed/fidelity trade-off of the classic memory system appeals to a wider audience. If not, there is always ARM :-).

Andreas

From: Ivan Stalev via gem5-users <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Reply-To: Ivan Stalev <***@psu.edu<mailto:***@psu.edu>>, gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Date: Monday, 9 June 2014 06:29
To: "Castillo Villar, Emilio" <***@unican.es<mailto:***@unican.es>>
Cc: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] System Hangs

Hi,

Thank you for verifying and reproducing the bug. I am now attempting to run with Ruby as you suggested; however, I am getting a seg fault during boot-up. I compile like this:

scons build/X86/gem5.fast -j 12 PROTOCOL=MOESI_hammer

and then run/boot-up like this:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp --ruby -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --script=rcs_scripts/run.rcS --caches --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64 --max-checkpoints=1

The last line in system.pc.com_1.terminal is "Kernel command line: earlyprintk=ttyS0 console=ttyS0 lpj=7999923 root=/dev/hda1". Comparing with the system.pc.com_1.terminal of a successful boot-up, the next line should be "Initializing CPU#0".

Reading through the forums, there seem to have been some previous issues with Ruby X86 FS, specific protocols, as well as checkpointing (i.e. checkpointing using atomic and restoring with detailed). Can you guys suggest what a working configuration would be (i.e. which protocol, checkpointing)? Essentially, I need to run the same setup as the one I tried with the classical model, but with minimum overhead in terms of simulation time due to using Ruby.

Thanks,

Ivan


On Sun, Jun 8, 2014 at 6:44 PM, Castillo Villar, Emilio <***@unican.es<mailto:***@unican.es>> wrote:
Hello,

Mr. Hestness is right, simulations do not make progress once the output has hung.
The cpus keep executing some code and committing instructions, but they are doing spinlocks in almost all cases.
So although cpus are actually executing stuff, the simulation is completely freeze. Thats what I meat by "instructions are still being committed", sorry for not being clear enough.

Just running in FS with the classic memory system and do a "cat /proc/cpuinfo" will crash the system.

Kind regards.


________________________________
De: Joel Hestness [***@gmail.com<mailto:***@gmail.com>]
Enviado: domingo, 08 de junio de 2014 23:27
Para: Castillo Villar, Emilio
CC: Ivan Stalev; gem5 users mailing list

Asunto: Re: [gem5-users] System Hangs

Hi guys,
I've been able to reproduce Ivan's issue in the latest gem5 (rev. 10235). It seems this may be the same bug as a report that I filed about a year ago<http://flyspray.gem5.org/index.php?do=details&task_id=7>. Previously, I had overlooked that Ivan's tests were using the classic memory system (and frankly, I had forgotten I had submitted that bug report). I'll second Emilio and recommend using Ruby for now.

For anyone's future reference: Contrary to Emilio's statement, I'm not clear that simulations necessarily make progress after terminal output interrupts are lost with the classic memory model. It is possible that unimplemented x86 atomics in the classic memory hierarchy could be the problem, and if so, many other problems besides hung terminal output could arise.

Joel



On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio <***@unican.es<mailto:***@unican.es>> wrote:
Hello,

I have seen similar issues when running X86 timing and detailed cpus with the Classic memory system. Mostly due to X86 atomic memory accesses not being implemented. The stdout freezes but instructions are still being committed.

If you want to run with timing or detailed cpus in X86 & FS & multi-core I am afraid you will need to use Ruby.

Emilio
________________________________
De: gem5-users [gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] en nombre de Ivan Stalev via gem5-users [gem5-***@gem5.org<mailto:gem5-***@gem5.org>]
Enviado: viernes, 06 de junio de 2014 1:14
Para: Joel Hestness
CC: gem5 users mailing list
Asunto: Re: [gem5-users] System Hangs

Hi Joel,

Thanks for getting back to me.

I ran it again with the ProtocolTrace flag and the only output there is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012

With the Exec flag, I do see spinlock output on and off in the beginning during regular execution, so that is normal as you said. But once the "problem" occurs shortly after, the Exec flag output is just continuous spinlock forever as I posted previously.

The exact gem5 command lines I use are posted in my previous post. The kernel and disk image are the simply the default ones from the GEM5 downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2

I have attached a zip file containing the following files:

BOOT-config.ini - The config.ini from the first run, i.e. booting in atomic mode, creating a checkpoint, and exiting.
BOOT-system.pc.com_1.terminal - The terminal output from the first run
CPT-config.ini - The config.ini when restoring from the checkpoint in detailed mode
CPT-system.pc.com_1.terminal - The system output after restoring from the checkpoint
run.c - The dummy program started by the run script
run.rcS - The run script
flag-exec-partial.out - The output from the Exec flag, right before the "problem" occurs, The infinite spinlock starts at tick 5268700121500

Again, this problem occurs even without checkpointing. I have also tried a few different kernels and disk images. I did the same test with both alpha and arm64 and it works, so it appears to just be an issue with x86.

Thank you,

Ivan



On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
Sorry for the delay on this.

I haven't had an opportunity to try to reproduce your problem, though the traces you've supplied here can help a bit. Specifically, the stalled LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect periodic interrupts to continue. However, the last interrupt on CPU 1 appears to get cleared, which looks fine. The CPU spin lock is normal for threads that don't have any work to complete, but it's confusing why they wouldn't be doing something.

The next thing to dig into would be to figure out what the CPUs were doing last before they entered the spin loop. For this we may need to trace a bit earlier in time using the Exec flags, and since it is likely that messages/responses may be getting lost in the memory hierarchy or devices, we'll need to use the ProtocolTrace flag to see what is being communicated. You could try playing around with these as a start.

I may also have time to try to reproduce this over the next week, so I'm hoping you could give me some more information: can you send me your gem5 command line, config.ini, and system.pc.com_1.terminal output from your simulation, and details on the kernel and disk image that you're trying to use?


Thanks!
Joel




On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi,

Has anyone been able to reproduce this issue?

Thanks,

Ivan


On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi Joel,

I am using revision 10124. I removed all of my own modifications just to be safe.

Running with gem5.opt and restoring from a boot-up checkpoint with--debug-flag=Exec, it appears that the CPU is stuck in some sort of infinite loop, executing this continuously:

5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :

....and so on repetitively without stopping.

Using --debug-flag=LocalApic, the output does indeed stop shortly after restoring from the checkpoint. The last output is:
..
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Generated regular interrupt fault object.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at offset 0xb0 as 0.

...and no more output from this point on.

I appreciate your help tremendously.

Ivan



On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
I believe that the email thread you previously referenced was related to a bug that we identified and fixed with changeset 9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>. That bug was causing interrupts to be dropped in x86 when running with the O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd recommend updating to a more recent version and trying to replicate this issue again.

If you are using a more recent version of gem5, first, please let us know which changeset and whether you've made any changes. Then, I'd recommend compiling gem5.opt and using the DPRINTF tracing functionality to see if you can zero in on what is happening. Specifically, first try passing the flag --debug-flag=Exec to look at what the CPU cores are executing (you may also want to pass --trace-start=<<tick>> with a simulator tick time close to when the hang happens). This trace will include Linux kernel symbols for at least some of the lines if executing in the kernel (e.g. handling an interrupt). If you've compiled your benchmark without debugging symbols, it may just show the memory addresses of instructions being executed within the application. I will guess that you'll see kernel symbols for at least some of the executed instructions for interrupts.

If it appears that the CPUs are continuing to execute, try running with --debug-flag=LocalApic. This will print the interrupts that each core is receiving, and if it stops printing at any point, it means something has gone wrong and we'd have to do some deeper digging.

Keep us posted on what you find,
Joel



On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu<mailto:***@psu.edu>> wrote:
Hi Joel,

I have tried several different kernels and disk images, including the default ones provided on the GEM5 website in the x86-system.tar.bz2 download. I run with these commands:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz --script=rcs_scripts/run.rcS --max-checkpoints=1

My run.rcS script simply contains:

#!/bin/sh
/sbin/m5 resetstats
/sbin/m5 checkpoint
echo 'booted'
/extras/run
/sbin/m5 exit

where "/extras/run" is simply a C program with an infinite loop that prints a counter.

I then restore:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64

I specified the disk image file in Benchmarks.py. Restoring from the same checkpoint and running in atomic mode works fine. I also tried booting the system in detailed and letting it run for a while, but once it boots, there is no more output. So it seems that checkpointing is not the issue. The "run" program is just a dummy, and the same issue also persists when running SPEC benchmarks or any other program.

My dummy program is simply:

int count=0;
printf("**************************** HEYY \n");
while(1)
printf("\n %d \n", count++);

Letting it run for a while, the only output is exactly this:

booted
*******

It doesn't even finish printing the first printf.

Another thing to add: In another scenario, I modified the kernel to call an m5 pseudo instruction on every context switch, and then GEM5 prints that a context switch occurred. Once again, in atomic mode this worked as expected. However, in detailed, even the GEM5 (printf inside GEM5 itself) output stopped along with the system output in the terminal.

Thank you for your help.

Ivan


On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Ivan,
Can you please give more detail on what you're running? Specifically, can you give your command line, and which kernel, disk image you're using? Are you using checkpointing?

Joel


On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <gem5-***@gem5.org<mailto:gem5-***@gem5.org>> wrote:
Hello,

I am running X86 in full system mode. When running just 1 CPU, both atomic and detailed mode work fine. However, with more than 1 CPU, atomic works fine, but in detailed mode the system appears to hang shortly after boot-up. GEM5 doesn't crash, but the system stops having any output. Looking at the stats, it appears that instructions are still being committed, but the actual applications/benchmarks are not making progress. The issue persists with the latest version of GEM5. I also tried two different kernel versions and several different disk images.

I might be experiencing what seems to be the same issue that was found about a year ago but appears to not have been fixed: https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html

Can anyone reproduce this or know of a solution?

Thank you,

Ivan



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/




--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/





--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/




--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
马久跃 via gem5-users
2014-06-09 08:37:20 UTC
Permalink
Hi all,
I found someone in linux-rt community had report the same uart deadlock problem (http://www.spinics.net/lists/linux-rt-users/msg09246.html), they point out a recursive dead lock in uart driver, the call stack is show as following:
------------------------------------------------------------mpc52xx_uart_int()
lock(port->lock);
mpc52xx_psc_handle_irq()
mpc52xx_uart_int_tx_chars()
uart_write_wakeup()
tty_wakeup()
hci_uart_tx_wakeup()
len = tty->ops->write(tty, skb->data, skb->len);
The associated write function is uart_write
uart_write()
lock(port->lock) --> deadlock ------------------------------------------------------------
it seems uart driver use a single lock (port->lock) for both "struct uart_port" and "struct uart_info", which caused the recursive deadlock.
I have make a patch (based on linux-2.6.22 kernel) for this problem, it did following things to avoid deadlock: - add extra lock for struct uart_info - protect struct uart_info using port->info->lock instead of port->lock
it works fine for me, but I'm not sure if it can really solve this system hangs problem. You guys can try this patch.

Jiuyue Ma


To: ***@psu.edu; gem5-***@gem5.org; ***@unican.es
Date: Mon, 9 Jun 2014 08:49:22 +0100
Subject: Re: [gem5-users] System Hangs
From: gem5-***@gem5.org






Hi all,



It would be valuable if someone actually dig in and fixed the issue to that the non-Ruby memory system works for X86. I would imagine that the speed/fidelity trade-off of the classic memory system appeals to a wider audience. If not, there is always ARM
:-).



Andreas





From: Ivan Stalev via gem5-users <gem5-***@gem5.org>

Reply-To: Ivan Stalev <***@psu.edu>, gem5 users mailing list <gem5-***@gem5.org>

Date: Monday, 9 June 2014 06:29

To: "Castillo Villar, Emilio" <***@unican.es>

Cc: gem5 users mailing list <gem5-***@gem5.org>

Subject: Re: [gem5-users] System Hangs





Hi,



Thank you for verifying and reproducing the bug. I am now attempting to run with Ruby as you suggested; however, I am getting a seg fault during boot-up. I compile like this:



scons build/X86/gem5.fast -j 12 PROTOCOL=MOESI_hammer





and then run/boot-up like this:



build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
--ruby -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --script=rcs_scripts/run.rcS --caches --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64 --max-checkpoints=1





The last line in system.pc.com_1.terminal is "Kernel command line: earlyprintk=ttyS0 console=ttyS0 lpj=7999923 root=/dev/hda1". Comparing with the system.pc.com_1.terminal of a successful boot-up, the next line should be "Initializing CPU#0".



Reading through the forums, there seem to have been some previous issues with Ruby X86 FS, specific protocols, as well as checkpointing (i.e. checkpointing using atomic and restoring with detailed). Can you guys suggest what a working configuration would
be (i.e. which protocol, checkpointing)? Essentially, I need to run the same setup as the one I tried with the classical model, but with minimum overhead in terms of simulation time due to using Ruby.



Thanks,



Ivan





On Sun, Jun 8, 2014 at 6:44 PM, Castillo Villar, Emilio
<***@unican.es> wrote:



Hello,



Mr. Hestness is right, simulations do not make progress once the output has hung.
The cpus keep executing some code and committing instructions, but they are doing spinlocks in almost all cases.
So although cpus are actually executing stuff, the simulation is completely freeze. Thats what I meat by "instructions are still being committed", sorry for not being clear enough.



Just running in FS with the classic memory system and do a "cat /proc/cpuinfo" will crash the system.



Kind regards.









De: Joel Hestness [***@gmail.com]

Enviado: domingo, 08 de junio de 2014 23:27

Para: Castillo Villar, Emilio

CC: Ivan Stalev; gem5 users mailing list



Asunto: Re: [gem5-users] System Hangs










Hi guys,
I've been able to reproduce Ivan's issue in the latest gem5 (rev. 10235). It seems this may be the same bug as a

report that I filed about a year ago. Previously, I had overlooked that Ivan's tests were using the classic memory system (and frankly, I had forgotten I had submitted that bug report). I'll second Emilio and recommend using Ruby for now.



For anyone's future reference: Contrary to Emilio's statement, I'm not clear that simulations necessarily make progress after terminal output interrupts are lost with the classic memory model. It is possible that unimplemented x86 atomics in the classic
memory hierarchy could be the problem, and if so, many other problems besides hung terminal output could arise.



Joel








On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio
<***@unican.es> wrote:



Hello,



I have seen similar issues when running X86 timing and detailed cpus with the Classic memory system. Mostly due to X86 atomic memory accesses not being implemented. The stdout freezes but instructions are still being committed.




If you want to run with timing or detailed cpus in X86 & FS & multi-core I am afraid you will need to use Ruby.



Emilio



De: gem5-users [gem5-users-***@gem5.org] en nombre de Ivan Stalev via gem5-users [gem5-***@gem5.org]

Enviado: viernes, 06 de junio de 2014 1:14

Para: Joel Hestness

CC: gem5 users mailing list

Asunto: Re: [gem5-users] System Hangs








Hi Joel,



Thanks for getting back to me.



I ran it again with the ProtocolTrace flag and the only output there is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012



With the Exec flag, I do see spinlock output on and off in the beginning during regular execution, so that is normal as you said. But once the "problem" occurs shortly after, the Exec flag output is just continuous spinlock forever as I posted previously.



The exact gem5 command lines I use are posted in my previous post. The kernel and disk image are the simply the default ones from the GEM5 downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2



I have attached a zip file containing the following files:



BOOT-config.ini - The config.ini from the first run, i.e. booting in atomic mode, creating a checkpoint, and exiting.
BOOT-system.pc.com_1.terminal - The terminal output from the first run
CPT-config.ini - The config.ini when restoring from the checkpoint in detailed mode
CPT-system.pc.com_1.terminal - The system output after restoring from the checkpoint
run.c - The dummy program started by the run script
run.rcS - The run script
flag-exec-partial.out - The output from the Exec flag, right before the "problem" occurs, The infinite spinlock starts at tick 5268700121500



Again, this problem occurs even without checkpointing. I have also tried a few different kernels and disk images. I did the same test with both alpha and arm64 and it works, so it appears to just be an issue with x86.



Thank you,



Ivan





On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness
<***@gmail.com> wrote:


Hi Ivan,
Sorry for the delay on this.



I haven't had an opportunity to try to reproduce your problem, though the traces you've supplied here can help a bit. Specifically, the stalled LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect periodic interrupts to continue. However,
the last interrupt on CPU 1 appears to get cleared, which looks fine. The CPU spin lock is normal for threads that don't have any work to complete, but it's confusing why they wouldn't be doing something.



The next thing to dig into would be to figure out what the CPUs were doing last before they entered the spin loop. For this we may need to trace a bit earlier in time using the Exec flags, and since it is likely that messages/responses may be getting
lost in the memory hierarchy or devices, we'll need to use the ProtocolTrace flag to see what is being communicated. You could try playing around with these as a start.



I may also have time to try to reproduce this over the next week, so I'm hoping you could give me some more information: can you send me your gem5 command line, config.ini, and system.pc.com_1.terminal output from your simulation, and details on the
kernel and disk image that you're trying to use?






Thanks!

Joel













On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev
<***@psu.edu> wrote:


Hi,



Has anyone been able to reproduce this issue?



Thanks,



Ivan







On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev
<***@psu.edu> wrote:


Hi Joel,



I am using revision 10124. I removed all of my own modifications just to be safe.




Running with gem5.opt and restoring from a boot-up checkpoint with--debug-flag=Exec, it appears that the CPU is stuck in some sort of infinite loop, executing this continuously:




5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 : CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 : CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 : CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 : JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 : JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 : JLE_I : wrip , t1, t2 : IntAlu :
5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP : IntAlu :




....and so on repetitively without stopping.



Using --debug-flag=LocalApic, the output does indeed stop shortly after restoring from the checkpoint. The last output is:
..

5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Generated regular interrupt fault object.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at offset 0xb0 as 0.



...and no more output from this point on.



I appreciate your help tremendously.




Ivan












On Fri, May 16, 2014 at 11:18 AM, Joel Hestness
<***@gmail.com> wrote:


Hi Ivan,
I believe that the email thread you previously referenced was related to a bug that we identified and fixed with changeset 9624. That bug was causing interrupts
to be dropped in x86 when running with the O3 CPU. Are you using a version of gem5 after that changeset? If not, I'd recommend updating to a more recent version and trying to replicate this issue again.



If you are using a more recent version of gem5, first, please let us know which changeset and whether you've made any changes. Then, I'd recommend compiling gem5.opt and using the DPRINTF tracing functionality to see if you can zero in on what is happening.
Specifically, first try passing the flag --debug-flag=Exec to look at what the CPU cores are executing (you may also want to pass --trace-start=<<tick>> with a simulator tick time close to when the hang happens). This trace will include Linux kernel symbols
for at least some of the lines if executing in the kernel (e.g. handling an interrupt). If you've compiled your benchmark without debugging symbols, it may just show the memory addresses of instructions being executed within the application. I will guess
that you'll see kernel symbols for at least some of the executed instructions for interrupts.



If it appears that the CPUs are continuing to execute, try running with --debug-flag=LocalApic. This will print the interrupts that each core is receiving, and if it stops printing at any point, it means something has gone wrong and we'd have to do
some deeper digging.



Keep us posted on what you find,
Joel










On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev
<***@psu.edu> wrote:


Hi Joel,



I have tried several different kernels and disk images, including the default ones provided on the GEM5 website in the x86-system.tar.bz2 download. I run with these commands:



build/X86/gem5.fast -d m5out/test_run configs/example/fs.py --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz --script=rcs_scripts/run.rcS --max-checkpoints=1





My run.rcS script simply contains:



#!/bin/sh



/sbin/m5 resetstats
/sbin/m5 checkpoint

echo 'booted'
/extras/run


/sbin/m5 exit





where "/extras/run" is simply a C program with an infinite loop that prints a counter.



I then restore:



build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
--kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
--l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64



I specified the disk image file in Benchmarks.py. Restoring from the same checkpoint and running in atomic mode works fine. I also tried booting the system in detailed and letting it run for a while, but once it boots, there is no more output. So it seems
that checkpointing is not the issue. The "run" program is just a dummy, and the same issue also persists when running SPEC benchmarks or any other program.



My dummy program is simply:




int count=0;
printf("**************************** HEYY \n");
while(1)
printf("\n %d \n", count++);


Letting it run for a while, the only output is exactly this:



booted
*******





It doesn't even finish printing the first printf.



Another thing to add: In another scenario, I modified the kernel to call an m5 pseudo instruction on every context switch, and then GEM5 prints that a context switch occurred. Once again, in atomic mode this worked as expected. However, in detailed, even
the GEM5 (printf inside GEM5 itself) output stopped along with the system output in the terminal.



Thank you for your help.




Ivan







On Thu, May 15, 2014 at 10:51 PM, Joel Hestness
<***@gmail.com> wrote:


Hi Ivan,
Can you please give more detail on what you're running? Specifically, can you give your command line, and which kernel, disk image you're using? Are you using checkpointing?



Joel







On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users
<gem5-***@gem5.org> wrote:






Hello,



I am running X86 in full system mode. When running just 1 CPU, both atomic and detailed mode work fine. However, with more than 1 CPU, atomic works fine, but in detailed mode the system appears to hang shortly after boot-up. GEM5 doesn't crash, but the
system stops having any output. Looking at the stats, it appears that instructions are still being committed, but the actual applications/benchmarks are not making progress. The issue persists with the latest version of GEM5. I also tried two different kernel
versions and several different disk images.




I might be experiencing what seems to be the same issue that was found about a year ago but appears to not have been fixed: https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html



Can anyone reproduce this or know of a solution?



Thank you,



Ivan










_______________________________________________

gem5-users mailing list

gem5-***@gem5.org

http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users









--

Joel Hestness

PhD Student, Computer Architecture

Dept. of Computer Science, University of Wisconsin - Madison

http://pages.cs.wisc.edu/~hestness/















--

Joel Hestness

PhD Student, Computer Architecture

Dept. of Computer Science, University of Wisconsin - Madison

http://pages.cs.wisc.edu/~hestness/
























--

Joel Hestness

PhD Student, Computer Architecture

Dept. of Computer Science, University of Wisconsin - Madison

http://pages.cs.wisc.edu/~hestness/






















--

Joel Hestness

PhD Student, Computer Architecture

Dept. of Computer Science, University of Wisconsin - Madison

http://pages.cs.wisc.edu/~hestness/














-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents
to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.



ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590

ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Ivan Stalev via gem5-users
2014-06-09 17:03:09 UTC
Permalink
Hi Jiuyue,

Thanks for sharing that information. I applied your patch and the problem
appears to be fixed! I will therefore continue to use the classic memory
model with X86 FS detailed CPUs and will post if I run into any further
issues. So it appears that this was a kernel bug and not a GEM5 bug.

Ivan


On Mon, Jun 9, 2014 at 4:37 AM, 马久跃 via gem5-users <gem5-***@gem5.org>
wrote:

> Hi all,
>
> I found someone in linux-rt community had report the same uart deadlock
> problem (http://www.spinics.net/lists/linux-rt-users/msg09246.html), they
> point out a recursive dead lock in uart driver, the call stack is show as
> following:
>
> ------------------------------------------------------------
> mpc52xx_uart_int()
>
> *lock(port->lock);*
>
> mpc52xx_psc_handle_irq()
>
> mpc52xx_uart_int_tx_chars()
>
> uart_write_wakeup()
>
> tty_wakeup()
>
> hci_uart_tx_wakeup()
>
> len = tty->ops->write(tty, skb->data, skb->len);
>
> The associated write function is uart_write
>
> uart_write()
>
> lock(port->lock) *--> deadlock*
> ------------------------------------------------------------
>
> it seems uart driver use a single lock (port->lock) for both "struct uart_port"
> and "struct uart_info", which caused the recursive deadlock.
>
> I have make a patch (based on linux-2.6.22 kernel) for this problem, it
> did following things to avoid deadlock:
> - add extra lock for struct uart_info
> - protect struct uart_info using port->info->lock instead of port->lock
>
> it works fine for me, but I'm not sure if it can really solve this system
> hangs problem. You guys can try this patch.
>
>
> Jiuyue Ma
>
>
> ------------------------------
> To: ***@psu.edu; gem5-***@gem5.org; ***@unican.es
> Date: Mon, 9 Jun 2014 08:49:22 +0100
>
> Subject: Re: [gem5-users] System Hangs
> From: gem5-***@gem5.org
>
>
> Hi all,
>
> It would be valuable if someone actually dig in and fixed the issue to
> that the non-Ruby memory system works for X86. I would imagine that the
> speed/fidelity trade-off of the classic memory system appeals to a wider
> audience. If not, there is always ARM :-).
>
> Andreas
>
> From: Ivan Stalev via gem5-users <gem5-***@gem5.org>
> Reply-To: Ivan Stalev <***@psu.edu>, gem5 users mailing list <
> gem5-***@gem5.org>
> Date: Monday, 9 June 2014 06:29
> To: "Castillo Villar, Emilio" <***@unican.es>
> Cc: gem5 users mailing list <gem5-***@gem5.org>
> Subject: Re: [gem5-users] System Hangs
>
> Hi,
>
> Thank you for verifying and reproducing the bug. I am now attempting to
> run with Ruby as you suggested; however, I am getting a seg fault during
> boot-up. I compile like this:
>
> scons build/X86/gem5.fast -j 12 PROTOCOL=MOESI_hammer
>
> and then run/boot-up like this:
>
> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> *--ruby* -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz
> --script=rcs_scripts/run.rcS --caches --l2cache --num-l2caches=1
> --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 --l1i_assoc=4 --l2_size=4MB
> --l2_assoc=8 --cacheline_size=64 --max-checkpoints=1
>
> The last line in system.pc.com_1.terminal is "Kernel command line:
> earlyprintk=ttyS0 console=ttyS0 lpj=7999923 root=/dev/hda1". Comparing with
> the system.pc.com_1.terminal of a successful boot-up, the next line should
> be "Initializing CPU#0".
>
> Reading through the forums, there seem to have been some previous issues
> with Ruby X86 FS, specific protocols, as well as checkpointing (i.e.
> checkpointing using atomic and restoring with detailed). Can you guys
> suggest what a working configuration would be (i.e. which protocol,
> checkpointing)? Essentially, I need to run the same setup as the one I
> tried with the classical model, but with minimum overhead in terms of
> simulation time due to using Ruby.
>
> Thanks,
>
> Ivan
>
>
> On Sun, Jun 8, 2014 at 6:44 PM, Castillo Villar, Emilio <
> ***@unican.es> wrote:
>
> Hello,
>
> Mr. Hestness is right, simulations do not make progress once the output
> has hung.
> The cpus keep executing some code and committing instructions, but they
> are doing spinlocks in almost all cases.
> So although cpus are actually executing stuff, the simulation is
> completely freeze. Thats what I meat by "instructions are still being
> committed", sorry for not being clear enough.
>
> Just running in FS with the classic memory system and do a "cat
> /proc/cpuinfo" will crash the system.
>
> Kind regards.
>
>
> ------------------------------
> *De:* Joel Hestness [***@gmail.com]
> *Enviado:* domingo, 08 de junio de 2014 23:27
> *Para:* Castillo Villar, Emilio
> *CC:* Ivan Stalev; gem5 users mailing list
>
> *Asunto:* Re: [gem5-users] System Hangs
>
> Hi guys,
> I've been able to reproduce Ivan's issue in the latest gem5 (rev.
> 10235). It seems this may be the same bug as a report that I filed about
> a year ago <http://flyspray.gem5.org/index.php?do=details&task_id=7>.
> Previously, I had overlooked that Ivan's tests were using the classic
> memory system (and frankly, I had forgotten I had submitted that bug
> report). I'll second Emilio and recommend using Ruby for now.
>
> For anyone's future reference: Contrary to Emilio's statement, I'm not
> clear that simulations necessarily make progress after terminal output
> interrupts are lost with the classic memory model. It is possible that
> unimplemented x86 atomics in the classic memory hierarchy could be the
> problem, and if so, many other problems besides hung terminal output could
> arise.
>
> Joel
>
>
>
> On Fri, Jun 6, 2014 at 10:49 AM, Castillo Villar, Emilio <
> ***@unican.es> wrote:
>
> Hello,
>
> I have seen similar issues when running X86 timing and detailed cpus with
> the Classic memory system. Mostly due to X86 atomic memory accesses not
> being implemented. The stdout freezes but instructions are still being
> committed.
>
> If you want to run with timing or detailed cpus in X86 & FS & multi-core I
> am afraid you will need to use Ruby.
>
> Emilio
> ------------------------------
> *De:* gem5-users [gem5-users-***@gem5.org] en nombre de Ivan Stalev
> via gem5-users [gem5-***@gem5.org]
> *Enviado:* viernes, 06 de junio de 2014 1:14
> *Para:* Joel Hestness
> *CC:* gem5 users mailing list
> *Asunto:* Re: [gem5-users] System Hangs
>
> Hi Joel,
>
> Thanks for getting back to me.
>
> I ran it again with the ProtocolTrace flag and the only output there
> is: 0: rtc: Real-time clock set to Sun Jan 1 00:00:00 2012
>
> With the Exec flag, I do see spinlock output on and off in the beginning
> during regular execution, so that is normal as you said. But once the
> "problem" occurs shortly after, the Exec flag output is just continuous
> spinlock forever as I posted previously.
>
> The exact gem5 command lines I use are posted in my previous post. The
> kernel and disk image are the simply the default ones from the GEM5
> downloads page: http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2
>
> I have attached a zip file containing the following files:
>
> BOOT-config.ini - The config.ini from the first run, i.e. booting in
> atomic mode, creating a checkpoint, and exiting.
> BOOT-system.pc.com_1.terminal - The terminal output from the first run
> CPT-config.ini - The config.ini when restoring from the checkpoint in
> detailed mode
> CPT-system.pc.com_1.terminal - The system output after restoring from the
> checkpoint
> run.c - The dummy program started by the run script
> run.rcS - The run script
> flag-exec-partial.out - The output from the Exec flag, right before the
> "problem" occurs, The infinite spinlock starts at tick 5268700121500
>
> Again, this problem occurs even without checkpointing. I have also tried
> a few different kernels and disk images. I did the same test with both
> alpha and arm64 and it works, so it appears to just be an issue with x86.
>
> Thank you,
>
> Ivan
>
>
>
> On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness <***@gmail.com>
> wrote:
>
> Hi Ivan,
> Sorry for the delay on this.
>
> I haven't had an opportunity to try to reproduce your problem, though
> the traces you've supplied here can help a bit. Specifically, the stalled
> LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect
> periodic interrupts to continue. However, the last interrupt on CPU 1
> appears to get cleared, which looks fine. The CPU spin lock is normal for
> threads that don't have any work to complete, but it's confusing why they
> wouldn't be doing something.
>
> The next thing to dig into would be to figure out what the CPUs were
> doing last before they entered the spin loop. For this we may need to
> trace a bit earlier in time using the Exec flags, and since it is likely
> that messages/responses may be getting lost in the memory hierarchy or
> devices, we'll need to use the ProtocolTrace flag to see what is being
> communicated. You could try playing around with these as a start.
>
> I may also have time to try to reproduce this over the next week, so
> I'm hoping you could give me some more information: can you send me your
> gem5 command line, config.ini, and system.pc.com_1.terminal output from
> your simulation, and details on the kernel and disk image that you're
> trying to use?
>
>
> Thanks!
> Joel
>
>
>
>
> On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <***@psu.edu> wrote:
>
> Hi,
>
> Has anyone been able to reproduce this issue?
>
> Thanks,
>
> Ivan
>
>
> On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <***@psu.edu> wrote:
>
> Hi Joel,
>
> I am using revision 10124. I removed all of my own modifications just to
> be safe.
>
> Running with gem5.opt and restoring from a boot-up checkpoint
> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
> infinite loop, executing this continuously:
>
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2 :
> JLE_I : wrip , t1, t2 : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16 : NOP
> : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0 :
> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1 :
> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2 :
> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0 :
> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1 :
> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
> JLE_I : wrip , t1, t2 : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
> : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0 :
> CMP_M_I : limm t2d, 0 : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1 :
> CMP_M_I : ld t1d, DS:[rdi] : MemRead : D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2 :
> CMP_M_I : sub t0d, t1d, t2d : IntAlu : D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0 :
> JLE_I : rdip t1, %ctrl153, : IntAlu : D=0xffffffff80596897
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1 :
> JLE_I : limm t2, 0xfffffffffffffff9 : IntAlu : D=0xfffffffffffffff9
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2 :
> JLE_I : wrip , t1, t2 : IntAlu :
> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16 : NOP
> : IntAlu :
>
> ....and so on repetitively without stopping.
>
> Using --debug-flag=LocalApic, the output does indeed stop shortly after
> restoring from the checkpoint. The last output is:
> ..
> 5269570990500: system.cpu1.interrupts: Reported pending regular
> interrupt.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
> object.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
> offset 0xb0 as 0.
>
> ...and no more output from this point on.
>
> I appreciate your help tremendously.
>
> Ivan
>
>
>
> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <***@gmail.com>
> wrote:
>
> Hi Ivan,
> I believe that the email thread you previously referenced was related to
> a bug that we identified and fixed with changeset 9624
> <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>. That
> bug was causing interrupts to be dropped in x86 when running with the O3
> CPU. Are you using a version of gem5 after that changeset? If not, I'd
> recommend updating to a more recent version and trying to replicate this
> issue again.
>
> If you are using a more recent version of gem5, first, please let us
> know which changeset and whether you've made any changes. Then, I'd
> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
> see if you can zero in on what is happening. Specifically, first try
> passing the flag --debug-flag=Exec to look at what the CPU cores are
> executing (you may also want to pass --trace-start=<<tick>> with a
> simulator tick time close to when the hang happens). This trace will
> include Linux kernel symbols for at least some of the lines if executing in
> the kernel (e.g. handling an interrupt). If you've compiled your benchmark
> without debugging symbols, it may just show the memory addresses of
> instructions being executed within the application. I will guess that
> you'll see kernel symbols for at least some of the executed instructions
> for interrupts.
>
> If it appears that the CPUs are continuing to execute, try running
> with --debug-flag=LocalApic. This will print the interrupts that each core
> is receiving, and if it stops printing at any point, it means something has
> gone wrong and we'd have to do some deeper digging.
>
> Keep us posted on what you find,
> Joel
>
>
>
> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <***@psu.edu> wrote:
>
> Hi Joel,
>
> I have tried several different kernels and disk images, including the
> default ones provided on the GEM5 website in the x86-system.tar.bz2
> download. I run with these commands:
>
> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
> --script=rcs_scripts/run.rcS --max-checkpoints=1
>
> My run.rcS script simply contains:
>
> #!/bin/sh
> /sbin/m5 resetstats
> /sbin/m5 checkpoint
> echo 'booted'
> /extras/run
> /sbin/m5 exit
>
> where "/extras/run" is simply a C program with an infinite loop that
> prints a counter.
>
> I then restore:
>
> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>
> I specified the disk image file in Benchmarks.py. Restoring from the
> same checkpoint and running in atomic mode works fine. I also tried booting
> the system in detailed and letting it run for a while, but once it boots,
> there is no more output. So it seems that checkpointing is not the issue.
> The "run" program is just a dummy, and the same issue also persists when
> running SPEC benchmarks or any other program.
>
> My dummy program is simply:
>
> int count=0;
> printf("**************************** HEYY \n");
> while(1)
> printf("\n %d \n", count++);
>
> Letting it run for a while, the only output is exactly this:
>
> booted
> *******
>
> It doesn't even finish printing the first printf.
>
> Another thing to add: In another scenario, I modified the kernel to call
> an m5 pseudo instruction on every context switch, and then GEM5 prints that
> a context switch occurred. Once again, in atomic mode this worked as
> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
> output stopped along with the system output in the terminal.
>
> Thank you for your help.
>
> Ivan
>
>
> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <***@gmail.com>
> wrote:
>
> Hi Ivan,
> Can you please give more detail on what you're running? Specifically,
> can you give your command line, and which kernel, disk image you're using?
> Are you using checkpointing?
>
> Joel
>
>
> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
> gem5-***@gem5.org> wrote:
>
> Hello,
>
> I am running X86 in full system mode. When running just 1 CPU, both
> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
> works fine, but in detailed mode the system appears to hang shortly after
> boot-up. GEM5 doesn't crash, but the system stops having any output.
> Looking at the stats, it appears that instructions are still being
> committed, but the actual applications/benchmarks are not making progress.
> The issue persists with the latest version of GEM5. I also tried two
> different kernel versions and several different disk images.
>
> I might be experiencing what seems to be the same issue that was found
> about a year ago but appears to not have been fixed:
> https://www.mail-archive.com/gem5-***@gem5.org/msg08839.html
>
> Can anyone reproduce this or know of a solution?
>
> Thank you,
>
> Ivan
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
>
>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
>
>
>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
>
>
>
>
> --
> Joel Hestness
> PhD Student, Computer Architecture
> Dept. of Computer Science, University of Wisconsin - Madison
> http://pages.cs.wisc.edu/~hestness/
>
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
>
> _______________________________________________ gem5-users mailing list
> gem5-***@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> _______________________________________________
> gem5-users mailing list
> gem5-***@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
Continue reading on narkive:
Loading...