Discussion:
dist-gem5 checkpointing
(too old to reply)
David Kim
2018-03-05 17:59:58 UTC
Permalink
Hello,

I am trying to checkpoint dist-gem5 in the middle of the execution of the
application.
The following is my script file that used to run dist-gem5 (with 2 nodes)
after boot up Linux.

< for node 1 (node1.rcS)>
*#!/bin/sh*

*# Set up IP address for node 1*
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
*/sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*

*cd /root/NPB3.3.1/NPB3.3-MPI/bin*

*# checkpoint after delay (in ns, so the below delay represents 50000
seconds! I have also tested 0.1s,10s, and 100s delay)*
*/sbin/m5 checkpoint 50000000000000*

*/sbin/m5 loadsymbol*

*/sbin/m5 resetstats*
*mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
*/sbin/m5 exit*

< for node 2 (node2.rcS) >
*#!/bin/sh*


* # Set up IP address for node 2 *
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
*/sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*

And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh
for some reason, and the following commandline works well in general)

*For switch node,*

*. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch
--dist-size=2 --dist-server-name=localhost --dist-server-port=2200*

*For computer nodes (here is one for node1),*
*/build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
--machine-type=VExpress_EMM64
--disk-image=aarch64-ubuntu-trusty-headless.img
--kernel=vmlinux.aarch64.20140821
--dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU
--num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
--mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2
--dist-server-name=localhost --dist-server-port=2200
--dist-sync-start=1000000000000t*

I have increased checkpoint delay to see if there is any change in my
checkpoint image, but seems to show same behavior; wait that amount of time
(not running an application) then do checkpoint (no progress is displayed
on console until checkpoint. Then, restoring gem5 prints out all the
application output from the beginning).

To checkpoint in the middle of the running of an application, for example,
after 1 billion cycles after running an application, should I only use
m5_roi_begin() and m5_roi_end() call in the application's source code (I
did not test this yet, but guess it will work?), but cannot just add some
delay to checkpoint as shown above (and thus not change application source
code)?

Any comment will be appreciated.

Thanks.

Regards,
Dong Wan Kim
Mohammad Alian
2018-03-06 00:01:41 UTC
Permalink
Hi,

What you have should work. Are you sure that you start the application
after the checkpoint command (you don't block any where?)? E.g. what would
be the output if you add an echo right before starting the MPI app:

/sbin/m5 checkpoint 50000000000000

/sbin/m5 loadsymbol

/sbin/m5 resetstats
*echo "start the app"*
mpiexec -hosts=node1,node2 -np 2 ./cg.S.2


Do you see immediate progress in your application if you remove "/sbin/m5
checkpoint 50000000000000"?

Best,
Mohammad
Post by David Kim
Hello,
I am trying to checkpoint dist-gem5 in the middle of the execution of the
application.
The following is my script file that used to run dist-gem5 (with 2 nodes)
after boot up Linux.
< for node 1 (node1.rcS)>
*#!/bin/sh*
*# Set up IP address for node 1*
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
*/sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
*cd /root/NPB3.3.1/NPB3.3-MPI/bin*
*# checkpoint after delay (in ns, so the below delay represents 50000
seconds! I have also tested 0.1s,10s, and 100s delay)*
*/sbin/m5 checkpoint 50000000000000*
*/sbin/m5 loadsymbol*
*/sbin/m5 resetstats*
*mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
*/sbin/m5 exit*
< for node 2 (node2.rcS) >
*#!/bin/sh*
* # Set up IP address for node 2 *
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
*/sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh
for some reason, and the following commandline works well in general)
*For switch node,*
*. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch
--dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
*For computer nodes (here is one for node1),*
*/build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
--machine-type=VExpress_EMM64
--disk-image=aarch64-ubuntu-trusty-headless.img
--kernel=vmlinux.aarch64.20140821
--dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU
--num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
--mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2
--dist-server-name=localhost --dist-server-port=2200
--dist-sync-start=1000000000000t*
I have increased checkpoint delay to see if there is any change in my
checkpoint image, but seems to show same behavior; wait that amount of time
(not running an application) then do checkpoint (no progress is displayed
on console until checkpoint. Then, restoring gem5 prints out all the
application output from the beginning).
To checkpoint in the middle of the running of an application, for example,
after 1 billion cycles after running an application, should I only use
m5_roi_begin() and m5_roi_end() call in the application's source code (I
did not test this yet, but guess it will work?), but cannot just add some
delay to checkpoint as shown above (and thus not change application source
code)?
Any comment will be appreciated.
Thanks.
Regards,
Dong Wan Kim
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
David Kim
2018-03-07 16:21:41 UTC
Permalink
Hello,

I have looked at the output message again, and it gave the following
message;
info: m5 checkpoint called with non-zero delay => triggering immediate
checkpoint (at the next sync)

So, I look at the source code that print out that message, and the
following is the code snippet,

*@ src/dev/net/dist_iface.cc*

*bool*
*DistIface::readyToCkpt(Tick delay, Tick period)*
*{*
* bool ret = true;*
* DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu "*
* "period:%lu\n", delay, period);*
* if (master) {*
* if (delay == 0) {*
* inform("m5 checkpoint called with zero delay => triggering
collaborative "*
* "checkpoint\n");*
* sync->requestCkpt(ReqType::collective);*
* } else {*
* inform("m5 checkpoint called with non-zero delay => triggering
immediate "*
* "checkpoint (at the next sync)\n");*
* sync->requestCkpt(ReqType::immediate);*
* }*
* if (period != 0)*
* inform("Non-zero period for m5_ckpt is ignored in "*
* "distributed gem5 runs\n");*
* ret = false;*
* }*
* return ret;*
*}*

*@ src/sim/pseudo_inst.cc*

*void*
*m5checkpoint(ThreadContext *tc, Tick delay, Tick period)*
*{*
* DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay,
period);*
* if (!tc->getCpuPtr()->params()->do_checkpoint_insts)*
* return;*

* if (DistIface::readyToCkpt(delay, period)) {*
* Tick when = curTick() + delay * SimClock::Int::ns;*
* Tick repeat = period * SimClock::Int::ns;*
* exitSimLoop("checkpoint", 0, when, repeat);*
* }*
*}*

Since the checkpoint delay is non-zero value, it seems to force do
checkpointing at the next sync time rather than delay value.
In this simulation, I added 'dist-sync-start=1000000000000t', so I think
sync will be on every 1s in simulation time, right?

FYI, I have added 'echo' command, but it was not printed out, so I think
simulation did not reach that point.

Can you explain what is exactly happening in the dist-gem5 checkpoint
routine? Any suggestion or idea will be appreciated.

Thanks.

Dong Wan Kim
Post by Mohammad Alian
Hi,
What you have should work. Are you sure that you start the application
after the checkpoint command (you don't block any where?)? E.g. what would
/sbin/m5 checkpoint 50000000000000
/sbin/m5 loadsymbol
/sbin/m5 resetstats
*echo "start the app"*
mpiexec -hosts=node1,node2 -np 2 ./cg.S.2
Do you see immediate progress in your application if you remove "/sbin/m5
checkpoint 50000000000000"?
Best,
Mohammad
Post by David Kim
Hello,
I am trying to checkpoint dist-gem5 in the middle of the execution of the
application.
The following is my script file that used to run dist-gem5 (with 2 nodes)
after boot up Linux.
< for node 1 (node1.rcS)>
*#!/bin/sh*
*# Set up IP address for node 1*
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
*/sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
*cd /root/NPB3.3.1/NPB3.3-MPI/bin*
*# checkpoint after delay (in ns, so the below delay represents 50000
seconds! I have also tested 0.1s,10s, and 100s delay)*
*/sbin/m5 checkpoint 50000000000000*
*/sbin/m5 loadsymbol*
*/sbin/m5 resetstats*
*mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
*/sbin/m5 exit*
< for node 2 (node2.rcS) >
*#!/bin/sh*
* # Set up IP address for node 2 *
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
*/sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh
for some reason, and the following commandline works well in general)
*For switch node,*
*. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch
--dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
*For computer nodes (here is one for node1),*
*/build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
--machine-type=VExpress_EMM64
--disk-image=aarch64-ubuntu-trusty-headless.img
--kernel=vmlinux.aarch64.20140821
--dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU
--num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
--mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2
--dist-server-name=localhost --dist-server-port=2200
--dist-sync-start=1000000000000t*
I have increased checkpoint delay to see if there is any change in my
checkpoint image, but seems to show same behavior; wait that amount of time
(not running an application) then do checkpoint (no progress is displayed
on console until checkpoint. Then, restoring gem5 prints out all the
application output from the beginning).
To checkpoint in the middle of the running of an application, for
example, after 1 billion cycles after running an application, should I only
use m5_roi_begin() and m5_roi_end() call in the application's source code
(I did not test this yet, but guess it will work?), but cannot just add
some delay to checkpoint as shown above (and thus not change application
source code)?
Any comment will be appreciated.
Thanks.
Regards,
Dong Wan Kim
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gabor Dozsa
2018-03-08 09:43:57 UTC
Permalink
Hi,

dist-gem5 does not support delayed checkpoint currently. The delay option is used to decide whether a 'collective' or an 'immediate' checkpoint is to be taken.

If delay == 0 then the checkpoint is triggered only when all gem5 instances (participating in the dist-gem5 run) have completed an 'm5 checkpoint 0' command. An example use case is when one wants to take a checkpoint at a synchronisation point in the simulated distributed application (e.g. take a checkpoint just before an MPI_Barrier() completes in an MPI application).

On the other hand, if a checkpoint command with a delay != 0 parameter is hit in any of the gem5 processes then a checkpoint is taken immediately across all participating gem5 instances.

Regards,
Gabor Dozsa

-------------------------

Date: Wed, 7 Mar 2018 10:21:41 -0600
From: David Kim <***@gmail.com>
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] dist-gem5 checkpointing
Message-ID:
<CAAuOSRmiiKAf9Bfe2gUDrzrWVtz6hE-***@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hello,

I have looked at the output message again, and it gave the following
message;
info: m5 checkpoint called with non-zero delay => triggering immediate
checkpoint (at the next sync)

So, I look at the source code that print out that message, and the
following is the code snippet,

*@ src/dev/net/dist_iface.cc*

*bool*
*DistIface::readyToCkpt(Tick delay, Tick period)*
*{*
* bool ret = true;*
* DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu "*
* "period:%lu\n", delay, period);*
* if (master) {*
* if (delay == 0) {*
* inform("m5 checkpoint called with zero delay => triggering
collaborative "*
* "checkpoint\n");*
* sync->requestCkpt(ReqType::collective);*
* } else {*
* inform("m5 checkpoint called with non-zero delay => triggering
immediate "*
* "checkpoint (at the next sync)\n");*
* sync->requestCkpt(ReqType::immediate);*
* }*
* if (period != 0)*
* inform("Non-zero period for m5_ckpt is ignored in "*
* "distributed gem5 runs\n");*
* ret = false;*
* }*
* return ret;*
*}*

*@ src/sim/pseudo_inst.cc*

*void*
*m5checkpoint(ThreadContext *tc, Tick delay, Tick period)*
*{*
* DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay,
period);*
* if (!tc->getCpuPtr()->params()->do_checkpoint_insts)*
* return;*

* if (DistIface::readyToCkpt(delay, period)) {*
* Tick when = curTick() + delay * SimClock::Int::ns;*
* Tick repeat = period * SimClock::Int::ns;*
* exitSimLoop("checkpoint", 0, when, repeat);*
* }*
*}*

Since the checkpoint delay is non-zero value, it seems to force do
checkpointing at the next sync time rather than delay value.
In this simulation, I added 'dist-sync-start=1000000000000t', so I think
sync will be on every 1s in simulation time, right?

FYI, I have added 'echo' command, but it was not printed out, so I think
simulation did not reach that point.

Can you explain what is exactly happening in the dist-gem5 checkpoint
routine? Any suggestion or idea will be appreciated.

Thanks.

Dong Wan Kim
Post by Mohammad Alian
Hi,
What you have should work. Are you sure that you start the application
after the checkpoint command (you don't block any where?)? E.g. what would
/sbin/m5 checkpoint 50000000000000
/sbin/m5 loadsymbol
/sbin/m5 resetstats
*echo "start the app"*
mpiexec -hosts=node1,node2 -np 2 ./cg.S.2
Do you see immediate progress in your application if you remove "/sbin/m5
checkpoint 50000000000000"?
Best,
Mohammad
Post by David Kim
Hello,
I am trying to checkpoint dist-gem5 in the middle of the execution of the
application.
The following is my script file that used to run dist-gem5 (with 2 nodes)
after boot up Linux.
< for node 1 (node1.rcS)>
*#!/bin/sh*
*# Set up IP address for node 1*
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
*/sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
*cd /root/NPB3.3.1/NPB3.3-MPI/bin*
*# checkpoint after delay (in ns, so the below delay represents 50000
seconds! I have also tested 0.1s,10s, and 100s delay)*
*/sbin/m5 checkpoint 50000000000000*
*/sbin/m5 loadsymbol*
*/sbin/m5 resetstats*
*mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
*/sbin/m5 exit*
< for node 2 (node2.rcS) >
*#!/bin/sh*
* # Set up IP address for node 2 *
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
*/sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh
for some reason, and the following commandline works well in general)
*For switch node,*
*. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch
--dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
*For computer nodes (here is one for node1),*
*/build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
--machine-type=VExpress_EMM64
--disk-image=aarch64-ubuntu-trusty-headless.img
--kernel=vmlinux.aarch64.20140821
--dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU
--num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
--mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2
--dist-server-name=localhost --dist-server-port=2200
--dist-sync-start=1000000000000t*
I have increased checkpoint delay to see if there is any change in my
checkpoint image, but seems to show same behavior; wait that amount of time
(not running an application) then do checkpoint (no progress is displayed
on console until checkpoint. Then, restoring gem5 prints out all the
application output from the beginning).
To checkpoint in the middle of the running of an application, for
example, after 1 billion cycles after running an application, should I only
use m5_roi_begin() and m5_roi_end() call in the application's source code
(I did not test this yet, but guess it will work?), but cannot just add
some delay to checkpoint as shown above (and thus not change application
source code)?
Any comment will be appreciated.
Thanks.
Regards,
Dong Wan Kim
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Dong Wan Kim
2018-03-09 16:53:01 UTC
Permalink
I see, Gabor.

Sorry that it took some time to reply because I wanted to see if other
checkpoint option is working (e.g. m5_checkpoint() ) and it looks working.

So, it seems like using function calls such as m5_checkpoint() is
"currently" right way to do checkpoint dist-gem5 during execution of an
application, right?

Also, I have tested checkpointing NPB benchmark that you used in your
dist-gem5 evaluation, but some of them seems not working correctly.
For example, DT terminated with a message like "*BAD TERMINATION OF ONE OF
YOUR APPLICATION PROCESSES*", which implies an application was terminated
for some reason, not because of MPICH (according to MPICH doc. :-) ).
Since you used a subset of NPB suite in your evaluation, do you have some
insight about why some test cases could not finish successfully and a list
of those applications?

Thanks.

Dong Wan Kim
Post by Gabor Dozsa
Hi,
dist-gem5 does not support delayed checkpoint currently. The delay option
is used to decide whether a 'collective' or an 'immediate' checkpoint is to
be taken.
If delay == 0 then the checkpoint is triggered only when all gem5
instances (participating in the dist-gem5 run) have completed an 'm5
checkpoint 0' command. An example use case is when one wants to take a
checkpoint at a synchronisation point in the simulated distributed
application (e.g. take a checkpoint just before an MPI_Barrier() completes
in an MPI application).
On the other hand, if a checkpoint command with a delay != 0 parameter is
hit in any of the gem5 processes then a checkpoint is taken immediately
across all participating gem5 instances.
Regards,
Gabor Dozsa
-------------------------
Date: Wed, 7 Mar 2018 10:21:41 -0600
Subject: Re: [gem5-users] dist-gem5 checkpointing
gmail.com>
Content-Type: text/plain; charset="utf-8"
Hello,
I have looked at the output message again, and it gave the following
message;
info: m5 checkpoint called with non-zero delay => triggering immediate
checkpoint (at the next sync)
So, I look at the source code that print out that message, and the
following is the code snippet,
*bool*
*DistIface::readyToCkpt(Tick delay, Tick period)*
*{*
* bool ret = true;*
* DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu "*
* "period:%lu\n", delay, period);*
* if (master) {*
* if (delay == 0) {*
* inform("m5 checkpoint called with zero delay => triggering
collaborative "*
* "checkpoint\n");*
* sync->requestCkpt(ReqType::collective);*
* } else {*
* inform("m5 checkpoint called with non-zero delay => triggering
immediate "*
* "checkpoint (at the next sync)\n");*
* sync->requestCkpt(ReqType::immediate);*
* }*
* if (period != 0)*
* inform("Non-zero period for m5_ckpt is ignored in "*
* "distributed gem5 runs\n");*
* ret = false;*
* }*
* return ret;*
*}*
*void*
*m5checkpoint(ThreadContext *tc, Tick delay, Tick period)*
*{*
* DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay,
period);*
* if (!tc->getCpuPtr()->params()->do_checkpoint_insts)*
* return;*
* if (DistIface::readyToCkpt(delay, period)) {*
* Tick when = curTick() + delay * SimClock::Int::ns;*
* Tick repeat = period * SimClock::Int::ns;*
* exitSimLoop("checkpoint", 0, when, repeat);*
* }*
*}*
Since the checkpoint delay is non-zero value, it seems to force do
checkpointing at the next sync time rather than delay value.
In this simulation, I added 'dist-sync-start=1000000000000t', so I think
sync will be on every 1s in simulation time, right?
FYI, I have added 'echo' command, but it was not printed out, so I think
simulation did not reach that point.
Can you explain what is exactly happening in the dist-gem5 checkpoint
routine? Any suggestion or idea will be appreciated.
Thanks.
Dong Wan Kim
Post by Mohammad Alian
Hi,
What you have should work. Are you sure that you start the
application
Post by Mohammad Alian
after the checkpoint command (you don't block any where?)? E.g. what
would
Post by Mohammad Alian
/sbin/m5 checkpoint 50000000000000
/sbin/m5 loadsymbol
/sbin/m5 resetstats
*echo "start the app"*
mpiexec -hosts=node1,node2 -np 2 ./cg.S.2
Do you see immediate progress in your application if you remove
"/sbin/m5
Post by Mohammad Alian
checkpoint 50000000000000"?
Best,
Mohammad
Post by David Kim
Hello,
I am trying to checkpoint dist-gem5 in the middle of the execution
of the
Post by Mohammad Alian
Post by David Kim
application.
The following is my script file that used to run dist-gem5 (with 2
nodes)
Post by Mohammad Alian
Post by David Kim
after boot up Linux.
< for node 1 (node1.rcS)>
*#!/bin/sh*
*# Set up IP address for node 1*
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
*/sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
*cd /root/NPB3.3.1/NPB3.3-MPI/bin*
*# checkpoint after delay (in ns, so the below delay represents
50000
Post by Mohammad Alian
Post by David Kim
seconds! I have also tested 0.1s,10s, and 100s delay)*
*/sbin/m5 checkpoint 50000000000000*
*/sbin/m5 loadsymbol*
*/sbin/m5 resetstats*
*mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
*/sbin/m5 exit*
< for node 2 (node2.rcS) >
*#!/bin/sh*
* # Set up IP address for node 2 *
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
*/sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
And, here is my commandline to run dist-gem5 (I did not use
gem5-dist.sh
Post by Mohammad Alian
Post by David Kim
for some reason, and the following commandline works well in
general)
Post by Mohammad Alian
Post by David Kim
*For switch node,*
*. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py
--is-switch
Post by Mohammad Alian
Post by David Kim
--dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
*For computer nodes (here is one for node1),*
*/build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
--machine-type=VExpress_EMM64
--disk-image=aarch64-ubuntu-trusty-headless.img
--kernel=vmlinux.aarch64.20140821
--dtb-filename=vexpress.aarch64.20140821.dtb
--cpu-type=TimingSimpleCPU
Post by Mohammad Alian
Post by David Kim
--num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
--mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0
--dist-size=2
Post by Mohammad Alian
Post by David Kim
--dist-server-name=localhost --dist-server-port=2200
--dist-sync-start=1000000000000t*
I have increased checkpoint delay to see if there is any change in
my
Post by Mohammad Alian
Post by David Kim
checkpoint image, but seems to show same behavior; wait that amount
of time
Post by Mohammad Alian
Post by David Kim
(not running an application) then do checkpoint (no progress is
displayed
Post by Mohammad Alian
Post by David Kim
on console until checkpoint. Then, restoring gem5 prints out all the
application output from the beginning).
To checkpoint in the middle of the running of an application, for
example, after 1 billion cycles after running an application,
should I only
Post by Mohammad Alian
Post by David Kim
use m5_roi_begin() and m5_roi_end() call in the application's
source code
Post by Mohammad Alian
Post by David Kim
(I did not test this yet, but guess it will work?), but cannot just
add
Post by Mohammad Alian
Post by David Kim
some delay to checkpoint as shown above (and thus not change
application
Post by Mohammad Alian
Post by David Kim
source code)?
Any comment will be appreciated.
Thanks.
Regards,
Dong Wan Kim
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gabor Dozsa
2018-03-14 09:23:13 UTC
Permalink
Hi,

m5_checkpoint(delay,period) works the same way as the “m5 checkpoint” command (from gem5/util) does in dist-gem5. If you want to initiate the checkpoint from your application (by modifying the source code) then you can use m5_checkpoint(x,0). If you want to initiate the checkpoint from your OS bootscript then you can add “m5 checkpoint x” to your bootscript.

Regarding DT, you may find some clues in the system terminal log file (by default it is in m5out.n/system.terminal where n=[0,N-1] if your dist-gem5 run include N gem5 processes).

Regards,
- Gabor

From: gem5-users <gem5-users-***@gem5.org> on behalf of Dong Wan Kim <***@gmail.com>
Reply-To: gem5 users mailing list <gem5-***@gem5.org>
Date: Friday, 9 March 2018 at 16:53
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] dist-gem5 checkpointing

I see, Gabor.

Sorry that it took some time to reply because I wanted to see if other checkpoint option is working (e.g. m5_checkpoint() ) and it looks working.

So, it seems like using function calls such as m5_checkpoint() is "currently" right way to do checkpoint dist-gem5 during execution of an application, right?

Also, I have tested checkpointing NPB benchmark that you used in your dist-gem5 evaluation, but some of them seems not working correctly.
For example, DT terminated with a message like "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES", which implies an application was terminated for some reason, not because of MPICH (according to MPICH doc. :-) ).
Since you used a subset of NPB suite in your evaluation, do you have some insight about why some test cases could not finish successfully and a list of those applications?

Thanks.

Dong Wan Kim

On Thu, Mar 8, 2018 at 3:43 AM, Gabor Dozsa <***@arm.com<mailto:***@arm.com>> wrote:
Hi,

dist-gem5 does not support delayed checkpoint currently. The delay option is used to decide whether a 'collective' or an 'immediate' checkpoint is to be taken.

If delay == 0 then the checkpoint is triggered only when all gem5 instances (participating in the dist-gem5 run) have completed an 'm5 checkpoint 0' command. An example use case is when one wants to take a checkpoint at a synchronisation point in the simulated distributed application (e.g. take a checkpoint just before an MPI_Barrier() completes in an MPI application).

On the other hand, if a checkpoint command with a delay != 0 parameter is hit in any of the gem5 processes then a checkpoint is taken immediately across all participating gem5 instances.

Regards,
Gabor Dozsa

-------------------------

Date: Wed, 7 Mar 2018 10:21:41 -0600
From: David Kim <***@gmail.com<mailto:***@gmail.com>>
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] dist-gem5 checkpointing
Message-ID:
<CAAuOSRmiiKAf9Bfe2gUDrzrWVtz6hE-***@mail.gmail.com<mailto:CAAuOSRmiiKAf9Bfe2gUDrzrWVtz6hE-***@mail.gmail.com>>
Content-Type: text/plain; charset="utf-8"

Hello,

I have looked at the output message again, and it gave the following
message;
info: m5 checkpoint called with non-zero delay => triggering immediate
checkpoint (at the next sync)

So, I look at the source code that print out that message, and the
following is the code snippet,

*@ src/dev/net/dist_iface.cc*

*bool*
*DistIface::readyToCkpt(Tick delay, Tick period)*
*{*
* bool ret = true;*
* DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu "*
* "period:%lu\n", delay, period);*
* if (master) {*
* if (delay == 0) {*
* inform("m5 checkpoint called with zero delay => triggering
collaborative "*
* "checkpoint\n");*
* sync->requestCkpt(ReqType::collective);*
* } else {*
* inform("m5 checkpoint called with non-zero delay => triggering
immediate "*
* "checkpoint (at the next sync)\n");*
* sync->requestCkpt(ReqType::immediate);*
* }*
* if (period != 0)*
* inform("Non-zero period for m5_ckpt is ignored in "*
* "distributed gem5 runs\n");*
* ret = false;*
* }*
* return ret;*
*}*

*@ src/sim/pseudo_inst.cc*

*void*
*m5checkpoint(ThreadContext *tc, Tick delay, Tick period)*
*{*
* DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay,
period);*
* if (!tc->getCpuPtr()->params()->do_checkpoint_insts)*
* return;*

* if (DistIface::readyToCkpt(delay, period)) {*
* Tick when = curTick() + delay * SimClock::Int::ns;*
* Tick repeat = period * SimClock::Int::ns;*
* exitSimLoop("checkpoint", 0, when, repeat);*
* }*
*}*

Since the checkpoint delay is non-zero value, it seems to force do
checkpointing at the next sync time rather than delay value.
In this simulation, I added 'dist-sync-start=1000000000000t', so I think
sync will be on every 1s in simulation time, right?

FYI, I have added 'echo' command, but it was not printed out, so I think
simulation did not reach that point.

Can you explain what is exactly happening in the dist-gem5 checkpoint
routine? Any suggestion or idea will be appreciated.

Thanks.

Dong Wan Kim
Post by Mohammad Alian
Hi,
What you have should work. Are you sure that you start the application
after the checkpoint command (you don't block any where?)? E.g. what would
/sbin/m5 checkpoint 50000000000000
/sbin/m5 loadsymbol
/sbin/m5 resetstats
*echo "start the app"*
mpiexec -hosts=node1,node2 -np 2 ./cg.S.2
Do you see immediate progress in your application if you remove "/sbin/m5
checkpoint 50000000000000"?
Best,
Mohammad
Post by David Kim
Hello,
I am trying to checkpoint dist-gem5 in the middle of the execution of the
application.
The following is my script file that used to run dist-gem5 (with 2 nodes)
after boot up Linux.
< for node 1 (node1.rcS)>
*#!/bin/sh*
*# Set up IP address for node 1*
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
*/sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
*cd /root/NPB3.3.1/NPB3.3-MPI/bin*
*# checkpoint after delay (in ns, so the below delay represents 50000
seconds! I have also tested 0.1s,10s, and 100s delay)*
*/sbin/m5 checkpoint 50000000000000*
*/sbin/m5 loadsymbol*
*/sbin/m5 resetstats*
*mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
*/sbin/m5 exit*
< for node 2 (node2.rcS) >
*#!/bin/sh*
* # Set up IP address for node 2 *
*/sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
*/sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh
for some reason, and the following commandline works well in general)
*For switch node,*
*. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch
--dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
*For computer nodes (here is one for node1),*
*/build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
--machine-type=VExpress_EMM64
--disk-image=aarch64-ubuntu-trusty-headless.img
--kernel=vmlinux.aarch64.20140821
--dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU
--num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
--mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2
--dist-server-name=localhost --dist-server-port=2200
--dist-sync-start=1000000000000t*
I have increased checkpoint delay to see if there is any change in my
checkpoint image, but seems to show same behavior; wait that amount of time
(not running an application) then do checkpoint (no progress is displayed
on console until checkpoint. Then, restoring gem5 prints out all the
application output from the beginning).
To checkpoint in the middle of the running of an application, for
example, after 1 billion cycles after running an application, should I only
use m5_roi_begin() and m5_roi_end() call in the application's source code
(I did not test this yet, but guess it will work?), but cannot just add
some delay to checkpoint as shown above (and thus not change application
source code)?
Any comment will be appreciated.
Thanks.
Regards,
Dong Wan Kim
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Continue reading on narkive:
Loading...