Discussion:
Running Dist-gem5
(too old to reply)
Vitorio Cargnini (lcargnini)
2017-12-06 00:03:06 UTC
Permalink
Hello,

Please, what exactly do I need to run dist-gem5 with the --dist?

I'm trying, however it fails with "Failed ot start switch"

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the 'localhost'.

Regards,
Vitorio.
Mohammad Alian
2017-12-06 05:18:08 UTC
Permalink
Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating
switch cannot start. There can be so many reasons that a gem5 process fails
to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh
between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env
variable to assign gem5 processes to physical hosts. E.g. if your simulated
cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4
on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <
Post by Vitorio Cargnini (lcargnini)
Hello,
Please, what exactly do I need to run dist-gem5 with the –-dist?
I’m trying, however it fails with “Failed ot start switch”
Also, what do I need in place for it start distributed acroos nodes,
instead of launching multiple/parallel runs in the ‘localhost’.
Regards,
Vitorio.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-06 19:50:22 UTC
Permalink
Hi Mohammad,

Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.

Got the following output:
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017

The log.switch had the following:
command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.

From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5

Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hello,

Please, what exactly do I need to run dist-gem5 with the –-dist?

I’m trying, however it fails with “Failed ot start switch”

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.

Regards,
Vitorio.









_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Mohammad Alian
2017-12-07 00:28:02 UTC
Permalink
Again you need to look at log.* to find out why the simulation gets killed.
Don't only look at log.switch. If one of the gem5 processes aborts then the
entire dist-gem5 simulation will be killed.

On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <
Post by Vitorio Cargnini (lcargnini)
Hi Mohammad,
Thank you for the prompt response. I checked the log.switch the first
erros and I fixed was the path, the script needs full-paths to work, so, I
fixed that, once I tried again, it executed and failed a little later.
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017
command line: /wada/wada/gem5/build/ARM/gem5.opt -d
/wada/wada/gem5/m5out.switch --debug-flags=DistEthernet
/wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch
--is-switch --dist-size=8 --dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link
delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.
Alian
*Sent:* Tuesday, December 5, 2017 9:18 PM
*Subject:* [EXT] Re: [gem5-users] Running Dist-gem5
Hi Vitorio,
You should check the content of log.switch and why gem5 node simulating
switch cannot start. There can be so many reasons that a gem5 process fails
to run. If you print the content of switch.log here then I can help.
Regarding "distributed run", you first need to setup passwordless ssh
between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env
variable to assign gem5 processes to physical hosts. E.g. if your simulated
cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4
export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"
Best,
Mohammad
On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <
Hello,
Please, what exactly do I need to run dist-gem5 with the –-dist?
I’m trying, however it fails with “Failed ot start switch”
Also, what do I need in place for it start distributed acroos nodes,
instead of launching multiple/parallel runs in the ‘localhost’.
Regards,
Vitorio.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gabe Black
2017-12-07 05:42:33 UTC
Permalink
It's also possible if a bunch of copies of gem5 are all running on the same
machine, that machine ran out of memory and started killing processes to
stay afloat.

Gabe
Post by Mohammad Alian
Again you need to look at log.* to find out why the simulation gets
killed. Don't only look at log.switch. If one of the gem5 processes aborts
then the entire dist-gem5 simulation will be killed.
On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <
Post by Vitorio Cargnini (lcargnini)
Hi Mohammad,
Thank you for the prompt response. I checked the log.switch the first
erros and I fixed was the path, the script needs full-paths to work, so, I
fixed that, once I tried again, it executed and failed a little later.
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017
command line: /wada/wada/gem5/build/ARM/gem5.opt -d
/wada/wada/gem5/m5out.switch --debug-flags=DistEthernet
/wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch
--is-switch --dist-size=8 --dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link
delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.
Alian
*Sent:* Tuesday, December 5, 2017 9:18 PM
*Subject:* [EXT] Re: [gem5-users] Running Dist-gem5
Hi Vitorio,
You should check the content of log.switch and why gem5 node simulating
switch cannot start. There can be so many reasons that a gem5 process fails
to run. If you print the content of switch.log here then I can help.
Regarding "distributed run", you first need to setup passwordless ssh
between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env
variable to assign gem5 processes to physical hosts. E.g. if your simulated
cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4
export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"
Best,
Mohammad
On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <
Hello,
Please, what exactly do I need to run dist-gem5 with the –-dist?
I’m trying, however it fails with “Failed ot start switch”
Also, what do I need in place for it start distributed acroos nodes,
instead of launching multiple/parallel runs in the ‘localhost’.
Regards,
Vitorio.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-07 18:31:49 UTC
Permalink
Thanks Gabe,

But no, I set it to 4 nodes with 2 process per node to start testing.


From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Gabe Black
Sent: Wednesday, December 6, 2017 9:43 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

It's also possible if a bunch of copies of gem5 are all running on the same machine, that machine ran out of memory and started killing processes to stay afloat.

Gabe

On Wed, Dec 6, 2017 at 4:28 PM, Mohammad Alian <***@gmail.com<mailto:***@gmail.com>> wrote:
Again you need to look at log.* to find out why the simulation gets killed. Don't only look at log.switch. If one of the gem5 processes aborts then the entire dist-gem5 simulation will be killed.


On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi Mohammad,

Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.

Got the following output:
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017

The log.switch had the following:
command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5

Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hello,

Please, what exactly do I need to run dist-gem5 with the –-dist?

I’m trying, however it fails with “Failed ot start switch”

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.

Regards,
Vitorio.









_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-07 17:55:22 UTC
Permalink
Hi,

The m5out.*/stats.txt from everyone are empty.

However, the m5out.switch/config.ini is filled with:
It goes from 0 to 7:
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]

I’m thinking if the server_name could be the problem



From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Mohammad Alian
Sent: Wednesday, December 6, 2017 4:28 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Again you need to look at log.* to find out why the simulation gets killed. Don't only look at log.switch. If one of the gem5 processes aborts then the entire dist-gem5 simulation will be killed.

On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi Mohammad,

Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.

Got the following output:
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017

The log.switch had the following:
command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5

Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hello,

Please, what exactly do I need to run dist-gem5 with the –-dist?

I’m trying, however it fails with “Failed ot start switch”

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.

Regards,
Vitorio.









_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Mohammad Alian
2017-12-07 17:59:46 UTC
Permalink
Please look at the content of log.* not m5out.*/stats.txt . It's not
surprising that stats.txt is empty ...

On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <
Post by Vitorio Cargnini (lcargnini)
Hi,
The m5out.*/stats.txt from everyone are empty.
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]
I’m thinking if the server_name could be the problem

Alian
*Sent:* Wednesday, December 6, 2017 4:28 PM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Again you need to look at log.* to find out why the simulation gets
killed. Don't only look at log.switch. If one of the gem5 processes aborts
then the entire dist-gem5 simulation will be killed.
On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <
Hi Mohammad,
Thank you for the prompt response. I checked the log.switch the first
erros and I fixed was the path, the script needs full-paths to work, so, I
fixed that, once I tried again, it executed and failed a little later.
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017
command line: /wada/wada/gem5/build/ARM/gem5.opt -d
/wada/wada/gem5/m5out.switch --debug-flags=DistEthernet
/wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch
--is-switch --dist-size=8 --dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link
delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.
Alian
*Sent:* Tuesday, December 5, 2017 9:18 PM
*Subject:* [EXT] Re: [gem5-users] Running Dist-gem5
Hi Vitorio,
You should check the content of log.switch and why gem5 node simulating
switch cannot start. There can be so many reasons that a gem5 process fails
to run. If you print the content of switch.log here then I can help.
Regarding "distributed run", you first need to setup passwordless ssh
between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env
variable to assign gem5 processes to physical hosts. E.g. if your simulated
cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4
export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"
Best,
Mohammad
On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <
Hello,
Please, what exactly do I need to run dist-gem5 with the –-dist?
I’m trying, however it fails with “Failed ot start switch”
Also, what do I need in place for it start distributed acroos nodes,
instead of launching multiple/parallel runs in the ‘localhost’.
Regards,
Vitorio.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-08 18:36:00 UTC
Permalink
Thanks Mohammad I made some changes and attempted again, it worked but for some reason it simplies 
 dies after a while, not sure why.

Igot the following message on my terminal:
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000




On log.switch this is what I got:

**** REAL SIMULATION ****
0: system.portlink0: DistEtherLink::startup() called
0: global: DistIface::startup() started
info: Dist sync scheduled at 5200000000000 and repeats 0: global: DistIface::startup() done
10000000
0: system.portlink1: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink2: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink3: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink4: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink5: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink6: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink7: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000

From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Mohammad Alian
Sent: Thursday, December 7, 2017 10:00 AM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Please look at the content of log.* not m5out.*/stats.txt . It's not surprising that stats.txt is empty ...

On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi,

The m5out.*/stats.txt from everyone are empty.

However, the m5out.switch/config.ini is filled with:
It goes from 0 to 7:
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]

I’m thinking if the server_name could be the problem



From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Wednesday, December 6, 2017 4:28 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Again you need to look at log.* to find out why the simulation gets killed. Don't only look at log.switch. If one of the gem5 processes aborts then the entire dist-gem5 simulation will be killed.

On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi Mohammad,

Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.

Got the following output:
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017

The log.switch had the following:
command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5

Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hello,

Please, what exactly do I need to run dist-gem5 with the –-dist?

I’m trying, however it fails with “Failed ot start switch”

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.

Regards,
Vitorio.









_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Mohammad Alian
2017-12-11 03:53:45 UTC
Permalink
Oh, you should start synchronization between gem5 nodes before you start
communication inside the simulated cluster. Use "--dist-sync-start" option
to start synchronization before send tick (4428354726000). You should pass
this option to all gem5 processes (FS nodes + switch node). So you should
set --dist-sync-start as a "--cf-args" argument in your launch script:

--cf-args --dist-sync-start=1000000000000


Best,
Mohammad




On Fri, Dec 8, 2017 at 12:36 PM, Vitorio Cargnini (lcargnini) <
Post by Vitorio Cargnini (lcargnini)
Thanks Mohammad I made some changes and attempted again, it worked but
for some reason it simplies 
 dies after a while, not sure why.
0: global: DistIface::startup() done
panic: panic condition recv_tick <= curTick() occurred: Simulators out of
sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0
send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000
**** REAL SIMULATION ****
0: system.portlink0: DistEtherLink::startup() called
0: global: DistIface::startup() started
DistIface::startup() done
10000000
0: system.portlink1: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink2: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink3: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink4: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink5: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink6: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink7: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
panic: panic condition recv_tick <= curTick() occurred: Simulators out of
sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0
send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000
Alian
*Sent:* Thursday, December 7, 2017 10:00 AM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Please look at the content of log.* not m5out.*/stats.txt . It's not
surprising that stats.txt is empty ...
On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <
Hi,
The m5out.*/stats.txt from everyone are empty.
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]
I’m thinking if the server_name could be the problem

Alian
*Sent:* Wednesday, December 6, 2017 4:28 PM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Again you need to look at log.* to find out why the simulation gets
killed. Don't only look at log.switch. If one of the gem5 processes aborts
then the entire dist-gem5 simulation will be killed.
On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <
Hi Mohammad,
Thank you for the prompt response. I checked the log.switch the first
erros and I fixed was the path, the script needs full-paths to work, so, I
fixed that, once I tried again, it executed and failed a little later.
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017
command line: /wada/wada/gem5/build/ARM/gem5.opt -d
/wada/wada/gem5/m5out.switch --debug-flags=DistEthernet
/wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch
--is-switch --dist-size=8 --dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link
delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.
Alian
*Sent:* Tuesday, December 5, 2017 9:18 PM
*Subject:* [EXT] Re: [gem5-users] Running Dist-gem5
Hi Vitorio,
You should check the content of log.switch and why gem5 node simulating
switch cannot start. There can be so many reasons that a gem5 process fails
to run. If you print the content of switch.log here then I can help.
Regarding "distributed run", you first need to setup passwordless ssh
between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env
variable to assign gem5 processes to physical hosts. E.g. if your simulated
cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4
export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"
Best,
Mohammad
On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <
Hello,
Please, what exactly do I need to run dist-gem5 with the –-dist?
I’m trying, however it fails with “Failed ot start switch”
Also, what do I need in place for it start distributed acroos nodes,
instead of launching multiple/parallel runs in the ‘localhost’.
Regards,
Vitorio.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-11 18:40:03 UTC
Permalink
Thanks Mohammad,

I tried and got the following in the log(log.switch) (below), I also did try using different orders in the parameters, what should I look for now?

Log:
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 compiled Dec 6 2017 14:35:43
gem5 started Dec 11 2017 11:16:10
gem5 executing on rndarch11, pid 9841
command line: /wada/gem5/build/ARM/gem5.opt -d /wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/gem5/configs/dist/sw.py --dist-sync-start=1000000000000 --checkpoint-dir=/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
Traceback (most recent call last):
File "<string>", line 1, in <module>
File /wada/gem5/src/python/m5/main.py", line 433, in main
exec filecode in scope
File /wada/gem5/configs/dist/sw.py", line 79, in <module>
main()
File /wada/gem5/configs/dist/sw.py", line 76, in main
Simulation.run(options, root, None, None)
File /wada/gem5/configs/common/Simulation.py", line 589, in run
m5.instantiate(checkpoint_dir)
File /wada/gem5/src/python/m5/simulate.py", line 115, in instantiate
for obj in root.descendants(): obj.createCCObject()
File /wada/gem5/src/python/m5/SimObject.py", line 1484, in createCCObject
self.getCCParams()
File /wada/gem5/src/python/m5/SimObject.py", line 1439, in getCCParams
setattr(cc_params, param, value)
TypeError: (): incompatible function arguments. The following argument types are supported:
1. (self: _m5.param_DistEtherLink.DistEtherLinkParams, arg0: int) -> None

Invoked with: <_m5.param_DistEtherLink.DistEtherLinkParams object at 0x7f9a37b8fd80>, 999999999999999983222784L

From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Mohammad Alian
Sent: Sunday, December 10, 2017 7:54 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Oh, you should start synchronization between gem5 nodes before you start communication inside the simulated cluster. Use "--dist-sync-start" option to start synchronization before send tick (4428354726000). You should pass this option to all gem5 processes (FS nodes + switch node). So you should set --dist-sync-start as a "--cf-args" argument in your launch script:

--cf-args --dist-sync-start=1000000000000


Best,
Mohammad




On Fri, Dec 8, 2017 at 12:36 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Thanks Mohammad I made some changes and attempted again, it worked but for some reason it simplies 
 dies after a while, not sure why.

Igot the following message on my terminal:
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000




On log.switch this is what I got:

**** REAL SIMULATION ****
0: system.portlink0: DistEtherLink::startup() called
0: global: DistIface::startup() started
info: Dist sync scheduled at 5200000000000 and repeats 0: global: DistIface::startup() done
10000000
0: system.portlink1: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink2: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink3: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink4: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink5: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink6: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink7: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Thursday, December 7, 2017 10:00 AM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Please look at the content of log.* not m5out.*/stats.txt . It's not surprising that stats.txt is empty ...

On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi,

The m5out.*/stats.txt from everyone are empty.

However, the m5out.switch/config.ini is filled with:
It goes from 0 to 7:
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]

I’m thinking if the server_name could be the problem



From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Wednesday, December 6, 2017 4:28 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Again you need to look at log.* to find out why the simulation gets killed. Don't only look at log.switch. If one of the gem5 processes aborts then the entire dist-gem5 simulation will be killed.

On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi Mohammad,

Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.

Got the following output:
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017

The log.switch had the following:
command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5

Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hello,

Please, what exactly do I need to run dist-gem5 with the –-dist?

I’m trying, however it fails with “Failed ot start switch”

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.

Regards,
Vitorio.









_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Mohammad Alian
2017-12-12 02:02:54 UTC
Permalink
add a "t" at the end of tick number:

--dist-sync-start=1000000000000t

On Mon, Dec 11, 2017 at 12:40 PM, Vitorio Cargnini (lcargnini) <
Post by Vitorio Cargnini (lcargnini)
Thanks Mohammad,
I tried and got the following in the log(log.switch) (below), I also did
try using different orders in the parameters, what should I look for now?
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.
gem5 compiled Dec 6 2017 14:35:43
gem5 started Dec 11 2017 11:16:10
gem5 executing on rndarch11, pid 9841
command line: /wada/gem5/build/ARM/gem5.opt -d /wada/gem5/m5out.switch
--debug-flags=DistEthernet /wada/gem5/configs/dist/sw.py --dist-sync-start=1000000000000
--checkpoint-dir=/wada/gem5/m5out.switch --is-switch --dist-size=8
--dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
File "<string>", line 1, in <module>
File /wada/gem5/src/python/m5/main.py", line 433, in main
exec filecode in scope
File /wada/gem5/configs/dist/sw.py", line 79, in <module>
main()
File /wada/gem5/configs/dist/sw.py", line 76, in main
Simulation.run(options, root, None, None)
File /wada/gem5/configs/common/Simulation.py", line 589, in run
m5.instantiate(checkpoint_dir)
File /wada/gem5/src/python/m5/simulate.py", line 115, in instantiate
for obj in root.descendants(): obj.createCCObject()
File /wada/gem5/src/python/m5/SimObject.py", line 1484, in
createCCObject
self.getCCParams()
File /wada/gem5/src/python/m5/SimObject.py", line 1439, in getCCParams
setattr(cc_params, param, value)
1. (self: _m5.param_DistEtherLink.DistEtherLinkParams, arg0: int) -> None
Invoked with: <_m5.param_DistEtherLink.DistEtherLinkParams object at
0x7f9a37b8fd80>, 999999999999999983222784L
Alian
*Sent:* Sunday, December 10, 2017 7:54 PM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Oh, you should start synchronization between gem5 nodes before you start
communication inside the simulated cluster. Use "--dist-sync-start" option
to start synchronization before send tick (4428354726000). You should
pass this option to all gem5 processes (FS nodes + switch node). So you
should set --dist-sync-start as a "--cf-args" argument in your launch
--cf-args --dist-sync-start=1000000000000
Best,
Mohammad
On Fri, Dec 8, 2017 at 12:36 PM, Vitorio Cargnini (lcargnini) <
Thanks Mohammad I made some changes and attempted again, it worked but for
some reason it simplies 
 dies after a while, not sure why.
0: global: DistIface::startup() done
panic: panic condition recv_tick <= curTick() occurred: Simulators out of
sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0
send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000
**** REAL SIMULATION ****
0: system.portlink0: DistEtherLink::startup() called
0: global: DistIface::startup() started
DistIface::startup() done
10000000
0: system.portlink1: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink2: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink3: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink4: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink5: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink6: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink7: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
panic: panic condition recv_tick <= curTick() occurred: Simulators out of
sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0
send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000
Alian
*Sent:* Thursday, December 7, 2017 10:00 AM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Please look at the content of log.* not m5out.*/stats.txt . It's not
surprising that stats.txt is empty ...
On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <
Hi,
The m5out.*/stats.txt from everyone are empty.
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]
I’m thinking if the server_name could be the problem

Alian
*Sent:* Wednesday, December 6, 2017 4:28 PM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Again you need to look at log.* to find out why the simulation gets
killed. Don't only look at log.switch. If one of the gem5 processes aborts
then the entire dist-gem5 simulation will be killed.
On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <
Hi Mohammad,
Thank you for the prompt response. I checked the log.switch the first
erros and I fixed was the path, the script needs full-paths to work, so, I
fixed that, once I tried again, it executed and failed a little later.
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017
command line: /wada/wada/gem5/build/ARM/gem5.opt -d
/wada/wada/gem5/m5out.switch --debug-flags=DistEthernet
/wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch
--is-switch --dist-size=8 --dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link
delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.
Alian
*Sent:* Tuesday, December 5, 2017 9:18 PM
*Subject:* [EXT] Re: [gem5-users] Running Dist-gem5
Hi Vitorio,
You should check the content of log.switch and why gem5 node simulating
switch cannot start. There can be so many reasons that a gem5 process fails
to run. If you print the content of switch.log here then I can help.
Regarding "distributed run", you first need to setup passwordless ssh
between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env
variable to assign gem5 processes to physical hosts. E.g. if your simulated
cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4
export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"
Best,
Mohammad
On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <
Hello,
Please, what exactly do I need to run dist-gem5 with the –-dist?
I’m trying, however it fails with “Failed ot start switch”
Also, what do I need in place for it start distributed acroos nodes,
instead of launching multiple/parallel runs in the ‘localhost’.
Regards,
Vitorio.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-12 21:02:19 UTC
Permalink
Thanks Mohammad,

Now it executed flawlessly.

Appreciate all the assistance.

Now I gonna play with some benchmarks.

Regards,
Vitorio.

From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Mohammad Alian
Sent: Monday, December 11, 2017 6:03 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

add a "t" at the end of tick number:

--dist-sync-start=1000000000000t

On Mon, Dec 11, 2017 at 12:40 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Thanks Mohammad,

I tried and got the following in the log(log.switch) (below), I also did try using different orders in the parameters, what should I look for now?

Log:
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 compiled Dec 6 2017 14:35:43
gem5 started Dec 11 2017 11:16:10
gem5 executing on rndarch11, pid 9841
command line: /wada/gem5/build/ARM/gem5.opt -d /wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/gem5/configs/dist/sw.py --dist-sync-start=1000000000000 --checkpoint-dir=/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
Traceback (most recent call last):
File "<string>", line 1, in <module>
File /wada/gem5/src/python/m5/main.py", line 433, in main
exec filecode in scope
File /wada/gem5/configs/dist/sw.py", line 79, in <module>
main()
File /wada/gem5/configs/dist/sw.py", line 76, in main
Simulation.run(options, root, None, None)
File /wada/gem5/configs/common/Simulation.py", line 589, in run
m5.instantiate(checkpoint_dir)
File /wada/gem5/src/python/m5/simulate.py", line 115, in instantiate
for obj in root.descendants(): obj.createCCObject()
File /wada/gem5/src/python/m5/SimObject.py", line 1484, in createCCObject
self.getCCParams()
File /wada/gem5/src/python/m5/SimObject.py", line 1439, in getCCParams
setattr(cc_params, param, value)
TypeError: (): incompatible function arguments. The following argument types are supported:
1. (self: _m5.param_DistEtherLink.DistEtherLinkParams, arg0: int) -> None

Invoked with: <_m5.param_DistEtherLink.DistEtherLinkParams object at 0x7f9a37b8fd80>, 999999999999999983222784L

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Sunday, December 10, 2017 7:54 PM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Oh, you should start synchronization between gem5 nodes before you start communication inside the simulated cluster. Use "--dist-sync-start" option to start synchronization before send tick (4428354726000). You should pass this option to all gem5 processes (FS nodes + switch node). So you should set --dist-sync-start as a "--cf-args" argument in your launch script:

--cf-args --dist-sync-start=1000000000000


Best,
Mohammad




On Fri, Dec 8, 2017 at 12:36 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Thanks Mohammad I made some changes and attempted again, it worked but for some reason it simplies 
 dies after a while, not sure why.

Igot the following message on my terminal:
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000




On log.switch this is what I got:

**** REAL SIMULATION ****
0: system.portlink0: DistEtherLink::startup() called
0: global: DistIface::startup() started
info: Dist sync scheduled at 5200000000000 and repeats 0: global: DistIface::startup() done
10000000
0: system.portlink1: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink2: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink3: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink4: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink5: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink6: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink7: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Thursday, December 7, 2017 10:00 AM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Please look at the content of log.* not m5out.*/stats.txt . It's not surprising that stats.txt is empty ...

On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi,

The m5out.*/stats.txt from everyone are empty.

However, the m5out.switch/config.ini is filled with:
It goes from 0 to 7:
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]

I’m thinking if the server_name could be the problem



From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Wednesday, December 6, 2017 4:28 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Again you need to look at log.* to find out why the simulation gets killed. Don't only look at log.switch. If one of the gem5 processes aborts then the entire dist-gem5 simulation will be killed.

On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi Mohammad,

Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.

Got the following output:
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017

The log.switch had the following:
command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5

Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hello,

Please, what exactly do I need to run dist-gem5 with the –-dist?

I’m trying, however it fails with “Failed ot start switch”

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.

Regards,
Vitorio.









_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-15 20:20:01 UTC
Permalink
Thanks Mohammad,

Everything it is working now.

However, a question, how do I can capture the ‘virtual terminal.’ because I the script I passed was supposed to run /bin/hostname and /bin/date, then call m5 exit. So where can I observe the terminal output?

Also, can I use checkpoints from my rcS script?
I already added. However, I’m not sure if that would work.

Last thing, once it is up and running can I conenct using mterm?

Regards,
Vitorio.



From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Mohammad Alian
Sent: Monday, December 11, 2017 6:03 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

add a "t" at the end of tick number:

--dist-sync-start=1000000000000t

On Mon, Dec 11, 2017 at 12:40 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Thanks Mohammad,

I tried and got the following in the log(log.switch) (below), I also did try using different orders in the parameters, what should I look for now?

Log:
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 compiled Dec 6 2017 14:35:43
gem5 started Dec 11 2017 11:16:10
gem5 executing on rndarch11, pid 9841
command line: /wada/gem5/build/ARM/gem5.opt -d /wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/gem5/configs/dist/sw.py --dist-sync-start=1000000000000 --checkpoint-dir=/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
Traceback (most recent call last):
File "<string>", line 1, in <module>
File /wada/gem5/src/python/m5/main.py", line 433, in main
exec filecode in scope
File /wada/gem5/configs/dist/sw.py", line 79, in <module>
main()
File /wada/gem5/configs/dist/sw.py", line 76, in main
Simulation.run(options, root, None, None)
File /wada/gem5/configs/common/Simulation.py", line 589, in run
m5.instantiate(checkpoint_dir)
File /wada/gem5/src/python/m5/simulate.py", line 115, in instantiate
for obj in root.descendants(): obj.createCCObject()
File /wada/gem5/src/python/m5/SimObject.py", line 1484, in createCCObject
self.getCCParams()
File /wada/gem5/src/python/m5/SimObject.py", line 1439, in getCCParams
setattr(cc_params, param, value)
TypeError: (): incompatible function arguments. The following argument types are supported:
1. (self: _m5.param_DistEtherLink.DistEtherLinkParams, arg0: int) -> None

Invoked with: <_m5.param_DistEtherLink.DistEtherLinkParams object at 0x7f9a37b8fd80>, 999999999999999983222784L

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Sunday, December 10, 2017 7:54 PM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Oh, you should start synchronization between gem5 nodes before you start communication inside the simulated cluster. Use "--dist-sync-start" option to start synchronization before send tick (4428354726000). You should pass this option to all gem5 processes (FS nodes + switch node). So you should set --dist-sync-start as a "--cf-args" argument in your launch script:

--cf-args --dist-sync-start=1000000000000


Best,
Mohammad




On Fri, Dec 8, 2017 at 12:36 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Thanks Mohammad I made some changes and attempted again, it worked but for some reason it simplies 
 dies after a while, not sure why.

Igot the following message on my terminal:
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000




On log.switch this is what I got:

**** REAL SIMULATION ****
0: system.portlink0: DistEtherLink::startup() called
0: global: DistIface::startup() started
info: Dist sync scheduled at 5200000000000 and repeats 0: global: DistIface::startup() done
10000000
0: system.portlink1: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink2: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink3: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink4: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink5: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink6: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink7: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
info: Entering event queue @ 0. Starting simulation...
panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Thursday, December 7, 2017 10:00 AM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Please look at the content of log.* not m5out.*/stats.txt . It's not surprising that stats.txt is empty ...

On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi,

The m5out.*/stats.txt from everyone are empty.

However, the m5out.switch/config.ini is filled with:
It goes from 0 to 7:
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]

I’m thinking if the server_name could be the problem



From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Wednesday, December 6, 2017 4:28 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5

Again you need to look at log.* to find out why the simulation gets killed. Don't only look at log.switch. If one of the gem5 processes aborts then the entire dist-gem5 simulation will be killed.

On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hi Mohammad,

Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.

Got the following output:
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017

The log.switch had the following:
command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200

info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5

Hi Vitorio,

You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.

Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:

export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"


Best,
Mohammad


On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:
Hello,

Please, what exactly do I need to run dist-gem5 with the –-dist?

I’m trying, however it fails with “Failed ot start switch”

Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.

Regards,
Vitorio.









_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Mohammad Alian
2017-12-16 18:16:04 UTC
Permalink
On Fri, Dec 15, 2017 at 2:20 PM, Vitorio Cargnini (lcargnini) <
Post by Vitorio Cargnini (lcargnini)
Thanks Mohammad,
Everything it is working now.
However, a question, how do I can capture the ‘virtual terminal.’ because
I the script I passed was supposed to run /bin/hostname and /bin/date, then
call m5 exit. So where can I observe the terminal output?
You can check terminal output from: m5out.*/system.terminal
Post by Vitorio Cargnini (lcargnini)
Also, can I use checkpoints from my rcS script?
I already added. However, I’m not sure if that would work.
Sure- check $CKPT_DIR/m5out.* to see if you have successfully dumped a
checkpoint. You can also check log.* and look for a checkpoint dump
printout (e.g. "Writing checkpoint" and "info: m5 checkpoint called with
non-zero delay => triggering immediate checkpoint (at the next sync)")

Please note that in dist-gem5 you can take checkpoint in two flavors. You
should choose one based on your need.
1- collaborative:

When all nodes in the simulated cluster called "/sbin/m5 checkpoint" then
dist-gem5 starts dumping a checkpoint

2- immediate

Only one node call "/sbin/m5 checkpoint <some delay (e.g. 1)>" and
dist-gem5 starts dumping a checkpoint at the beginning of next sync quantum.

Last thing, once it is up and running can I conenct using mterm?
It should be possible (although not useful, you should try to work with rcS
scripts as its the most efficient way for communicating with gem5). But I
guess when you redirect the output of gem5 to a file (which is the case for
dist-gem5) then gem5 automatically disables all listeners, including
sockets for connecting m5term. If that's the case (most likely), you should
first find a way to re enable listeners before you try to connect m5term.
Post by Vitorio Cargnini (lcargnini)
Regards,
Vitorio.
Alian
*Sent:* Monday, December 11, 2017 6:03 PM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
--dist-sync-start=1000000000000t
On Mon, Dec 11, 2017 at 12:40 PM, Vitorio Cargnini (lcargnini) <
Thanks Mohammad,
I tried and got the following in the log(log.switch) (below), I also did
try using different orders in the parameters, what should I look for now?
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.
gem5 compiled Dec 6 2017 14:35:43
gem5 started Dec 11 2017 11:16:10
gem5 executing on rndarch11, pid 9841
command line: /wada/gem5/build/ARM/gem5.opt -d /wada/gem5/m5out.switch
--debug-flags=DistEthernet /wada/gem5/configs/dist/sw.py --dist-sync-start=1000000000000
--checkpoint-dir=/wada/gem5/m5out.switch --is-switch --dist-size=8
--dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
File "<string>", line 1, in <module>
File /wada/gem5/src/python/m5/main.py", line 433, in main
exec filecode in scope
File /wada/gem5/configs/dist/sw.py", line 79, in <module>
main()
File /wada/gem5/configs/dist/sw.py", line 76, in main
Simulation.run(options, root, None, None)
File /wada/gem5/configs/common/Simulation.py", line 589, in run
m5.instantiate(checkpoint_dir)
File /wada/gem5/src/python/m5/simulate.py", line 115, in instantiate
for obj in root.descendants(): obj.createCCObject()
File /wada/gem5/src/python/m5/SimObject.py", line 1484, in
createCCObject
self.getCCParams()
File /wada/gem5/src/python/m5/SimObject.py", line 1439, in getCCParams
setattr(cc_params, param, value)
1. (self: _m5.param_DistEtherLink.DistEtherLinkParams, arg0: int) -> None
Invoked with: <_m5.param_DistEtherLink.DistEtherLinkParams object at
0x7f9a37b8fd80>, 999999999999999983222784L
Alian
*Sent:* Sunday, December 10, 2017 7:54 PM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Oh, you should start synchronization between gem5 nodes before you start
communication inside the simulated cluster. Use "--dist-sync-start" option
to start synchronization before send tick (4428354726000). You should
pass this option to all gem5 processes (FS nodes + switch node). So you
should set --dist-sync-start as a "--cf-args" argument in your launch
--cf-args --dist-sync-start=1000000000000
Best,
Mohammad
On Fri, Dec 8, 2017 at 12:36 PM, Vitorio Cargnini (lcargnini) <
Thanks Mohammad I made some changes and attempted again, it worked but for
some reason it simplies 
 dies after a while, not sure why.
0: global: DistIface::startup() done
panic: panic condition recv_tick <= curTick() occurred: Simulators out of
sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0
send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000
**** REAL SIMULATION ****
0: system.portlink0: DistEtherLink::startup() called
0: global: DistIface::startup() started
DistIface::startup() done
10000000
0: system.portlink1: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink2: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink3: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink4: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink5: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink6: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
0: system.portlink7: DistEtherLink::startup() called
0: global: DistIface::startup() started
0: global: DistIface::startup() done
panic: panic condition recv_tick <= curTick() occurred: Simulators out of
sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0
send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )
Memory Usage: 402472 KBytes
Program aborted at tick 5200000000000
Alian
*Sent:* Thursday, December 7, 2017 10:00 AM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Please look at the content of log.* not m5out.*/stats.txt . It's not
surprising that stats.txt is empty ...
On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <
Hi,
The m5out.*/stats.txt from everyone are empty.
[system.portlink7]
type=DistEtherLink
delay=10000000
delay_var=0
dist_rank=0
dist_size=8
dist_sync_on_pseudo_op=false
dump=Null
eventq_index=0
is_switch=true
num_nodes=8
server_name=127.0.0.1
server_port=2200
speed=800.000000
sync_repeat=0
sync_start=5200000000000
int0=system.interface[7]
I’m thinking if the server_name could be the problem

Alian
*Sent:* Wednesday, December 6, 2017 4:28 PM
*Subject:* Re: [gem5-users] [EXT] Re: Running Dist-gem5
Again you need to look at log.* to find out why the simulation gets
killed. Don't only look at log.switch. If one of the gem5 processes aborts
then the entire dist-gem5 simulation will be killed.
On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <
Hi Mohammad,
Thank you for the prompt response. I checked the log.switch the first
erros and I fixed was the path, the script needs full-paths to work, so, I
fixed that, once I tried again, it executed and failed a little later.
launch switch gem5 process on node0 ...
waiting for switch to start ..
node #switch started
START Wed Dec 6 12:36:04 MST 2017
starting gem5 on node0...
starting gem5 on node0...
starting gem5 on node1...
starting gem5 on node1...
starting gem5 on node2 ...
starting gem5 on node2 ...
starting gem5 on node3 ...
starting gem5 on node3 ...
(I) (some) gem5 process(es) exited
KILLED Wed Dec 6 12:37:35 MST 2017
ABORT Wed Dec 6 12:37:35 MST 2017
command line: /wada/wada/gem5/build/ARM/gem5.opt -d
/wada/wada/gem5/m5out.switch --debug-flags=DistEthernet
/wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch
--is-switch --dist-size=8 --dist-server-port=2200
info: Standard input is not a terminal, disabling listeners.
Global frequency set at 1000000000000 ticks per second
0: system.portlink0: DistEtherLink::DistEtherLink() link
delay:10000000 ticksPerByte:800
0: global: DistIface() ctor rank:0
info: tcp_iface listening on port 2200
Killed by signal 15.
Alian
*Sent:* Tuesday, December 5, 2017 9:18 PM
*Subject:* [EXT] Re: [gem5-users] Running Dist-gem5
Hi Vitorio,
You should check the content of log.switch and why gem5 node simulating
switch cannot start. There can be so many reasons that a gem5 process fails
to run. If you print the content of switch.log here then I can help.
Regarding "distributed run", you first need to setup passwordless ssh
between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env
variable to assign gem5 processes to physical hosts. E.g. if your simulated
cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4
export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"
Best,
Mohammad
On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <
Hello,
Please, what exactly do I need to run dist-gem5 with the –-dist?
I’m trying, however it fails with “Failed ot start switch”
Also, what do I need in place for it start distributed acroos nodes,
instead of launching multiple/parallel runs in the ‘localhost’.
Regards,
Vitorio.
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Vitorio Cargnini (lcargnini)
2017-12-20 02:23:00 UTC
Permalink
Thanks Mohammad,
I really appreciate all the help; I’ll work on dist-gem5 now, test some more, new kernel, new images, new DTBs in dist mode, new scripts and make some measures.

If I hit another wall, I’ll ping again, also, if I can help in anything let me know. In addition to that, if you finished your grad or near end, or just looking, ping me off-line.

Best Regards,
Luis Vitorio Cargnini.

From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Mohammad Alian
Sent: Saturday, December 16, 2017 10:16 AM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5



On Fri, Dec 15, 2017 at 2:20 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:

Thanks Mohammad,



Everything it is working now.



However, a question, how do I can capture the ‘virtual terminal.’ because I the script I passed was supposed to run /bin/hostname and /bin/date, then call m5 exit. So where can I observe the terminal output?

You can check terminal output from: m5out.*/system.terminal




Also, can I use checkpoints from my rcS script?

I already added. However, I’m not sure if that would work.



Sure- check $CKPT_DIR/m5out.* to see if you have successfully dumped a checkpoint. You can also check log.* and look for a checkpoint dump printout (e.g. "Writing checkpoint" and "info: m5 checkpoint called with non-zero delay => triggering immediate checkpoint (at the next sync)")

Please note that in dist-gem5 you can take checkpoint in two flavors. You should choose one based on your need.
1- collaborative:

When all nodes in the simulated cluster called "/sbin/m5 checkpoint" then dist-gem5 starts dumping a checkpoint

2- immediate

Only one node call "/sbin/m5 checkpoint <some delay (e.g. 1)>" and dist-gem5 starts dumping a checkpoint at the beginning of next sync quantum.


Last thing, once it is up and running can I conenct using mterm?


It should be possible (although not useful, you should try to work with rcS scripts as its the most efficient way for communicating with gem5). But I guess when you redirect the output of gem5 to a file (which is the case for dist-gem5) then gem5 automatically disables all listeners, including sockets for connecting m5term. If that's the case (most likely), you should first find a way to re enable listeners before you try to connect m5term.


Regards,

Vitorio.







From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Monday, December 11, 2017 6:03 PM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5



add a "t" at the end of tick number:



--dist-sync-start=1000000000000t



On Mon, Dec 11, 2017 at 12:40 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:

Thanks Mohammad,



I tried and got the following in the log(log.switch) (below), I also did try using different orders in the parameters, what should I look for now?



Log:

gem5 Simulator System. http://gem5.org

gem5 is copyrighted software; use the --copyright option for details.



gem5 compiled Dec 6 2017 14:35:43

gem5 started Dec 11 2017 11:16:10

gem5 executing on rndarch11, pid 9841

command line: /wada/gem5/build/ARM/gem5.opt -d /wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/gem5/configs/dist/sw.py --dist-sync-start=1000000000000 --checkpoint-dir=/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200



info: Standard input is not a terminal, disabling listeners.

Global frequency set at 1000000000000 ticks per second

Traceback (most recent call last):

File "<string>", line 1, in <module>

File /wada/gem5/src/python/m5/main.py", line 433, in main

exec filecode in scope

File /wada/gem5/configs/dist/sw.py", line 79, in <module>

main()

File /wada/gem5/configs/dist/sw.py", line 76, in main

Simulation.run(options, root, None, None)

File /wada/gem5/configs/common/Simulation.py", line 589, in run

m5.instantiate(checkpoint_dir)

File /wada/gem5/src/python/m5/simulate.py", line 115, in instantiate

for obj in root.descendants(): obj.createCCObject()

File /wada/gem5/src/python/m5/SimObject.py", line 1484, in createCCObject

self.getCCParams()

File /wada/gem5/src/python/m5/SimObject.py", line 1439, in getCCParams

setattr(cc_params, param, value)

TypeError: (): incompatible function arguments. The following argument types are supported:

1. (self: _m5.param_DistEtherLink.DistEtherLinkParams, arg0: int) -> None



Invoked with: <_m5.param_DistEtherLink.DistEtherLinkParams object at 0x7f9a37b8fd80>, 999999999999999983222784L



From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Sunday, December 10, 2017 7:54 PM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5



Oh, you should start synchronization between gem5 nodes before you start communication inside the simulated cluster. Use "--dist-sync-start" option to start synchronization before send tick (4428354726000). You should pass this option to all gem5 processes (FS nodes + switch node). So you should set --dist-sync-start as a "--cf-args" argument in your launch script:



--cf-args --dist-sync-start=1000000000000





Best,

Mohammad









On Fri, Dec 8, 2017 at 12:36 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:

Thanks Mohammad I made some changes and attempted again, it worked but for some reason it simplies 
 dies after a while, not sure why.



Igot the following message on my terminal:

0: global: DistIface::startup() done

info: Entering event queue @ 0. Starting simulation...

panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )

Memory Usage: 402472 KBytes

Program aborted at tick 5200000000000









On log.switch this is what I got:



**** REAL SIMULATION ****

0: system.portlink0: DistEtherLink::startup() called

0: global: DistIface::startup() started

info: Dist sync scheduled at 5200000000000 and repeats 0: global: DistIface::startup() done

10000000

0: system.portlink1: DistEtherLink::startup() called

0: global: DistIface::startup() started

0: global: DistIface::startup() done

0: system.portlink2: DistEtherLink::startup() called

0: global: DistIface::startup() started

0: global: DistIface::startup() done

0: system.portlink3: DistEtherLink::startup() called

0: global: DistIface::startup() started

0: global: DistIface::startup() done

0: system.portlink4: DistEtherLink::startup() called

0: global: DistIface::startup() started

0: global: DistIface::startup() done

0: system.portlink5: DistEtherLink::startup() called

0: global: DistIface::startup() started

0: global: DistIface::startup() done

0: system.portlink6: DistEtherLink::startup() called

0: global: DistIface::startup() started

0: global: DistIface::startup() done

0: system.portlink7: DistEtherLink::startup() called

0: global: DistIface::startup() started

0: global: DistIface::startup() done

info: Entering event queue @ 0. Starting simulation...

panic: panic condition recv_tick <= curTick() occurred: Simulators out of sync - missed packet receive by 771635016399 ticks(rev_recv_tick: 0 send_tick: 4428354726000 send_delay: 257601 linkDelay: 10000000 )

Memory Usage: 402472 KBytes

Program aborted at tick 5200000000000



From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Thursday, December 7, 2017 10:00 AM

To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5



Please look at the content of log.* not m5out.*/stats.txt . It's not surprising that stats.txt is empty ...



On Thu, Dec 7, 2017 at 11:55 AM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:

Hi,



The m5out.*/stats.txt from everyone are empty.



However, the m5out.switch/config.ini is filled with:

It goes from 0 to 7:

[system.portlink7]

type=DistEtherLink

delay=10000000

delay_var=0

dist_rank=0

dist_size=8

dist_sync_on_pseudo_op=false

dump=Null

eventq_index=0

is_switch=true

num_nodes=8

server_name=127.0.0.1

server_port=2200

speed=800.000000

sync_repeat=0

sync_start=5200000000000

int0=system.interface[7]



I’m thinking if the server_name could be the problem






From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Wednesday, December 6, 2017 4:28 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] [EXT] Re: Running Dist-gem5



Again you need to look at log.* to find out why the simulation gets killed. Don't only look at log.switch. If one of the gem5 processes aborts then the entire dist-gem5 simulation will be killed.



On Wed, Dec 6, 2017 at 1:50 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:

Hi Mohammad,



Thank you for the prompt response. I checked the log.switch the first erros and I fixed was the path, the script needs full-paths to work, so, I fixed that, once I tried again, it executed and failed a little later.



Got the following output:

launch switch gem5 process on node0 ...

waiting for switch to start ..

node #switch started

START Wed Dec 6 12:36:04 MST 2017

starting gem5 on node0...

starting gem5 on node0...

starting gem5 on node1...

starting gem5 on node1...

starting gem5 on node2 ...

starting gem5 on node2 ...

starting gem5 on node3 ...

starting gem5 on node3 ...

(I) (some) gem5 process(es) exited

KILLED Wed Dec 6 12:37:35 MST 2017

ABORT Wed Dec 6 12:37:35 MST 2017



The log.switch had the following:

command line: /wada/wada/gem5/build/ARM/gem5.opt -d /wada/wada/gem5/m5out.switch --debug-flags=DistEthernet /wada/wada/gem5/configs/dist/sw.py --checkpoint-dir=/wada/wada/gem5/m5out.switch --is-switch --dist-size=8 --dist-server-port=2200



info: Standard input is not a terminal, disabling listeners.

Global frequency set at 1000000000000 ticks per second

0: system.portlink0: DistEtherLink::DistEtherLink() link delay:10000000 ticksPerByte:800

0: global: DistIface() ctor rank:0

info: tcp_iface listening on port 2200

Killed by signal 15.



From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Mohammad Alian
Sent: Tuesday, December 5, 2017 9:18 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: [EXT] Re: [gem5-users] Running Dist-gem5



Hi Vitorio,



You should check the content of log.switch and why gem5 node simulating switch cannot start. There can be so many reasons that a gem5 process fails to run. If you print the content of switch.log here then I can help.



Regarding "distributed run", you first need to setup passwordless ssh between your simulation (physical) hosts and then use "LSB_MCPU_HOSTS" env variable to assign gem5 processes to physical hosts. E.g. if your simulated cluster size is 8 and you want to run 4 gem5 processes on host_name0 and 4 on host_name1, then your LSB_MCPU_HOSTS looks like this:



export LSB_MCPU_HOSTS="host_name0 4 host_name1 4"





Best,

Mohammad





On Tue, Dec 5, 2017 at 6:03 PM, Vitorio Cargnini (lcargnini) <***@micron.com<mailto:***@micron.com>> wrote:

Hello,



Please, what exactly do I need to run dist-gem5 with the –-dist?



I’m trying, however it fails with “Failed ot start switch”



Also, what do I need in place for it start distributed acroos nodes, instead of launching multiple/parallel runs in the ‘localhost’.



Regards,

Vitorio.















_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Continue reading on narkive:
Loading...