Discussion:
Running multi-threaded workloads in O3 CPU
(too old to reply)
abhishek rawat
2010-04-16 03:59:37 UTC
Permalink
Hi,
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I understand
that for running multi-programmed benchmarks one can pass a vector of
benchmarks as the parameters and set the cpu.numThreads to the size of
vector. But is it possible for the O3 CPU to extract the TLP in a (single)
given multithreaded benchmark. So what I am trying to say is lets say I just
want to run a single multi-threaded benchmark and just by specifying the
number of SMT threads (say 4 SMT threads) can the O3 CPU extract the threads
from the benchmark and run in a SMT fashion? I would appreciate any
directions in this regard.

Something tells me that it is not possible. And if that is the case then how
do I compare different SMT configurations for a given benchmark ? So the
Idea is that I want to change the SMT configuration for a O3 CPU and compare
the preformance of a given workload.

Thanks,
Abhishek
--
Korey Sewell
2010-04-16 04:26:50 UTC
Permalink
Post by abhishek rawat
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I understand
that for running multi-programmed benchmarks one can pass a vector of
benchmarks as the parameters and set the cpu.numThreads to the size of
vector. But is it possible for the O3 CPU to extract the TLP in a (single)
given multithreaded benchmark. So what I am trying to say is lets say I just
want to run a single multi-threaded benchmark and just by specifying the
number of SMT threads (say 4 SMT threads) can the O3 CPU extract the threads
from the benchmark and run in a SMT fashion? I would appreciate any
directions in this regard.
As far as I know, O3CPU will only run the threads that you give it, so your
actual benchmark needs to specify how many threads are going to be run in
parallel in order for you to see any benefit in SE_SMT. The problem of
automatically extracting TLP is in fact it's own subset of the research
community, so if you wish, I'm sure you can find plenty of literature on
that topic.
Post by abhishek rawat
Something tells me that it is not possible. And if that is the case then
how do I compare different SMT configurations for a given benchmark ?
That's a bit of a abstract question and on the M5-side, I'm not sure there
is a definite *right* answer here just user-dependent ones. For some people,
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.

Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each other,
and then decide what to do next with their results.
--
- Korey
abhishek rawat
2010-04-16 04:43:18 UTC
Permalink
Thanks for the reply. I have a follow up question. So, lets say I am only
using one O3 core and If I want to run 2-threaded SMT that means I will have
to provide as input for instance 2 FFT benchmarks. And for 4-threaded SMT
similarly I will have to provide as input 4 FFT benchmark. In this case how
can I do the comparison as the benchmarks have changed, in size. So my
question was can I just provide lets say 1 FFT benchmark as input and then
try to run it with 2 threaded-SMT and then with 4 threaded-SMT
configuration. And I think you are saying it is not possible.

Thanks,
--Abhishek
Post by abhishek rawat
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I understand
Post by abhishek rawat
that for running multi-programmed benchmarks one can pass a vector of
benchmarks as the parameters and set the cpu.numThreads to the size of
vector. But is it possible for the O3 CPU to extract the TLP in a (single)
given multithreaded benchmark. So what I am trying to say is lets say I just
want to run a single multi-threaded benchmark and just by specifying the
number of SMT threads (say 4 SMT threads) can the O3 CPU extract the threads
from the benchmark and run in a SMT fashion? I would appreciate any
directions in this regard.
As far as I know, O3CPU will only run the threads that you give it, so your
actual benchmark needs to specify how many threads are going to be run in
parallel in order for you to see any benefit in SE_SMT. The problem of
automatically extracting TLP is in fact it's own subset of the research
community, so if you wish, I'm sure you can find plenty of literature on
that topic.
Post by abhishek rawat
Something tells me that it is not possible. And if that is the case then
how do I compare different SMT configurations for a given benchmark ?
That's a bit of a abstract question and on the M5-side, I'm not sure there
is a definite *right* answer here just user-dependent ones. For some people,
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.
Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each other,
and then decide what to do next with their results.
--
- Korey
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
--
-Abhi
Korey Sewell
2010-04-16 04:50:00 UTC
Permalink
Post by abhishek rawat
Thanks for the reply. I have a follow up question. So, lets say I am only
using one O3 core and If I want to run 2-threaded SMT that means I will have
to provide as input for instance 2 FFT benchmarks. And for 4-threaded SMT
similarly I will have to provide as input 4 FFT benchmark. In this case how
can I do the comparison as the benchmarks have changed, in size. So my
question was can I just provide lets say 1 FFT benchmark as input and then
try to run it with 2 threaded-SMT and then with 4 threaded-SMT
configuration. And I think you are saying it is not possible.
Unless your FFT has pthreads or some other multithreaded software that will
spawn threads for you, then no, I dont see it as possible. But, I'll let
someone else respond if they have additional thoughts.
--
- Korey
Steve Reinhardt
2010-04-16 12:50:50 UTC
Permalink
If the benchmark is explicitly multithreaded and compiled against the
m5threads library, then you can run it on an SMT O3 CPU and it should
work.

Steve
Post by abhishek rawat
Thanks for the reply. I have a follow up question. So, lets say I am only
using one O3 core and If I want to run 2-threaded SMT that means I will have
to provide as input for instance 2 FFT  benchmarks. And for 4-threaded SMT
similarly I will have to provide as input 4 FFT benchmark. In this case how
can I do the comparison as the benchmarks have changed, in size. So my
question was can I just provide lets say 1 FFT benchmark as input and then
try to run it with 2 threaded-SMT and then with 4 threaded-SMT
configuration. And I think you are saying it is not possible.
Thanks,
--Abhishek
Post by Korey Sewell
Post by abhishek rawat
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I
understand that for running multi-programmed benchmarks one can pass a
vector of benchmarks as the parameters and set the cpu.numThreads to the
size of vector. But is it possible for the O3 CPU to extract the TLP in a
(single) given multithreaded benchmark. So what I am trying to say is lets
say I just want to run a single multi-threaded benchmark and just by
specifying the number of SMT threads (say 4 SMT threads) can the O3 CPU
extract the threads from the benchmark and run in a SMT fashion? I would
appreciate any directions in this regard.
As far as I know, O3CPU will only run the threads that you give it, so
your actual benchmark needs to specify how many threads are going to be run
in parallel in order for you to see any benefit in SE_SMT.  The problem of
automatically extracting TLP is in fact it's own subset of the research
community, so if you wish, I'm sure you can find plenty of literature on
that topic.
Post by abhishek rawat
Something tells me that it is not possible. And if that is the case then
how do I compare different SMT configurations for a given benchmark ?
That's a bit of a abstract question and on the M5-side, I'm not sure there
is a definite *right* answer here just user-dependent ones. For some people,
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.
Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each other,
and then decide what to do next with their results.
--
- Korey
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
--
-Abhi
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
abhishek rawat
2010-04-17 14:19:50 UTC
Permalink
Thanks Steve! I tried using the run.py in configs/splash2 directory but I
keep getting segmentation fault. I am running FFT benchmark (I downloaded
the binaries from the link provided at M5 wiki's splash benchmark's page). I
made following changes in my python script (run.py):

................................
class FFT(LiveProcess):
cwd = options.rootdir + '/kernels/fft'
executable = options.rootdir + '/kernels/fft/FFT'
cmd = ['FFT', '-p', str(*options.numcpus*options.smt*), '-m18']

..............................
..............................
cpus = [DerivO3CPU(cpu_id=i,
clock=options.frequency,
*numThreads=options.smt,
smtNumFetchingThreads=options.smt,
smtFetchPolicy='RoundRobin'*)
for i in xrange(options.numcpus)]

So, if I understood what Steve said in last mail then I think I am running
it correctly. I tried debugging but it didn't help. I am using a latest
version of M5 which I downloaded from the repositories about a week ago.
Also when I used smtNumFetchingThreads=1 and numThreads=1 then it works!!


This is what I got after running gdb:

: /af20/ar8eb/Desktop/CS8501/m5-ad784e759a74 ; gdb build/ALPHA_SE/m5.debug
GNU gdb 6.8-debian
................................................................
This GDB was configured as "x86_64-linux-gnu"...
(gdb) run configs/splash2/run_test.py -n 1 --smt=2
--rootdir=../m5-stable-94c016415053/v1-splash-alpha/splash2/codes/ -d -b FFT
....................................................................
command line:
/net/af20/ar8eb/Desktop/CS8501/m5-ad784e759a74/build/ALPHA_SE/m5.debug
configs/splash2/run_test.py -n 1 --smt=2
--rootdir=../m5-stable-94c016415053/v1-splash-alpha/splash2/codes/ -d -b FFT
Global frequency set at 1000000000000 ticks per second
0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
0: system.remote_gdb.listener: listening for remote gdb #1 on port 7001

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f6d50dc76f0 (LWP 18930)]
0x00000000004522a0 in std::vector<int, std::allocator<int> >::push_back
(this=0x40, __x=@0x7fff58dea6b4) at
/usr/include/c++/4.2/bits/stl_vector.h:599
599 if (this->_M_impl._M_finish != this->_M_impl._M_end_of_storage)


Thanks,
Abhishek
Post by Steve Reinhardt
If the benchmark is explicitly multithreaded and compiled against the
m5threads library, then you can run it on an SMT O3 CPU and it should
work.
Steve
Post by abhishek rawat
Thanks for the reply. I have a follow up question. So, lets say I am only
using one O3 core and If I want to run 2-threaded SMT that means I will
have
Post by abhishek rawat
to provide as input for instance 2 FFT benchmarks. And for 4-threaded
SMT
Post by abhishek rawat
similarly I will have to provide as input 4 FFT benchmark. In this case
how
Post by abhishek rawat
can I do the comparison as the benchmarks have changed, in size. So my
question was can I just provide lets say 1 FFT benchmark as input and
then
Post by abhishek rawat
try to run it with 2 threaded-SMT and then with 4 threaded-SMT
configuration. And I think you are saying it is not possible.
Thanks,
--Abhishek
Post by Korey Sewell
Post by abhishek rawat
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I
understand that for running multi-programmed benchmarks one can pass a
vector of benchmarks as the parameters and set the cpu.numThreads to
the
Post by abhishek rawat
Post by Korey Sewell
Post by abhishek rawat
size of vector. But is it possible for the O3 CPU to extract the TLP in
a
Post by abhishek rawat
Post by Korey Sewell
Post by abhishek rawat
(single) given multithreaded benchmark. So what I am trying to say is
lets
Post by abhishek rawat
Post by Korey Sewell
Post by abhishek rawat
say I just want to run a single multi-threaded benchmark and just by
specifying the number of SMT threads (say 4 SMT threads) can the O3 CPU
extract the threads from the benchmark and run in a SMT fashion? I
would
Post by abhishek rawat
Post by Korey Sewell
Post by abhishek rawat
appreciate any directions in this regard.
As far as I know, O3CPU will only run the threads that you give it, so
your actual benchmark needs to specify how many threads are going to be
run
Post by abhishek rawat
Post by Korey Sewell
in parallel in order for you to see any benefit in SE_SMT. The problem
of
Post by abhishek rawat
Post by Korey Sewell
automatically extracting TLP is in fact it's own subset of the research
community, so if you wish, I'm sure you can find plenty of literature on
that topic.
Post by abhishek rawat
Something tells me that it is not possible. And if that is the case
then
Post by abhishek rawat
Post by Korey Sewell
Post by abhishek rawat
how do I compare different SMT configurations for a given benchmark ?
That's a bit of a abstract question and on the M5-side, I'm not sure
there
Post by abhishek rawat
Post by Korey Sewell
is a definite *right* answer here just user-dependent ones. For some
people,
Post by abhishek rawat
Post by Korey Sewell
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.
Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each
other,
Post by abhishek rawat
Post by Korey Sewell
and then decide what to do next with their results.
--
- Korey
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
--
-Abhi
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
--
-Abhi
Continue reading on narkive:
Loading...