Discussion:
Micro-op Data Dependency
(too old to reply)
Alec Roelke
2016-07-27 16:10:15 UTC
Permalink
Hello,

I'm trying to add an ISA to gem5 which has several atomic read-modify-write
instructions. Currently I have them implemented as pairs of micro-ops
which read data in the first operation and then modify-write in the
second. This works for the simple CPU model, but runs into trouble for the
minor and O3 models, which want to execute the modify-write half before the
load half is complete. I tried forcing both parts of the instruction to
have the same src and dest register indices, but that causes other problems
with the O3 model.

Is there a way to indicate that there is a data dependency between the two
micro-ops in the instruction? Or, better yet, is there a way I could
somehow have two memory accesses in one instruction without having to split
it into micro-ops?

Thanks,
Alec Roelke
Steve Reinhardt
2016-07-28 20:45:10 UTC
Permalink
There are really two issues here, I think:

1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.

I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.

Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.

Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented as pairs
of micro-ops which read data in the first operation and then modify-write
in the second. This works for the simple CPU model, but runs into trouble
for the minor and O3 models, which want to execute the modify-write half
before the load half is complete. I tried forcing both parts of the
instruction to have the same src and dest register indices, but that causes
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between the two
micro-ops in the instruction? Or, better yet, is there a way I could
somehow have two memory accesses in one instruction without having to split
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Alec Roelke
2016-07-29 16:32:39 UTC
Permalink
Yes, that sums up my issues. I haven't gotten to tackling the second one
yet; I'm still working on the first. Thanks for the patch link, though,
that should help a lot when I get to it.

To be more specific, I can get it to work with either the minor CPU model
or the O3 model, but not both at the same time. To get it to work with the
O3 model, I added the "IsNonSpeculative" flag to the modify-write micro-op,
which I assumed would prevent the O3 model from speculating on its
execution (which I also had to do with regular store instructions to ensure
that registers containing addresses would have the proper values when the
instruction executed). This works, but when I use it in the minor CPU
model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.

I assume this is caused by the fact that the code for the read operation
doesn't reference any register, as the instruction writes the value that
was read from memory to a dest register before modifying it and writing it
back. Because the dest register can be the same as a source register, I
have to pass the memory value from the read micro-op to the modify-write
micro-op without writing it to a register to avoid potentially polluting
the data written back.

My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented as
pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then modify-write
in the second. This works for the simple CPU model, but runs into
trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the modify-write half
before the load half is complete. I tried forcing both parts of the
instruction to have the same src and dest register indices, but that
causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between the
two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I could
somehow have two memory accesses in one instruction without having to
split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
Steve Reinhardt
2016-07-29 18:50:30 UTC
Permalink
I'm still confused about the problems you're having. Stores should never
be executed speculatively in O3, even without the non-speculative flag.
Also, assuming the store micro-op reads a register that is written by the
load micro-op, then that true data dependence through the intermediate
register should enforce an ordering. Whether that destination register is
also a source or not should be irrelevant, particularly in O3 where all the
registers get renamed anyway.

Perhaps if you show some snippets of your actual code it will be clearer to
me what's going on.

Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the second one
yet; I'm still working on the first. Thanks for the patch link, though,
that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU model
or the O3 model, but not both at the same time. To get it to work with the
O3 model, I added the "IsNonSpeculative" flag to the modify-write micro-op,
which I assumed would prevent the O3 model from speculating on its
execution (which I also had to do with regular store instructions to ensure
that registers containing addresses would have the proper values when the
instruction executed). This works, but when I use it in the minor CPU
model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read operation
doesn't reference any register, as the instruction writes the value that
was read from memory to a dest register before modifying it and writing it
back. Because the dest register can be the same as a source register, I
have to pass the memory value from the read micro-op to the modify-write
micro-op without writing it to a register to avoid potentially polluting
the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented as
pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs into
trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the modify-write half
before the load half is complete. I tried forcing both parts of the
instruction to have the same src and dest register indices, but that
causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between the
two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I could
somehow have two memory accesses in one instruction without having to
split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Alec Roelke
2016-07-30 02:37:23 UTC
Permalink
Sure, I can show some code snippets. First, here is the code for the read
micro-op for an atomic read-add-write:

temp = Mem_sd;

And the modify-write micro-op:

Rd_sd = temp;
Mem_sd = Rs2_sd + temp;

The memory address comes from Rs1. The variable "temp" is a temporary
location shared between the read and modify-write micro-ops (the address
from Rs1 is shared similarly to ensure it's the same when the instructions
are issued).

In the constructor for the macro-op, I've included some code that
explicitly sets the src and dest register indices so that they are
displayed properly for execution traces:

_numSrcRegs = 2;
_srcRegIdx[0] = RS1;
_srcRegIdx[1] = RS2;
_numDestRegs = 1;
_destRegIdx[0] = RD;

So far, this works for the O3 model. But, in the minor model, it tries to
execute the modify-write micro-op before the read micro-op is executed.
The address is never loaded from Rs1, and so a segmentation fault often
occurs. To try to fix it, I added this code to the constructors of each of
the two micro-ops:

_numSrcRegs = _p->_numSrcRegs;
for (int i = 0; i < _numSrcRegs; i++)
_srcRegIdx[i] = _p->_srcRegIdx[i];
_numDestRegs = _p->_numDestRegs;
for (int i = 0; i < _numDestRegs; i++)
_destRegIdx[i] = _p->_destRegIdx[i];

_p is a pointer to the "parent" macro-op. With this code, it works with
minor model, but the final calculated value in the modify-write micro-op
never gets written at the end of the instruction in the O3 model.
Post by Steve Reinhardt
I'm still confused about the problems you're having. Stores should never
be executed speculatively in O3, even without the non-speculative flag.
Also, assuming the store micro-op reads a register that is written by the
load micro-op, then that true data dependence through the intermediate
register should enforce an ordering. Whether that destination register is
also a source or not should be irrelevant, particularly in O3 where all the
registers get renamed anyway.
Perhaps if you show some snippets of your actual code it will be clearer
to me what's going on.
Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the second one
yet; I'm still working on the first. Thanks for the patch link, though,
that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU model
or the O3 model, but not both at the same time. To get it to work with the
O3 model, I added the "IsNonSpeculative" flag to the modify-write micro-op,
which I assumed would prevent the O3 model from speculating on its
execution (which I also had to do with regular store instructions to ensure
that registers containing addresses would have the proper values when the
instruction executed). This works, but when I use it in the minor CPU
model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read operation
doesn't reference any register, as the instruction writes the value that
was read from memory to a dest register before modifying it and writing it
back. Because the dest register can be the same as a source register, I
have to pass the memory value from the read micro-op to the modify-write
micro-op without writing it to a register to avoid potentially polluting
the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented as
pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs into
trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the modify-write
half
Post by Alec Roelke
before the load half is complete. I tried forcing both parts of the
instruction to have the same src and dest register indices, but that
causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between the
two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I could
somehow have two memory accesses in one instruction without having to
split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Steve Reinhardt
2016-07-30 18:14:01 UTC
Permalink
You shouldn't be passing values between micro-ops using C++ variables, you
should pass the data in a register. (If necessary, create microcode-only
temporary registers for this purpose, like x86 does.) This is
microarchitectural state so you can't hide it from the CPU model. The main
problem here is that, since this "hidden" data dependency isn't visible to
the CPU model, it doesn't know that the micro-ops must be executed in
order. If you pass that data in a register, the pipeline model will
enforce the dependency.

Also, where do you set the address for the memory accesses? Again, both
micro-ops should read that out of a register, it should not be passed
implicitly via hidden variables.

You shouldn't have to explicitly set the internal fields like _srcRegIdx
and _destRegIdx, the ISA parser should do that for you.

Unfortunately the ISA description system wasn't originally designed to
support microcode, and that support was kind of shoehorned in after the
fact, so it is a little messy. Is your whole ISA microcoded, or just a few
specific instructions?

Steve
Post by Alec Roelke
Sure, I can show some code snippets. First, here is the code for the read
temp = Mem_sd;
Rd_sd = temp;
Mem_sd = Rs2_sd + temp;
The memory address comes from Rs1. The variable "temp" is a temporary
location shared between the read and modify-write micro-ops (the address
from Rs1 is shared similarly to ensure it's the same when the instructions
are issued).
In the constructor for the macro-op, I've included some code that
explicitly sets the src and dest register indices so that they are
_numSrcRegs = 2;
_srcRegIdx[0] = RS1;
_srcRegIdx[1] = RS2;
_numDestRegs = 1;
_destRegIdx[0] = RD;
So far, this works for the O3 model. But, in the minor model, it tries to
execute the modify-write micro-op before the read micro-op is executed.
The address is never loaded from Rs1, and so a segmentation fault often
occurs. To try to fix it, I added this code to the constructors of each of
_numSrcRegs = _p->_numSrcRegs;
for (int i = 0; i < _numSrcRegs; i++)
_srcRegIdx[i] = _p->_srcRegIdx[i];
_numDestRegs = _p->_numDestRegs;
for (int i = 0; i < _numDestRegs; i++)
_destRegIdx[i] = _p->_destRegIdx[i];
_p is a pointer to the "parent" macro-op. With this code, it works with
minor model, but the final calculated value in the modify-write micro-op
never gets written at the end of the instruction in the O3 model.
Post by Steve Reinhardt
I'm still confused about the problems you're having. Stores should never
be executed speculatively in O3, even without the non-speculative flag.
Also, assuming the store micro-op reads a register that is written by the
load micro-op, then that true data dependence through the intermediate
register should enforce an ordering. Whether that destination register is
also a source or not should be irrelevant, particularly in O3 where all the
registers get renamed anyway.
Perhaps if you show some snippets of your actual code it will be clearer
to me what's going on.
Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the second
one yet; I'm still working on the first. Thanks for the patch link,
though, that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU
model or the O3 model, but not both at the same time. To get it to work
with the O3 model, I added the "IsNonSpeculative" flag to the modify-write
micro-op, which I assumed would prevent the O3 model from speculating on
its execution (which I also had to do with regular store instructions to
ensure that registers containing addresses would have the proper values
when the instruction executed). This works, but when I use it in the minor
CPU model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read operation
doesn't reference any register, as the instruction writes the value that
was read from memory to a dest register before modifying it and writing it
back. Because the dest register can be the same as a source register, I
have to pass the memory value from the read micro-op to the modify-write
micro-op without writing it to a register to avoid potentially polluting
the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with
AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented as
pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs into
trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the modify-write
half
Post by Alec Roelke
before the load half is complete. I tried forcing both parts of the
instruction to have the same src and dest register indices, but that
causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between
the two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I could
somehow have two memory accesses in one instruction without having to
split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Alec Roelke
2016-07-31 14:21:07 UTC
Permalink
That makes sense. Would it be enough for me to just create a new IntReg
operand, like this:

'Rt': ('IntReg', 'ud', None, 'IsInteger', 4)

and then increase the number of integer registers? The other integer
operands have a bit field from the instruction bits, but since the ISA
doesn't specify that these RMW instructions should be microcoded, there's
no way to decode a temporary register from the instruction bits. Will GEM5
understand that and pick any integer register that's available?

The memory address is taken from Rs1 before the load micro-op, and then
stored in a C++ variable for the remainder of the instruction. That was
done to ensure that other intervening instructions that might get executed
in the O3 model don't change Rs1 between the load and modify-write
micro-ops, but if I can get the temp register to work then that might fix
itself.

I was only setting _srcRegIdx and _destRegIdx for disassembly reasons;
since the macro-op and first micro-op don't make use of Rs2, the
instruction wasn't setting _srcRegIdx[1] and the disassembly would show
something like 4294967295. Then it presented a potential solution to the
minor CPU model problem I described before.

No, most of the ISA is not microcoded. In fact, as I said, these RMW
instructions are not specified to be microcoded by the ISA, but since they
each have two memory transactions they didn't appear to work unless I split
them into two micro-ops.
Post by Steve Reinhardt
You shouldn't be passing values between micro-ops using C++ variables, you
should pass the data in a register. (If necessary, create microcode-only
temporary registers for this purpose, like x86 does.) This is
microarchitectural state so you can't hide it from the CPU model. The main
problem here is that, since this "hidden" data dependency isn't visible to
the CPU model, it doesn't know that the micro-ops must be executed in
order. If you pass that data in a register, the pipeline model will
enforce the dependency.
Also, where do you set the address for the memory accesses? Again, both
micro-ops should read that out of a register, it should not be passed
implicitly via hidden variables.
You shouldn't have to explicitly set the internal fields like _srcRegIdx
and _destRegIdx, the ISA parser should do that for you.
Unfortunately the ISA description system wasn't originally designed to
support microcode, and that support was kind of shoehorned in after the
fact, so it is a little messy. Is your whole ISA microcoded, or just a few
specific instructions?
Steve
Post by Alec Roelke
Sure, I can show some code snippets. First, here is the code for the
temp = Mem_sd;
Rd_sd = temp;
Mem_sd = Rs2_sd + temp;
The memory address comes from Rs1. The variable "temp" is a temporary
location shared between the read and modify-write micro-ops (the address
from Rs1 is shared similarly to ensure it's the same when the instructions
are issued).
In the constructor for the macro-op, I've included some code that
explicitly sets the src and dest register indices so that they are
_numSrcRegs = 2;
_srcRegIdx[0] = RS1;
_srcRegIdx[1] = RS2;
_numDestRegs = 1;
_destRegIdx[0] = RD;
So far, this works for the O3 model. But, in the minor model, it tries
to execute the modify-write micro-op before the read micro-op is executed.
The address is never loaded from Rs1, and so a segmentation fault often
occurs. To try to fix it, I added this code to the constructors of each of
_numSrcRegs = _p->_numSrcRegs;
for (int i = 0; i < _numSrcRegs; i++)
_srcRegIdx[i] = _p->_srcRegIdx[i];
_numDestRegs = _p->_numDestRegs;
for (int i = 0; i < _numDestRegs; i++)
_destRegIdx[i] = _p->_destRegIdx[i];
_p is a pointer to the "parent" macro-op. With this code, it works with
minor model, but the final calculated value in the modify-write micro-op
never gets written at the end of the instruction in the O3 model.
Post by Steve Reinhardt
I'm still confused about the problems you're having. Stores should
never be executed speculatively in O3, even without the non-speculative
flag. Also, assuming the store micro-op reads a register that is written
by the load micro-op, then that true data dependence through the
intermediate register should enforce an ordering. Whether that destination
register is also a source or not should be irrelevant, particularly in O3
where all the registers get renamed anyway.
Perhaps if you show some snippets of your actual code it will be clearer
to me what's going on.
Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the second
one yet; I'm still working on the first. Thanks for the patch link,
though, that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU
model or the O3 model, but not both at the same time. To get it to work
with the O3 model, I added the "IsNonSpeculative" flag to the modify-write
micro-op, which I assumed would prevent the O3 model from speculating on
its execution (which I also had to do with regular store instructions to
ensure that registers containing addresses would have the proper values
when the instruction executed). This works, but when I use it in the minor
CPU model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read
operation doesn't reference any register, as the instruction writes the
value that was read from memory to a dest register before modifying it and
writing it back. Because the dest register can be the same as a source
register, I have to pass the memory value from the read micro-op to the
modify-write micro-op without writing it to a register to avoid potentially
polluting the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with
AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented
as pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs into
trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the modify-write
half
Post by Alec Roelke
before the load half is complete. I tried forcing both parts of the
instruction to have the same src and dest register indices, but that
causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between
the two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I could
somehow have two memory accesses in one instruction without having
to split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Steve Reinhardt
2016-08-01 14:58:01 UTC
Permalink
You don't need to worry about the size of the bitfield in the instruction
encoding, because the temporary register(s) will never be directly
addressed by any machine instruction. You should define a new
architectural register using an index that doesn't appear in any
instruction (e.g., if the ISA includes r0 to r31, then the temp reg can be
r32). This register will get renamed in the O3 model.

Steve
Post by Alec Roelke
That makes sense. Would it be enough for me to just create a new IntReg
'Rt': ('IntReg', 'ud', None, 'IsInteger', 4)
and then increase the number of integer registers? The other integer
operands have a bit field from the instruction bits, but since the ISA
doesn't specify that these RMW instructions should be microcoded, there's
no way to decode a temporary register from the instruction bits. Will GEM5
understand that and pick any integer register that's available?
The memory address is taken from Rs1 before the load micro-op, and then
stored in a C++ variable for the remainder of the instruction. That was
done to ensure that other intervening instructions that might get executed
in the O3 model don't change Rs1 between the load and modify-write
micro-ops, but if I can get the temp register to work then that might fix
itself.
I was only setting _srcRegIdx and _destRegIdx for disassembly reasons;
since the macro-op and first micro-op don't make use of Rs2, the
instruction wasn't setting _srcRegIdx[1] and the disassembly would show
something like 4294967295. Then it presented a potential solution to the
minor CPU model problem I described before.
No, most of the ISA is not microcoded. In fact, as I said, these RMW
instructions are not specified to be microcoded by the ISA, but since they
each have two memory transactions they didn't appear to work unless I split
them into two micro-ops.
Post by Steve Reinhardt
You shouldn't be passing values between micro-ops using C++ variables,
you should pass the data in a register. (If necessary, create
microcode-only temporary registers for this purpose, like x86 does.) This
is microarchitectural state so you can't hide it from the CPU model. The
main problem here is that, since this "hidden" data dependency isn't
visible to the CPU model, it doesn't know that the micro-ops must be
executed in order. If you pass that data in a register, the pipeline model
will enforce the dependency.
Also, where do you set the address for the memory accesses? Again, both
micro-ops should read that out of a register, it should not be passed
implicitly via hidden variables.
You shouldn't have to explicitly set the internal fields like _srcRegIdx
and _destRegIdx, the ISA parser should do that for you.
Unfortunately the ISA description system wasn't originally designed to
support microcode, and that support was kind of shoehorned in after the
fact, so it is a little messy. Is your whole ISA microcoded, or just a few
specific instructions?
Steve
Post by Alec Roelke
Sure, I can show some code snippets. First, here is the code for the
temp = Mem_sd;
Rd_sd = temp;
Mem_sd = Rs2_sd + temp;
The memory address comes from Rs1. The variable "temp" is a temporary
location shared between the read and modify-write micro-ops (the address
from Rs1 is shared similarly to ensure it's the same when the instructions
are issued).
In the constructor for the macro-op, I've included some code that
explicitly sets the src and dest register indices so that they are
_numSrcRegs = 2;
_srcRegIdx[0] = RS1;
_srcRegIdx[1] = RS2;
_numDestRegs = 1;
_destRegIdx[0] = RD;
So far, this works for the O3 model. But, in the minor model, it tries
to execute the modify-write micro-op before the read micro-op is executed.
The address is never loaded from Rs1, and so a segmentation fault often
occurs. To try to fix it, I added this code to the constructors of each of
_numSrcRegs = _p->_numSrcRegs;
for (int i = 0; i < _numSrcRegs; i++)
_srcRegIdx[i] = _p->_srcRegIdx[i];
_numDestRegs = _p->_numDestRegs;
for (int i = 0; i < _numDestRegs; i++)
_destRegIdx[i] = _p->_destRegIdx[i];
_p is a pointer to the "parent" macro-op. With this code, it works with
minor model, but the final calculated value in the modify-write micro-op
never gets written at the end of the instruction in the O3 model.
Post by Steve Reinhardt
I'm still confused about the problems you're having. Stores should
never be executed speculatively in O3, even without the non-speculative
flag. Also, assuming the store micro-op reads a register that is written
by the load micro-op, then that true data dependence through the
intermediate register should enforce an ordering. Whether that destination
register is also a source or not should be irrelevant, particularly in O3
where all the registers get renamed anyway.
Perhaps if you show some snippets of your actual code it will be
clearer to me what's going on.
Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the second
one yet; I'm still working on the first. Thanks for the patch link,
though, that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU
model or the O3 model, but not both at the same time. To get it to work
with the O3 model, I added the "IsNonSpeculative" flag to the modify-write
micro-op, which I assumed would prevent the O3 model from speculating on
its execution (which I also had to do with regular store instructions to
ensure that registers containing addresses would have the proper values
when the instruction executed). This works, but when I use it in the minor
CPU model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read
operation doesn't reference any register, as the instruction writes the
value that was read from memory to a dest register before modifying it and
writing it back. Because the dest register can be the same as a source
register, I have to pass the memory value from the read micro-op to the
modify-write micro-op without writing it to a register to avoid potentially
polluting the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with
AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented
as pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs into
trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the modify-write
half
Post by Alec Roelke
before the load half is complete. I tried forcing both parts of the
instruction to have the same src and dest register indices, but
that causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between
the two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I
could
Post by Alec Roelke
somehow have two memory accesses in one instruction without having
to split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Alec Roelke
2016-08-02 20:54:15 UTC
Permalink
Okay, thanks. How do I tell the ISA parser that the 'Rt' operand I've
created refers to the extra architectural register? Or is there some
function I can call inside the instruction's code that writes directly to
an architectural register? All I can see from the code GEM5 generates is
"setIntRegOperand," which takes indices into _destRegIdx rather than
register indices.
Post by Steve Reinhardt
You don't need to worry about the size of the bitfield in the instruction
encoding, because the temporary register(s) will never be directly
addressed by any machine instruction. You should define a new
architectural register using an index that doesn't appear in any
instruction (e.g., if the ISA includes r0 to r31, then the temp reg can be
r32). This register will get renamed in the O3 model.
Steve
Post by Alec Roelke
That makes sense. Would it be enough for me to just create a new IntReg
'Rt': ('IntReg', 'ud', None, 'IsInteger', 4)
and then increase the number of integer registers? The other integer
operands have a bit field from the instruction bits, but since the ISA
doesn't specify that these RMW instructions should be microcoded, there's
no way to decode a temporary register from the instruction bits. Will GEM5
understand that and pick any integer register that's available?
The memory address is taken from Rs1 before the load micro-op, and then
stored in a C++ variable for the remainder of the instruction. That was
done to ensure that other intervening instructions that might get executed
in the O3 model don't change Rs1 between the load and modify-write
micro-ops, but if I can get the temp register to work then that might fix
itself.
I was only setting _srcRegIdx and _destRegIdx for disassembly reasons;
since the macro-op and first micro-op don't make use of Rs2, the
instruction wasn't setting _srcRegIdx[1] and the disassembly would show
something like 4294967295. Then it presented a potential solution to the
minor CPU model problem I described before.
No, most of the ISA is not microcoded. In fact, as I said, these RMW
instructions are not specified to be microcoded by the ISA, but since they
each have two memory transactions they didn't appear to work unless I split
them into two micro-ops.
Post by Steve Reinhardt
You shouldn't be passing values between micro-ops using C++ variables,
you should pass the data in a register. (If necessary, create
microcode-only temporary registers for this purpose, like x86 does.) This
is microarchitectural state so you can't hide it from the CPU model. The
main problem here is that, since this "hidden" data dependency isn't
visible to the CPU model, it doesn't know that the micro-ops must be
executed in order. If you pass that data in a register, the pipeline model
will enforce the dependency.
Also, where do you set the address for the memory accesses? Again, both
micro-ops should read that out of a register, it should not be passed
implicitly via hidden variables.
You shouldn't have to explicitly set the internal fields like _srcRegIdx
and _destRegIdx, the ISA parser should do that for you.
Unfortunately the ISA description system wasn't originally designed to
support microcode, and that support was kind of shoehorned in after the
fact, so it is a little messy. Is your whole ISA microcoded, or just a few
specific instructions?
Steve
Post by Alec Roelke
Sure, I can show some code snippets. First, here is the code for the
temp = Mem_sd;
Rd_sd = temp;
Mem_sd = Rs2_sd + temp;
The memory address comes from Rs1. The variable "temp" is a temporary
location shared between the read and modify-write micro-ops (the address
from Rs1 is shared similarly to ensure it's the same when the instructions
are issued).
In the constructor for the macro-op, I've included some code that
explicitly sets the src and dest register indices so that they are
_numSrcRegs = 2;
_srcRegIdx[0] = RS1;
_srcRegIdx[1] = RS2;
_numDestRegs = 1;
_destRegIdx[0] = RD;
So far, this works for the O3 model. But, in the minor model, it tries
to execute the modify-write micro-op before the read micro-op is executed.
The address is never loaded from Rs1, and so a segmentation fault often
occurs. To try to fix it, I added this code to the constructors of each of
_numSrcRegs = _p->_numSrcRegs;
for (int i = 0; i < _numSrcRegs; i++)
_srcRegIdx[i] = _p->_srcRegIdx[i];
_numDestRegs = _p->_numDestRegs;
for (int i = 0; i < _numDestRegs; i++)
_destRegIdx[i] = _p->_destRegIdx[i];
_p is a pointer to the "parent" macro-op. With this code, it works
with minor model, but the final calculated value in the modify-write
micro-op never gets written at the end of the instruction in the O3 model.
Post by Steve Reinhardt
I'm still confused about the problems you're having. Stores should
never be executed speculatively in O3, even without the non-speculative
flag. Also, assuming the store micro-op reads a register that is written
by the load micro-op, then that true data dependence through the
intermediate register should enforce an ordering. Whether that destination
register is also a source or not should be irrelevant, particularly in O3
where all the registers get renamed anyway.
Perhaps if you show some snippets of your actual code it will be
clearer to me what's going on.
Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the second
one yet; I'm still working on the first. Thanks for the patch link,
though, that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU
model or the O3 model, but not both at the same time. To get it to work
with the O3 model, I added the "IsNonSpeculative" flag to the modify-write
micro-op, which I assumed would prevent the O3 model from speculating on
its execution (which I also had to do with regular store instructions to
ensure that registers containing addresses would have the proper values
when the instruction executed). This works, but when I use it in the minor
CPU model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read
operation doesn't reference any register, as the instruction writes the
value that was read from memory to a dest register before modifying it and
writing it back. Because the dest register can be the same as a source
register, I have to pass the memory value from the read micro-op to the
modify-write micro-op without writing it to a register to avoid potentially
polluting the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them implemented
as pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs into
trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the
modify-write half
Post by Alec Roelke
before the load half is complete. I tried forcing both parts of
the
Post by Alec Roelke
instruction to have the same src and dest register indices, but
that causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency between
the two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I
could
Post by Alec Roelke
somehow have two memory accesses in one instruction without having
to split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Steve Reinhardt
2016-08-02 21:49:02 UTC
Permalink
I don't know that off the top of my head---the ISAs I'm familiar with are
either not microcoded, or use a micro-op assembler to generate all the
micro-ops (i.e., x86). Have you looked at how ARM micro-ops are
constructed? That's the one ISA that I believe is mostly not microcoded
but still has some microcode in it.

Though come to think of it, it may be as easy as just using a constant
where the other operands specify the machine code bitfield, if there's
syntax that allows that.

Steve
Post by Alec Roelke
Okay, thanks. How do I tell the ISA parser that the 'Rt' operand I've
created refers to the extra architectural register? Or is there some
function I can call inside the instruction's code that writes directly to
an architectural register? All I can see from the code GEM5 generates is
"setIntRegOperand," which takes indices into _destRegIdx rather than
register indices.
Post by Steve Reinhardt
You don't need to worry about the size of the bitfield in the instruction
encoding, because the temporary register(s) will never be directly
addressed by any machine instruction. You should define a new
architectural register using an index that doesn't appear in any
instruction (e.g., if the ISA includes r0 to r31, then the temp reg can be
r32). This register will get renamed in the O3 model.
Steve
Post by Alec Roelke
That makes sense. Would it be enough for me to just create a new IntReg
'Rt': ('IntReg', 'ud', None, 'IsInteger', 4)
and then increase the number of integer registers? The other integer
operands have a bit field from the instruction bits, but since the ISA
doesn't specify that these RMW instructions should be microcoded, there's
no way to decode a temporary register from the instruction bits. Will GEM5
understand that and pick any integer register that's available?
The memory address is taken from Rs1 before the load micro-op, and then
stored in a C++ variable for the remainder of the instruction. That was
done to ensure that other intervening instructions that might get executed
in the O3 model don't change Rs1 between the load and modify-write
micro-ops, but if I can get the temp register to work then that might fix
itself.
I was only setting _srcRegIdx and _destRegIdx for disassembly reasons;
since the macro-op and first micro-op don't make use of Rs2, the
instruction wasn't setting _srcRegIdx[1] and the disassembly would show
something like 4294967295. Then it presented a potential solution to the
minor CPU model problem I described before.
No, most of the ISA is not microcoded. In fact, as I said, these RMW
instructions are not specified to be microcoded by the ISA, but since they
each have two memory transactions they didn't appear to work unless I split
them into two micro-ops.
Post by Steve Reinhardt
You shouldn't be passing values between micro-ops using C++ variables,
you should pass the data in a register. (If necessary, create
microcode-only temporary registers for this purpose, like x86 does.) This
is microarchitectural state so you can't hide it from the CPU model. The
main problem here is that, since this "hidden" data dependency isn't
visible to the CPU model, it doesn't know that the micro-ops must be
executed in order. If you pass that data in a register, the pipeline model
will enforce the dependency.
Also, where do you set the address for the memory accesses? Again,
both micro-ops should read that out of a register, it should not be passed
implicitly via hidden variables.
You shouldn't have to explicitly set the internal fields like
_srcRegIdx and _destRegIdx, the ISA parser should do that for you.
Unfortunately the ISA description system wasn't originally designed to
support microcode, and that support was kind of shoehorned in after the
fact, so it is a little messy. Is your whole ISA microcoded, or just a few
specific instructions?
Steve
Post by Alec Roelke
Sure, I can show some code snippets. First, here is the code for the
temp = Mem_sd;
Rd_sd = temp;
Mem_sd = Rs2_sd + temp;
The memory address comes from Rs1. The variable "temp" is a temporary
location shared between the read and modify-write micro-ops (the address
from Rs1 is shared similarly to ensure it's the same when the instructions
are issued).
In the constructor for the macro-op, I've included some code that
explicitly sets the src and dest register indices so that they are
_numSrcRegs = 2;
_srcRegIdx[0] = RS1;
_srcRegIdx[1] = RS2;
_numDestRegs = 1;
_destRegIdx[0] = RD;
So far, this works for the O3 model. But, in the minor model, it
tries to execute the modify-write micro-op before the read micro-op is
executed. The address is never loaded from Rs1, and so a segmentation
fault often occurs. To try to fix it, I added this code to the
_numSrcRegs = _p->_numSrcRegs;
for (int i = 0; i < _numSrcRegs; i++)
_srcRegIdx[i] = _p->_srcRegIdx[i];
_numDestRegs = _p->_numDestRegs;
for (int i = 0; i < _numDestRegs; i++)
_destRegIdx[i] = _p->_destRegIdx[i];
_p is a pointer to the "parent" macro-op. With this code, it works
with minor model, but the final calculated value in the modify-write
micro-op never gets written at the end of the instruction in the O3 model.
Post by Steve Reinhardt
I'm still confused about the problems you're having. Stores should
never be executed speculatively in O3, even without the non-speculative
flag. Also, assuming the store micro-op reads a register that is written
by the load micro-op, then that true data dependence through the
intermediate register should enforce an ordering. Whether that destination
register is also a source or not should be irrelevant, particularly in O3
where all the registers get renamed anyway.
Perhaps if you show some snippets of your actual code it will be
clearer to me what's going on.
Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the
second one yet; I'm still working on the first. Thanks for the patch link,
though, that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU
model or the O3 model, but not both at the same time. To get it to work
with the O3 model, I added the "IsNonSpeculative" flag to the modify-write
micro-op, which I assumed would prevent the O3 model from speculating on
its execution (which I also had to do with regular store instructions to
ensure that registers containing addresses would have the proper values
when the instruction executed). This works, but when I use it in the minor
CPU model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read
operation doesn't reference any register, as the instruction writes the
value that was read from memory to a dest register before modifying it and
writing it back. Because the dest register can be the same as a source
register, I have to pass the memory value from the read micro-op to the
modify-write micro-op without writing it to a register to avoid potentially
polluting the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them
implemented as pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs
into trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the
modify-write half
Post by Alec Roelke
before the load half is complete. I tried forcing both parts of
the
Post by Alec Roelke
instruction to have the same src and dest register indices, but
that causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency
between the two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I
could
Post by Alec Roelke
somehow have two memory accesses in one instruction without
having to split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20160728/dc22e5dd/attachment-0001.html
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Alec Roelke
2016-08-04 21:50:24 UTC
Permalink
Yeah, I looked at them first to figure out what I had to do--I don't think
they have intermediate registers like mine have to, or at least I didn't
see it when I first looked. Anyway, your suggestion for creating a
constant with the value of the register index to use in the operand
definition worked, and so now the RMW instructions work for all four CPU
models. Thanks for your help!
Post by Steve Reinhardt
I don't know that off the top of my head---the ISAs I'm familiar with are
either not microcoded, or use a micro-op assembler to generate all the
micro-ops (i.e., x86). Have you looked at how ARM micro-ops are
constructed? That's the one ISA that I believe is mostly not microcoded
but still has some microcode in it.
Though come to think of it, it may be as easy as just using a constant
where the other operands specify the machine code bitfield, if there's
syntax that allows that.
Steve
Post by Alec Roelke
Okay, thanks. How do I tell the ISA parser that the 'Rt' operand I've
created refers to the extra architectural register? Or is there some
function I can call inside the instruction's code that writes directly to
an architectural register? All I can see from the code GEM5 generates is
"setIntRegOperand," which takes indices into _destRegIdx rather than
register indices.
Post by Steve Reinhardt
You don't need to worry about the size of the bitfield in the
instruction encoding, because the temporary register(s) will never be
directly addressed by any machine instruction. You should define a new
architectural register using an index that doesn't appear in any
instruction (e.g., if the ISA includes r0 to r31, then the temp reg can be
r32). This register will get renamed in the O3 model.
Steve
Post by Alec Roelke
That makes sense. Would it be enough for me to just create a new
'Rt': ('IntReg', 'ud', None, 'IsInteger', 4)
and then increase the number of integer registers? The other integer
operands have a bit field from the instruction bits, but since the ISA
doesn't specify that these RMW instructions should be microcoded, there's
no way to decode a temporary register from the instruction bits. Will GEM5
understand that and pick any integer register that's available?
The memory address is taken from Rs1 before the load micro-op, and then
stored in a C++ variable for the remainder of the instruction. That was
done to ensure that other intervening instructions that might get executed
in the O3 model don't change Rs1 between the load and modify-write
micro-ops, but if I can get the temp register to work then that might fix
itself.
I was only setting _srcRegIdx and _destRegIdx for disassembly reasons;
since the macro-op and first micro-op don't make use of Rs2, the
instruction wasn't setting _srcRegIdx[1] and the disassembly would show
something like 4294967295. Then it presented a potential solution to the
minor CPU model problem I described before.
No, most of the ISA is not microcoded. In fact, as I said, these RMW
instructions are not specified to be microcoded by the ISA, but since they
each have two memory transactions they didn't appear to work unless I split
them into two micro-ops.
Post by Steve Reinhardt
You shouldn't be passing values between micro-ops using C++ variables,
you should pass the data in a register. (If necessary, create
microcode-only temporary registers for this purpose, like x86 does.) This
is microarchitectural state so you can't hide it from the CPU model. The
main problem here is that, since this "hidden" data dependency isn't
visible to the CPU model, it doesn't know that the micro-ops must be
executed in order. If you pass that data in a register, the pipeline model
will enforce the dependency.
Also, where do you set the address for the memory accesses? Again,
both micro-ops should read that out of a register, it should not be passed
implicitly via hidden variables.
You shouldn't have to explicitly set the internal fields like
_srcRegIdx and _destRegIdx, the ISA parser should do that for you.
Unfortunately the ISA description system wasn't originally designed to
support microcode, and that support was kind of shoehorned in after the
fact, so it is a little messy. Is your whole ISA microcoded, or just a few
specific instructions?
Steve
Post by Alec Roelke
Sure, I can show some code snippets. First, here is the code for the
temp = Mem_sd;
Rd_sd = temp;
Mem_sd = Rs2_sd + temp;
The memory address comes from Rs1. The variable "temp" is a
temporary location shared between the read and modify-write micro-ops (the
address from Rs1 is shared similarly to ensure it's the same when the
instructions are issued).
In the constructor for the macro-op, I've included some code that
explicitly sets the src and dest register indices so that they are
_numSrcRegs = 2;
_srcRegIdx[0] = RS1;
_srcRegIdx[1] = RS2;
_numDestRegs = 1;
_destRegIdx[0] = RD;
So far, this works for the O3 model. But, in the minor model, it
tries to execute the modify-write micro-op before the read micro-op is
executed. The address is never loaded from Rs1, and so a segmentation
fault often occurs. To try to fix it, I added this code to the
_numSrcRegs = _p->_numSrcRegs;
for (int i = 0; i < _numSrcRegs; i++)
_srcRegIdx[i] = _p->_srcRegIdx[i];
_numDestRegs = _p->_numDestRegs;
for (int i = 0; i < _numDestRegs; i++)
_destRegIdx[i] = _p->_destRegIdx[i];
_p is a pointer to the "parent" macro-op. With this code, it works
with minor model, but the final calculated value in the modify-write
micro-op never gets written at the end of the instruction in the O3 model.
Post by Steve Reinhardt
I'm still confused about the problems you're having. Stores should
never be executed speculatively in O3, even without the non-speculative
flag. Also, assuming the store micro-op reads a register that is written
by the load micro-op, then that true data dependence through the
intermediate register should enforce an ordering. Whether that destination
register is also a source or not should be irrelevant, particularly in O3
where all the registers get renamed anyway.
Perhaps if you show some snippets of your actual code it will be
clearer to me what's going on.
Steve
Post by Alec Roelke
Yes, that sums up my issues. I haven't gotten to tackling the
second one yet; I'm still working on the first. Thanks for the patch link,
though, that should help a lot when I get to it.
To be more specific, I can get it to work with either the minor CPU
model or the O3 model, but not both at the same time. To get it to work
with the O3 model, I added the "IsNonSpeculative" flag to the modify-write
micro-op, which I assumed would prevent the O3 model from speculating on
its execution (which I also had to do with regular store instructions to
ensure that registers containing addresses would have the proper values
when the instruction executed). This works, but when I use it in the minor
CPU model, it issues the modify-write micro-op before the read micro-op
executes, meaning it hasn't loaded the memory address from the register
file yet and causes a segmentation fault.
I assume this is caused by the fact that the code for the read
operation doesn't reference any register, as the instruction writes the
value that was read from memory to a dest register before modifying it and
writing it back. Because the dest register can be the same as a source
register, I have to pass the memory value from the read micro-op to the
modify-write micro-op without writing it to a register to avoid potentially
polluting the data written back.
My fix was to explicitly set the source and dest registers of both
micro-ops to what was decoded by the macro-op so GEM5 can infer
dependencies, but then when I try it using the O3 model, the modify-write
portion does not appear to actually write back to memory.
Post by Steve Reinhardt
1. Managing the ordering of the two micro-ops in the pipeline, which seems
to be the issue you're facing.
2. Providing atomicity when you have multiple cores.
I'm surprised you're having problems with #1, because that's the easy part.
I'd assume that you'd have a direct data dependency between the micro-ops
(the load would write a register that the store reads, for the load to pass
data to the store) which should enforce ordering. In addition, since
they're both accessing the same memory location, there shouldn't be any
reordering of the memory operations either.
Providing atomicity in the memory system is the harder part. The x86 atomic
RMW memory ops are implemented by setting LOCKED_RMW on both the load and
store operations (see
http://grok.gem5.org/source/xref/gem5/src/mem/request.hh#145, as well
as src/arch/x86/isa/microops/ldstop.isa). This works with AtomicSimpleCPU
and with Ruby, but there is no support for enforcing this atomicity in the
classic cache in timing mode. I have a patch that provides this but you
have to apply it manually: http://reviews.gem5.org/r/2691.
Steve
Post by Alec Roelke
Hello,
I'm trying to add an ISA to gem5 which has several atomic
read-modify-write instructions. Currently I have them
implemented as pairs
Post by Alec Roelke
of micro-ops which read data in the first operation and then
modify-write
Post by Alec Roelke
in the second. This works for the simple CPU model, but runs
into trouble
Post by Alec Roelke
for the minor and O3 models, which want to execute the
modify-write half
Post by Alec Roelke
before the load half is complete. I tried forcing both parts of
the
Post by Alec Roelke
instruction to have the same src and dest register indices, but
that causes
Post by Alec Roelke
other problems with the O3 model.
Is there a way to indicate that there is a data dependency
between the two
Post by Alec Roelke
micro-ops in the instruction? Or, better yet, is there a way I
could
Post by Alec Roelke
somehow have two memory accesses in one instruction without
having to split
Post by Alec Roelke
it into micro-ops?
Thanks,
Alec Roelke
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
-------------- next part --------------
Post by Steve Reinhardt
An HTML attachment was scrubbed...
URL: <http://m5sim.org/cgi-bin/mailman/private/gem5-users/
attachments/20160728/dc22e5dd/attachment-0001.html>
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Continue reading on narkive:
Loading...