Discussion:
Non copying Rc-Immix was Putting the stack on an RC-Immix heap
(too old to reply)
Ben Kloosterman
2013-11-12 05:42:16 UTC
Permalink
In contrast to a buddy allocator, however, this space can mostly be
defragmented. The stacks *may* (not always) be non-relocatable because of
C frames, but all of the other blocks can be evacuated over time. And we
need to do that anyway to avoid fragmentation of the *virtual* address
space in the LOS. This layer is also the layer at which real pages get
returned to the underlying operating system.
There is some cost however, if you could have any 2^N size then there is
far less point in copying , yet i agree it should not be an issue ...
which is interesting..

So let me propose a non copying RC-Immix heap - its not as silly as it
first looks. ie you start with large blocks as normal (likely larger) but
when you scan for space looking for opportunities to copy you can split
existing blocks to reclaim holes.. In fact even with 3 sizes this is a
very useful property which could reduce copying substantially maybe to the
point of not being needed ( I’d gladly give 20% fragmentation space , a few
percent in locality and some work on a background thread for no copying and
great scaling / low pauses) . With RC-Immix the main reason we copy (and
it’s not a lot of copying compared to a gen collector) is because a few
objects can hold an entire block and crucially there are too many of these
for the recycled block allocation , but we can for example just break the
512K block into 5 64K blocks and possibly even break the last block into
8K blocks. (Likely you would leave the recycled 128K block as 128K and
only under heavy memory pressure or after a long time do you break it down
further)
Quick Notes
- The nice part is this block splitting / merging can be done by a
background thread.
- It doesn’t have the nice property Nursery collectors have of
improving locality of mid aged objects by compacting but Rc-Immix does a
poor job on this as well. ( eg they have poor small heap performance) .
As far as i can see this is the main negative but it should be at least as
good as Immix which is not bad. Immix copying is only for defragmenation.
- Copying has a cost as well. Adding a stage 1 Nursery like Gen-Immix
will likely be better (esp for low thread counts) but this has to be
weighed up against the cost of copying especially for highly concurrent
cases ie the requirement for barriers and safe points as well as a
complexity cost which restricts optimization ( and im not talking about
just LLVM) . When trying to run fully concurrent RC-Immix ( as opposed to
mutators and collectors running) you will get worse performance. A Nursery
is much simpler than a fully concurrent collector requiring just a write
barrier and will be better than Rc-Immix and a non relocating heap by
itself is likely only a little worse and likely similar to a concurrent
Rc-Immix .
- It scales well.
- It doesn’t have the low cache1 hit rate of sized order heaps because
different sized objects ( related / similar liveliness) are still
allocated adjacent as normal.
- RC-Immix inserts new objects into holes in recycled blocks , a new
smaller block will make this more efficient as the mutator does not have to
scan for a hole ( which is one of the main reasons for lines but you still
need it for object /fragmentation counts and ref count) . We only have
recycled blocks that have poor splitting potential.
- These empty blocks are then reused as new and can be remerged with
adjacent ( blocks that are still partially full after being released by the
TLA and obviously adjacent empty blocks are also merged) . Thus allowing
blocks to reform. When all blocks are in use they should all be default
size ( - blocks held for stack , regions and TLA) . This is important so
the heap doesnt just degraded to minimum sized blocks.
- Simple to implement.
- Under high memory pressure /fragmentation you can break blocks down
further reducing fragmentation but increasing the block splitting work and
the TLA has to change blocks more frequently .. I think this will be more
efficient than scanning for lots of holes which Immix has to do.
- As stated earlier unlike a Buddy allocator we can work with smaller
blocks when needed ( as long as the minimum size is equal or larger than
the minimum large object) and as you say only a few sizes are used which
reduce fragmentation cost.
- It should not be that hard to test on Jikes ( but take a while to
tune) eg remove copying from Rc-Immix patch and change the block
manager/allocator , try various sizes and change the TLA allocator block
strategy. ( eg instead of use all empty , use all recycled collect you
need to work out when to use smaller blocks and when to look for them ).
Likely ( Use large empty block , use smaller sized empty blocks ( There
will only be many small blocks when fragmentation is high but medium blocks
will be common) , use recycled blocks) .


I think i need to think further on this (especially the cost of managing
block sizes and the cost of the locality changes for mid aged objects) as
its too obvious but i thought I’d throw it out there as it seems a good
compromise.

Ben
Jonathan S. Shapiro
2013-11-12 18:34:15 UTC
Permalink
Post by Ben Kloosterman
There is some cost however, if you could have any 2^N size then there is
far
Post by Ben Kloosterman
less point in copying , yet i agree it should not be an issue ... which
is
Post by Ben Kloosterman
interesting..
I'm not being clear enough about the 2^k thing.

When you are running a bump allocator *from the mutator*, you really want
to know the size of the block at static compile time (alternatively: at JIT
time). This is true because you need to know what mask to apply to the bump
pointer in order to find the front of the block. You need that to update
the reference count, and you *really* don't want to be doing a table lookup
for it. So what this means in practice is that the size of any immix block *in
the nursery* really wants to be fixed. I guess in a JIT system you could
change it on the fly if you are wiling to flush the existing translation
cache.

You *do* know at static compile time whether you are stack allocating or
heap allocating, so there is no need for those two sizes to match, but it
sure helps if the stack is 2^k pages. On some architectures this is
desirable in order to find the thread-local data pointer in any case.

But in both cases, the need to know the block size statically seems like an
impediment to some of the things you were talking about.

Relocating objects is another matter. I don't have any experience to draw
from on that, but my intuition is that a block-size lookup operation isn't
prohibitive there. The bigger issue is that you have now partitioned the
blocks by size, and a free block can no longer go naively onto a queue of
available free blocks.

So I think that there are three "core" block sizes: the one for the
nursery, the one for the heap, and the one for the stack. My guess is that
block sizes in LOS need to be viewed as a separate problem, if only because
they don't satisfy the 2^k property.

I do *not* think that the size of the block changes the argument for
copying one way or the other. 10% or so of objects in a "young" tenured
block will turn out to be long lived, regardless of the size of the block.
The question, I suppose, would be the spacial distribution of those islands
within the block. I don't think that "splitting" such a block really helps,
since the bump allocator is already prepared to skip past such islands.
Post by Ben Kloosterman
- It doesn’t have the low cache1 hit rate of sized order heaps because
different sized objects ( related / similar liveliness) are still
allocated
Post by Ben Kloosterman
adjacent as normal.
Up to some new, potentially larger size. Yes. But it's an open question how
important that would be in practice. Given what we know about object size
demographics, I'm dubious about how much difference it makes. Still, it
seems like an easy thing to test and measure.



shap
Ben Kloosterman
2013-11-17 15:35:15 UTC
Permalink
Have done some testing see at end.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
There is some cost however, if you could have any 2^N size then there is
far
Post by Ben Kloosterman
less point in copying , yet i agree it should not be an issue ... which
is
Post by Ben Kloosterman
interesting..
I'm not being clear enough about the 2^k thing.
No i did follow , BTW the Immix code has a concept called Chunks which is
just N blocks. Basically a bunch of pages .. but it has some metadata.
More on this later , im still exploring it.
Post by Jonathan S. Shapiro
But in both cases, the need to know the block size statically seems like
an impediment to some of the things you were talking about.
At first i thought I had some options here..
1) Use the last page as a guard .. the fault though expensive works out at
roughly 200+ Cycles ( on x86_64) and a 256K block can have somewhere
between 32-16000 objects on average the cost here is low. ( Setting the
pages can be done in the background)
2) Divide the block into fixed sizes with meta data at each division but
allocate them together for the bump allocator ( this is important as you
dont want to go to the block allocator to often) .
3) Use a lookup in FS /GS for the current block size

But all this is rather silly when checking the code, as it allocates them
on lines .. so it needs to store a byte count for the amount of bytes in
the line ( to check overflow) and the amount of bytes in a line is fixed .
Changing number of lines in a block to a variable is far less of an issue
and either just a line# variable or combined with option (2). There is
also a #of lines allocated counter so there are a few options around this.
Post by Jonathan S. Shapiro
Relocating objects is another matter. I don't have any experience to draw
from on that, but my intuition is that a block-size lookup operation isn't
prohibitive there. The bigger issue is that you have now partitioned the
blocks by size, and a free block can no longer go naively onto a queue of
available free blocks.
Its just a free block , once you get the block you get the size and the
allocator is responsible for ensuring the bigger blocks get handed out
first. Since blocks get reclaimed by the background ref counting ( esp
cycles) it can just add then in size order.

That said I think there will only be 2 sizes ( because lines is the 3rd
size) so you could just add them to the free queue. And the allocator just
loads the linesize from &block+lineCount. or you can go with option2
above.
Post by Jonathan S. Shapiro
So I think that there are three "core" block sizes: the one for the
nursery, the one for the heap, and the one for the stack. My guess is that
block sizes in LOS need to be viewed as a separate problem, if only because
they don't satisfy the 2^k property.
Its very nice to have diffirent sized blocks for the stack if calling some
C lib with heavy recursions you dont want to allocate a large amount of
blocks and the cost of this may outweigh other issues.
Post by Jonathan S. Shapiro
I do *not* think that the size of the block changes the argument for
copying one way or the other. 10% or so of objects in a "young" tenured
block will turn out to be long lived, regardless of the size of the block.
The question, I suppose, would be the spacial distribution of those islands
within the block. I don't think that "splitting" such a block really helps,
since the bump allocator is already prepared to skip past such islands.
Yes the benefit of bump allocating a recycled block is pretty low ( i
thought it would be higher but the line mechanism is mutator efficient at
the cost of fragmentation) . One huge benefit of splitting i was thinking
of is zeroing memory its a huge problem with many allocators. .. If a block
is split it can be zeroed far more efficiently . That said this applies
more to an Immix style collector than Rc-Immic because with ref counting
you can zero when freeing .

The procedure used is interesting and not a line scan at all , i have
seen no evidence of object start marks at all so far ( the scan goes from
object to object but i have not looked deeper) ,its quite simple actually
when the collect /new obj ref count goes through it marks the number of
objects in each line and hence whether the line is full or empty. The
allocator finds the first and last empty line and just grabs contiguous
empty lines and that is where it then allocates. This avoids a more
expensive scanning in the line and why they talked about a worst case
fragmentation of 1 object per line. In Immix the only purpose to copying is
to reduce this worst case fragmentation cost.

This does mean fragmentation will be higher for Immix / RCImmix . One thing
that may be done is the defrag histogram calculations could build a list
of space or indicate a large hole in a line which the allocator can pick
up when it does an extreme collection. This would have a similar effect to
copying but at a cost of mutator performance when memory is low .
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
- It doesn’t have the low cache1 hit rate of sized order heaps
because
Post by Ben Kloosterman
different sized objects ( related / similar liveliness) are still
allocated
Post by Ben Kloosterman
adjacent as normal.
Up to some new, potentially larger size. Yes. But it's an open question
how important that would be in practice. Given what we know about object
size demographics, I'm dubious about how much difference it makes. Still,
it seems like an easy thing to test and measure.
Its a reasonable job to do a fuller implimentation. I have got Jikes and
RCImmix building and testing and have started with some quick changes eg
disable defragmenting on Immix and measure the impact . If you want to see
the rc-immix code yourself i can put it on github. One big issue is most
more concurrent tests dont seem to run on Jikes ( and i only have a 4 core
machine) but will check this later.


Some Immix / RC-Immix tnotes.

Im not sold on is having a whole 2nd block held by the allocator ( it looks
like a late mod in the code) just for medium sized blocks which dont fit
at the end ( note end of block not line) , its ok to request the block and
put the object there but when you finish the current block why not just use
that . This can waste 256K per thread and reduce locality for little gain
compared to just using the 2nd block when the first is full.

Except for very high core counts RC-Immix is probably better of with a
Nursery then its copy new object scheme.

There is a lot of allignment code.

Oh one more thing Immix builds a histogram which highlights fragmentation
and hence objects which are candidates for copying .. It may be possible
with ref counting ( or the background cycle mark) to identify candidate
blocks for such objects where the objects in the candidate block have a
reference to the object to be moved. This should improve the locality and
L1 cache for such long lived objects. I doubt its viable for new objects .

Performance testing on Jikes
Have done some testing and its a bit tricky as they use inheritance and
shared code everywhere and changing the base class is tricky. Observations

- Disable Copy of new objects ( easy to do as its just a test)
No real performance diffirience except for small heaps eg 1-1.5* min heap
is about 4% worse. A proper nursery like Gen-Immix will be better as
covered in the Rc-Immix paper. ( which means non copying is much worse
then Gen-Immix )

- Stop defragment run.
The defragment run is interwoven with Cycle detection and refcount updates
( all get done at this point in a class that inherits from a
StopTheWorldCollector) . No real performance impact by not defragmenting
generally the diffirence either way was less than the variation. ( It
would take quite a lot of work to calculate it exactly which would also
entail doing some optomizations by removing a lot of code) however minimum
memory increased eg for DeCappo Bloat RCImmix was 29M to ncRCImmix 33M (
note i think i removed the dedicated copy space but im not 100%) .

The most notable part is the performance compared to other non copying GCs
eg ncRCImmix bloat 50M ( 10 run take the highest was ) 3164 vs Mark sweep
3942ms , jython 75M 3164ms vs 3942ms ( and that were the benchmark i just
ran others were a bit less ) . These were about min heap *1.5.

This is very promising where you cant copy ... and non moving has huge
benefits eg no read barrier a lighter write barrier ( just for RC) , C
interop , compiler optomization , internal references / slices , no stack
map , lower pauses, simplicity allowing more time in optimization / tuning
alogorithms , scaling to high core counts is trivial etc but the cost has
to be reasonable . Non copying scales very well and it may even be that
this is competative due to better optomization , no safe zones and
barriers ( at a cost of say 10% more memory eg extra fragmentation - copy
store) . A Mark Sweep Nursery + no copy RCImmix is likely to be better
than RCImmix.

I think this is a very attractive model to begin with eg no copy RCImmix it
has a complexity equivalent to most generational collectors. You can get is
completely stable then tune it , add then add a Mark Sweep nursery ( if the
runtime allows) and evaluate highly concurrent solutions vs synced nursery
vs the base but the base will be competitive unless you need to run very
tight heaps.


I wont go further because it will be a significant amount of work ( would
want 8+ core machines and more benchmarks) and wont really prove anything
, the block solution i mention may help though we can clear whole lines
when the line is marked free ( not done in the code) but it doesnt help
fragmentation and is just an optomization . Lower line size would be
better eg non copy would go back to Immix 128 byte lines.

There are also significant differences between the way Jikes/Java uses
memory and a language with unboxed types as you will get a lot less small
objects ( and hence fewer but larger objects) which means the optomizations
for small objects are less worthwhile and all the tuning would need to be
re-evaluated.

Ben
Jonathan S. Shapiro
2013-11-17 16:27:53 UTC
Permalink
Post by Ben Kloosterman
Have done some testing see at end.
Great!

BTW the Immix code has a concept called Chunks which is just N blocks.
Post by Ben Kloosterman
Basically a bunch of pages .. but it has some metadata.
More on this later , im still exploring it.
Look forward to hearing about it.
Post by Ben Kloosterman
At first i thought I had some options here..
1) Use the last page as a guard .. the fault though expensive works out at
roughly 200+ Cycles ( on x86_64) and a 256K block can have somewhere
between 32-16000 objects on average the cost here is low. ( Setting the
pages can be done in the background)
2) Divide the block into fixed sizes with meta data at each division but
allocate them together for the bump allocator ( this is important as you
dont want to go to the block allocator to often) .
3) Use a lookup in FS /GS for the current block size
But all this is rather silly when checking the code, as it allocates them
on lines .. so it needs to store a byte count for the amount of bytes in
the line ( to check overflow) and the amount of bytes in a line is fixed .
Changing number of lines in a block to a variable is far less of an issue
and either just a line# variable or combined with option (2). There is
also a #of lines allocated counter so there are a few options around this.
I think (2) won't help because you have metadata in the middle of the
stack. You end up having to do all of the call-time guards to avoid overrun
that you would need for a segmented stack. But you are right that this
check is subsumed by the bump limit pointer check. The only real difference
is that the stack grows down while most blocks will grow up, but that's
just a sign change on some of the comparisons.

You make it sound as if the current RC-immix mechanism is computing
bytes-allocated-in-line explicitly. That surprises me. Don't the low-order
bits of the bump pointer already encode that?

Changing the (static) size of the block isn't a problem, except that at
some point you are going to hit architectural costs imposed by short
immediate values (for masking) in the underlying instruction set. It's
better if you don't have to do multiple instructions to build the mask
value, but as long as the size is statically known the mask is just a
constant value.

But if the size of the chunk is *dynamic*, you need some instructions to
look up the size so that you can determine where the metadata is and decide
what mask to use. You could certainly turn the block base pointer into
thread-local data, and you could store bits-to-mask in the first (otherwise
unused) byte of the metadata, but it still seems like more instructions to
run in the inline path. This is a marginal (new) computation in the inline
path, and I'm already somewhat concerned about the number of instructions
in the inline path.

I'm making a bunch of assumptions here, though. Since you have actually
looked at how this thing works, can you give pseudo-code for the fast path
of the bump allocator?

One thing I'm particularly curious about: does the fast path check the
limit pointer each time an allocation occurs, or are limit checks across
multiple allocations combined in the obvious way?
Post by Ben Kloosterman
Its very nice to have diffirent sized blocks for the stack if calling some
C lib with heavy recursions you dont want to allocate a large amount of
blocks and the cost of this may outweigh other issues.
Let me come back to this in my next note.

If a block is split it can be zeroed far more efficiently . That said this
Post by Ben Kloosterman
applies more to an Immix style collector than Rc-Immic because with ref
counting you can zero when freeing.
You can, but you really don't want do. The zeroing speed is one of the
under-appreciated advantages of a bump allocator. Zeroing long chunks can
be done with CLZ (cache line zero) on a lot of modern machines. That's
hugely faster than word-at-a-time can do it.
Post by Ben Kloosterman
The procedure used is interesting and not a line scan at all , i have
seen no evidence of object start marks at all so far ( the scan goes from
object to object but i have not looked deeper) ...
Right. Immix doesn't use object start marks, just a count of objects per
line. Their "continued" bit is in the object header, not the per-line
metadata.
Post by Ben Kloosterman
Im not sold on is having a whole 2nd block held by the allocator ( it
looks like a late mod in the code) just for medium sized blocks which dont
fit at the end ( note end of block not line) , its ok to request the block
and put the object there but when you finish the current block why not just
use that . This can waste 256K per thread and reduce locality for little
gain compared to just using the 2nd block when the first is full.
According to the paper, that's not what the second block is for. The second
block gets used when the object won't fit in the current range of the first
block. If you just skip over holes that are too small, you can end up with
a lot of unused ranges.


shap
Jonathan S. Shapiro
2013-11-17 16:37:57 UTC
Permalink
I just realized that the immix papers don't mention one of the big
advantages to bump allocations: when you are allocating multiple objects in
sequence, the limit checks can be consolidated into a single check. It
seems to me that this is not true for immix, because the occupancy count on
each line makes the consolidation difficult. If things can't be
consolidated, that's a pretty significant fast-path overhead. Does anybody
know if they are consolidating successfully?

Regarding the stack, there is another issue that I hadn't considered: if we
use an immix block of some form for the stack, we can't rely on the
collector to zero it in the background. We would have needed to zero
reference slots anyway, so it's not like this is a new cost, but the
stack-of-objects approach may change how eagerly we need to do that.
Actually, I don't think it does, but it needs looking at.

In the end, I'm coming to the view that a stack map is likely to be better.
The stack-of-objects idea ends up having to build the same data that we
need for the stack map; it just evades the hash table lookup to find the
map record. If the hash is self-scaling, that shouldn't be a big issue, so
at this point I'm leaning away from the stack-of-objects idea.


shap
Ben Kloosterman
2013-11-18 13:57:57 UTC
Permalink
Post by Jonathan S. Shapiro
I just realized that the immix papers don't mention one of the big
advantages to bump allocations: when you are allocating multiple objects in
sequence, the limit checks can be consolidated into a single check. It
seems to me that this is not true for immix, because the occupancy count on
each line makes the consolidation difficult. If things can't be
consolidated, that's a pretty significant fast-path overhead. Does anybody
know if they are consolidating successfully?
I think there is a runtime contraint where the runtime presents 1 call to
alloc at a time.

I see no reason why if the runtime/compiler supports it you cant go.

var ptr= tryallocatecontiguious ( size_of_multiple_objects.)
if ( ptr == NULL)
// allocate each object individually
//construct the objects

Though it does add to the inline path.

I think ( i checked but im not 100%) that the occupancy count is done
durring the collect phase (when processing newroots ) . Note the metadata
for lines is located at the start of the Chunk not the block ( i think all
metadata except the object & object GC header is in the chunk and hence a
block is just unmarked objects)
Post by Jonathan S. Shapiro
Regarding the stack, there is another issue that I hadn't considered: if
we use an immix block of some form for the stack, we can't rely on the
collector to zero it in the background. We would have needed to zero
reference slots anyway, so it's not like this is a new cost, but the
stack-of-objects approach may change how eagerly we need to do that.
Actually, I don't think it does, but it needs looking at.
I take it we cant rely on the collector to do this because collector
threads have immix block for a stack . Immix should do it in the
background but it doesnt.. there is no issue to do it in the allocator.
Worst case you can have cleared and uncleared blocks where you manually
clear it if none have been cleared. This is probably a safer method
because their is no guarantee the collector has even run .

One huge issue is block availability..Normal Rc-Immix cycle is Allocate
all new blocks ,Allocate all recycled blocks , Run Collect. We probably
dont want a stack to use Recycled blocks so we need some sort of reserve
here.. At this point i think you would want to redesign this and
incorporate some sort of New Uncleared Blocks , Cleared Blocks , Recycled
Blocks , EmergencyBlocks ( copy reserve and emergency stack eg for
collector stack) scheme.
Post by Jonathan S. Shapiro
In the end, I'm coming to the view that a stack map is likely to be
better. The stack-of-objects idea ends up having to build the same data
that we need for the stack map; it just evades the hash table lookup to
find the map record. If the hash is self-scaling, that shouldn't be a big
issue, so at this point I'm leaning away from the stack-of-objects idea.
There is still some benefit of a uniform block memory model ,that said
Immix blocks are likely allocated from large page chunks on some archs
which is good but stacks need guard pages.

Writing a good self-scaling hash table is not easy..

I kind of like the idea of writing a reference to static meta data
regardless of the method since
- You want the writing to be as fast as possible and the parsing is less
important .
- It has constant time regardless of the amount of variables.


Ben
Jonathan S. Shapiro
2013-11-18 15:41:11 UTC
Permalink
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
I just realized that the immix papers don't mention one of the big
advantages to bump allocations: when you are allocating multiple objects in
sequence, the limit checks can be consolidated into a single check. It
seems to me that this is not true for immix, because the occupancy count on
each line makes the consolidation difficult. If things can't be
consolidated, that's a pretty significant fast-path overhead. Does anybody
know if they are consolidating successfully?
I think there is a runtime contraint where the runtime presents 1 call to
alloc at a time.
Right. But if you know you are going to allocate an 8 word object followed
by a 12 word object, you can just do a single 24 word (allowing for
headers) allocation. Given the inline sequence you sent out, a conventional
compiler would do this optimization more or less automatically, except that
the procedure calls to the long path are an impediment.
Post by Ben Kloosterman
I see no reason why if the runtime/compiler supports it you cant go.
var ptr= tryallocatecontiguious ( size_of_multiple_objects.)
if ( ptr == NULL)
// allocate each object individually
//construct the objects
Though it does add to the inline path.
You wouldn't do it conditionally. You'd just do it. You'd probably end up
allocating more chunks out of the "free" block, but that's not necessarily
bad.
Post by Ben Kloosterman
I think ( i checked but im not 100%) that the occupancy count is done
durring the collect phase (when processing newroots ) .
(slaps head) yes, of course. And that would make all the difference. If the
object counts are handled during nursery collection, dynamically sized
blocks aren't a problem at all.

Note the metadata for lines is located at the start of the Chunk not the
Post by Ben Kloosterman
block ( i think all metadata except the object & object GC header is in the
chunk and hence a block is just unmarked objects)
That's strange, if only because it contradicts the paper.
Post by Ben Kloosterman
Regarding the stack, there is another issue that I hadn't considered: if
Post by Jonathan S. Shapiro
we use an immix block of some form for the stack, we can't rely on the
collector to zero it in the background. We would have needed to zero
reference slots anyway, so it's not like this is a new cost, but the
stack-of-objects approach may change how eagerly we need to do that.
Actually, I don't think it does, but it needs looking at.
I take it we cant rely on the collector to do this because collector
threads have immix block for a stack .
No no. We'll get an initially zero block just fine. The problem is that the
stack pointer is going up and down within the block. As procedures return,
they leave dirty lines "below" the stack. This happens *way* too rapidly
for background zeroing to work.

Maybe I haven't been clear enough. I'm not imagining that we allocate a
frame as a nursery object. I'm imagining that we manage a *conventional* stack
in an immix block, using the bump limit pointer to guard the end of the
block so we know when to create a new stack block/segment. All I was doing
was laying out the frames in a way that made the stack walkable by the
conventional object mark/scan logic.

One huge issue is block availability...
Right. Except now that I've said more clearly that I'm not allocating stack
frames as heap objects, hopefully it's clearer why I don't think that's an
issue.
Post by Ben Kloosterman
Normal Rc-Immix cycle is Allocate all new blocks ,Allocate all recycled
blocks , Run Collect. We probably dont want a stack to use Recycled blocks
so we need some sort of reserve here.
For what I have in mind, the stack *definitely* can't use recycled blocks.
Post by Ben Kloosterman
Writing a good self-scaling hash table is not easy..
True. But in this case, doubling the hash table size each time is a good
and simple approach.
Post by Ben Kloosterman
I kind of like the idea of writing a reference to static meta data
regardless of the method since
- You want the writing to be as fast as possible and the parsing is less
important .
- It has constant time regardless of the amount of variables.
Sure, but the stack map is effectively the same. It's using the same actual
map data, but it's statically precomputed and doesn't require any writes at
run time at all.


shap
Jonathan S. Shapiro
2013-11-18 15:43:09 UTC
Permalink
Post by Jonathan S. Shapiro
Sure, but the stack map is effectively the same. It's using the same
actual map data, but it's statically precomputed and doesn't require any
writes at run time at all.
Writing the conventional stack in such a way that the stack frames look
like conventional objects does two things:

1. It enables reuse of the object mark logic
2. It eliminates the reliance on libunwind.

That's really all it does for us.
Ben Kloosterman
2013-11-19 04:46:56 UTC
Permalink
+

3. A convenient / cheap way to check end of stack if you choose to use it
instead of a guard page . Which in turn allows allocation from large pages
from a shared heap.


Ben
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
Sure, but the stack map is effectively the same. It's using the same
actual map data, but it's statically precomputed and doesn't require any
writes at run time at all.
Writing the conventional stack in such a way that the stack frames look
1. It enables reuse of the object mark logic
2. It eliminates the reliance on libunwind.
That's really all it does for us.
_______________________________________________
bitc-dev mailing list
http://www.coyotos.org/mailman/listinfo/bitc-dev
Ben Kloosterman
2013-11-19 03:32:36 UTC
Permalink
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
I just realized that the immix papers don't mention one of the big
advantages to bump allocations: when you are allocating multiple objects in
sequence, the limit checks can be consolidated into a single check. It
seems to me that this is not true for immix, because the occupancy count on
each line makes the consolidation difficult. If things can't be
consolidated, that's a pretty significant fast-path overhead. Does anybody
know if they are consolidating successfully?
I think there is a runtime contraint where the runtime presents 1 call to
alloc at a time.
Right. But if you know you are going to allocate an 8 word object followed
by a 12 word object, you can just do a single 24 word (allowing for
headers) allocation. Given the inline sequence you sent out, a conventional
compiler would do this optimization more or less automatically, except that
the procedure calls to the long path are an impediment.
In Jikes this is external C code not likeley to be inlined .. but that
shouldnt stop another runtime.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
I see no reason why if the runtime/compiler supports it you cant go.
var ptr= tryallocatecontiguious ( size_of_multiple_objects.)
if ( ptr == NULL)
// allocate each object individually
//construct the objects
Though it does add to the inline path.
You wouldn't do it conditionally. You'd just do it. You'd probably end up
allocating more chunks out of the "free" block, but that's not necessarily
bad.
Yes .. missed that . That should work provided there is not too much
consolidation else you get big holes at the end of each block.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
I think ( i checked but im not 100%) that the occupancy count is done
durring the collect phase (when processing newroots ) .
(slaps head) yes, of course. And that would make all the difference. If
the object counts are handled during nursery collection, dynamically sized
blocks aren't a problem at all.
Except a minor cost in managing them..
Post by Jonathan S. Shapiro
Note the metadata for lines is located at the start of the Chunk not the
Post by Ben Kloosterman
block ( i think all metadata except the object & object GC header is in the
chunk and hence a block is just unmarked objects)
That's strange, if only because it contradicts the paper.
Pretty sure on this i saw the line count lookup the Chunk table and put it
on an offset there. Note the history here

Immix paper
Immix continued developed as production GC for Jikes
Gen Immix
RC Immix paper
Patch for Jikes.

Its kind of nice to have clean blocks and lines.. also all the non object
header metadata is close together which will be better.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
if we use an immix block of some form for the stack, we can't rely on the
collector to zero it in the background. We would have needed to zero
reference slots anyway, so it's not like this is a new cost, but the
stack-of-objects approach may change how eagerly we need to do that.
Actually, I don't think it does, but it needs looking at.
I take it we cant rely on the collector to do this because collector
threads have immix block for a stack .
No no. We'll get an initially zero block just fine. The problem is that
the stack pointer is going up and down within the block. As procedures
return, they leave dirty lines "below" the stack. This happens *way* too
rapidly for background zeroing to work.
I was thinking just the intial clear when the block is received as it is
received at the moment dirty..
Post by Jonathan S. Shapiro
Maybe I haven't been clear enough. I'm not imagining that we allocate a
frame as a nursery object. I'm imagining that we manage a *conventional* stack
in an immix block, using the bump limit pointer to guard the end of the
block so we know when to create a new stack block/segment. All I was doing
was laying out the frames in a way that made the stack walkable by the
conventional object mark/scan logic.
I understand though its a bit unconventional .. I keep thinking i can
lookup the type now since i have a vtable to the method...:-) . Also the
bump pointer check is not conventional and you can go either way use the
bump pointer to check for end of stack vs guard page. ( No reason you cant
have a guard page at the end of an Immix block though it does need a block
from 4K pages)

A bit concerned about the bump pointer , the paper i posted showed the
stack check cost was massive, here we have a bump pointer check doing
effectively the same thing ( yet this is a super cheap check) .
Post by Jonathan S. Shapiro
One huge issue is block availability...
Right. Except now that I've said more clearly that I'm not allocating
stack frames as heap objects, hopefully it's clearer why I don't think
that's an issue.
Its a minor issue as you need some code to get a block , clear it and then
set it up as a stack. Your probably thinking a seperate pool which fetches
its own pages but i dont think thats needed just reserved blocks in the
existing pool.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
Writing a good self-scaling hash table is not easy..
True. But in this case, doubling the hash table size each time is a good
and simple approach.
Post by Ben Kloosterman
I kind of like the idea of writing a reference to static meta data
regardless of the method since
- You want the writing to be as fast as possible and the parsing is less
important .
- It has constant time regardless of the amount of variables.
Sure, but the stack map is effectively the same. It's using the same
actual map data, but it's statically precomputed and doesn't require any
writes at run time at all.
Yeah i figured you can put the map data as a constant -word offset of
the address ( say as a header to the method ) but since its not the mutator
may as well use a hash and not pollute the cache.

Ben
Jonathan S. Shapiro
2013-11-19 17:27:20 UTC
Permalink
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
Right. But if you know you are going to allocate an 8 word object
followed by a 12 word object, you can just do a single 24 word (allowing
for headers) allocation. Given the inline sequence you sent out, a
conventional compiler would do this optimization more or less
automatically, except that the procedure calls to the long path are an
impediment.
In Jikes this is external C code not likeley to be inlined .. but that
shouldnt stop another runtime.
Wait. Are you saying that the *fast path* isn't inlined?

Combining allocations requires that the front end knows something about the
allocator code, so that it knows the call to the slow path (which is an
intrinsic call) can be combined. The ability to do this combination is one
of the big advantages of a copying nursery.

Hmm. Come to that, I'm not thinking about this right, and the sequence you
sent is wrong. That is: I'm confident they are using the sequence as you
describe it, but that's not the one we want in mainline code.

The sequence you sent is the one that would be used for the object
relocator. That's the one that gets used to allocate storage when
evacuating a block or copying things out of the nursery. The reason I'm
sure of this is the test

if (bytes > BYTES_IN_LINE) {

which we wouldn't be doing in a nursery allocation (because the size there
is statically known).

But if we use a copying nursery (i.e. a hybrid approach), then the nursery
itself is *always* a free block (perhaps bigger than 32K, but still a free
block). When allocating from a free block, there is no internal
fragmentation, and if the allocation fails you're going to have to do a
full nursery collection (and coalesce or copy) in any case. Under those
conditions, the bump allocation reduces to (borrowing from your code,
possibly incorrectly):

do {

Address start = alignAllocationNoFill(cursor, align, offset);
Address end = start.plus(bytes);
if (end.LE(limit)) break;

GC_nursery();
}
cursor = end;

We can very easily impose a rule in the nursery that maintains /start/
at an two word boundary. In that case, for typically aligned
allocations, it reduces to:

do {
Address start = cursor + 2 * sizeof(uintptr_t); /* GC hdr, vtable */
Address end = start.plus(CompilerRoundsUp(bytes, 8));

if (end.GT(limit) GC_nursery();
}
cursor = end;

And if we institute that alignment requirement on the cursor, then the
probe hoists to the top of the procedure, and we get:

/* Pre-probe the available nursery space */
end = cursor + SpaceWeNeed; if (end.GT(limit)) GC_nursery();

/* At this point, we have permission to allocate up to SpaceWeNeed
without checking. */
Address start = cursor + 2 * sizeof(uintptr_t);
cursor += StaticObjectSizeAfterRoundup;

And, of course, a lot of this will peephole out if the compiler is
aggressive about it. This approach won't work for loops, but it still
helps.

Note here that I'm *only* talking about nursery allocations! This
would hypothetically work in a recycled block, but it would leave big
holes in practice.
Post by Ben Kloosterman
Pretty sure on this i saw the line count lookup the Chunk table and put it
on an offset there. Note the history here
Immix paper
Immix continued developed as production GC for Jikes
Gen Immix
RC Immix paper
Patch for Jikes.
Its kind of nice to have clean blocks and lines.. also all the non object
header metadata is close together which will be better.
It was already close together in the block design, and it was already a
nice cache line multiple, because it was 256 bytes. But it doesn't really
matter if the metadata is chunkwise or line-wise, and moving it to a 4M
chunk regularizes some things and enables large PTEs.

The other subtle effect here is that the metadata itself is now 32 Kb,
which has an obvious relationship to the L1 cache size on most machines.
One effect is to change the compulsory miss rate during block scans. So it
makes sense to do this if you have enough total system memory to be able to
afford it.

Regarding the stack, there is another issue that I hadn't considered: if we
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
use an immix block of some form for the stack, we can't rely on the
collector to zero it in the background. We would have needed to zero
reference slots anyway, so it's not like this is a new cost, but the
stack-of-objects approach may change how eagerly we need to do that.
Actually, I don't think it does, but it needs looking at.
I take it we cant rely on the collector to do this because collector
threads have immix block for a stack .
No no. We'll get an initially zero block just fine. The problem is that
the stack pointer is going up and down within the block. As procedures
return, they leave dirty lines "below" the stack. This happens *way* too
rapidly for background zeroing to work.
I was thinking just the intial clear when the block is received as it is
received at the moment dirty..
Zeroing it at allocation rather than release simply shifts the work from
the GC to the mutator. My guess is it doesn't matter from a performance
perspective.

A bit concerned about the bump pointer , the paper i posted showed the
Post by Ben Kloosterman
stack check cost was massive, here we have a bump pointer check doing
effectively the same thing ( yet this is a super cheap check).
Yes. But I think the key problem with the stack check isn't the
end-of-stack test. It's the case where things bounce back and forth across
segments.

A procedure typically decrements %SP on entry. If you have a guard page,
then the stack overflow test is simply:

subl %sp, %sp, $frame_size ; exists already
mov $0,(%sp)


All that needs to be done is to ensure that this instruction happens early
enough. It's very hard to believe that the probe itself would induce a
performance problem. If you *don't* want to use a guard page, you can
instead use:

subl %sp, %sp, $frame_size
subl %tmp, %sp, $limit
jge $procedure_body
ldc %tmp, $procedure_body
jlt $stub_that_grows_the_stack


Which, I concede, is unpleasantly longer, especially since it deals with
such a low-likelihood event. The guard page is almost certainly more
efficient.

In the end, though, I think it's also important to remember that the use
case for automatically growable stacks doesn't arise that frequently.

Yeah i figured you can put the map data as a constant -word offset of
Post by Ben Kloosterman
the address ( say as a header to the method ) but since its not the mutator
may as well use a hash and not pollute the cache.
There are a whole bunch of ways to do it, and several places it can go.
Which one is right can only be determined by measurement.

The real issue is that in an optimizing implementation, you may end up with
multiple stack maps for a single frame, because locations on the stack may
be used for integers at one moment and object references the next. You
already need some sort of lookup of the form (base PC, bound PC, stack map
addr), but you'll get a lot more entries if the shape of the frame is
changing.

Not clear that a hash table is even the best choice, but that's an
implementation detail at this point.


Jonathan
Jonathan S. Shapiro
2013-11-19 17:36:48 UTC
Permalink
Oh. I forgot a whole topic in my last: stack copying.

The performance killer in a segmented stack isn't the check. It's the fact
that many call sites (or equivalently, entry sites) now need to be prepared
to shuffle their own frame onto a different call stack. That's both a lot
of code and a lot of overhead. The only real way to avoid this is to
relocate the stack.

One of the nice things about a precise stack map combined with a precise
register map is that the stack can be copied. So the goal here shouldn't be
stitching stack segments together. It should be *growing* the stack up to
some thread-defined guard size (above which a stack overflow is really an
error).

Unfortunately, there may be C code on the stack. That can't be moved,
because we don't know what it does. When a stack relocation occurs while C
stack frames are present, what it needs to do is leave a return trampoline
at the top of the new stack that returns to the C frame on the old stack.
We then leave a marker at the *top* of the C frames to indicate that we
should migrate the stuff *above* the C frames onto the larger stack frame
when the C code has returned.

Note that the call from C to managed code can arrange to always perform the
code on the correct (newer and larger) stack.

Also note that during the "copy on return from C" code, the new stack
region is known to be completely empty, just as it was when we did the
earlier stack relocation. Offhand, I provisionally believe that whenever a
stack relocation occurs, it is always the case that frames are being
migrated into an empty chunk.


shap
Ben Kloosterman
2013-11-20 06:20:19 UTC
Permalink
I put the stack comments here
Post by Jonathan S. Shapiro
In the end, though, I think it's also important to remember that the use
case for automatically growable stacks doesn't arise that frequently.
Im firmly in the 1 big stack and tune it down camp ( its simple ,known and
is the status quo) and add growable stacks as a feature ( preferably copy
) . The large amount of threads is driven by the actor model ( and a few
others) but i think its a bit miss guided and their are other ways like
mutliplexing actor/ thread pumped services on less system threads instead
of dealling with hundreds of threads.
Post by Jonathan S. Shapiro
There are a whole bunch of ways to do it, and several places it can go.
Which one is right can only be determined by measurement.
The real issue is that in an optimizing implementation, you may end up
with multiple stack maps for a single frame, because locations on the stack
may be used for integers at one moment and object references the next. You
already need some sort of lookup of the form (base PC, bound PC, stack map
addr), but you'll get a lot more entries if the shape of the frame is
changing.
Wouldnt it be better to burn the stack memory and reserve slots for each
potential use ? Variable size stacks dont mix well with Immix Blocks and
method metadata.

Also the Metadata map would need to be made after inlining ( or inlining
smart enough to update the metadata) which is quite low level.
Post by Jonathan S. Shapiro
Oh. I forgot a whole topic in my last: stack copying.
The performance killer in a segmented stack isn't the check. It's the fact
that many call sites (or equivalently, entry sites) now need to be prepared
to shuffle their own frame onto a different call stack. That's both a lot
of code and a lot of overhead. The only real way to avoid this is to
relocate the stack.
I know these costs and the cache costs of using a new memory region, but
what i dont get ( and which maybe misleading me ) was in the paper
mentioned before how size analysis of the worst case stack path
significantly reduced the cost . Maybe they could specify a stack segment
that covered the worst case path and hence reduce these costs ( eg 7K for
some worker threads may be statically determined) their use case of 100's
of threads is unusual so in effect the analysis was used to reduce segments
rather than reduce the cost of the check..
Post by Jonathan S. Shapiro
One of the nice things about a precise stack map combined with a precise
register map is that the stack can be copied. So the goal here shouldn't be
stitching stack segments together. It should be *growing* the stack up to
some thread-defined guard size (above which a stack overflow is really an
error).
Unfortunately, there may be C code on the stack. That can't be moved,
because we don't know what it does. When a stack relocation occurs while C
stack frames are present, what it needs to do is leave a return trampoline
at the top of the new stack that returns to the C frame on the old stack.
We then leave a marker at the *top* of the C frames to indicate that we
should migrate the stuff *above* the C frames onto the larger stack frame
when the C code has returned.
Note that the call from C to managed code can arrange to always perform
the code on the correct (newer and larger) stack.
Also note that during the "copy on return from C" code, the new stack
region is known to be completely empty, just as it was when we did the
earlier stack relocation. Offhand, I provisionally believe that whenever a
stack relocation occurs, it is always the case that frames are being
migrated into an empty chunk.
Badly behaved C is a pain but you are right you can leave the old stack
there and handle it , at the cost of wasting memory .

Ben
Jonathan S. Shapiro
2013-11-22 18:41:24 UTC
Permalink
Post by Ben Kloosterman
I put the stack comments here
Thanks!
Post by Ben Kloosterman
Im firmly in the 1 big stack and tune it down camp ( its simple ,known and
is the status quo) and add growable stacks as a feature.
I agree. I felt it was worth understanding what copying the stack would
entail, but I think this is the right place to start. I do think that
different activities have different stack size requirements, so I want that
to be something you can say at thread allocation time, but I don't even
think saying *that* needs to be doable in the default thread creation call.
Post by Ben Kloosterman
The real issue is that in an optimizing implementation, you may end up
Post by Jonathan S. Shapiro
with multiple stack maps for a single frame, because locations on the stack
may be used for integers at one moment and object references the next. You
already need some sort of lookup of the form (base PC, bound PC, stack map
addr), but you'll get a lot more entries if the shape of the frame is
changing.
Wouldnt it be better to burn the stack memory and reserve slots for each
potential use ? Variable size stacks dont mix well with Immix Blocks and
method metadata.
All stacks are variable size. You're talking about variable frame size. In
the presence of alloca or general stack allocation, a variable frame size
is unavoidable. As I've noted elsewhere, I was never contemplating treating
frames as immix-allocated objects. I was only contemplating using the immix
block layout for certain purposes of convenience. And I think we've
concluded that it probably isn't worth it.

All that being said, you may be right, and it may be that every stack frame
should be strongly partitioned along the lines you suggest. The problem
case is going to arise when a single procedure first has a loop
manipulating a larger object in sequence (that does not survive the loop)
and then has a second loop doing integer-only computation. We'll end up
paying the stack space for that object after we are done with it. This is
an observed problem in C compilers (where objects are not a consideration),
which is why many of those compilers do interference analysis on locals to
try to shrink the stack frame sizes. It will be greatly exacerbated by
aggressive stack allocation.

So: what you propose is the right place to start, but it's an optimization
we probably need to address long term.
Post by Ben Kloosterman
Also the Metadata map would need to be made after inlining which is quite
low level.
Definitely. And that's why all of the intermediate transformations along
the way need to maintain a distinction between object referencing SSA
registers and value-only SSA registers.
Post by Ben Kloosterman
...what i dont get ( and which maybe misleading me ) was in the paper
mentioned before how size analysis of the worst case stack path
significantly reduced the cost...
Remember that they were allocating segments dynamically without benefit of
a guard page, so in the absence of interprocedural analysis every procedure
had to do a painful check in the preamble. The interprocedural analysis
lets them clone procedures into checking and non-checking implementations,
and then roll up the depth checks into a single dominating location.

Think of it as an optimization to the calling convention for leaf
procedures, but extending the notion of "leaf" to cover more than one
layer.
Post by Ben Kloosterman
Badly behaved C is a pain but you are right you can leave the old stack
there and handle it , at the cost of wasting memory .
Usually temporarily, because it's rare for C code to call back into the
managed subsystem.


shap
Ben Kloosterman
2013-11-20 04:55:51 UTC
Permalink
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
Right. But if you know you are going to allocate an 8 word object
followed by a 12 word object, you can just do a single 24 word (allowing
for headers) allocation. Given the inline sequence you sent out, a
conventional compiler would do this optimization more or less
automatically, except that the procedure calls to the long path are an
impediment.
In Jikes this is external C code not likeley to be inlined .. but that
shouldnt stop another runtime.
Wait. Are you saying that the *fast path* isn't inlined?
The new obj code is static C which calls the java allocator , the java is
Jit-ted .. So im not sure but i doubt it ( since the static C code would
need to inline the JIT code) .. but note Jikes goal is not great
performance it uses the latest algorithms but is a almost pure java
platform so lags the Oracle JRE. For GC research showing relative
performance on the same platform is sufficient.

I think it goes something like this
C Code
if size > Large_SIze
allocator ==1
else
allocator ==2

Then call the code below which then calls the alloc method.
Post by Jonathan S. Shapiro
Combining allocations requires that the front end knows something about
the allocator code, so that it knows the call to the slow path (which is an
intrinsic call) can be combined. The ability to do this combination is one
of the big advantages of a copying nursery.
Hmm. Come to that, I'm not thinking about this right, and the sequence you
sent is wrong. That is: I'm confident they are using the sequence as you
describe it, but that's not the one we want in mainline code.
The sequence you sent is the one that would be used for the object
relocator. That's the one that gets used to allocate storage when
evacuating a block or copying things out of the nursery. The reason I'm
sure of this is the test
if (bytes > BYTES_IN_LINE) {
which we wouldn't be doing in a nursery allocation (because the size there
is statically known).
Pretty sure this is incorrect, the copy use the same alloc code , Immix has
no Nursery at all , RC-Immix has a bunch of empty blocks called the
reserved copy space , when it copies it reallocates a new object . ( eg
young allocator is of type RCImmixAllocator , copy alocator is of type
RCImmixAllocator) so its just a form of compaction i would not call it a
Nursery either. These filled blocks then are no longer part of the copy
space ( which is just empty blocks to make sure there is room for a copy).

The line above is already in the slow path so it doesnt matter much ( eg
limit exceed means you need to find a new block or hole) , the bytes is
passed in by the C run time this has to be a variable .
Post by Jonathan S. Shapiro
But if we use a copying nursery (i.e. a hybrid approach), then the nursery
itself is *always* a free block (perhaps bigger than 32K, but still a
free block).
Correct and note with such a copying Nursery i see no reason to move
objects or defragment ..the Nursery will remove most of the fragmentation
and the benefit is not worth the cost since the heap fills most holes. This
means for a highly concurrent GC you only need to be concerned with moves
from the copying Nursery.

There is a complication however right now there is no ref counting at all
in RC-Immix if there is sufficient space . Ref counting only occurs durring
collections , this sync is important. eg Allocate from free Collect (
ref count and create holes) then repeat Allocate from free , Allocate to
holes , Collect.. Im not sure if there are any holes created by
processing a modbuff/newbuff and ref counts going to 0 outside of the
collect phase. The complication is freeing these objects cant use ref
counts else it would need to be synced with the ref count hence it will
just do a GC Mark and you probably dont want these sync'd with the ref
count as Nursery sweeps are too frequent.

Note if these counts are only done durring a GC then IMHO its not really a
ref counting GC but merely a standard GC where the ref counting replaces
the mark and card table.
Post by Jonathan S. Shapiro
When allocating from a free block, there is no internal fragmentation, and
if the allocation fails you're going to have to do a full nursery
collection (and coalesce or copy) in any case. Under those conditions, the
do {
Address start = alignAllocationNoFill(cursor, align, offset);
Address end = start.plus(bytes);
if (end.LE(limit)) break;
GC_nursery();
}
cursor = end;
Correct .. you changed the sign on the LE which got me ( a horrible
construct but they can compare address like that and not make it a numeric
type)
Post by Jonathan S. Shapiro
do {
Address start = cursor + 2 * sizeof(uintptr_t); /* GC hdr, vtable */
Address end = start.plus(CompilerRoundsUp(bytes, 8));
if (end.GT(limit) GC_nursery();
}
cursor = end;
Agree. Was thinking the same thing , there is quite a lot of allignment
code which i wandered about but its because Jikes code is uniform ( there
is no platform specific conditionals) so when running on a 386 it would use
allign 4.
Post by Jonathan S. Shapiro
And if we institute that alignment requirement on the cursor, then the
/* Pre-probe the available nursery space */
end = cursor + SpaceWeNeed; if (end.GT(limit)) GC_nursery();
/* At this point, we have permission to allocate up to SpaceWeNeed without checking. */
Address start = cursor + 2 * sizeof(uintptr_t);
cursor += StaticObjectSizeAfterRoundup;
Whether you can do static object sizes depend on the integration between
the runtime object creation, allocator , compiler/jit and app. If it was
all a same language blob then sure.
Post by Jonathan S. Shapiro
And, of course, a lot of this will peephole out if the compiler is aggressive about it. This approach won't work for loops, but it still helps.
aggressive loop unrolling will help though.
Note here that I'm *only* talking about nursery allocations! This would hypothetically work in a recycled block, but it would leave big holes in practice.
A Hybrid like Gen-Immix Nursery would not allocate to recycled blocks but
the sweep would. allocating to recycled blocks would complicate the
mark/sweep and references from old to new space. No point going there...
Post by Jonathan S. Shapiro
Regarding the stack, there is another issue that I hadn't considered: if
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
we use an immix block of some form for the stack, we can't rely on the
collector to zero it in the background. We would have needed to zero
reference slots anyway, so it's not like this is a new cost, but the
stack-of-objects approach may change how eagerly we need to do that.
Actually, I don't think it does, but it needs looking at.
I take it we cant rely on the collector to do this because collector
threads have immix block for a stack .
No no. We'll get an initially zero block just fine. The problem is that
the stack pointer is going up and down within the block. As procedures
return, they leave dirty lines "below" the stack. This happens *way* too
rapidly for background zeroing to work.
I was thinking just the intial clear when the block is received as it is
received at the moment dirty..
Zeroing it at allocation rather than release simply shifts the work from
the GC to the mutator. My guess is it doesn't matter from a performance
perspective.
Agree at the moment it doesnt matter but if the Collector was concurrent or
free objects are discovered through counts between collects then it would.

Ben
Jonathan S. Shapiro
2013-11-22 18:28:20 UTC
Permalink
...if we use a copying nursery (i.e. a hybrid approach), then the nursery
Post by Jonathan S. Shapiro
itself is *always* a free block (perhaps bigger than 32K, but still a
free block).
Correct and note with such a copying Nursery i see no reason to move
objects or defragment...
This question about whether we need to relocate is important, because it
drives a *whole* lot of (eventual) work, and it constrains some aspects of
the source level language design. I'd really like to try to get a better
handle on it.

It would be *really* nice to know how the fragmentation numbers work out
over time.

We have a whole bunch of input that says only 10% of the nursery generally
survives. That number has regime shifts, but it seems to be a sensible
guideline. The survivors get copied into a new generation. Let's call that
G1.

I don't know of good demographics about survival in G1, I suspect it's
higher than 10%, but how much higher? Some of the objects in G1 will have
been young in the nursery. 90% of those will die. But how many is that?

We *do* have various data that suggests G1 immix blocks exist after GC in
which 90% of the objects have died but 10% remain. We don't know the
distribution of those pesky live objects within the block, but it doesn't
seem like an insurmountable number. I don't have any intuition about how
that number changes over multiple recyclings. Problem is: a 90% free block
is not a fully free block, especially if it has bad fragmentation
characteristics.

My guess is that a lot of those 10% occupied blocks arise from regime
shifts. That is: the dead 90% seem likely to be dying in bulk.

If blocks are 32Kb/128b, then a 10% block has something approximating 10
free regions which, on average, are roughly 10 lines (1280 bytes) long. I
suspect that if we manage blocks in a *vaguely* size-partitioned way, we
can probably achieve an average block occupancy of 85% without relocating.
That part, at least, seems pretty reasonable to me. The problem is that we
still aren't going to get any fully free blocks, which means that mid-sized
allocations are going to tend to push us to grow the heap. It *does* seem
possible to me that a *rough* partitioning scheme in which mid-sized
objects go to "big blocks" might sufficiently resolve this, but I
*really*hate to concede a 15% heap inefficiency and an indefinitely
growing heap at
the gate.

When I weigh this against the cost of building a whole optimizing compiler
that tracks enough information to do relocating *well*, it's undeniably a
hard choice, and there is a very strong temptation to toss relocation out
the window, at least initially.

But having said that, I'm also mindful of something that Steve got right at
NeXT: he never built a machine with a single-plane display. There are ways
to "cheat" in code with a black-and-white bitmap that you can't do with a
multilayer bitmap, and he wanted to make sure that bad algorithms never got
into the code in the first place.

I think we're in much the same position with respect to relocating
collection. My itchy expectation is that we *do* eventually need to do a
relocating collector, and consequently an optimizing compiler with the
right infrastructure. And I think that means that we need a relocating
option *now* to keep ourselves honest. It doesn't have to be a good
relocating collector, and it doesn't have to be a good optimizer. It
*does* need
to be enough to serve to keep us honest on language and runtime design and
semantics.

In the end, I'm pretty sure that Ben is right, and that relocation is only
necessary in corner cases. For those "stuck" objects, I'd really like to
know what the in-degree is, and I'd like some sense of how read-hot the
containing blocks are once the other 90% of the objects are gone. That
would tell us if a "mostly not copying" design is feasible. At the moment I
don't feel like I have any basis for intuitions on that.

Unfortunately, the only way to get that data appears to be building an
implementation we can measure. As nice and as clean as Jikes is, it is
built to support a language with a qualitatively different set of objects
and a significantly different BCVM semantics available to the compiler. And
we're contemplating making further changes to the BCVM semantics.

Do we know of any comparative data on *Java* programs between CLR and JVM
that would tell us whether, for those programs, the behaviors of the
respective BCVM machines are comparable?

There is a complication however right now there is no ref counting at all
in RC-Immix if there is sufficient space . Ref counting only occurs durring
collections , this sync is important. eg Allocate from free Collect (
ref count and create holes) then repeat Allocate from free , Allocate to
holes , Collect.. Im not sure if there are any holes created by
processing a modbuff/newbuff and ref counts going to 0 outside of the
collect phase. The complication is freeing these objects cant use ref
counts else it would need to be synced with the ref count hence it will
just do a GC Mark and you probably dont want these sync'd with the ref
count as Nursery sweeps are too frequent.
This got too compacted for me to follow. I'm not sure what "sync" you are
referring to. Could you expand on this, but could you please do it in a
separate message of its own? No need to change the subject line; I just
feel like the responses are getting too spread out.

In the end, I *think* you're saying that there is an overall heap threshold
size below which the right strategy is to grow the heap in preference to
doing *any* GC. I'm not clear whether, during this phase, you bother
keeping track of deferred references or not. Tracking them would require
more space, but it probably isn't initially necessary if you are prepared
to rebuild the counts during the first GC (which seems reasonable).

Note if these counts are only done durring a GC then IMHO its not really a
ref counting GC but merely a standard GC where the ref counting replaces
the mark and card table.
Or in an alternative view, it's a hybrid scheme. But yes, I agree.
We can very easily impose a rule in the nursery that maintains /start/ at
Post by Jonathan S. Shapiro
an two word boundary. In that case, for typically aligned allocations, it
reduces to...
Agree. Was thinking the same thing , there is quite a lot of allignment code which i wandered about but its because Jikes code is uniform ( there is no platform specific conditionals) so when running on a 386 it would use allign 4.
Yup.
I think part of the confusion is that you are looking at the general
allocator implementation (the one used by the relocator) rather than the
initial inline allocator implementation. The initial allocator has a lot of
static type information that it can use to partially evaluate a bunch of
this code. The general allocator no longer has access to that, and so has
to actually do the work. :-)

And if we institute that alignment requirement on the cursor, then the
Post by Jonathan S. Shapiro
probe hoists to the top of the procedure, and we get...
Whether you can do static object sizes depend on the integration between
the runtime object creation, allocator , compiler/jit and app. If it was
all a same language blob then sure.
Yes. For nursery allocations, at least, it seems very doable.
And, of course, a lot of this will peephole out if the compiler is aggressive about it. This approach won't work for loops, but it still helps.
Post by Jonathan S. Shapiro
aggressive loop unrolling will help though.
Definitely. The issue is a loop that does allocations which outlive the
loop, where you don't know a (low) static bound on the number of loop
iterations. When those conditions hold, you can't preallocate a (possibly
conservative) frame size. It's similar to the conditions under which you
need a frame pointer vs. just the stack pointer. If you prefer to think of
it in C-like terms, think of it as similar to using alloca() inside a loop
in C code.


shap
Ben Kloosterman
2013-11-24 09:45:45 UTC
Permalink
Post by Jonathan S. Shapiro
In the end, I'm pretty sure that Ben is right, and that relocation is only
necessary in corner cases. For those "stuck" objects, I'd really like to
know what the in-degree is, and I'd like some sense of how read-hot the
containing blocks are once the other 90% of the objects are gone. That
would tell us if a "mostly not copying" design is feasible. At the moment I
don't feel like I have any basis for intuitions on that.
I dont disagree with your comments . A few comments .

People looking at Boehm / Mark Sweep and uncooperative environments should
definetly look at non moving RC-Immix style collector.

The worst case is not a unsustained growth because new blocks are not
needed you can reallocate holes at decreasing mutator efficiency until all
space is used. We do have to be concerned about the fragmentation cost.

For fragmentation the best case is every object is a multiple of the line
size in which case there is zero fragmentation , the worst case is where
you have lots of minimum sized objects ( say 32 bytes ) and exactly 1
survives for a 75 % fragmentation cost. Also its tunable a smaller line
size has lower fragmention at a slight cost to mutator performance. I
think the cost is about 10-15% which could even drop by half for say a 64
byte line size if memory is crucial eg on a mobile device run a
SpaceEfficient allocator . Either way its not huge as many allocators have
some fragmentation cost and in some cases 4 word headers. One thing to
note is nursery fragmentation is a bigger issue due to the small size of
most short lived objects.

Potentially we could even fill fragmented lines but the cost of
allocating to these small holes is pretty big it would be an allocation
list .

The much bigger issue is Rc-Immix has poor performance ( compared to
cutting edge) for heaps up to 1.5* and it isnt great until about 2 * heap
. The reasons is not so much related to copying but I will restate for
emphasis.
1) Either copying cost is not as efficient as mark sweep or no copying has
poor Gen 1 locality ( the diffirence between these is low) ( for 1-1.5*
heap performance )
2) More importantly because of the greater space availability there is far
less work especially ref counting so cycles are more dispersed.
Concurrency can help here depending on barrier cost .. but changes the
whole nature of the collector .

The fact most GC heaps require 1.5+* memory for really good performance is
a bigger issue than 10% fragmentation.

I believe the best intial solutions is non relocating Rc-Immix instrument
it , optomize it which is much easier to do when you dont need to worry
about barrier costs / mutator non allocating behaviour .

For v2 I think this can be significantly improved by adding a fixed
Nursery ( eg N blocks which are always used first and swept) for 1 - 1.5 *
heap performance and it reduces a lot of our fragmentation this only
requires a cheap write barrier with a fast and slow path ie if a nursery
is the first N blocks all we need is a conditional on write to see if the
address is < x - still a cost but much cheaper than a card table actually
since we have a modbuf we could avoid this and process the mod buf when
the nursery is swept ) .

After this we can consider the high concurrency cases we can either resort
to non copying or have individual mark / Sweep nurseries for each
thread/core which need to be synced ( as discussed earlier possibly by
forcing sweep if a nursery references another nursery object or disallowing
it) or more complex barriers which would also allow the main heap to copy.
Post by Jonathan S. Shapiro
Unfortunately, the only way to get that data appears to be building an
implementation we can measure. As nice and as clean as Jikes is, it is
built to support a language with a qualitatively different set of objects
and a significantly different BCVM semantics available to the compiler. And
we're contemplating making further changes to the BCVM semantics.
Agree best way is to build it with instrumentation first . Unfortunately we
have no base line .. thats the nice part of Jikes so may need to write a
Mark sweep or ref count as a comparison . Though we could just use malloc
as the baseline ..
Post by Jonathan S. Shapiro
Do we know of any comparative data on *Java* programs between CLR and JVM
that would tell us whether, for those programs, the behaviors of the
respective BCVM machines are comparable?
There is nothing published im aware of .. I can only guess at things like
bigger but many fewer objects. ( which is an issue with the 8K block size
/ LOB size ) .

Ben
Jonathan S. Shapiro
2013-11-24 20:15:20 UTC
Permalink
Post by Ben Kloosterman
People looking at Boehm / Mark Sweep and uncooperative environments
should definetly look at non moving RC-Immix style collector.
Unfortunately, the false retention rate for conservative collectors is
astonishingly high, and rises rapidly as a function of heap size (I speak
from *bitter* experience). This can easily turn that 10% stuck object
number into 30%, which is a different set of conditions entirely.

Non-relocation and conservative retention are two different issues.
Conservative collectors, of necessity, can't relocate anything that has
been conservatively marked, but it's the fact that pointer identification
is conservative that leads to the high false retention rate.
Post by Ben Kloosterman
The worst case is not a unsustained growth because new blocks are not
needed you can reallocate holes at decreasing mutator efficiency until all
space is used. We do have to be concerned about the fragmentation cost.
I've been wondering about that. My version of this question is: "How many
islands can a 'free' block have and still reasonably be treated as free?"

If we ignore the question of nursery allocation, then I think free blocks
have two purposes:

1. They provide for mid-sized object allocation, and
2. They avoid the need to leave holes in the current TLA's recycled
block.

I suspect it might work out to define a free block as "a block having at
least *k* free regions suitable for mid-sized object allocation".
Allocating from the "free" block might now lead to unused subranges if
allocation requests come in some unfortunate order. When you fall off the
end of the free block, you stick that block in a queue, from which it will
be taken the next time the mutator's TLA requires a recycled block. In this
way the holes you left behind get filled in.
Post by Ben Kloosterman
For fragmentation the best case is every object is a multiple of the line
size in which case there is zero fragmentation , the worst case is where
you have lots of minimum sized objects ( say 32 bytes ) and exactly 1
survives for a 75 % fragmentation cost. Also its tunable a smaller line
size has lower fragmention at a slight cost to mutator performance. I
think the cost is about 10-15% which could even drop by half for say a 64
byte line size if memory is crucial eg on a mobile device run a
SpaceEfficient allocator . Either way its not huge as many allocators have
some fragmentation cost and in some cases 4 word headers. One thing to
note is nursery fragmentation is a bigger issue due to the small size of
most short lived objects.
That's internal fragmentation, and it's not what worries me. What worries
me is *external* fragmentation, which comes from free ranges of lines that
don't get used because of bad luck in allocation sizes and orderings.
Post by Ben Kloosterman
The much bigger issue is Rc-Immix has poor performance ( compared to
cutting edge) for heaps up to 1.5* and it isnt great until about 2 * heap
. The reasons is not so much related to copying but I will restate for
emphasis.
1) Either copying cost is not as efficient as mark sweep or no copying
has poor Gen 1 locality ( the diffirence between these is low) ( for
1-1.5* heap performance )
2) More importantly because of the greater space availability there is
far less work especially ref counting so cycles are more dispersed.
Concurrency can help here depending on barrier cost .. but changes the
whole nature of the collector .
The fact most GC heaps require 1.5+* memory for really good performance is
a bigger issue than 10% fragmentation.
No no. Those 10% islands will likely lead to 15% or 20% fragmentation.
Dropping the heap requirement by 20% is the difference between a 1.5* heap
at a given performance level and a 1.2* heap at the same performance.
That's why this matters so much.

I believe the best intial solutions is non relocating Rc-Immix instrument
Post by Ben Kloosterman
it , optomize it which is much easier to do when you dont need to worry
about barrier costs / mutator non allocating behaviour .
Yes. Associated with a precise marking capability.

But in parallel with this we want to do a lame-ass relocating
implementation, because we want to keep ourselves honest at the runtime and
language semantics layers.

Though I'd still do a copying nursery in gen1.



Also, the "1.5* heap" metric is a very peculiar metric, because nobody runs
comparable examinations of manually allocated heaps.


shap
Ben Kloosterman
2013-11-25 04:34:32 UTC
Permalink
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
People looking at Boehm / Mark Sweep and uncooperative environments
should definetly look at non moving RC-Immix style collector.
Unfortunately, the false retention rate for conservative collectors is
astonishingly high, and rises rapidly as a function of heap size (I speak
from *bitter* experience). This can easily turn that 10% stuck object
number into 30%, which is a different set of conditions entirely.
Non-relocation and conservative retention are two different issues.
Conservative collectors, of necessity, can't relocate anything that has
been conservatively marked, but it's the fact that pointer identification
is conservative that leads to the high false retention rate.
Interesting , though a lot of lit. says these false references are very low
i didnt think it was that bad when using boehm with mono (whi is the lit
wrong does it require high virt memory address to avoid false references
?).. anyway im not advocating non precise but they are still better of
with RC-Immix.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
The worst case is not a unsustained growth because new blocks are not
needed you can reallocate holes at decreasing mutator efficiency until all
space is used. We do have to be concerned about the fragmentation cost.
I've been wondering about that. My version of this question is: "How many
islands can a 'free' block have and still reasonably be treated as free?"
The rate is configurable and they found 1 was best.
Post by Jonathan S. Shapiro
If we ignore the question of nursery allocation, then I think free blocks
1. They provide for mid-sized object allocation, and
2. They avoid the need to leave holes in the current TLA's recycled
block.
I suspect it might work out to define a free block as "a block having at
least *k* free regions suitable for mid-sized object allocation".
Allocating from the "free" block might now lead to unused subranges if
allocation requests come in some unfortunate order. When you fall off the
end of the free block, you stick that block in a queue, from which it will
be taken the next time the mutator's TLA requires a recycled block. In this
way the holes you left behind get filled in.
Yes that is the right way .. but you will favour groups of contiguous lines
not number of free lines for mid size.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
For fragmentation the best case is every object is a multiple of the line
size in which case there is zero fragmentation , the worst case is where
you have lots of minimum sized objects ( say 32 bytes ) and exactly 1
survives for a 75 % fragmentation cost. Also its tunable a smaller line
size has lower fragmention at a slight cost to mutator performance. I
think the cost is about 10-15% which could even drop by half for say a 64
byte line size if memory is crucial eg on a mobile device run a
SpaceEfficient allocator . Either way its not huge as many allocators have
some fragmentation cost and in some cases 4 word headers. One thing to
note is nursery fragmentation is a bigger issue due to the small size of
most short lived objects.
That's internal fragmentation, and it's not what worries me. What worries
me is *external* fragmentation, which comes from free ranges of lines
that don't get used because of bad luck in allocation sizes and orderings.
Unused portions of lines are external to the object ( so external
fragmenation yes headers are internal ) , but lack of large enough free
range is an issue . That said LOBs / malloc have the same issue , buddy
derived collectors i think have over 30% fragmentation . Malloc could
certainly relocate and defragment / compact but we dont as we think the
pain is not worth the memory cost. We do have a worse issue because those
collectors have size based lists , while we could build a sized based
index in the background for mid sized blocks i think it will be counter
productive outside the mobile space .
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
The much bigger issue is Rc-Immix has poor performance ( compared to
cutting edge) for heaps up to 1.5* and it isnt great until about 2 * heap
. The reasons is not so much related to copying but I will restate for
emphasis.
1) Either copying cost is not as efficient as mark sweep or no copying
has poor Gen 1 locality ( the diffirence between these is low) ( for
1-1.5* heap performance )
2) More importantly because of the greater space availability there is
far less work especially ref counting so cycles are more dispersed.
Concurrency can help here depending on barrier cost .. but changes the
whole nature of the collector .
The fact most GC heaps require 1.5+* memory for really good performance
is a bigger issue than 10% fragmentation.
No no. Those 10% islands will likely lead to 15% or 20% fragmentation.
Dropping the heap requirement by 20% is the difference between a 1.5* heap
at a given performance level and a 1.2* heap at the same performance.
That's why this matters so much.
For really good performance you want extra blocks anyway guaranteeing
enough space for medium objects ( though instead of free blocks we want
mostly free which led me to sub blocks but mostly free will do) .. Im
perfectly happy saying you need 20-25% more memory for heap until we get a
Nursery happening and a mark sweep will do a lot of compaction and likely
half that . With region analysis pushing things to the stack and regions
and with objects in the LOB its quite possible the heap itself will be
1/3-1/2 the size so your looking at 7-12% of total data storage cost , and
probably half that for a nursery .

Note by having 8K blocks we push a lot of objects to the LOB ( compare CLR
where its 80K) .... it is likely this has a higher fragmentation .
Post by Jonathan S. Shapiro
I believe the best intial solutions is non relocating Rc-Immix
Post by Ben Kloosterman
instrument it , optomize it which is much easier to do when you dont need
to worry about barrier costs / mutator non allocating behaviour .
Yes. Associated with a precise marking capability.
But in parallel with this we want to do a lame-ass relocating
implementation, because we want to keep ourselves honest at the runtime and
language semantics layers.
Though I'd still do a copying nursery in gen1.
The copy nursery global pause could do a lame ass-copy ..of some blocks
identified in the background. A proper region system with analysis is
likely more valuable than non nursery copying .
Post by Jonathan S. Shapiro
Also, the "1.5* heap" metric is a very peculiar metric, because nobody
runs comparable examinations of manually allocated heaps.
Agree 100% .. i nearly started a discussion on this.
- There is no comparison to malloc and its hidden fragmentation (Which
could be good or bad)
- Min heap is the "No memory error" level , who decides the actual request
of pages from the kernel eg do most CLR implimentations request 2* min heap
? ( RC-Immix has a high water mark scheme on the chunks when it reaches it
, it requests more )
- Except for rarer cycle detection ,Hybrid Gcs should play nicer with the
mm since they dont visit every page .

Ben
Ben Kloosterman
2013-11-25 04:51:10 UTC
Permalink
" buddy derived collectors i think have over 30% fragmentation" obviously
thats worst case..

Ben
Jonathan S. Shapiro
2013-11-25 05:04:57 UTC
Permalink
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
Unfortunately, the false retention rate for conservative collectors is
astonishingly high, and rises rapidly as a function of heap size (I speak
from *bitter* experience). This can easily turn that 10% stuck object
number into 30%, which is a different set of conditions entirely.
Non-relocation and conservative retention are two different issues.
Conservative collectors, of necessity, can't relocate anything that has
been conservatively marked, but it's the fact that pointer identification
is conservative that leads to the high false retention rate.
Interesting , though a lot of lit. says these false references are very
low i didnt think it was that bad when using boehm with mono (whi is the
lit wrong does it require high virt memory address to avoid false
references ?).. anyway im not advocating non precise but they are still
better of with RC-Immix.
IIRC, Mono uses Boehm in a hybrid mode. Boehm has a way to allocate objects
with an associated pointer map. Any object that has an associated pointer
map is traced precisely. This reduces the conservative portion of the
tracing to the stack and the registers. You still get false accusals, but
it's from a much smaller pool of candidate words.

But if you stop and think about it, you'll conclude that for any given word
that is examined conservatively, there is a k% chance that this value, when
imagined to be a pointer, falls within an object. That percent k is the
percentage of the VA space that is utilized. As the heap gets to be around
1G on a 4G VA machine, the false hit rate hits a tipping point. Using the
reference maps helps a lot, but it's not perfect.

Actually, the hit rate is slightly worse than the % heap utilization,
because Boehm considers pointers *after* an object to point to the object
too if they fall within a bound. This is to deal with optimizations that
produce pointers that are off the ends of vectors and may be the only
surviving live pointer.
Post by Ben Kloosterman
The worst case is not a unsustained growth because new blocks are not
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
needed you can reallocate holes at decreasing mutator efficiency until all
space is used. We do have to be concerned about the fragmentation cost.
I've been wondering about that. My version of this question is: "How many
islands can a 'free' block have and still reasonably be treated as free?"
The rate is configurable and they found 1 was best.
In the paper, they said that a free block had to be entirely free. If we
are using the term "island" the same way, you are now saying that a block
can have a single contiguous region of surviving allocations and still
usefully be treated as free. Can you point me at where you saw this
discussion? I must have missed it.
Post by Ben Kloosterman
I suspect it might work out to define a free block as "a block having at
Post by Jonathan S. Shapiro
least *k* free regions suitable for mid-sized object allocation".
Yes that is the right way .. but you will favour groups of contiguous
lines not number of free lines for mid size.
Yes. Though if you then re-use the block as a recycled block that may
become less important.
Post by Ben Kloosterman
Malloc could certainly relocate and defragment / compact but we dont as
Post by Jonathan S. Shapiro
we think the pain is not worth the memory cost.
Most current implementations of malloc use a buddy allocator. They coalesce.

Malloc actually *can't* defragment/compact, because it is used in languages
that don't have a pointer map. If you move an object, there is no way to
update the references.

Note by having 8K blocks we push a lot of objects to the LOB ( compare CLR
Post by Ben Kloosterman
where its 80K) .... it is likely this has a higher fragmentation .
Who said anything about 8K blocks? I was thinking 32K.

LOB objects have a minimum size of one page and are allocated in page
multiples. Compacting the LOB is *much* easier than compacting the general
heap. That's probably the easiest part of the problem.


shap
William ML Leslie
2013-11-25 05:38:40 UTC
Permalink
Post by Ben Kloosterman
Interesting , though a lot of lit. says these false references are very low
i didnt think it was that bad when using boehm with mono (whi is the lit
wrong does it require high virt memory address to avoid false references
?).. anyway im not advocating non precise but they are still better of with
RC-Immix.
A character I used to know (his name was Benny actually) once said to
me: "Sometimes you just gotta breve".
--
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law. You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in. Any attempt to deny you those rights would be illegal without
prior contractual agreement.
Ben Kloosterman
2013-11-24 12:03:19 UTC
Permalink
Post by Jonathan S. Shapiro
This got too compacted for me to follow. I'm not sure what "sync" you are
referring to. Could you expand on this, but could you please do it in a
separate message of its own?
Ref counting only occurs durring collections , this sync between when
various ref counting tasks like process mod buf , processGcroots ( which
are new objects) and standard Gc things like sweep nursery and collect is
important.

The main loop goes like this
Allocate from free
Collect ( ref count and create holes) then

repeat
Allocate from free ,
Allocate to holes ( holes are n free lines ) ,
Collect..

Right now there is no ref counting at all in RC-Immix if there is
sufficient space , if there is no collect ( or few collects ) than the mod
bufs could be an enormous chain . Also new objects may not be evacuated ...

You probably dont want these sync'd with the whole heap ref count as
Nursery sweeps are too frequent. So by breaking the synchronization you are
completely breaking out of how RC-Immix works .

eg 1. There is a complication when we introduce a ( locally allocated
shared) Nursery, while we can use the modbuf to track changes ( its the
write barrier ) we need to process it when we sweep the nursery ( eg
surving objects are objects refered to from a root , mod buf , or a surving
nursery object) . We need a way so we dont look at the whole modbuf (
it could be massive ) . Though this is easily done* then we copy the
objects ( via short pause and update references to mod buff objects and
roots) . Do we count them now this is far more frequent than RC-Immix
collections (but we know all the references) ?

eg 2, Im not sure if there are any holes ( free lines) created by
processing a modbuff/newbuff and ref counts going to 0 outside of the stop
the world collect phase. I dont think so and it is desirable as holes open
up without a collect. This is hard to do without barriers and rc-Immix
just setlles for doing it during the stop the world GC Mark ( meaning these
recycled blocks are not available)

eg 3. When using threads with their own nursery it gets much harder to
synch the counts ( unless there are no cross nursery references) .

Ben

* it may be possible to build an index of the mod buf in the back ground .
This may allow us to reclaim objects whenever we stop the world ( eg for a
nursery sweep for non nursery objects ) via process a list of objects
where the ref count is 0 ( i think there is such a list but im not 100% if
its updated outside of the collect) , check modbuf references to the index
. Heck it may even be possible to have short pauses and process a fraction
of the likely dead list whenever the index has built in the background .
Jonathan S. Shapiro
2013-11-24 21:04:33 UTC
Permalink
Post by Ben Kloosterman
Right now there is no ref counting at all in RC-Immix if there is
sufficient space , if there is no collect ( or few collects ) than the
mod bufs could be an enormous chain . Also new objects may not be evacuated
...
If we assume that we use the allocate-as-dead and lazy-modbuf-insertion
optimizations described on p. 9 of the "Down for the Count?" paper, then
the modbuf is *empty* prior to the first collection. We can even perform
several *non-RC* collections during this phase, as long as surviving
objects continue to be marked N,M (new, logged). At some point we decide to
shift regimes, and the next major collection turns of the (N, M) bits as it
proceeds. This in turn triggers coalesced reference counting that is
performed by the mutator.

The only point here is that you can not only imagine a heap using mixed
regimes in different places (RC vs. collection), you can also imagine a
regime shift that occurs between program startup and program steady state.
Given that you need the background collector for cycle detection anyway, it
doesn't feel like it ought to be that much marginal effort.

Whether it's worthwhile is a matter for measurement.

eg 1. There is a complication when we introduce a ( locally allocated
Post by Ben Kloosterman
shared) Nursery...
Confused. How can a nursery be both locally allocated and shared?
Post by Ben Kloosterman
...while we can use the modbuf to track changes ( its the write barrier )
we need to process it when we sweep the nursery...
Yes. In fact, I'm inclined to think that the modbuf should actually be
*allocated* from the nursery, since the lifespans are connected.

We need a way so we dont look at the whole modbuf ( it could be massive ).
I'm not sure why the modbuf should be massive, given the
lazy-modbuf-insertion optimization. The modbuf only gets into play when a
mature object that has already survived a collection is modified *again*.
What you suggest could certainly be right, but do we have any data?

eg 2, Im not sure if there are any holes ( free lines) created by
Post by Ben Kloosterman
processing a modbuff/newbuff and ref counts going to 0 outside of the stop
the world collect phase. I dont think so and it is desirable as holes open
up without a collect.
I think there should be. Processing the modbuf and ZCT will cause some
object reference counts to go to zero. That in turn causes the
objects-in-line count to decrement. The objects-in-line decrement can be
done eagerly or lazily. My sense is that it should be done lazily, because
we're not really interested in free lines until we reconsider the block as
a candidate for recycling, and we have to make a pass then in any case.


shap
Ben Kloosterman
2013-11-25 04:52:03 UTC
Permalink
Post by Jonathan S. Shapiro
The only point here is that you can not only imagine a heap using mixed
regimes in different places (RC vs. collection), you can also imagine a
regime shift that occurs between program startup and program steady state.
Given that you need the background collector for cycle detection anyway, it
doesn't feel like it ought to be that much marginal effort.
Whether it's worthwhile is a matter for measurement.
I think it will be even for non copying but yes we can measure it.
Post by Jonathan S. Shapiro
eg 1. There is a complication when we introduce a ( locally allocated
Post by Ben Kloosterman
shared) Nursery...
Confused. How can a nursery be both locally allocated and shared?
By sharing a fixed block pool that is swept / cleared in unison. 1 thread
may dominate but that is fine as long as the pool is proportional to the
number of threads.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
...while we can use the modbuf to track changes ( its the write barrier )
we need to process it when we sweep the nursery...
Yes. In fact, I'm inclined to think that the modbuf should actually be
*allocated* from the nursery, since the lifespans are connected.
We need a way so we dont look at the whole modbuf ( it could be massive ).
I'm not sure why the modbuf should be massive, given the
lazy-modbuf-insertion optimization. The modbuf only gets into play when a
mature object that has already survived a collection is modified *again*.
What you suggest could certainly be right, but do we have any data?
No data next phase is to collect it but a mature object frequently
changing objects eg a mode object would not be uncommon ( though much less
so when you have value types) .
Post by Jonathan S. Shapiro
eg 2, Im not sure if there are any holes ( free lines) created by
Post by Ben Kloosterman
processing a modbuff/newbuff and ref counts going to 0 outside of the stop
the world collect phase. I dont think so and it is desirable as holes open
up without a collect.
I think there should be. Processing the modbuf and ZCT will cause some
object reference counts to go to zero. That in turn causes the
objects-in-line count to decrement. The objects-in-line decrement can be
done eagerly or lazily. My sense is that it should be done lazily, because
we're not really interested in free lines until we reconsider the block as
a candidate for recycling, and we have to make a pass then in any case.
I think it should be lazy as well but not not wait till a full collection
, once a line is empty the object can be marked for recycling.


Note the keypoint here is the moment we detach from doing everything during
the Collection than we really have a new collector and we have to redesign
lots of behaviour ..

Ben
Jonathan S. Shapiro
2013-11-25 05:11:15 UTC
Permalink
Post by Jonathan S. Shapiro
I'm not sure why the modbuf should be massive, given the
Post by Jonathan S. Shapiro
lazy-modbuf-insertion optimization. The modbuf only gets into play when a
mature object that has already survived a collection is modified *again*.
What you suggest could certainly be right, but do we have any data?
No data next phase is to collect it but a mature object frequently
changing objects eg a mode object would not be uncommon ( though much less
so when you have value types) .
Ah. So you are saying that if the object is write-hot it will end up
getting put in *somebody's* modbuf almost all the time, and this will occur
at approximately the rate of nursery collections if the modbuf lives in the
nursery. Yes. I can see how that might happen.

Really need some trace data to understand how often this happens, though.
Post by Jonathan S. Shapiro
I think there should be. Processing the modbuf and ZCT will cause some
Post by Jonathan S. Shapiro
object reference counts to go to zero. That in turn causes the
objects-in-line count to decrement. The objects-in-line decrement can be
done eagerly or lazily. My sense is that it should be done lazily, because
we're not really interested in free lines until we reconsider the block as
a candidate for recycling, and we have to make a pass then in any case.
I think it should be lazy as well but not not wait till a full collection
, once a line is empty the object can be marked for recycling.
But you don't *know* a line is empty until somebody processes the
containing block - which is what the major collection does (or possibly a
generational collection). This part can be done very incrementally, though.
Post by Jonathan S. Shapiro
Note the keypoint here is the moment we detach from doing everything
during the Collection than we really have a new collector and we have to
redesign lots of behaviour ..
Yup. Though it's useful to know which things take us over that cliff and
which are easily done within the current conceptual framework.


Jonathan
Ben Kloosterman
2013-11-25 06:42:53 UTC
Permalink
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
I'm not sure why the modbuf should be massive, given the
Post by Jonathan S. Shapiro
lazy-modbuf-insertion optimization. The modbuf only gets into play when a
mature object that has already survived a collection is modified *again*.
What you suggest could certainly be right, but do we have any data?
No data next phase is to collect it but a mature object frequently
changing objects eg a mode object would not be uncommon ( though much less
so when you have value types) .
Ah. So you are saying that if the object is write-hot it will end up
getting put in *somebody's* modbuf almost all the time, and this will
occur at approximately the rate of nursery collections if the modbuf lives
in the nursery. Yes. I can see how that might happen.
Really need some trace data to understand how often this happens, though.
Yep
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
I think there should be. Processing the modbuf and ZCT will cause some
Post by Jonathan S. Shapiro
object reference counts to go to zero. That in turn causes the
objects-in-line count to decrement. The objects-in-line decrement can be
done eagerly or lazily. My sense is that it should be done lazily, because
we're not really interested in free lines until we reconsider the block as
a candidate for recycling, and we have to make a pass then in any case.
I think it should be lazy as well but not not wait till a full
collection , once a line is empty the object can be marked for recycling.
But you don't *know* a line is empty until somebody processes the
containing block - which is what the major collection does (or possibly a
generational collection). This part can be done very incrementally, though.
Yes im talking about doing some of this incrementally so if you can free
the object between collections ( ie Nursery sweeps) its pretty cheap to
see if the line is free.

Ben
Ben Kloosterman
2013-11-25 09:55:01 UTC
Permalink
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
I'm not sure why the modbuf should be massive, given the
Post by Jonathan S. Shapiro
lazy-modbuf-insertion optimization. The modbuf only gets into play when a
mature object that has already survived a collection is modified *again*.
What you suggest could certainly be right, but do we have any data?
No data next phase is to collect it but a mature object frequently
changing objects eg a mode object would not be uncommon ( though much less
so when you have value types) .
Ah. So you are saying that if the object is write-hot it will end up
getting put in *somebody's* modbuf almost all the time, and this will
occur at approximately the rate of nursery collections if the modbuf lives
in the nursery. Yes. I can see how that might happen.
Really need some trace data to understand how often this happens, though.
Yep
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
I think there should be. Processing the modbuf and ZCT will cause some
Post by Jonathan S. Shapiro
object reference counts to go to zero. That in turn causes the
objects-in-line count to decrement. The objects-in-line decrement can be
done eagerly or lazily. My sense is that it should be done lazily, because
we're not really interested in free lines until we reconsider the block as
a candidate for recycling, and we have to make a pass then in any case.
I think it should be lazy as well but not not wait till a full
collection , once a line is empty the object can be marked for recycling.
But you don't *know* a line is empty until somebody processes the
containing block - which is what the major collection does (or possibly a
generational collection). This part can be done very incrementally, though.
Yes im talking about doing some of this incrementally so if you can free
the object between collections ( ie Nursery sweeps) its pretty cheap to
see if the line is free ( unless you dont have stop the world nurseries)

Ben

Jonathan S. Shapiro
2013-11-24 19:57:26 UTC
Permalink
For those "stuck" objects, I'd really like to know what the in-degree is,
and I'd like some sense of how read-hot the containing blocks are once the
other 90% of the objects are gone. That would tell us if a "mostly not
copying" design is feasible.
That was important, and much too elided. Let me unpack it.

If the last 10% of objects in a block have low in-degree (i.e. the refcount
hasn't hit "top"), and it isn't *write* hot (I said read-hot above, but
that's a mistake), then we can relocate the objects *even* in a collector
that is otherwise non-relocating. Here's how:

1. At start of GC, we take note of the blocks we would like to
clear. We *read
and write protect* these blocks if the mutator is concurrent, mirroring
at some other address so that relocation can proceed.
2. As we rebuild reference counts, we keep track of the slots containing
pointers into evacuating blocks.
3. We forward the objects to new locations.
4. We patch the locations containing in-bound pointers. What's left is
copies that may have been made by the mutators.
5. Halt each mutator in turn long enough to forward it's current frame,
install a stack barrier, and collect the nursery, forwarding references as
we go. This flushes the RC deferred update table, which has the side effect
of forwarding any pointers to the old location that may have been stored to
tenured space since we forwarded the object.
6. Resume that mutator.

Hmm. No, that's not quite it, because a reference could bounce back and
forth between two CPUs via the heap and this won't necessarily catch that
case. Darn. I really hate to put a forwarding check in the store barrier...

OK. I don't have it yet.
Ben Kloosterman
2013-11-25 04:49:05 UTC
Permalink
Some more ideas ( i will think more deeply when i impliment it and
understand all the limitations)

1. You could isolate cases where objects have a low number of references
this creates a very narrow scope of change eg if an object only has 1
reference you can put a guard on the source object ( or even just the 4k
around the field ) and move it . This will prevent it going on the stack or
anywhere.
2. Use non closed heap areas ( asuming the heap can be divided , note they
dont need to be closed but we do need to avoid frequent changes ie identify
loops/ recursions ) and note which mutators have access to which regions (
eg when they change it is noted in a similar manner to ref counting eg lazy
release so frequent changes dont incur a huge cost) ..if the from /to
reference is not one of those areas for current mutators than they cant
move , mutators will stall/spin when a region they request is
unavailable due to a move. Basically a low cost ( CPU time not dev time)
form of locking, trying to take advantage of the fact that some threads
have very narrow allocations and dont really need to be stopped.
3. If you are using a mark sweep Nursery you have global pauses ( at least
for the stack check even if using a non shared nursery) .. these may be
small but provide a good opportunity to move the worst offenders.

Ben


Hmm. No, that's not quite it, because a reference could bounce back and
Post by Jonathan S. Shapiro
forth between two CPUs via the heap and this won't necessarily catch that
case. Darn. I really hate to put a forwarding check in the store barrier...
OK. I don't have it yet.
_______________________________________________
bitc-dev mailing list
http://www.coyotos.org/mailman/listinfo/bitc-dev
Ben Kloosterman
2013-11-18 11:54:22 UTC
Permalink
Post by Jonathan S. Shapiro
BTW the Immix code has a concept called Chunks which is just N blocks.
Post by Ben Kloosterman
Basically a bunch of pages .. but it has some metadata.
More on this later , im still exploring it.
Look forward to hearing about it.
Things i note
- The Heap is a list of chunks
- Chunks seem to hold the metadata for the blocks.
- It is a Jikes structure
- Chunks are 4Meg
- Chunks , Lines and Blocks are all 2^N structures with Masks .
- For systems with pages < 32K , 128 blocks in a Chunk.
- Blocks are allocated from chunks and marked in use / recycled.
- There is a high water mark ..so used for getting more memory .

Bits i dont understand for now
- Discontiguous chunks
- Blocks being freed release pages.
Post by Jonathan S. Shapiro
Post by Ben Kloosterman
At first i thought I had some options here..
1) Use the last page as a guard .. the fault though expensive works out
at roughly 200+ Cycles ( on x86_64) and a 256K block can have somewhere
between 32-16000 objects on average the cost here is low. ( Setting the
pages can be done in the background)
2) Divide the block into fixed sizes with meta data at each division but
allocate them together for the bump allocator ( this is important as you
dont want to go to the block allocator to often) .
3) Use a lookup in FS /GS for the current block size
But all this is rather silly when checking the code, as it allocates them
on lines .. so it needs to store a byte count for the amount of bytes in
the line ( to check overflow) and the amount of bytes in a line is fixed .
Changing number of lines in a block to a variable is far less of an issue
and either just a line# variable or combined with option (2). There is
also a #of lines allocated counter so there are a few options around this.
I think (2) won't help because you have metadata in the middle of the
stack. You end up having to do all of the call-time guards to avoid overrun
that you would need for a segmented stack. But you are right that this
check is subsumed by the bump limit pointer check. The only real difference
is that the stack grows down while most blocks will grow up, but that's
just a sign change on some of the comparisons.
Yep definetly wont work for stack . Was only for memory blocks.
Post by Jonathan S. Shapiro
You make it sound as if the current RC-immix mechanism is computing
bytes-allocated-in-line explicitly. That surprises me. Don't the low-order
bits of the bump pointer already encode that?
Changing the (static) size of the block isn't a problem, except that at
some point you are going to hit architectural costs imposed by short
immediate values (for masking) in the underlying instruction set. It's
better if you don't have to do multiple instructions to build the mask
value, but as long as the size is statically known the mask is just a
constant value.
But if the size of the chunk is *dynamic*, you need some instructions to
look up the size so that you can determine where the metadata is and decide
what mask to use. You could certainly turn the block base pointer into
thread-local data, and you could store bits-to-mask in the first (otherwise
unused) byte of the metadata, but it still seems like more instructions to
run in the inline path. This is a marginal (new) computation in the inline
path, and I'm already somewhat concerned about the number of instructions
in the inline path.
I'm making a bunch of assumptions here, though. Since you have actually
looked at how this thing works, can you give pseudo-code for the fast path
of the bump allocator?
One thing I'm particularly curious about: does the fast path check the
limit pointer each time an allocation occurs, or are limit checks across
multiple allocations combined in the obvious way?
Posted below since i talk about limit at the end. It is needed for finding
holes ,possibly you could use a diffirent allocator for new ( constant) vs
recycled ( variable) but my gut is the allignment code will cost a lot
more than the variable. For RC-Immix now multiple block sizes have no
cost to the fast allocator ( you just set a diffirent limit address in the
slow path)
Post by Jonathan S. Shapiro
If a block is split it can be zeroed far more efficiently . That said this
Post by Ben Kloosterman
applies more to an Immix style collector than Rc-Immic because with ref
counting you can zero when freeing.
You can, but you really don't want do. The zeroing speed is one of the
under-appreciated advantages of a bump allocator. Zeroing long chunks can
be done with (cache line zero) on a lot of modern machines. That's hugely
faster than word-at-a-time can do it.
Does ARM/X86 have it ? I know IBM Power do .

I note zeroing is done by the mutator and is done when a new block is
received by an allocator or when a new hole is found ( eg zero line N to M
) . Which is efficient enough and provides no gains by having sub blocks.
Post by Jonathan S. Shapiro
According to the paper, that's not what the second block is for. The
second block gets used when the object won't fit in the current range of
the first block. If you just skip over holes that are too small, you can
end up with a lot of unused ranges.
Actually that is correct , there was a check to "limit" which is normally
always equal to block size but burried away in the Immix code ( which
RC-Immix inherits from ) there is a call in getlines which drops the
limit. I still think you should use that block instead of fetching a new
one ( unless mostly full) .

When allocating new blocks this spare block is only used for the end of
block.
When processing recycled blocks which is done when all new blocks are used
this block will be used a lot.


Code for fast path. Dont think Pseudo code is needed

// fast path is straight forward rest is not

@Inline

public final Address alloc(int bytes, int align, int offset) {

/* establish how much we need */

//BK tons of allignment code everywhere.
Address start = alignAllocationNoFill(cursor, align, offset);
Address end = start.plus(bytes);
/* check whether we've exceeded the limit */

//BK limit is either end or block end of hole ( hole is cursor to
limit ) this is the slow path and it gets complicated quickly.
if (end.GT(limit)) {
if (bytes > BYTES_IN_LINE) {
return overflowAlloc(bytes, align, offset);
} else {
return allocSlowHot(bytes, align, offset); //BK look for new
hole or get new block
}
}
/* sufficient memory is available, so we can finish performing the
allocation */

//BK does nothing much except write a known allignment byte in
unaligned dead space . Prob usefull for debugging.
fillAlignmentGap(cursor, start);
cursor = end;

//BK obv. the return value is used to put down the object.

return start;

}


// this gets called by the runtime which calls the above. There is a
constant for large object size size.

@Override
@Inline
public Address alloc(int bytes, int align, int offset, int
allocator, int site) {
switch (allocator) {
case RCBase.ALLOC_DEFAULT:
return rc.alloc(bytes, align, offset);
case RCBase.ALLOC_NON_MOVING:
case RCBase.ALLOC_CODE:
case RCBase.ALLOC_LOS:
case RCBase.ALLOC_PRIMITIVE_LOS:
case RCBase.ALLOC_LARGE_CODE:
return rclos.alloc(bytes, align, offset);
case RCBase.ALLOC_IMMORTAL:
return super.alloc(bytes, align, offset, allocator, site);
default:
VM.assertions.fail("Allocator not understood by RC");
return Address.zero();
}
}


Ben
Jonathan S. Shapiro
2013-11-18 15:24:50 UTC
Permalink
Post by Ben Kloosterman
Post by Jonathan S. Shapiro
BTW the Immix code has a concept called Chunks which is just N blocks.
Post by Ben Kloosterman
Basically a bunch of pages .. but it has some metadata.
More on this later , im still exploring it.
Look forward to hearing about it.
Things i note
- The Heap is a list of chunks
- Chunks seem to hold the metadata for the blocks.
- It is a Jikes structure
- Chunks are 4Meg
- Chunks , Lines and Blocks are all 2^N structures with Masks .
- For systems with pages < 32K , 128 blocks in a Chunk.
- Blocks are allocated from chunks and marked in use / recycled.
- There is a high water mark ..so used for getting more memory .
Bits i dont understand for now
- Discontiguous chunks
- Blocks being freed release pages.
The way you write this leads me to ask: is per-line metadata being stored
in the block, or in the chunk?

Also, you write "chunks are 4M" and then "chunks are 128 blocks". Which
happens to be correct if blocks are 32K, but I suspect the 4M number
reflects the intent better - it matches the large-page page table entry
span on most architectures. At 4M, it's also perfectly reasonable for a
32-bit implementation to allocate a 1Kbyte table for chunk metadata, making
frame type discovery easier.

Releasing the backing pages when a block is freed makes good sense - if you
don't release the pages, you continue to have swap space occupancy.
Post by Ben Kloosterman
You can, but you really don't want do. The zeroing speed is one of the
Post by Jonathan S. Shapiro
under-appreciated advantages of a bump allocator. Zeroing long chunks can
be done with (cache line zero) on a lot of modern machines. That's hugely
faster than word-at-a-time can do it.
Does ARM/X86 have it ? I know IBM Power do .
X86 does, and I *think* Arm does too.

Thank you for digging in to the fast path allocation. I didn't anticipate
the /cursor/ variable, but it's otherwise very much what I expected. It's
not clear to me why they are bothering to fill the alignment gap. And come
to that, I guess I'm a little surprised that they aren't using a common
alignment for nearly everything. But very little of that really matters.
The code makes sense.


Jonathan
Loading...