Discussion:
Carmack : GC, functional, STM, concurrent, imperative hybrid
(too old to reply)
David Jeske
2013-08-04 06:04:08 UTC
Permalink
John Carmack has a bunch of juicy PL bits in his 2013 QuakeCon keynote. I'm
going to post a larger summary, but I want to split this segment out, since
I think it might the most discussion worthy...

Below is a direct link to a segment where he describes a "research"
game-coding structure idea, based on his unfinished research project
porting Wolfenstein 3d to Haskell.



In a nutshell, he hypothesizes an explicitly two-phase memory structure,
which is a curious hybridization of GC, pure-functional immutability, STM,
and imperative mutability.

I think what he's trying to do is get a system which has some of the
benefits of functional immutability and STM, without the overhead of the
extra copies. To do this, he wants to intertwine the GC's copy/compact with
establish immutability. In a sense, the explicit triggering of the GC is
like a global STM commit. In this model, all code runs at
mutable/imperative speed, and the only copies are the ones that already
happen because of compacting GC.

It sounds interesting, since we want GC but struggle with unpredictable
timing and want STM/immutability but struggle with the cost.

----

Here is a more complete description of my understanding of his idea..

An explicit "per game frame" compacting collector is responsible not only
for GC, but also for "phase shifting" data. The last phase is not
immediately discarded, but is kept around for one extra cycle, shifted into
self-referential immutable data.

Code-components (entities) are allowed to see mutable pointers to their own
data, but only read-only pointers to the *previous*phase* of data in all
other entities in the system. Using this structure, one can spin up a
thread-per entity, knowing they can't clobber each-other.

Cross-communication between entities is handled by message-passing
partially bound functions which entities apply to themselves during the
next processing phase.

It's not clear whether entities are only allowed borrowed pointers to the
immutable-phase (assuring they can't hold them past the phase shift), or
whether the GC has a two-phase fixup, where it updates references one phase
behind.
David Jeske
2013-09-19 02:50:18 UTC
Permalink
Fishing around the STM+GC integration, I found these references which seem
to be remotely related to Carmack's brainstorm..

"Synchronization phases (to speed up transactional memory)" - this is the
closest research I can find.

http://www.tik.ee.ethz.ch/file/82b807b0317988c73dbe6025e3aeb4c3/TIK-Report-340.pdf

"Concurrent GC leveraging Transactional Memory" - while this is general
write-contention STM not Carmack's no-write-contention "phase flipping"
concept. they shared some of the copying and barriers between the STM and
GC subsystems. They achieve low average pause times and per-thread pausing
rather than stop-the-world, However, they have no bound on worst-case pause
times, which in their test was still >100ms.

http://dl.acm.org/citation.cfm?id=1345238

PyPy STM + GC experiments:

http://morepypy.blogspot.com/2013/06/stm-on-drawing-board.html
http://morepypy.blogspot.com/2013/07/software-transactional-memory-lisp.html
Sandro Magi
2013-09-19 03:03:58 UTC
Permalink
Optimistic GC will never fly. Abort rates are too high, and the very
possibility of aborts makes reasoning about programs with side-effects
infeasible.

The only way forward IMO, is either a pessimistic STM which restores
reasoning and composition for transactions [1], or preferably,
declaratively stating the commutative properties of concurrent variables
so concurrency is fully deterministic and inherently eventually
consistent [2]. One of their earliest papers on concurrent revisions
ported a game and demonstrated promising near-linear speedups [3].
They've since extended it to include transparent parallel and
incremental computation [4], all using the same API.

It's the most promising API for scaling mutable state I've seen to date.

Sandro

[1] http://mcg.cs.tau.ac.il/papers/transact2012-pessimistic.pdf
[2] http://research.microsoft.com/en-us/projects/revisions/
[3] http://research.microsoft.com/apps/pubs/default.aspx?id=132619
[4] http://research.microsoft.com/apps/pubs/default.aspx?id=150180
Post by David Jeske
Fishing around the STM+GC integration, I found these references which
seem to be remotely related to Carmack's brainstorm..
"Synchronization phases (to speed up transactional memory)" - this is
the closest research I can find.
http://www.tik.ee.ethz.ch/file/82b807b0317988c73dbe6025e3aeb4c3/TIK-Report-340.pdf
"Concurrent GC leveraging Transactional Memory" - while this is
general write-contention STM not Carmack's no-write-contention "phase
flipping" concept. they shared some of the copying and barriers
between the STM and GC subsystems. They achieve low average pause
times and per-thread pausing rather than stop-the-world, However, they
have no bound on worst-case pause times, which in their test was still
Post by David Jeske
100ms.
http://dl.acm.org/citation.cfm?id=1345238
http://morepypy.blogspot.com/2013/06/stm-on-drawing-board.html
http://morepypy.blogspot.com/2013/07/software-transactional-memory-lisp.html
_______________________________________________
bitc-dev mailing list
http://www.coyotos.org/mailman/listinfo/bitc-dev
Jonathan S. Shapiro
2013-09-19 04:48:20 UTC
Permalink
Optimistic GC will never fly. Abort rates are too high...
"Optimistic", in this context, is a code word for "speculative".

That being said, it's worth looking harder at things that will "never fly".
Sandro Magi
2013-09-19 04:59:46 UTC
Permalink
Hah, I totally wrote the wrong word. I meant "optimistic STM will never
fly", and I think there's ample evidence for that conclusion at this point.

Sandro
Optimistic GC will never fly. Abort rates are too high...
"Optimistic", in this context, is a code word for "speculative".
That being said, it's worth looking harder at things that will "never fly".
_______________________________________________
bitc-dev mailing list
http://www.coyotos.org/mailman/listinfo/bitc-dev
William ML Leslie
2013-09-19 05:58:54 UTC
Permalink
Post by Sandro Magi
Optimistic GC will never fly. Abort rates are too high, and the very
possibility of aborts makes reasoning about programs with side-effects
infeasible.
Abort rates can be high, and this probably will be a scalability
problem in the future. I don't really know what you're saying about
side-effects, though, transactions can't do IO unless they are
inevitable in most STM systems.
Post by Sandro Magi
The only way forward IMO, is either a pessimistic STM which restores
reasoning and composition for transactions [1],
[1] http://mcg.cs.tau.ac.il/papers/transact2012-pessimistic.pdf
Splork!

For the unprepared: the abstract for this paper describes a technique
that programming with it

"is logically as simple as with locks"
--
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law. You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in. Any attempt to deny you those rights would be illegal without
prior contractual agreement.
Sandro Magi
2013-09-19 06:53:01 UTC
Permalink
The work arounds for I/O in optimistic STM are exactly the problem. They tried to solve it by essentially eliminating the optimism without bringing the conveniences of real pessimism, namely direct expression of programs with I/O.

And your quote of the paper is misleading and out of context. The "convenience of locks" is specifically referring to the I/O problem, not transactional memory. Optimistic STMs are far less convenient to work with than locks when it comes to I/O. An STM that provides transactional memory while allowing you to express direct I/O with no chance of aborts is a huge win over the current state of the art.

Sandro
Post by William ML Leslie
Post by Sandro Magi
Optimistic GC will never fly. Abort rates are too high, and the very
possibility of aborts makes reasoning about programs with side-effects
infeasible.
Abort rates can be high, and this probably will be a scalability
problem in the future. I don't really know what you're saying about
side-effects, though, transactions can't do IO unless they are
inevitable in most STM systems.
Post by Sandro Magi
The only way forward IMO, is either a pessimistic STM which restores
reasoning and composition for transactions [1],
[1] http://mcg.cs.tau.ac.il/papers/transact2012-pessimistic.pdf
Splork!
For the unprepared: the abstract for this paper describes a technique
that programming with it
"is logically as simple as with locks"
--
William Leslie
Likely much of this email is, by the nature of copyright, covered
under copyright law. You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in. Any attempt to deny you those rights would be illegal without
prior contractual agreement.
_______________________________________________
bitc-dev mailing list
http://www.coyotos.org/mailman/listinfo/bitc-dev
William ML Leslie
2013-09-30 12:06:53 UTC
Permalink
Post by Sandro Magi
The work arounds for I/O in optimistic STM are exactly the problem. They tried to solve it by essentially eliminating the optimism without bringing the conveniences of real pessimism, namely direct expression of programs with I/O.
I think I would phrase 'the workarounds for I/O in ... STM' as 'unlike
memory, optimistic rules for alias detection don't apply generally to
IO', and consequently, they deserve unique treatment and finer-grained
control. There are many cases where IO can be decomposed into
interactions with different services, but this is not always the case.
For most STM systems, if they even provide a mechanism for defining
distinct services (that is, a way to describe the implied partial
order on IO effects or separate them out into different monads), it
usually lives outside the language.

I don't think these are problems with existing AME scheduling
mechanisms, rather, they are a problem with a lack of specification of
the various algebrae that comprise IO.

---

Nevertheless, what Carmack seems to be describing is GC (with card
tables, I guess). He wants more control, but like many, doesn't
quite seem to grasp what that control needs to be. There are
allocation patterns where lifetime is almost statically determined by
control flow, programmers can see these clearly, but sometimes have
trouble describing what these patterns are in a composable way. Such
thinking is where systems like Rust and linear types come from, but
neither feel like they describe the full picture.
--
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law. You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in. Any attempt to deny you those rights would be illegal without
prior contractual agreement.
david j
2013-09-30 15:19:07 UTC
Permalink
Sandro, thanks for posting those links [3,4]. I have to agree with you,
this is the most promising abstraction for structuring fine grained
parallelism I have seen to date.

I'm suspicious or may be very hard to keep the overhead near their observed
5% for less trivial data-structures, bit it is worth a shot!
Post by William ML Leslie
Nevertheless, what Carmack seems to be describing is GC (with card
tables, I guess).
I don't think so. I think he is describing a two-revision GC integrated
version of [3], where the indirection cost is removed by replacing their
COW and version lookup with a full copy integrated with the GC relocate.

[3] http://research.microsoft.com/apps/pubs/default.aspx?id=132619
[4] http://research.microsoft.com/apps/pubs/default.aspx?id=150180
Post by William ML Leslie
He wants more control, but like many, doesn't
quite seem to grasp what that control needs to be.
That is not what I see. I see a very specific desire to synchronize GC
sweeps with frame rendering - letting standard gc handle lifetime. (No
static lifetime analysis)

However, I suspect [3] may be dramatically more efficient, capable, and
flexible ... Just so long as GC stop the world is sufficiently minimized.
Sandro Magi
2013-10-01 00:33:51 UTC
Permalink
Post by david j
Sandro, thanks for posting those links [3,4]. I have to agree with
you, this is the most promising abstraction for structuring fine
grained parallelism I have seen to date.
I'm suspicious or may be very hard to keep the overhead near their
observed 5% for less trivial data-structures, bit it is worth a shot!
A properly implemented language founded on CR for mutable state could
totally achieve within 10% IMO. Since CR variables are implemented via
thread-local arrays, the overhead CR adds to reads and writes is a
pointer offset + an additional pointer indirection (which adds 5-10%
overhead last I checked up on Brooks-style read barriers); keep a
pointer to the block of thread-local storage in a register to reduce
overhead as much as possible, and copy-on-write optimized by the
compiler only when needed, and 5-10% seems reasonable.

A library approach to CR is obviously less efficient, but the CLR on
which they're conducting their experiments retains runtime types, so
thread-local state is also type-indexed, allowing it to be much more
efficient once code generation stabilizes. IIRC, their 5% figure
consists mostly of variables of atomic/immutable types, like int32,
float32, etc. so cloning overheads are low. However, you can efficiently
albeit conservatively check at runtime whether a type is immutable [1],
so if you ensure your objects are largely immutable, or you ensure that
all mutable state is encapsulated in Versioned<T> instances, then that
overhead could also be relatively small.

The LVars paper I just posted about is also an interesting point in the
design space that I'm still looking into.

Sandro

[1]
http://sourceforge.net/p/sasa/code/ci/default/tree/Sasa.Dynamics/Type.cs#l79
Post by david j
On Sep 30, 2013 5:07 AM, "William ML Leslie"
Post by William ML Leslie
Nevertheless, what Carmack seems to be describing is GC (with card
tables, I guess).
I don't think so. I think he is describing a two-revision GC
integrated version of [3], where the indirection cost is removed by
replacing their COW and version lookup with a full copy integrated
with the GC relocate.
[3] http://research.microsoft.com/apps/pubs/default.aspx?id=132619
[4] http://research.microsoft.com/apps/pubs/default.aspx?id=150180
Post by William ML Leslie
He wants more control, but like many, doesn't
quite seem to grasp what that control needs to be.
That is not what I see. I see a very specific desire to synchronize GC
sweeps with frame rendering - letting standard gc handle lifetime. (No
static lifetime analysis)
However, I suspect [3] may be dramatically more efficient, capable,
and flexible ... Just so long as GC stop the world is sufficiently
minimized.
_______________________________________________
bitc-dev mailing list
http://www.coyotos.org/mailman/listinfo/bitc-dev
david j
2013-10-01 01:48:50 UTC
Permalink
Post by Sandro Magi
A properly implemented language founded on CR for mutable state could
totally achieve within 10% IMO.
I can see how this is true if all your versioned elements are scalars or
reference types, but I don't see how this could possibly be true in the
general case of any larger sequential data-structure.

For example, consider looping over a large value type array. The
non-versioned value type array is going to be large cache-efficient
sequential memory access. If we have to version writes to this data, we
have to decide between either (a) expensively copying the entire array
because we changed one entry, or (b) dealing with indirection overhead for
every value of the array (which is going to cost alot more than 10% vs a
sequential scan).

Even a large class-object of scalars turned into versioned scalars has
disasterously comparable cache effects if you access many fields of the
object.

Do you see what I was getting at now?

I've been considering whether it's reasonable to build a value-type
implementation of the versions container. Of course it would have a
fixed-size number of in-place slots after which it would need to use a
pointer to 'overflow', but I think this would be better than adding a
pointer indirection to every value type.
david j
2013-10-01 01:50:44 UTC
Permalink
...or I suppose (c) implementing a versions aware array container, which
does some kind of change-overlay-deltas (like their list implementation)
Sandro Magi
2013-10-01 13:31:58 UTC
Permalink
Post by Sandro Magi
A properly implemented language founded on CR for mutable state
could totally achieve within 10% IMO.
I can see how this is true if all your versioned elements are scalars
or reference types, but I don't see how this could possibly be true in
the general case of any larger sequential data-structure.
For example, consider looping over a large value type array. The
non-versioned value type array is going to be large cache-efficient
sequential memory access. If we have to version writes to this data,
we have to decide between either (a) expensively copying the entire
array because we changed one entry, or (b) dealing with indirection
overhead for every value of the array (which is going to cost alot
more than 10% vs a sequential scan).
Even a large class-object of scalars turned into versioned scalars has
disasterously comparable cache effects if you access many fields of
the object.
1. I don't think they're nearly as disastrous as you're implying, for
the simple reason that the Versioned<T> instance is allocated right
after its containing class, which means it's very likely to be in the
same cache line. Indirection is then comparable to a Brooks-style read
barrier when the forwarding pointer is pointing to itself, which has
been empirically validated at less than 10%.

2. You're assuming a Versioned<T>[], which permits efficient update of
individual cells, but if you're more frequently doing batch updates over
the whole array, you're better off doing Versioned<T[]>, which
aggregates all the individual cloning ops into a single cloning op.

A language based on CR can be even more aggressive, since 1. it can
identify individual points where mutation occurs and thus avoid full
clones, and 2. it can allocate value type Versioned<T> instances as you
suggest below.
Post by Sandro Magi
Do you see what I was getting at now?
I've been considering whether it's reasonable to build a value-type
implementation of the versions container. Of course it would have a
fixed-size number of in-place slots after which it would need to use a
pointer to 'overflow', but I think this would be better than adding a
pointer indirection to every value type.
Yes it's possible in some cases [1], but it must be fully encapsulated
and managing the lifetime of the thread-local slot is pushed up to the
slot container, instead of being provided by the slot itself.

Sandro

[1] Basically, anywhere the CLR allows a private mutable field will
permit a value type VersionedField<T>, but the instance must never
escape the container, and the container must implement a finalizer which
returns the slot to the pool (basically reproducing the finalization
work of the Versioned<T> class type). I've implemented this in a private
prototype since the same thread-local logic and lifetime management
applies to instance-specific ThreadLocal<T> variables, so I plan to
reuse it in a few different contexts.

Sandro Magi
2013-10-01 00:10:54 UTC
Permalink
Post by William ML Leslie
I think I would phrase 'the workarounds for I/O in ... STM' as 'unlike
memory, optimistic rules for alias detection don't apply generally to
IO',
Two points:

1. Alias detection in memory doesn't even work that well; STM often
relies on timeouts to avoid deadlock because ordering lock acquisition
is too costly.

2. The predominant I/O to files is a historical artefact of Multics and
the C programming language it eventually spurned. In principle, disk is
just another level of the memory hierarchy, and we can push this all the
way down to the OS as the orthogonal persistence in KeyKOS, EROS/CapROS
and L4 have shown (and as E and Waterken show within safe language
environments). Any properties we expect of in-memory objects thus also
applies to I/O-level objects; we just hack around it via file system
abstractions.

I agree that some of these problems can be mitigated as you suggest, ie.
with more rigourous specification of IO, but why should I bother when
other solutions don't have this problem and further provide even more
compelling properties?

Furthering point 2 above, concurrent revisions and other related
approaches [1], provides just such a unified view of all levels of a
memory hierarchy. I recently came across [2] which purports to provide a
more formal account of precisely why concurrent revisions and Bloom
provide better fundamental abstractions. LVars are generalizations of
futures permitting multiple assignments (monotonically increasing as
determined by a user-specified lattice), thus automatically providing
high availability and eventual consistency in the presence of partitions.

These are the sorts of properties that are actually important to
express. I'm not impressed with my concurrency library imposing
arbitrary restrictions on my IO when there are more important properties
I need to concern myself with.

Sandro

[1] http://www.bloom-lang.net/calm/
[2] http://www.cs.indiana.edu/~lkuper/papers/lvars-fhpc13.pdf
<http://www.cs.indiana.edu/%7Elkuper/papers/lvars-fhpc13.pdf>
Post by William ML Leslie
and consequently, they deserve unique treatment and finer-grained
control. There are many cases where IO can be decomposed into
interactions with different services, but this is not always the case.
For most STM systems, if they even provide a mechanism for defining
distinct services (that is, a way to describe the implied partial
order on IO effects or separate them out into different monads), it
usually lives outside the language.
I don't think these are problems with existing AME scheduling
mechanisms, rather, they are a problem with a lack of specification of
the various algebrae that comprise IO.
---
Nevertheless, what Carmack seems to be describing is GC (with card
tables, I guess). He wants more control, but like many, doesn't
quite seem to grasp what that control needs to be. There are
allocation patterns where lifetime is almost statically determined by
control flow, programmers can see these clearly, but sometimes have
trouble describing what these patterns are in a composable way. Such
thinking is where systems like Rust and linear types come from, but
neither feel like they describe the full picture.
Loading...