Discussion:
Retrospective Thoughts on BitC
(too old to reply)
David Jeske
2012-04-09 10:32:03 UTC
Permalink
Forgive me for being under-educated in PL research for some of the audience
here...

http://www.coyotos.org/pipermail/bitc-dev/2012-March/003300.html

Shapiro wrote:
Jonathan S. Shapiro
2012-04-09 23:11:02 UTC
Permalink
The reason I sought out BitC is that I'm looking for solutions not to
program-expression, but to software modularity....
The two are not entirely separable. One requirement for modularity is that
the language does not enforce overspecialization through insufficient
expressiveness.
Some observations...
2) type-inference doesn't fix this... type inference systems are merely a
technique for software brevity
I don't entirely agree. Type inference is relevant because programmers are
notoriously bad at specifying "most general" types.
3) subtyping doesn't fix this.. it merely creates more challenges in both
modular forward compatibility and performance subtyping across module
boundaries
It may. But it is also an excellent basis for encapsulation of existential
type, which is very closely tied to modularity issues. On the other hand,
it's not the only candidate solution for that.
4) type-classes don't fix this... they are a PL or PL-implementation
feature and are not solving a moduarlity problem
Also don't agree here. While it doesn't have to be type classes, having
some mechanism for expressing constraints on types is very useful for
modularity.
IMO, if we want to move past C-shared-libraries, the solution we need is a
model for type-safe high-performance loadable software components with
upgradability (managable forward compatibility).
I'd certainly be very happy to see that, but I think we need to take one
step at a time. We still don't know how to do modularity in the *absence* of
versioning issues.
Microsoft singularity is quite interesting research in this direction, as
it is attempting to eclipse the C-shared-library.
The "whole program" idea in Singularity rests on a claim that David Tarditi
made about a process he called "tree shaking". Tree shaking, in brief, is a
process for removing unreachable code from programs. David claimed, without
supporting data or careful study, that the binary size reduction obtained
from tree shaking was greater than the size reduction obtained from shared
libraries.

There are two problems with this claim:

1. It disregards the fact that the two optimizations are orthogonal. The
ability to remove unreached code does not reduce the value of gathering *
reused* code in a common place.
2. The metric of interest isn't the size reduction in a single program,
but the total code footprint across the system as a whole (that is: across
*many* programs). The tree shaking approach results in a system where *
most* programs will duplicate a certain body of code that is commonly
used. That's the code that belongs in shared libraries.

Another way to say this is that what you really need is "whole system" tree
shaking rather than "whole program" tree shaking, and there are scaling
issues with that.

I believe that subsequent practical experience may have caused David to
revise his opinion on this.
a,b,e,f,h), (MSIL: a,b,c,d,e,f,h,i,j), but are still both missing a
critical missing link to replace C-shared libraries... "e" (i.e.
deterministic soft-real-time performance), making them unsuitable for
layered subsystems. (because worst-case GC pauses are unacceptably large
both in large-heaps and layered small-heaps)
I'm not sure why you say that for layered small heaps, and I'm fairly
convinced that it is wrong for large heaps *provided* concurrent collection
is used. Unfortunately, concurrent collector technology hasn't been widely
deployed.


shap
David Jeske
2012-04-10 00:54:20 UTC
Permalink
Post by Jonathan S. Shapiro
The reason I sought out BitC is that I'm looking for solutions not to
program-expression, but to software modularity....
The two are not entirely separable. One requirement for modularity is that
the language does not enforce overspecialization through insufficient
expressiveness.
Agreed.
Post by Jonathan S. Shapiro
Some observations...
2) type-inference doesn't fix this... type inference systems are merely a
technique for software brevity
I don't entirely agree. Type inference is relevant because programmers are
notoriously bad at specifying "most general" types.
True. Sather's (non-inference based) notion that one can always implement a
type-compatible object without implementation inheritence (every type is an
interface) seemed like it goes further than rigid implementation
inheritence class models. However, it does seem we can go further still
with structural compatibilty inference (like Google Go is
attempting) rather than named-type compatibility. I don't know if the they
have a model for separate compilation and upgradability, or if one is
possible.
Post by Jonathan S. Shapiro
IMO, if we want to move past C-shared-libraries, the solution we need is a
model for type-safe high-performance loadable software components with
upgradability (managable forward compatibility).
I'd certainly be very happy to see that, but I think we need to take one
step at a time. We still don't know how to do modularity in the *absence* of
versioning issues.
This I'm surprised at.

In the absense of versioning issues, don't most dynamic language runtimes
handle modularity? Java/JVM, C#/CIL, Python, Ruby, Javascript... all can
have interfaces specified (whether checked by a compiler or not) and load
compatible classes at runtime.

As far as I can see, both C-shlibs and CIL both (to some degree) at least
have some capability to handle multi-versioned dependencies because both of
them record the required dependency version in a way that is visible to the
runtime load/linker. Further, they can both load multiple versions of the
same shlib/DLL. Compare this with Java, Python, Javascript, etc.. which
don't have any record of such information to make available to the runtime,
nor can the runtime handle more than one version of a module/class loaded
because of name-clash.

Do we have a different definition of modularity? Ignoring the limits of
program expression, and simply thinking about how to get past the
C-warzone.. I'd be satisfied if I could have badly overspcified interfaces
between DLLs and actually load two modules without version typeclash. I
think C-shlibs and CIL both give me this. Do they not?
Post by Jonathan S. Shapiro
There are two problems with this claim: [regarding microsoft sinngularity
'tree shaking']
1. It disregards the fact that the two optimizations are orthogonal.
The ability to remove unreached code does not reduce the value of gathering
*reused* code in a common place.
2. The metric of interest isn't the size reduction in a single
program, but the total code footprint across the system as a whole (that
is: across *many* programs). The tree shaking approach results in a
system where *most* programs will duplicate a certain body of code
that is commonly used. That's the code that belongs in shared libraries.
I see and agree fully with your view here. I wasn't aware of this 'tree
shaking'. When I referenced Singularity I was thinking of the attempt to
provide isolation between modules via a mechanism which resembles the type
of brute-force message-passing we do today (between kernel and user or on
socket interfaces), but without the copying... (i.e. singularity's
immutable object transfer model) I admit it was many years ago I read this
paper, so I may be remembering it incorrectly.
Post by Jonathan S. Shapiro
a,b,e,f,h), (MSIL: a,b,c,d,e,f,h,i,j), but are still both missing a
critical missing link to replace C-shared libraries... "e" (i.e.
deterministic soft-real-time performance), making them unsuitable for
layered subsystems. (because worst-case GC pauses are unacceptably large
both in large-heaps and layered small-heaps)
I'm not sure why you say that for layered small heaps, and I'm fairly
convinced that it is wrong for large heaps *provided* concurrent
collection is used. Unfortunately, concurrent collector technology hasn't
been widely deployed.
I am unaware of a so-named "concurrent" collector which does not have an
unacceptable stop-the-world pause of the old-generation (I believe for
initial-mark). Certainly those in modern Java and C# have unacceptable and
unpredictable worst-case pause times. The only research I'm aware of which
attempts to solve this problem is Microsoft STOPLESS,
CHICKEN<http://www.cs.technion.ac.il/~erez/Papers/real-time-pldi.pdf>,
and hardware-assisted Pauseless
GC<http://static.usenix.org/events/vee05/full_papers/p46-click.pdf>by
Azul. None of these are available in generally useful x86
implementations.

In real systems we expect to be able to respond to user-actions 10-20ms.
This is not possible to do reliably with today's GC systems. In C manual
management or ref-counted systems, expensive work can be "hidden" from the
user by either manually amortorizing it or performing it when users are not
waiting. This is simply impossible in a GC system which unpredictably stops
the world. This in turn makes stop-the-world GC impossible to use for many
systems.

The problem of stop-the-world pauses are exacerbated by threading and
layered-systems. Much like cumulative MTBF, when many subsystems may have
unpredictable worst-cases pauses overlayed, the systemic worst-case pause
is additive. In a threaded heap, every thread is delayed by the world-stop.
We simply have to stop stopping the world.

It's all too easy to dismiss these pauses as insignificant, but in real
situations in real systems, we are avoiding GC everyday because of this
issue. It must be fixed for GC-based systems to replace C-shlibs.
Jonathan S. Shapiro
2012-04-10 01:07:50 UTC
Permalink
Post by David Jeske
I am unaware of a so-named "concurrent" collector which does not have an
unacceptable stop-the-world pause of the old-generation (I believe for
initial-mark). Certainly those in modern Java and C# have unacceptable and
unpredictable worst-case pause times. The only research I'm aware of which
attempts to solve this problem is Microsoft STOPLESS, CHICKEN<http://www.cs.technion.ac.il/~erez/Papers/real-time-pldi.pdf>,
and hardware-assisted Pauseless GC<http://static.usenix.org/events/vee05/full_papers/p46-click.pdf>by Azul. None of these are available in generally useful x86
implementations.
The STOPLESS collector and the Azul collector were the two I was thinking
about.
Post by David Jeske
In real systems we expect to be able to respond to user-actions 10-20ms.
This is not possible to do reliably with today's GC systems.
I think that overstates the requirement in some cases and grossly *
understates* it in others - in audio applications, the proper target is
0.7ms to 1ms for certain inputs.

But it is generally my experience that programs with truly hard
requirements of this kind have phases in which they apply and phases in
which they do not. If that is true, then the real problem is less the cost
of GC than the inability to write GC-free subprograms. What seems to have
happened is that the safe programming world, having bought off on the idea
that GC was an inevitable requirement, then declared that no other form of
memory handling was interesting. This, of course, was an issue that BitC
was trying to address explicitly.
Post by David Jeske
This is simply impossible in a GC system which *unpredictably* stops the
world. This in turn makes stop-the-world GC impossible to use for many
systems.
I don't like stop-the-world collectors either, but that "unpredictably"
word in your sentence is critical. The only collectors I know of that are
really unpredictable in this way are collectors for concurrent programs
that require a universal thread rendezvous.
Post by David Jeske
We simply have to stop stopping the world.
That *really* needs to go in your signature. :-)
Post by David Jeske
It's all too easy to dismiss these pauses as insignificant, but in real
situations in real systems, we are avoiding GC everyday because of this
issue. It must be fixed for GC-based systems to replace C-shlibs.
I hope you don't feel that I was dismissing them. I tend to agree with you,
though I do think that a language along the lines of what BitC was doing
probably offers a range of exploitable middle positions. Certainly doesn't
solve the problem entirely, but maybe enough for a lot more applications to
be able to make the jump to safe runtimes.


shap
David Jeske
2012-04-10 10:11:51 UTC
Permalink
Post by Jonathan S. Shapiro
The STOPLESS collector and the Azul collector were the two I was thinking
about.

I would be very happy if either were available and usable on x86. I would
very much appreciate your opinion on a thought of mine...

The general technique in STOPLESS of using dual-machine-code versions seems
challenging. I've wondered if it would be possible to do a reasonable
facimilie of the Azul hardware-assisted barriers with the help of the x86
MMU and phantom writes (or reads). My hope is to allow compiled code to
carry "optional barriers" with fairly low inactive-cost and then use the
MMU/TLB to make those barriers active when needed. This would allow us to
convert stop-the-world into slow-the-world, which I think would be a
significant advancement.

I have no practical experience with MMU/TLB coding, so I'm not sure if the
above theory works out in practice. Do you have an opinion on this? Does it
seem at all possible/practical?
Post by Jonathan S. Shapiro
Post by David Jeske
In real systems we expect to be able to respond to user-actions 10-20ms.
This is not possible to do reliably with today's GC systems.
Post by Jonathan S. Shapiro
I think that overstates the requirement in some cases and grossly
understates it in others - in audio applications, the proper target is
0.7ms to 1ms for certain inputs.
Post by Jonathan S. Shapiro
But it is generally my experience that programs with truly hard
requirements of this kind have phases in which they apply and phases in
which they do not. If that is true, then the real problem is less the cost
of GC than the inability to write GC-free subprograms.

My experience is less forgiving.

Consider any reasonably large GUI application -- a word-processor, computer
game, or a 3d graphics modeling program. There is no acceptable time to
stop the world because one can't predict when the user will interact next.
Moving most of the data out of the view of the GC is a solution, but then
we're right back to non-safety.

The same lack of forgiveness is present in threaded web applications. There
is normally no time that all threads are simultaneously in a pausable
activity. I did hear about one GC based system that solved this problem by
using an external load balancer to remove traffic from the process in
advance of a GC, so the system was under no load during the world-stop. Not
a very generalizable solution.
Post by Jonathan S. Shapiro
What seems to have happened is that the safe programming world, having
bought off on the idea that GC was an inevitable requirement, then declared
that no other form of memory handling was interesting. This, of course, was
an issue that BitC was trying to address explicitly.

I confess to being fully in the GC camp. I'm there not through foregone
conclusion, but through decades of watching the safe/unsafe tradeoffs in
manual/refcounting/gc techniques.

I see no way to write 'real' apps (a game, word processor dom, web browser
dom, 3d modeler, caching webserver) without heap mutability, and no way to
provide typesafe heap mutability without GC.

If there is a general pattern for solving these problems in a typesafe way
without GC, I'd love to be shown the light.

In the absense of that, I'm very happy with the direction of the CIL design
(relative to JVM or those before it) because value-type-structs and
parametric-type-instantiation off us practical ways to reduce tracable
pointers and thus pressure on the GC. That said, we still need a way to
stop stopping the world.
Post by Jonathan S. Shapiro
Post by David Jeske
This is simply impossible in a GC system which unpredictably stops the
world. This in turn makes stop-the-world GC impossible to use for many
systems.
Post by Jonathan S. Shapiro
I don't like stop-the-world collectors either, but that "unpredictably"
word in your sentence is critical. The only collectors I know of that are
really unpredictable in this way are collectors for concurrent programs
that require a universal thread rendezvous.

I do not understand the significance of your observation. AFAIK, every
currently shipping software collector, for every language system, requires
an unpredictable world-stop. If I am mis-informed, please point me to one
that does not.

The MS Research on dual-code-paths with STOPLESS and CHICKEN is the only
non-world-stop software collector design I'm aware of. It is not available
and the dual-code-path requirement is quite difficult to satisfy.

If my proposal to use MMU/TLM hardware-assist to get low-cost
optional-barriers is at all viable, I'm very much interested in putting
that idea into practice.
Bennie Kloosteman
2012-04-10 11:51:35 UTC
Permalink
Post by David Jeske
The same lack of forgiveness is present in threaded web applications.
There is normally no time that all threads are simultaneously in a pausable
activity. I did hear about one GC based system that solved this problem by
using an external load balancer to remove traffic from the process in
advance of a GC, so the system was under no load during the world-stop. Not
a very generalizable solution.
Actually it is fairly general as the same can be any message pump based
service- a runtime can manage and load balance them , This can work for
many things except where there is a shared resource . You can than try to
place those resources in a single service with either a small or static
heap.



Ben
Jonathan S. Shapiro
2012-04-10 17:09:17 UTC
Permalink
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
But it is generally my experience that programs with truly hard
requirements of this kind have phases in which they apply and phases in
which they do not. If that is true, then the real problem is less the cost
of GC than the inability to write GC-free subprograms.
My experience is less forgiving.
Consider any reasonably large GUI application -- a word-processor,
computer game, or a 3d graphics modeling program. There is no acceptable
time to stop the world because one can't predict when the user will
interact next. Moving most of the data out of the view of the GC is a
solution, but then we're right back to non-safety....
So I agree that the behavior you describe exists, but I do *not* agree that
it is any sort of impediment to use. The occasional, rare, 20ms pause in an
application like this just isn't a big deal
Post by Jonathan S. Shapiro
The same lack of forgiveness is present in threaded web applications.
There is normally no time that all threads are simultaneously in a pausable
activity.
A general stop isn't actually required. What's required is for all threads
to enter a new allocation regime before the GC can start. So basically,
you're complaining that the existing GCs are excessively naive and
therefore aren't meeting your needs. Where concurrent programs are
concerned, I tend to agree.

So what should we *do* about it?
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
What seems to have happened is that the safe programming world, having
bought off on the idea that GC was an inevitable requirement, then declared
that no other form of memory handling was interesting. This, of course, was
an issue that BitC was trying to address explicitly.
I confess to being fully in the GC camp. I'm there not through foregone
conclusion, but through decades of watching the safe/unsafe tradeoffs in
manual/refcounting/gc techniques.
But that's *exactly* the false dichotomy that the PL world lives in. You
assume that the only alternative is manual memory management. The thing is:
you are mistaken. There are two other, very useful, alternatives:

1. Language support for unboxing, which reduces the number of heap
allocations and permits stack allocation
2. First-class regions, which permit swathes of objects to be eliminated
simultaneously.

I see no way to write 'real' apps (a game, word processor dom, web browser
Post by Jonathan S. Shapiro
dom, 3d modeler, caching webserver) without heap mutability, and no way to
provide typesafe heap mutability without GC.
But the argument here isn't about removing GC. It's about combining GC with
other *safe* techniques so that (a) the load on the GC is reduced, and (b)
the triggering of GC can be predicted, and (c) the triggering of major
collections can be avoided
Post by Jonathan S. Shapiro
In the absense of that, I'm very happy with the direction of the CIL
design (relative to JVM or those before it) because value-type-structs and
parametric-type-instantiation off us practical ways to reduce tracable
pointers and thus pressure on the GC.
Yes and no. The value types in CIL remain second class, the absence of a
fixed-length array type is crippling to some very important cases, and the
atomicity rules in the presence of concurrency defeat a whole bunch of
inlining options.
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
I don't like stop-the-world collectors either, but that "unpredictably"
word in your sentence is critical. The only collectors I know of that are
really unpredictable in this way are collectors for concurrent programs
that require a universal thread rendezvous.
I do not understand the significance of your observation. AFAIK, every
currently shipping software collector, for every language system, requires
an unpredictable world-stop. If I am mis-informed, please point me to one
that does not.
As far as I know, there are *no* shipping collectors whose GC trigger
conditions are unpredictable. In every case, GC is an optional side effect
of allocation. The conclusion, then, is that we need the ability to
construct allocation-free subprograms.


shap
Pal-Kristian Engstad
2012-04-10 18:00:02 UTC
Permalink
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 4/10/2012 10:09 AM, Jonathan S. Shapiro wrote:
<blockquote
cite="mid:CAAP=3QN43JuPV-qdSYoEQr2BnWO9QNm04Ne8GY3xGoy-***@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<div class="gmail_quote"><br>
<div>So I agree that the behavior you describe exists, but I do
<i>not</i>&nbsp;agree that it is any sort of impediment to use. The
occasional, rare, 20ms pause in an application like this just
isn't a big deal <br>
</div>
</div>
</blockquote>
<br>
That depends on the application. Games, as an example, would never
tolerate a 20 msec pause. The application has to produce images at
30 frames per second, or sometimes at 60 frames per second. In a 30
FPS game, you have 1,000 [msec] / 30 [frames] = 33.33.. [msec/frame]
to respond to user input, perform game logic, prepare rendering
information and then produce an image. As you can clearly see, 20
msec is *huge*, it is 60% of a frame! <br>
<br>
I would think that a lot of real-time and soft real-time systems
would have similar reasons to abandon any system that paused the
program for such a long period of time.<br>
<br>
PKE<br>
<br>
</body>
</html>
Bennie Kloosteman
2012-04-11 07:53:56 UTC
Permalink
On Wed, Apr 11, 2012 at 2:00 AM, Pal-Kristian Engstad <
Post by Jonathan S. Shapiro
So I agree that the behavior you describe exists, but I do *not* agree
that it is any sort of impediment to use. The occasional, rare, 20ms pause
in an application like this just isn't a big deal
That depends on the application. Games, as an example, would never
tolerate a 20 msec pause. The application has to produce images at 30
frames per second, or sometimes at 60 frames per second. In a 30 FPS game,
you have 1,000 [msec] / 30 [frames] = 33.33.. [msec/frame] to respond to
user input, perform game logic, prepare rendering information and then
produce an image. As you can clearly see, 20 msec is *huge*, it is 60% of a
frame!
I would think that a lot of real-time and soft real-time systems would
have similar reasons to abandon any system that paused the program for such
a long period of time.
3D is the hardest GC case because
1) It uses a very large activee 1G heap ( and GC times for 1+G heaps are
not pretty)
2) It is bound to HW device , to which many parts of the Heap communicate


Im not sure 1 frame missed every half hour is a big deal , if it is an
issue you can move the vertex buffers and some basic code movement code
into a separate service with its own non GC thread and use either ref
counting OR static memory management on those buffers. On 2+ cores with a
multi threaded graphics runtime this is likely to work better anyway.

Managed DirectX seems to be ok from what i hear ( Warning hearsay) with the
new Background GC , though no one has really pushed it for a big game ,

Speak of Midori, WinRT ( which has some of the Midori ideas i noticed)
uses ref counting for some of its internals and its likely a new 3d
runtime will use ref counting while the app uses a GC ( or C++) as above,

Ben
Geoffrey Irving
2012-04-12 00:07:01 UTC
Permalink
Post by Bennie Kloosteman
On Wed, Apr 11, 2012 at 2:00 AM, Pal-Kristian Engstad
So I agree that the behavior you describe exists, but I do not agree that
it is any sort of impediment to use. The occasional, rare, 20ms pause in an
application like this just isn't a big deal
That depends on the application. Games, as an example, would never
tolerate a 20 msec pause. The application has to produce images at 30 frames
per second, or sometimes at 60 frames per second. In a 30 FPS game, you have
1,000 [msec] / 30 [frames] = 33.33.. [msec/frame] to respond to user input,
perform game logic, prepare rendering information and then produce an image.
As you can clearly see, 20 msec is *huge*, it is 60% of a frame!
I would think that a lot of real-time and soft real-time systems would
have similar reasons to abandon any system that paused the program for such
a long period of time.
3D is the hardest GC case because
1) It uses a very large activee 1G heap  ( and GC times for 1+G heaps are
not pretty)
2) It  is bound to HW device , to which  many parts of the Heap communicate
It's usually not nearly as bad as a general 1G heap, since most of the
memory should be in the form of large blocks of binary data which
don't need to be scanned for pointers.

Geoffrey
Jonathan S. Shapiro
2012-04-10 19:26:35 UTC
Permalink
Crap! This was supposed to go to the list too.

Pal: I imagine that you are using "reply-all" instead of "reply". Please
don't. Replies to bitc-dev are directed to bitc-dev automatically. When
your email agent adds me as a recipient, it breaks this behavior, with the
result that I end up replying to you privately instead of to the list. I'll
try to pay better attention, but could you please try not to add me
individually when you are replying to the list?

Thanks
On Tue, Apr 10, 2012 at 11:00 AM, Pal-Kristian Engstad <
Post by Jonathan S. Shapiro
So I agree that the behavior you describe exists, but I do *not* agree
that it is any sort of impediment to use. The occasional, rare, 20ms pause
in an application like this just isn't a big deal
That depends on the application. Games, as an example, would never
tolerate a 20 msec pause. The application has to produce images at 30
frames per second, or sometimes at 60 frames per second. In a 30 FPS game,
you have 1,000 [msec] / 30 [frames] = 33.33.. [msec/frame] to respond to
user input, perform game logic, prepare rendering information and then
produce an image. As you can clearly see, 20 msec is *huge*, it is 60% of a
frame!
I'm well aware of that. But well-written games do not allocate memory
during frame render, so GC shouldn't be triggered at all.
Or at least, that would be the case *except* that current safe languages *
force* you to allocate in various circumstances where it is not really
necessary, because the language/runtime design favors boxed objects.
It's also worth noting that *malloc* can be expensive!
Post by Jonathan S. Shapiro
I would think that a lot of real-time and soft real-time systems would
have similar reasons to abandon any system that paused the program for such
a long period of time.
Depends a lot on when, why, and for how long. 20ms is definitely a long
time. On the other hand, it's pretty clear to me that I could build an
audio system with 0.7-1ms requirements in a GC'd language provided the
language offered adequate support for unboxed types.
shap
Pal-Kristian Engstad
2012-04-12 00:24:33 UTC
Permalink
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<blockquote type="cite">
<div>I'm well aware of that. But well-written games do not
allocate memory during frame render, so GC shouldn't be
triggered at all.</div>
<div><br>
</div>
<div>Or at least, that would be the case <i>except</i>&nbsp;that
current safe languages <i>force</i>&nbsp;you to allocate in various
circumstances where it is not really necessary, because the
language/runtime design favors boxed objects.</div>
<div><br>
</div>
<div>It's also worth noting that <u>malloc</u>&nbsp;can be expensive!</div>
<div class="im">
<div>&nbsp;</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
I would think that a lot of real-time and soft real-time
systems would have similar reasons to abandon any system
that paused the program for such a long period of time.</div>
</blockquote>
<div><br>
</div>
</div>
<div>Depends a lot on when, why, and for how long. 20ms is
definitely a long time. On the other hand, it's pretty clear to
me that I could build an audio system with 0.7-1ms requirements
in a GC'd language provided the language offered adequate
support for unboxed types.</div>
<div><br>
</div>
<div><br>
</div>
<div>shap</div>
</blockquote>
<div><br>
</div>
<u>TL;DR:</u> Things have changed, modern games these days allocate
quite a bit during "render", and makes use of a whole slew of
techniques, from pre-allocating buffers, using simple heap arena
allocations, all the way to using garbage compaction and collection.
In particular, game developers are hoping to see robust, fast,
pause-less and concurrent garbage collection techniques.<br>
<br>
Long version, if you are interested:<br>
<br>
In our games, we continuously stream data from the BluRay in order
to avoid load-times for the players. This necessitates a series of
allocation systems, from malloc-like systems, GC-like systems and
the like. Also, script based game-play systems are written by
game-play designers, and these systems need GC.<br>
<br>
In the same notion, we heavily "use" memory - though "allocation" is
perhaps not the right name. What we do is pre-allocate many buffers
for use by different systems. We group things into "single-buffered
memory"/"double-buffered memory", as well as other private buffers
and ring-buffers. Most allocation is done through heap arenas. We
simply keep track of the base pointer, the current pointer and the
end-of memory pointer, and allocation is as simple as this:<br>
<br>
// Asserts and memory tracking are present in debug builds, but not
shown here for clarity.<br>
void *HeapArena::Allocate(size_t size)<br>
{<br>
&nbsp;&nbsp;&nbsp; U8 *pRet = m_pCurr;<br>
&nbsp;&nbsp;&nbsp; U8 *pNext = m_pCurr + size;<br>
&nbsp;&nbsp;&nbsp; if (pNext &gt; m_pEnd) return NULL; <br>
&nbsp;&nbsp;&nbsp; m_pCurr = pNext;<br>
&nbsp;&nbsp;&nbsp; return pRet;<br>
}<br>
<br>
The objects on the heap itself can never be deleted, but the whole
heap can be reset (m_pCurr = m_pBuffer). This happens once a frame
for the single-buffered memory buffers, and every other frame for
double-buffered memory buffers. Essentially, when you allocate into
a single-buffered memory buffer, you promise that the data will be
temporary and that it is not needed next frame. (Garbage collection
given life-time a promise? Wonder if you could statically type-check
something like that...)<br>
<br>
Finally, texture memory is a huge part of our memory management
system. Most data is textures. For this, we need to garbage collect
texture blocks based on actual usage. What we do is make the
graphics chip move chunks of memory into holes created by unused
textures, while at the same time streaming in new textures on top of
the stack.<br>
<br>
So, as you can see, your notion that "well-written games do not
allocate memory during frame-render" is not quite accurate. :-)<br>
<br>
Thanks,<br>
<br>
PKE<br>
<br>
<br>
On 4/10/2012 12:26 PM, Jonathan S. Shapiro wrote:
<blockquote
cite="mid:CAAP=3QMdTPB+YtkEgcH74mTELp79-s8kPdQCkaEe=***@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
Crap! This was supposed to go to the list too.
<div><br>
</div>
<div>Pal: I imagine that you are using "reply-all" instead of
"reply". Please don't. Replies to bitc-dev are directed to
bitc-dev automatically. When your email agent adds me as a
recipient, it breaks this behavior, with the result that I end
up replying to you privately instead of to the list. I'll try to
pay better attention, but could you please try not to add me
individually when you are replying to the list?</div>
<div><br>
</div>
<div>Thanks<br>
<br>
<div class="gmail_quote">On Tue, Apr 10, 2012 at 12:24 PM,
Jonathan S. Shapiro <span dir="ltr">&lt;<a
moz-do-not-send="true" href="mailto:***@eros-os.org">***@eros-os.org</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="gmail_quote">
<div class="im">On Tue, Apr 10, 2012 at 11:00 AM,
Pal-Kristian Engstad <span dir="ltr">&lt;<a
moz-do-not-send="true"
href="mailto:***@naughtydog.com"
target="_blank">***@naughtydog.com</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div> On 4/10/2012 10:09 AM, Jonathan S. Shapiro
wrote:
<blockquote type="cite">
<div class="gmail_quote"><br>
<div>So I agree that the behavior you describe
exists, but I do <i>not</i>&nbsp;agree that it
is any sort of impediment to use. The
occasional, rare, 20ms pause in an
application like this just isn't a big deal
<br>
</div>
</div>
</blockquote>
<br>
</div>
That depends on the application. Games, as an
example, would never tolerate a 20 msec pause. The
application has to produce images at 30 frames per
second, or sometimes at 60 frames per second. In a
30 FPS game, you have 1,000 [msec] / 30 [frames] =
33.33.. [msec/frame] to respond to user input,
perform game logic, prepare rendering information
and then produce an image. As you can clearly see,
20 msec is *huge*, it is 60% of a frame! <br>
</div>
</blockquote>
<div><br>
</div>
</div>
<div>I'm well aware of that. But well-written games do not
allocate memory during frame render, so GC shouldn't be
triggered at all.</div>
<div><br>
</div>
<div>Or at least, that would be the case <i>except</i>&nbsp;that
current safe languages <i>force</i>&nbsp;you to allocate in
various circumstances where it is not really necessary,
because the language/runtime design favors boxed
objects.</div>
<div><br>
</div>
<div>It's also worth noting that <u>malloc</u>&nbsp;can be
expensive!</div>
<div class="im">
<div>&nbsp;</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
I would think that a lot of real-time and soft
real-time systems would have similar reasons to
abandon any system that paused the program for such
a long period of time.</div>
</blockquote>
<div><br>
</div>
</div>
<div>Depends a lot on when, why, and for how long. 20ms is
definitely a long time. On the other hand, it's pretty
clear to me that I could build an audio system with
0.7-1ms requirements in a GC'd language provided the
language offered adequate support for unboxed types.</div>
<div><br>
</div>
<div><br>
</div>
<div>shap</div>
</div>
<br>
</blockquote>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
bitc-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:bitc-***@coyotos.org">bitc-***@coyotos.org</a>
<a class="moz-txt-link-freetext" href="http://www.coyotos.org/mailman/listinfo/bitc-dev">http://www.coyotos.org/mailman/listinfo/bitc-dev</a>
</pre>
</blockquote>
<br>
</body>
</html>
Jonathan S. Shapiro
2012-04-12 02:38:16 UTC
Permalink
On Wed, Apr 11, 2012 at 5:24 PM, Pal-Kristian Engstad <
So, as you can see, your notion that "well-written games do not allocate
memory during frame-render" is not quite accurate. :-)
Actually, the techniques you describe are exactly what I expected. As you
note, you pre-allocate arenas. In consequence, you don't do heap allocation
during frame render.

And yes, you are allocating, but not at the layer of abstraction that
triggers GC - which was the real point I was trying to make.
David Jeske
2012-04-12 00:57:35 UTC
Permalink
It's also worth noting that *malloc* can be expensive!
...even an expensive malloc does not stop all threads.

I'm sad to see folks still continuing this discussion of how such-and-such
pause is okay in so-and-so situation. This might be okay for a
whole-program Haskell compile, however, this doesn't fly in an industry
wide modular systems runtime.

Software is built with leverage, layer upon layer of abstractions combined
together. When we write a modular library, we have no idea who is going to
use it and for what, and with how large a heap, and under what conditions.
Any runtime which serves a subset of the industry is going to get a subset
of the support. If we want a runtime to have a chance of eclipsing the
C-shlib, it needs to have everyone's support, for every application -- and
reporting from the industry, I'll say that stop-the-world is more
unacceptable than some here seem to realize.

All the PL research in the world is not going to get us a system to replace
the C-shlib runtime if it ignores the nearly universal need for
no-world-stop, modular compilation, and modular forward upgradability. Just
consider where these whole-program functional or inference languages would
be if there were no c-shared libraries beneath them.

I'm really happy with the CIL increments over JVM. structs are a really
needed mechanism to get data-cache locality and avoid some
heap-allocations. Modular assemblies are a really necessary improvement
over JVM style "string-name class loading". Type-instantiation is a really
important performance and data-cache locatlity improvement over
type-erasure.

I'm thrilled to (as a result of this thread) to see the work by the Azul
folks to make a no-world-stop compacting GC in their C4 collector. If it
works as well as they claim, it could really help us move the industry
forward. I'll gladly take a 15% single-threaded performance hit to end the
world-stop -- in every one of my applications.

These pieces seem so close. Whether it's in an industry wide CIL+C4,
JVM2+C4, or something else entirely, we need to bring these pieces together
and make something that can really move modular systems programming and
shared libraries past the "era of C". It's hard to say when it will occur,
but I predict it can only occur once we privide a typesafe environment with
no world-stop, modular compilation, and modular forward upgradability --
just like C-shlibs have had for decades.
Bennie Kloosteman
2012-04-13 09:48:37 UTC
Permalink
Post by David Jeske
I'm sad to see folks still continuing this discussion of how such-and-such
pause is okay in so-and-so situation. This might be okay for a
whole-program Haskell compile, however, this doesn't fly in an industry
wide modular systems runtime.
I'm really happy with the CIL increments over JVM. structs are a really
needed mechanism to get data-cache locality and avoid some
heap-allocations. Modular assemblies are a really necessary improvement
over JVM style "string-name class loading". Type-instantiation is a really
important performance and data-cache locatlity improvement over
type-erasure.
I think CIL is a improvement but to go further you are really talking
about re-use ( including team factors such as discovery) , versioning and
contracts and I ask is the problem of modularity at the lib / dll/code
level inherently flawed especially in a 8+ core world ? Consider
Browsers - is the current plugin solution good ? The problem is so bad
that some browsers are going as far as having no plugins Apple and IE11
Metro for example .. SOA services are the right idea if a bit heavy and
some advocates like Juval Lowey are even going as far as 1 class = 1
service ( which i disagree with) , however SOA is constrained by cross
machine limitations . I also note Multi processing is moving to message
parsing "agent" libraries .

To me you want to break a program ( or OS / runtime) downs into as many
services as practical (and a run time that facilitates this in
an asynchronous , resilient and highly parallel way . These services can
use different allocation methods including manual / static / ref counting
or even be scheduled to not receive messages ( like the browser example
you gave ) and hence solve the GC problem.
Post by David Jeske
I'm thrilled to (as a result of this thread) to see the work by the Azul
folks to make a no-world-stop compacting GC in their C4 collector. If it
works as well as they claim, it could really help us move the industry
forward. I'll gladly take a 15% single-threaded performance hit to end the
world-stop -- in every one of my applications.
If you can deal with 15% what is wrong with a ref counting collector ?
There are advanced ones but even primitives ones can run a mark /sweep to
remove cycles when the programmer wants - eg levels loads , when there is
a dialog etc .

These are probably of interest

http://www.managedruntime.org/webfm_send/4 C4

http://www.managedruntime.org <http://www.managedruntime.org/webfm_send/4> a
run time based on C4

Is interesting that C4 does incur significant pauses for small ( 4G) heaps
(obviously for the current thread)
Post by David Jeske
These pieces seem so close. Whether it's in an industry wide CIL+C4,
JVM2+C4, or something else entirely, we need to bring these pieces together
and make something that can really move modular systems programming and
shared libraries past the "era of C". It's hard to say when it will occur,
but I predict it can only occur once we privide a typesafe environment with
no world-stop, modular compilation, and modular forward upgradability --
just like C-shlibs have had for decades.
There needs to be something done buts its difficult , the problem has been
around for a long time.. many solutions and attempts have been made , many
languages were written to address the modularity concept but they dont
really get there.,

Ben
David Jeske
2012-04-13 15:59:51 UTC
Permalink
Post by Bennie Kloosteman
I think CIL is a improvement but to go further you are really talking
about re-use ( including team factors such as discovery) , versioning and
contracts and I ask is the problem of modularity at the lib / dll/code
level inherently flawed especially in a 8+ core world ?
To me your statement sounds like classic naive futurism. My answer to your
question is obviously "No! in-process dll-style code modularity will remain
a critical type of modularity long into the future, because it's very very
fast."

90% of our stack ( kernel<->driver and program<->shlib ) is based on
in-process DLL/code-level modularity. Microkernels attempted to place the
kernel<->driver interface out-of-process and have all since died for
performance reasons. Deep program modularity is still handled with
in-process extension (photoshop plugins, VBA, chrome-extensions, gimp
plugins, emacs extensions, to name a few)
Post by Bennie Kloosteman
Consider Browsers - is the current plugin solution good ? The problem is
so bad that some browsers are going as far as having no plugins Apple and
IE11 Metro for example .. SOA services are the right idea if a bit
heavy and some advocates like Juval Lowey are even going as far as 1 class
= 1 service ( which i disagree with) , however SOA is constrained by
cross machine limitations . I also note Multi processing is moving to
message parsing "agent" libraries .
There are many types of code-modularity. Designing modularity for a given
situation depends on factors such as.... How rich is the data-interaction?
What is the frequency of the interaction? What is the bandwidth of the
interaction? What are the safety requirements of the boundary?

Browsers are beginning to act as operating systems themselves. We shouldn't
confuse "plug-ins" such as a full-panel PDF reader, which are really
separate programs loaded within the browser, with "extensions" that require
rich access to program data structures.

For rich performance-dependnt code-modularity, such as program-extension
and kernel<->driver boundaries: Solutions with GC pauses will not be used
because pauses are unacceptable for a leveraged software stack. Out of
process solutions will not be used because they are too slow. These are the
situations where time-and-time-again we are using C-shlibs. This is not a
small amount of our software stack, it's most of it.

If you can deal with 15% what is wrong with a ref counting collector ?
Post by Bennie Kloosteman
There are advanced ones but even primitives ones can run a mark /sweep to
remove cycles when the programmer wants - eg levels loads , when there is
a dialog etc .
There is simply never a good time to incur a stop-the-world pause, so any
solution which requires it will not succeed in replacing C and shlibs.
Implementing a mark/sweep cycle finder without stop-the-world is the same
challenge as GC without stop-the-world. Refcounting+AzulC4-style no-stop
cycle finding is entirely acceptable.

The reason pauses are unacceptable in our leveraged software stack is in
some sense an economical one. In software technology businesses, software
engineering costs generally don't scale with profits.

--> Which means in very-popular or very-profitable software there is no
reason to compromise user-experience for coding-efficiency. <--

If it makes the customer-experience better (i.e. no pauses), economics and
game-theory will dictate us rewrite it all in assembler if it's necessary
(which is basically what C is, and basically what we are doing). This
economic basis is why pausing-GC-systems are only viable for the top of our
stack.... Lower-volume solutions such as scripting, business specific
customization, low-volume apps, low-volume webservers, and top-of-stack
programs where authors don't realize the pauses are problem until later
(like webservers).

High-volume software must be pause free because customers would choose a
non-pausing alternative otherwise. As a result, it's written in
C/C++/Obj-C with no pausing-GC. This includes system software on Windows,
Mac, Linux, Win7Phone, iOS, plus all major applications (word, photoshop,
Firefox, Chrome, Safari, Apache, IIS, MySQL, MS-SQL, etc. etc.)
Post by Bennie Kloosteman
Is interesting that C4 does incur significant pauses for small ( 4G)
heaps (obviously for the current thread)
By my read of the paper, this wasn't because of a small heap, this was
because of the ration of used-space to free-space. (i.e. when there is
memory pressure, there is a chance that allocation catches
up-to-reclamation, and then there is a pause. Note that malloc/free could
deadlock in such a situation because of fragmenting and lack of compacting)
Post by Bennie Kloosteman
There needs to be something done buts its difficult , the problem has
been around for a long time.. many solutions and attempts have been made ,
many languages were written to address the modularity concept but they
dont really get there.,
If it's so difficult, then why has such a simple system (the C-shib) solved
this problem so well and for so long? I think the "difficult" part about
building a better system is avoiding the classic v2.0 problem of throwing
out the critical requirements of v1.0. We need to admit the requirements
are zero-pause, and typed forward-upgradable in-process code-modularity.
For as long as we think we can ignore these goals, we will watch 90% of our
stack be written in C and wonder why.

If we admit these two goals, we can design something which is the
safe-and-strong-typed replacement for the C-shlib. Then we can write
typesafe kernels, and typesafe system runtimes.. instead of coding in C.
Will it fundamentally change software? Hard to say, but it will make us all
a heck of alot more productive.

I believe the path to this is zero-pause GC, combined with at least
CIL-level support for structs/value-types, JIT, and possibly a constrained
forward-compatible-upgradable polymorphic interface boundary for modules.
There might be room for immutability and linear-types in some form of
fast-safe message-passing without copying. I don't think it's really that
"hard" to do zero-pause GC.. the technique Azul is using could have been
written a decade ago. Nobody is working on this stuff, we're all too busy
telling our selves the pauses are acceptable and scratching our heads about
why everything is still written in C.
Jonathan S. Shapiro
2012-04-16 13:45:06 UTC
Permalink
Microkernels attempted to place the kernel<->driver interface
out-of-process and have all since died for performance reasons.
David: Based on market share numbers, this statement is flatly wrong. There
are more microkernels running in the field today than there are monolithic
kernels.

And for some strange reason, the most widely deployed microkernel is a
capability system....
David Jeske
2012-04-16 16:27:43 UTC
Permalink
Post by Jonathan S. Shapiro
Microkernels attempted to place the kernel<->driver interface
out-of-process and have all since died for performance reasons.
Post by Jonathan S. Shapiro
David: Based on market share numbers, this statement is flatly wrong.
There are more microkernels running in the field today than there are
monolithic kernels.

.. now we have to define marketshare. If we count by instructions executed
annually, do you still think that to be true? Neither windows, linux,
android-linux, nor ios are microkernels... and since they have the lions
share of fast processors, i believe they dominate instructions executed per
year.

That said, I'll gladly retreat and simply qualify to end-user operating
systems since my entire argument revolves around end-user platforms not
embedded systems. (One can do very different things in small embedded,
including using stop-the-world, since leverage is generally less, and
there is no uncontrolled usage or installation)
Jonathan S. Shapiro
2012-04-16 17:56:42 UTC
Permalink
Post by Jonathan S. Shapiro
Microkernels attempted to place the kernel<->driver interface
out-of-process and have all since died for performance reasons.
Post by Jonathan S. Shapiro
David: Based on market share numbers, this statement is flatly wrong.
There are more microkernels running in the field today than there are
monolithic kernels.
.. now we have to define marketshare. If we count by instructions executed
annually, do you still think that to be true? Neither windows, linux,
android-linux, nor ios are microkernels... and since they have the lions
share of fast processors, i believe they dominate instructions executed per
year.
Count either instructions or unit numbers. I'm not sure about IOS devices,
but every other mobile device is running a capability microkernel under
those OS's. Most of those qualify as "end user systems".

I can also make a strong case that every direct-on-hardware virtual machine
solution is a microkernel. That's certainly a fair description of Hyper-V.
shap
David Jeske
2012-04-17 00:45:27 UTC
Permalink
Post by Jonathan S. Shapiro
Count either instructions or unit numbers. I'm not sure about IOS devices,
but every other mobile device is running a capability microkernel under
those OS's. Most of those qualify as "end user systems".
There is no reason to debate facts, and this is quite a large tangent from
where we started, but I'm curious about what facts lead to our differing
view of this situation.

My definition of Microkernel <http://en.wikipedia.org/wiki/Microkernel> is
"hardware drivers live in hardware-protected-spaces and communicate via
IPC", such as pure Mach 3.0, (some parts of) NT 3.51, QNX, QNX Neutrino,
and more recently hardware-enforced applications of L4.

AFAIK...

...NT 4.0+ and Nextstep/openstep/macOS-darwin admitted defeat to drivers
outside ring0 and pulled them into the macrokernel for performance.

.... which means all desktop WinNT4/Vista/W7, OSX/Mach-BSD, plus
smartphones running Android/Linux and iOS/Darwin all load drivers into the
privileged space (ring0 equivilent) and are effectively macro- kernels....
this is the source of my comment that "most" systems are macrokernels.

....non-smartphones mostly don't have an MMUs, and thus they can't be
evaluated as a micro-or-macro kernel based on this idea of MMU protection
-- they are a single-address-space non-protected embedded system (basically
one giant macrokernel system).

....QNX and QNX Neutrino are among the commercial MMU protected
true-microkernels, and I beileve they are
Single-Address-Space-Operating-Systems.

Some quick searching found this research project experimenting with an L4
version of Darwin<http://www.ssrg.nicta.com.au/publications/papers/Lee_Gray_06.pdf>,
which used an L4 hypervisor to offer microkernel style protections, but a
cursory glance looks like they took the QNX SASOS route.. and it's just a
research project.

I'm curious what body of "every other mobile device" have hardware isolated
driver processes? Or are we talking about a different definition of
microkernel?
Post by Jonathan S. Shapiro
I can also make a strong case that every direct-on-hardware virtual
machine solution is a microkernel. That's certainly a fair description of
Hyper-V.
What is the definition of microkernel that reaches this conclusion?
Jonathan S. Shapiro
2012-04-17 05:01:44 UTC
Permalink
Post by David Jeske
Post by Jonathan S. Shapiro
Count either instructions or unit numbers. I'm not sure about IOS
devices, but every other mobile device is running a capability microkernel
under those OS's. Most of those qualify as "end user systems".
There is no reason to debate facts, and this is quite a large tangent from
where we started, but I'm curious about what facts lead to our differing
view of this situation.
My definition of Microkernel <http://en.wikipedia.org/wiki/Microkernel> is
"hardware drivers live in hardware-protected-spaces and communicate via
IPC", such as pure Mach 3.0, (some parts of) NT 3.51, QNX, QNX Neutrino,
and more recently hardware-enforced applications of L4.
Right.
Post by David Jeske
.... which means all ... smartphones running Android/Linux and iOS/Darwin
all load drivers into the privileged space (ring0 equivilent) and are
effectively macro- kernels.... this is the source of my comment that "most"
systems are macrokernels.
Indeed. But it turns out that the majority of these smart phones don't run
on bare metal. The run on L4. It's effectively part of the microcode load
on modern mobile phone platforms.
Bennie Kloosteman
2012-04-18 12:45:02 UTC
Permalink
To be fair I dont think you can really count these Billion L4
derivatives its just a shell to abstract the HW and manage the
flashing . I think most drivers are Linux.

On the other hand QNX passes message between processes to different
address spaces , Im not sure if they can share address space - i
doubt it ( you can definitively share memory blocks) but even if
shared you still get the micro kernel IPC cost as this IS copied
between address spaces .

Anyway QNX is now on Blackberries there are a lot of micro kernels
out there and monolithic kernels have the first mover advantage..

Ben
Shap Wrote
Indeed. But it turns out that the majority of these smart phones don't run
on bare metal. The run on L4. It's effectively part of the microcode load on
modern mobile phone platforms.
Rob Earhart
2012-04-17 18:13:58 UTC
Permalink
Post by David Jeske
Post by Jonathan S. Shapiro
I can also make a strong case that every direct-on-hardware virtual
machine solution is a microkernel. That's certainly a fair description of
Hyper-V.
What is the definition of microkernel that reaches this conclusion?
(Caveat: I was one of the Hyper-V hypervisor devs back when it was
initially developed, but it's been a few years since I worked on it; things
have almost certainly changed.)

Hyper-V is really two parts: a hypervisor, and a virtualization stack. I
suspect Shap was talking about the hypervisor part.

One good way to look at the hypervisor is that it's a kernel whose process
ABI happens to almost exactly match the x64 hardware ABI, with just a
couple of extensions for partition management. It provides hardware
partitions, *not* virtual machines; virtual machine technology is provided
by the root partition. It does all the other things a kernel does, though:
it schedules threads onto cores, manages physical memory and the hypervisor
virtual address space, handles inter-core messaging, inter-partition
messaging, &c.

The root partition runs Windows. Along with the virtualization stack, that
partition contains almost all of the platform management and general device
driver code: the hypervisor itself manages the x64 processor (including the
LAPICs), and has a very small amount of platform/device knowledge (1394 and
UART support for debugging, the reboot path, &c), but that's about it.

I'm not 100% sure the hypervisor counts as a microkernel: the device
drivers aren't in the hypervisor, but they're all in the same partition
(and the machine will go down if that single partition dies). They don't
have to be in the same partition, though; there's no real reason why they
couldn't be pushed out to other partitions, especially with IOMMUs.

)Rob
Sandro Magi
2012-04-17 18:22:01 UTC
Permalink
Post by Rob Earhart
One good way to look at the hypervisor is that it's a kernel whose
process ABI happens to almost exactly match the x64 hardware ABI, with
just a couple of extensions for partition management. It provides
hardware partitions, *not* virtual machines; virtual machine technology
is provided by the root partition. It does all the other things a
kernel does, though: it schedules threads onto cores, manages physical
memory and the hypervisor virtual address space, handles inter-core
messaging, inter-partition messaging, &c.
These are exactly the functions of a microkernel, which is why Jonathan
could make the case that hypervisors are microkernels. It's essentially
a microkernel with a monolithic driver service.

Sandro
Bennie Kloosteman
2012-04-18 12:17:44 UTC
Permalink
Post by David Jeske
Post by Jonathan S. Shapiro
Count either instructions or unit numbers. I'm not sure about IOS devices,
but every other mobile device is running a capability microkernel under
those OS's. Most of those qualify as "end user systems".
There is no reason to debate facts, and this is quite a large tangent from
where we started, but I'm curious about what facts lead to our differing
view of this situation.
My definition of Microkernel is "hardware drivers live in
hardware-protected-spaces and communicate via IPC", such as pure Mach 3.0,
(some parts of) NT 3.51, QNX, QNX Neutrino, and more recently
hardware-enforced applications of L4.
AFAIK...
...NT 4.0+ and Nextstep/openstep/macOS-darwin admitted defeat to drivers
outside ring0 and pulled them into the macrokernel for performance.
NT4 ONLY pulled the Video driver in, windows device drivers can still
run in ring 3 or ring 0 today and that was partly because NT4 had to
compete with Windows 3.1 and Windows 95 DOS games which had little
OS overhead and managed the SVGA video card and banks them self via
the DOS int 10 and 16 calls. Most of these OS were based on Mach3
, the L4 techniques , papers and analysis to remove the performance
penalties were published after they were designed however the Mach3 /
GNU Herd stench stayed. eg the 20 fold improvement in L4 over Mach3
http://os.ibds.kit.edu/65_1059.php ...

Kind of makes me wonder about the original designers -we are going to
do a lot of IPC so lets make it an inefficient API ( ok call it secure
and extensible) , well they got the results .... The monolithic call
traps had been optimized for decades ( and even accelerated in the HW)
.

Very few drivers even benefit from ring 0 performance these days ,
CPUs and memory are just so fast and micro kernels can use big
buffers of shared memory data to stop large amounts of copying .
Futexes have probably solved one of the biggest issues , the main
ones now are the key services such as the scheduler and memory
manager. Some Micro Kernels still put the scheduler in the kernel.

Ben
Jonathan S. Shapiro
2012-04-16 17:15:38 UTC
Permalink
So a couple of comments about David's thoughts on modularity and versioning.

First: At a high level, I agree with him that these are important issues.
Where we may possibly disagree is how many of these problems should be
solved at the language level. The issue that concerns me is that different
platforms have different solutions, and an effective language needs to work
with all of them. Which is rather a nuisance. Beyond that, we don't know a
really good solution. It would be a shame to commit ourselves only to find
a better solution emerging that we cannot take advantage of. Kinda like
Java and 16 bit Unicode, for example.

My take for BitC:

- Arriving at an effective module system is in scope
- Dealing with module versioning is *not* in scope - or at least not yet.
Solutions with GC pauses will not be used because pauses are unacceptable
for a leveraged software stack. Out of process solutions will not be used
because they are too slow.
David, I'm sorry, but you are simply wrong on *both* of these points. What
you say concerning pauseless GC is true for large, monolithic applications.
It's worth noting that DLLs/shlibs exacerbate the problem of application
scale. It is *not* true for more rationally sized applications. The pause
times for such applications are negligible. The more pressing issue is the
need for a hand-over-hand collector. The requirement for hand-over-hand
tends to drive us back into concurrent collector technology, so the end
result is the same, but we should be careful to remain clear about which
requirements are being driven from which use cases and why.

What you are arguing amounts to "we need a pauseless collector because we
are too stupid to write our applications in manageable form". Which is,
sadly, true.

Out of process solutions are indeed horrendously slow in Windows. This is *
not* generally true in other systems.

Refcounting+AzulC4-style no-stop cycle finding is entirely acceptable.
No, it isn't. Refcounting collectors aren't coalescing collectors, and that
turns out to be important.
The reason pauses are unacceptable in our leveraged software stack is in
some sense an economical one. In software technology businesses, software
engineering costs generally don't scale with profits.
I agree with the scaling statement, but it isn't clear how it relates to
the economics statement.

--> Which means in very-popular or very-profitable software there is no
reason to compromise user-experience for coding-efficiency. <--
That's just nonsense. The reason to avoid compromising user experience is
entirely profit-driven. The user experience must be maintained. If that
requires improving coding efficiency, so be it.
This economic basis is why pausing-GC-systems are only viable for the top
of our stack.... Lower-volume solutions such as scripting, business
specific customization, low-volume apps, low-volume webservers, and
top-of-stack programs where authors don't realize the pauses are problem
until later (like webservers).
Well, it's a theory. Unfortunately it's based on a bad assumption.
Scripting languages are no longer used for small or low-volume
applications. Some of the applications being written in these languages are
surprisingly large. That's the main motivation for Dart.
High-volume software must be pause free because customers would choose a
non-pausing alternative otherwise.
Bennie Kloosteman
2012-04-18 12:21:43 UTC
Permalink
I think CIL is a improvement  but to go further you are really talking
about re-use ( including team factors such as discovery)  , versioning and
contracts   and I ask is the problem of modularity at the lib / dll/code
 level inherently flawed especially in a 8+ core world  ?
To me your statement sounds like classic naive futurism. My answer to your
question is obviously "No! in-process dll-style code modularity will remain
a critical type of modularity long into the future, because it's very very
fast."
90% of our stack  ( kernel<->driver and program<->shlib ) is based on
in-process DLL/code-level modularity.
Im not suggesting replacing dlls as a compilation module , they
still have a roll eg a 3D maths lib but services which wrap a bunch
of code have a lot of advantages as a reusable "module". I have seen
a lot of issues with dlls as a reusable code base and associated
dependencies and versioning.

The key to doing it in services is breaking along paths of little
communications. SOA systems are doing exactly that and I know many
practitioners who go too far (IMHO) and use full SOA services ( with
the whole Soap stack as seperate process ) within an application ,
some even go down to 1 class is 1 service .
Microkernels attempted to place the
kernel<->driver interface out-of-process and have all since died for
performance reasons.
Answered below.
Deep program modularity is still handled with
in-process extension (photoshop plugins, VBA, chrome-extensions, gimp
plugins, emacs extensions, to name a few)
Which is dying for reliability and security reasons no add ins in
IE11 Metro and some of Apples stuff. ( Note they even tried the
separate process model) . VBA had lots of issues with crashing and got
replaced with .NET assembly which communicates to via COM.
Consider Browsers - is the current plugin  solution good ? The problem is
so bad that some browsers are going as far as having no plugins Apple and
IE11 Metro for example ..   SOA   services are the right idea if a bit heavy
and some advocates like Juval Lowey are even going as far as 1 class = 1
service ( which i disagree with)  , however SOA  is constrained by cross
machine limitations . I also note  Multi processing is moving to message
parsing "agent"  libraries .
For rich performance-dependnt code-modularity, such as program-extension and
kernel<->driver boundaries: Solutions with GC pauses will not be used
because pauses are unacceptable for a leveraged software stack. Out of
process solutions will not be used because they are too slow.
Yet browser have done exactly that - not out of choice but necessity (
reliability , security). The malware companies are now saying the
scanning cost is on an exponential track with serious consequences
for the current model. It doesnt matter if you call a browser an OS
the problem is the same.

Your comments about performance is important ( up to a point) though
most software houses tend to write slow stuff cheaply , however
security and reliability are import considerations also - if your
critical program is compromised through an extension it can look bad
on your business . Then again flash business sadly is not suffering (
i hope soon) but its slow and insecure.
By my read of the paper, this wasn't because of a small heap, this was
because of the ration of used-space to free-space. (i.e. when there is
memory pressure, there is a chance that allocation catches
up-to-reclamation, and then there is a pause. Note that malloc/free could
deadlock in such a situation because of fragmenting and lack of compacting)
Yes malloc can do this and have some bad pauses on the current thread.
And yes lots of free space makes GCs pause a lot less.
There needs to be something done buts its difficult  , the problem has
been around for a long time.. many solutions and attempts have been made ,
many  languages were written to address the modularity concept  but they
dont really get there.,
If it's so difficult, then why has such a simple system (the C-shib) solved
this problem so well and for so long?
Do you think that system is flexible and extensible ? The first lib
did it that way and most of the existing functions are still there 40
years later.
I think the "difficult" part about
building a better system is avoiding the classic v2.0 problem of throwing
out the critical requirements of v1.0. We need to admit the requirements are
zero-pause, and typed forward-upgradable in-process code-modularity. For as
long as we think we can ignore these goals, we will watch 90% of our stack
be written in C and wonder why.
I think our requirements should be stability , security and an option .
95% C performance with small pauses OR
85% with no pauses.

The rest has been answered.
Bennie Kloosteman
2012-04-13 09:57:35 UTC
Permalink
While i think we can get somewhere by breaking things into service and
using appropriate allocations in each service ( like ref counting or static
allocation ) this is mainly for other purposes like reuse/ resiliance . I
really don't see what the issue is with having 10-15 % of a runtime code
base in C++ or asm ( after all we do this with SSE anyway ) ? The
cost of that 15% is probably much lower than trying to make a GC code base
work and even today many C based OS have asm chunks anyway
Post by David Jeske
It's also worth noting that *malloc* can be expensive!
...even an expensive malloc does not stop all threads.
I'm sad to see folks still continuing this discussion of how such-and-such
pause is okay in so-and-so situation. This might be okay for a
whole-program Haskell compile, however, this doesn't fly in an industry
wide modular systems runtime.
David Jeske
2012-04-13 16:03:02 UTC
Permalink
I really don't see what the issue is with having 10-15 % of a runtime code
base in C++ or asm ( after all we do this with SSE anyway ) ? The
cost of that 15% is probably much lower than trying to make a GC code base
work and even today many C based OS have asm chunks anyway
You're math is really different than mine. By my view, 90% of our stack is
written in C.

90% C/C++ : kernel, driver, display, algorithm, core shlibs libraries,
popular applications

10% pausing-GC: low-volume software, scripting, customer-specific
applications,

Plus my favorite: app prototyping where they havn't seen the pauses yet,
and when they do, they rewrite in C/C++.
Jonathan S. Shapiro
2012-04-13 17:15:54 UTC
Permalink
Post by David Jeske
You're math is really different than mine. By my view, 90% of our stack is
written in C.
90% C/C++ : kernel, driver, display, algorithm, core shlibs libraries,
popular applications
But of course, *none* of this needs to be written in C/C++, barring a
couple hundred lines of ASM code in the kernel for a few hardware
interactions and maybe a hundred lines of ASM/C in the language-level
runtime.

Once the BitC type system is augmented with the Habit stuff, it appears
possible to me to write an allocation-free, type-safe runtime within BitC
itself, except for some very, very small low-level operations.


shap
Florian Weimer
2012-04-11 19:32:50 UTC
Permalink
Post by Jonathan S. Shapiro
As far as I know, there are *no* shipping collectors whose GC trigger
conditions are unpredictable. In every case, GC is an optional side effect
of allocation.
What about concurrent collectors? They compete for certain resources
(CPU, memory bandwidth, some locks) with the mutator, and this is not
very predictable.
Jonathan S. Shapiro
2012-04-12 02:32:56 UTC
Permalink
Post by Jonathan S. Shapiro
Post by Jonathan S. Shapiro
As far as I know, there are *no* shipping collectors whose GC trigger
conditions are unpredictable. In every case, GC is an optional side
effect
Post by Jonathan S. Shapiro
of allocation.
What about concurrent collectors? They compete for certain resources
(CPU, memory bandwidth, some locks) with the mutator, and this is not
very predictable.
Hoist on my own petard! Yes, concurrent collectors run when they please.
However, they do not run in a fashion that blocks the main thread. The
original concern in this sub-discussion was pause times.

shap
David Jeske
2012-04-10 23:51:37 UTC
Permalink
Post by David Jeske
Consider any reasonably large GUI application -- a word-processor,
Post by Jonathan S. Shapiro
computer game, or a 3d graphics modeling program. There is no acceptable
time to stop the world because one can't predict when the user will
interact next. Moving most of the data out of the view of the GC is a
solution, but then we're right back to non-safety....
So I agree that the behavior you describe exists, but I do *not* agree
that it is any sort of impediment to use. The occasional, rare, 20ms pause
in an application like this just isn't a big deal
..either our experiences or our expectations are dramatically different
here.

I have yet to see a collector that has a global-stop shorter than several
**seconds** on a 1GB heap. IMO, we are still writing these apps (and
kernels) in C/C++ with manual-memory-management in large part because the
pauses are unacceptable. Our server architectures are suffering, and we're
being forced to write all but one final tier in C/C++ because cumulative GC
pauses kill. Even in the final tier, where we allow GC, we have to be extra
careful about heap-sizes because of the dramatically horiffic effects of GC
pauses on end-user latency.

To believe that GC pauses are "generally not that big of a deal" IMO, is
incredibly naive. They are a huge and constant deal in real systems. In the
real world, we are constantly being forced to use C++ as an implementation
language because of the need to avoid GC pauses. Either that, or we are
producing products with sub-standard end-user latency, because we can't
eliminate the pauses and are unwilling to use C/C++. This problem is as
large or larger than the modularity problem.
Post by David Jeske
The same lack of forgiveness is present in threaded web applications.
Post by Jonathan S. Shapiro
There is normally no time that all threads are simultaneously in a pausable
activity.
A general stop isn't actually required. What's required is for all threads
to enter a new allocation regime before the GC can start. So basically,
you're complaining that the existing GCs are excessively naive and
therefore aren't meeting your needs. Where concurrent programs are
concerned, I tend to agree.
I don't see the distinction for concurrent programs. What non-concurrent
program GC has no unpredictable world-stop? The only one I can think of is
refcounted (i.e. Python if you turn off the cycle-finder GC).

We are naive in not rejecting stop-the-world entirely. It should have been
declared a transitory dead-end long ago and fixed.
Post by David Jeske
So what should we *do* about it?
We should reject stop-the-world in our collectors.

We should produce mainstream royalty-free x86 runtimes for JVM and CIL that
have no world-stop. Looks like Azul may have a commercial JVM with no
world-stop. We need tech like this to be 'standard'.

We should admit these runtimes are not viable C/C++ alternatives for many
applications until we eliminate stop-the-world (and fix modularity).
Post by David Jeske
But that's *exactly* the false dichotomy that the PL world lives in. You
1. Language support for unboxing, which reduces the number of heap
allocations and permits stack allocation
2. First-class regions, which permit swathes of objects to be
eliminated simultaneously.
I see no way to write 'real' apps (a game, word processor dom, web
Post by Jonathan S. Shapiro
browser dom, 3d modeler, caching webserver) without heap mutability, and no
way to provide typesafe heap mutability without GC.
But the argument here isn't about removing GC. It's about combining GC
with other *safe* techniques so that (a) the load on the GC is reduced,
and (b) the triggering of GC can be predicted, and (c) the triggering of
major collections can be avoided
Once it is admitted that stopping-the-world is unacceptable, then "reducing
pauses" is not interesting --- because it still does not stop the pause.
Once we have pause-free GC, then these techniques become what they should
be, methods to increase GC efficiency and throughput... Using them with
stop-the-world collectors is like trying to use band-aids to triage a
multiple-stab-wound victim. They are too little too late. In reality this
doesn't even happen, because we have written the app in C/C++ long before
this situation could occur.
Post by David Jeske
In the absense of that, I'm very happy with the direction of the CIL
Post by Jonathan S. Shapiro
design (relative to JVM or those before it) because value-type-structs and
parametric-type-instantiation off us practical ways to reduce tracable
pointers and thus pressure on the GC.
Yes and no. The value types in CIL remain second class, the absence of a
fixed-length array type is crippling to some very important cases, and the
atomicity rules in the presence of concurrency defeat a whole bunch of
inlining options.
I'm interested to know more about this. Is there something written on it?
Post by David Jeske
Post by Jonathan S. Shapiro
I don't like stop-the-world collectors either, but that
"unpredictably" word in your sentence is critical. The only collectors I
know of that are really unpredictable in this way are collectors for
concurrent programs that require a universal thread rendezvous.
I do not understand the significance of your observation. AFAIK, every
currently shipping software collector, for every language system, requires
an unpredictable world-stop. If I am mis-informed, please point me to one
that does not.
As far as I know, there are *no* shipping collectors whose GC trigger
conditions are unpredictable. In every case, GC is an optional side effect
of allocation. The conclusion, then, is that we need the ability to
construct allocation-free subprograms.
Interesting. I don't see how to solve real-world problems with
allocation-free subprograms. Such as extending an HTML DOM in response to
user-input, or moving 3d polygons through an octree (which requires polygon
subdivision and tree splitting).

C/C++ programs don't solve these problems in "allocation free ways", so I
don't see why that should be our goal. To me it seems simpler and more
straightforward to solve the direct problem (GC pauses). Then we can shift
from unsafe-to-safe programming without debating whether a "pause is too
big or not".

After we have a zero-pause runtime, we can spend time and research on
alternative techniques such as regions and advancements to value-types to
relieve pressure on the GC system. Applying them will become a tradeoff of
efficiency vs programmer time. This is an tradeoff we are very good at
dealing with. GC pauses are not something we can "tradeoff" they are simply
an unacceptable wall that forces us back to C/C++.
Jonathan S. Shapiro
2012-04-13 17:09:59 UTC
Permalink
Post by David Jeske
We should produce mainstream royalty-free x86 runtimes for JVM and CIL
that have no world-stop. Looks like Azul may have a commercial JVM with no
world-stop. We need tech like this to be 'standard'.
I like this idea, but having worked a bit with such a collector I feel
confident saying that it's not a panacea. The hard truth is that we will
continue to need a diversity of collection technologies. For small heaps,
in particular, you are better off doing stop-the-world than you are doing
concurrent collection; the overhead is lower.

Once it is admitted that stopping-the-world is unacceptable, then "reducing
Post by David Jeske
pauses" is not interesting --- because it still does not stop the pause.
Except that the proposition is silly. On a 128Kb heap, stop the world is
perfectly fine, and the overheads of concurrent collection become
unwarranted.

I raise this because systems having a larger number of smaller processes
are of great interest from a security and robustness perspective.

*However*
*
*
Data sharing across heaps is very important in systems based on safe
languages, and *that* introduces some important constraints on GC...

Yes and no. The value types in CIL remain second class, the absence of a
Post by David Jeske
Post by Jonathan S. Shapiro
fixed-length array type is crippling to some very important cases, and the
atomicity rules in the presence of concurrency defeat a whole bunch of
inlining options.
I'm interested to know more about this. Is there something written on it?
For starters, a CIL struct (the user defined value type) does not admit
inheritance. For seconders, I *think* there are some constraints associated
with their use in generics.
Post by David Jeske
Interesting. I don't see how to solve real-world problems with
allocation-free subprograms. Such as extending an HTML DOM in response to
user-input, or moving 3d polygons through an octree (which requires polygon
subdivision and tree splitting).
Look. You ultimately have three choices:

1. Don't allocate
2. Pre-plan your required resource
3. Use explicit regions to relegate allocations to safe points in your
program

But there just isn't any silver bullet here. Every explicit memory manager
that is non-coalescing can suffer fragmentation. Every explicit memory
manager that is coalescing has asymptotic cases.

That said, the pre-planning approach bears real thought. The main concern
with explicit memory management is safety. When you are explicitly
allocating and deallocation from a pre-reserved pool containing objects of
like type, there is a dangling pointer issue but no safety issue. For the
most part, that's actually what ends up happening under the covers in the
explicit memory management scenario.

So really, I think there are just a *lot* of things that can be done to
help once we acknowledge that no allocator is a panacea and that real
programs sometimes have to express things at the "blood and the mud" level
- it's not all just abstract, which is how some people in the PL community
would still prefer to view it.

That attitude in the PL community is changing, by the way. It's not that
they abandoned concrete issues because they were unimportant. Rather, they
did so because they wanted to establish formal underpinnings for
programming languages and there were only so many things they could deal
with successfully at a time. 30 years later we know how to do things like
fancy type systems for languages that don't speak of concrete issues. Now
that this is true, *some *members of the PL community are coming back and
working actively with systems people again. Greg Morrissett's group and
Stephanie Weirich's group are two that come immediately to mind.
Post by David Jeske
After we have a zero-pause runtime, we can spend time and research on
alternative techniques such as regions and advancements to value-types to
relieve pressure on the GC system.
That's backwards. The region and other advances are available to us today.
The pausless collectors are still not understood well enough to be
mainstream yet.


shap
David Jeske
2012-04-13 20:19:28 UTC
Permalink
Post by Jonathan S. Shapiro
Post by David Jeske
After we have a zero-pause runtime, we can spend time and research on
alternative techniques such as regions and advancements to value-types to
relieve pressure on the GC system.
Post by Jonathan S. Shapiro
That's backwards. The region and other advances are available to us
today. The pausless collectors are still not understood well enough to be
mainstream yet.

I'm working on a longer response to some other points, but i think this
divide above deserves some digging into on both sides. Here is my
reasoning. Im very interested to hear yours.

If our goal is a typesafe systems runtime to represent all programs (i.e. a
replacement for c and c-shlibs), we need a general purpose typesafe heap
memory reclamation scheme. Regions, refcounting, and linear-types _alone_
can not represent all programs. Thus, if we wish to create a typesafe
runtime to represent all programs, these solutions can only serve as an
optimization for a subset of such a system. They can only represent all
programs by combining them with manual-management (which is not typesafe)
or gc (which brings us back to my point).

The only type-safe heap reclamation schemes im aware of which are capable
of representing all programs are (a) some variation of mark-sweep-compact
tracing gc, and (b) ref-counting, with some variation of mark-sweep cycle
finding. Therefore, one of these two schemes is a good candidate for the
base of a typesafe runtime (once it meets our no-pause/low-jitter
requirement) to replace c/shlibs/malloc, wheras regions et. al. are only
suitable as optimizations within such a system.

What is your argument?

Do you believe regions and such are capable of representing all heap
reclamation needs and we simply don't know how? It seems to me any other
argument devolves into gc or manual management, which simply proves my
point.
Matt Oliveri
2012-04-16 05:56:37 UTC
Permalink
DavidJ, I know that was mainly a question for Shap, but I have a few
points which might be useful:

The way I interpret Shap's point is that we don't know yet if
zero-pause GC is good enough for _all_ programs, or even all those
programs which would otherwise use C and C shlibs.
It's probably a lot closer than stop-the-world GC, but it's too early
to put all our eggs in that basket.

At any rate, maybe a typesafe runtime for all programs is not the right goal.
For starters, if zero-pause GC turns out to be unacceptable for some
programs, then we have no evidence that the goal is attainable.
Further, even if it is, it's restrictive; not all applications need GC.
For example, those programs which can be shown to be safe just using
region typing should not be using GC.
But of course, we want to be able to build a system out of components
that use GC and components that only use region-typesafe allocation.
More generally, we want to be able to build a system out of components
that use all sorts of memory management policies.
The problem then is that anyone could come up with a new and
reasonable memory management policy at any time, and we'd want to be
able to integrate it.
So any "fixed-function" runtime with a fixed set of memory management
policies it integrates will be insufficient sooner or later.
This is more obviously true when you consider the more general problem
of storage management which includes filesystems and databases,
including those split among multiple servers.
Ideally, programs that deal with all that should be demonstrably safe too.
But if our runtime is so extensible that new memory management
policies can be added and used, then type checking (as it's usually
understood) using a fixed type system is not strong enough to
guarantee safety of the resulting integrated system.

So here are the options I see:
- Live dangerously: Use various tricks to demonstrate the safety of
_some_ programs, with some remaining amount of low-level unsafe
infrastructure.
- Compromise: Settle on some typesafe runtime and shoehorn as many
programs into it as practical.
- Formal verification: Basically like "live dangerously" in that it
uses as many tricks as possible to provide simple safety arguments,
but instead of defaulting to unsafety, it defaults to a
machine-checkable proof using formal models and general-purpose logic.

If you want a typesafe runtime to replace C, then you are essentially
saying we should write all programs on top of some fixed runtime.
In other words, as I see it, you are asking for compromise.
But you have argued yourself that essentially people turn to C (live
dangerously) when they find the compromise option to be unacceptable,
currently because of GC pauses.
So with your proposed system based on zero-pause GC, the unacceptable
compromise would be something else, but it would still exist (due to
the inherent restriction of type-safety arguments), and it would lead
some programmers (probably far fewer) back to C or some other unsafe
infrastructure (in which case, why not C?).

In summary, I think zero-pause GC is promising, and a safe runtime
that uses it could be extremely popular, but it will not replace C.
So if the (much-better) compromise of a typesafe, zero-pause GC'ed
runtime is OK with you, then you are essentially right, but should
probably de-emphasize replacing C.

If you really want to replace C, your killer app is end-to-end system
safety proofs, and you want formal verification (upon which type
safety can be proven and used), because type checking alone is not
flexible enough.

I originally got interested in BitC because I thought its design goal
was to be a practical substrate for formal verification.
(I don't think C is a practical substrate because hardly anyone
completely follows the C spec anyway.)
But now it seems like Shap is pursuing the "live dangerously" option,
but trying to make it less dangerous using strong, low-level types.

All this is to say:
I would summarize your difference in opinion by saying Shap is trying
to make "live dangerously" less dangerous, and DavidJ is trying to
make "compromise" a more realistic compromise.
To restate again, you're both approaching the ideal of a safe,
universal systems language, but from different directions.
Shap wants to make languages safer while guaranteeing universality.
DavidJ wants (I think?) to make languages (and runtimes) more
universal while guaranteeing safety.

And as I said before, I don't believe the ideal (safe, universal
language) can be reached without some amount of formal verification.
Post by David Jeske
Post by Jonathan S. Shapiro
Post by David Jeske
After we have a zero-pause runtime, we can spend time and research on
alternative techniques such as regions and advancements to value-types to
relieve pressure on the GC system.
That's backwards. The region and other advances are available to us today.
The pausless collectors are still not understood well enough to be
mainstream yet.
I'm working on a longer response to some other points, but i think this
divide above deserves some digging into on both sides. Here is my reasoning.
Im very interested to hear yours.
If our goal is a typesafe systems runtime to represent all programs (i.e. a
replacement for c and c-shlibs), we need a general purpose typesafe heap
memory reclamation scheme. Regions, refcounting, and linear-types _alone_
can not represent all programs. Thus, if we wish to create a typesafe
runtime to represent all programs, these solutions can only serve as an
optimization for a subset of such a system. They can only represent all
programs by combining them with manual-management (which is not typesafe) or
gc (which brings us back to my point).
The only type-safe heap reclamation schemes im aware of which are capable of
representing all programs are (a) some variation of mark-sweep-compact
tracing gc, and (b) ref-counting, with some variation of mark-sweep cycle
finding.  Therefore, one of these two schemes is a good candidate for the
base of a typesafe runtime (once it meets our no-pause/low-jitter
requirement) to replace c/shlibs/malloc, wheras regions et. al. are only
suitable as optimizations within such a system.
What is your argument?
Do you believe regions and such are capable of representing all heap
reclamation needs and we simply don't know how? It seems to me any other
argument devolves into gc or manual management, which simply proves my
point.
David Jeske
2012-04-16 11:33:23 UTC
Permalink
Matt - Thanks for the very well-worded and clear response. Of course we
can't speak for Shap, but I can see those two well-reasoned perspectives
well in your explanation. I don't know if you anticipated this or not, but
I have one objection..
Post by Matt Oliveri
If you want a typesafe runtime to replace C, then you are essentially
saying we should write all programs on top of some fixed runtime.
Yes. That runtime would evolve over time, but yes.
Post by Matt Oliveri
In other words, as I see it, you are asking for compromise.
But you have argued yourself that essentially people turn to C (live
dangerously) when they find the compromise option to be unacceptable,
currently because of GC pauses.
I object to this. I do not believe people turn to C to avoid compromise. If
so they would always turn to assembler. I believe they turn to C because it
offers an improvement in programmer productivity in exchange for a
linear-performance degregation.

The problem with GC pausing is not that it is a compromise, that problem is
that it's non-linear. IMO, by eliminating this non-linearity, we would
provide a runtime which was strictly preferred in all but the most extreme
cases. (i.e. the same situations that turn to assembler today)
Matt Oliveri
2012-04-16 12:35:30 UTC
Permalink
You're welcome. Some more specific responses below.
Post by David Jeske
Matt - Thanks for the very well-worded and clear response. Of course we
can't speak for Shap, but I can see those two well-reasoned perspectives
well in your explanation. I don't know if you anticipated this or not, but I
have one objection..
Post by Matt Oliveri
If you want a typesafe runtime to replace C, then you are essentially
saying we should write all programs on top of some fixed runtime.
Yes. That runtime would evolve over time, but yes.
Post by Matt Oliveri
In other words, as I see it, you are asking for compromise.
But you have argued yourself that essentially people turn to C (live
dangerously) when they find the compromise option to be unacceptable,
currently because of GC pauses.
I object to this. I do not believe people turn to C to avoid compromise. If
so they would always turn to assembler. I believe they turn to C because it
offers an improvement in programmer productivity in exchange for a
linear-performance degregation.
Hmm, yeah I guess I overstated the point you argued. Sorry.
Post by David Jeske
The problem with GC pausing is not that it is a compromise, that problem is
that it's non-linear. IMO, by eliminating this non-linearity, we would
provide a runtime which was strictly preferred in all but the most extreme
cases. (i.e. the same situations that turn to assembler today)
I see. So it's specifically GC pauses you have a problem with.
Yeah, I really don't think a typesafe, zero-pause GC'ed runtime would
get rid of C by itself, but it sure would be nice.
And if you're right about how it's the GC pauses holding back typesafe
runtimes, then I wouldn't be at all surprised if almost all programs
end up being written for a runtime like that.

At any rate, I'm personally more interested in technologies that let
programmers avoid performance compromises, so I don't know what else I
can say.
I'm glad my points were helpful.
Jonathan S. Shapiro
2012-04-16 17:52:50 UTC
Permalink
Of course we can't speak for Shap...
Why not? You are entirely welcome to take my name in vain, just so long as
you shall have no other gods before me. :-)

Okay. I admit it. I'm punchy this morning. No idea why.
Post by Matt Oliveri
If you want a typesafe runtime to replace C, then you are essentially
saying we should write all programs on top of some fixed runtime.
Yes. That runtime would evolve over time, but yes.
I more or less agree. One thing I would like to see is a successor to CIL
that provides for regions and parametric types, and maybe for "shape types"
(which I need to explain).
In other words, as I see it, you are asking for compromise.
Post by Matt Oliveri
But you have argued yourself that essentially people turn to C (live
dangerously) when they find the compromise option to be unacceptable,
currently because of GC pauses.
I object to this. I do not believe people turn to C to avoid compromise.
If so they would always turn to assembler. I believe they turn to C because
it offers an improvement in programmer productivity in exchange for a
linear-performance degregation.
But you yourself note that C is essentially asm with human syntax.

I think there are a bunch of reasons people turn to C:

- Maybe I'm writing something that is dependent on an existing library
in C (e.g. OpenSSL)
- Maybe I need decent I/O performance, and safety is pushing me into
data copies that I can't afford to do. Biagioni's work on TCP in ML is
instructive here. Foxnet is painted as a success, but actually it's pretty
depressing.
- Maybe my app warm-up time matters (so JIT gets in the way). This was
the main issue for in-transaction database servlets
- Maybe pause times worry me
- Maybe the problem of debugging inadvertently retained storage was too
hard
- Maybe I just don't know any better
- Maybe I care about performance, therefore about memory placement, and
therefore need a rich system for unboxed types
- Maybe Brian, Dennis, and Bjarne are/were my colleagues (okay, that one
applies to fewer and fewer of us, but still)
- Maybe I'm a tea drinker
- Maybe the *physical* heap footprint of my app matters. Either I can't
afford 10x the memory initially, or I'm operating at a scale where I can't
afford to *cool* 10x the memory. That's why Google can't use some of
this stuff.
- Maybe I care about security and modularity, and I've worked out that
taming the Java standard library is utterly hopeless.

All I'm saying here is that GC is just one of *several* issues that are
getting in the way of safety adoption


shap
David Jeske
2012-04-17 01:59:54 UTC
Permalink
Post by Jonathan S. Shapiro
Post by David Jeske
Post by Matt Oliveri
If you want a typesafe runtime to replace C, then you are essentially
saying we should write all programs on top of some fixed runtime.
Yes. That runtime would evolve over time, but yes.
I more or less agree. One thing I would like to see is a successor to CIL
that provides for regions and parametric types, and maybe for "shape types"
(which I need to explain).
Sounds great! (as long as it doesn't stop-the-world) Wait, I think that
angelic sound and light beaming down from the heavens is Shape-Jeske
unification. Lets go build that.

- Are CIL apartments not a different way to achieve Regions?

- What about the CIL type-instantiated generics does not meet your
definition of parametrics?

- very interested to hear about shape-types..

- As for your previous comment about wanting subtyping in structs.... I'm
not convinced subtyping is the right reuse mechanism. It creates very large
challenges for modular forward upgradability. I'd prefer to see us
experiment more with composition and aggregation, more like COM or
Google-Go. Composition is a common pattern at the program architecture
level as well because of greater flexibility vs locked single-inheritence
hierarchies. (i.e. traits, game-component-architectures, flat-driver
interfaces, flat-plugin interfaces, etc.)
Post by Jonathan S. Shapiro
I object to this. I do not believe people turn to C to avoid compromise.
Post by David Jeske
If so they would always turn to assembler. I believe they turn to C because
it offers an improvement in programmer productivity in exchange for a
linear-performance degregation.
But you yourself note that C is essentially asm with human syntax.
touche! Guilty as charged.

I still maintain it is primarily the non-linearity of GC-pausing that has
us stuck in C-land, not a linear performance degredation.

Linear performance degredation is not a limit to deployment or scalability
and is only an issue in the most hyper-optimized of cases (which hardly
exist). Wheras GC-pausing is a non-assaultable response-time wall. GC
systems which stop the world don't give us ways to avoid it for many
application types. (the best we can do is get data out of the heap, which
often means just moving to C++)
I'm really thinking about the reasons people continue to turn-to-C "in the
limit case" (i.e. into the forseeable future until we give them a way not
to). Not the instantaneous reasons we do.

Broadening my scope somewhat, my list is much shorter than yours...

- non-linear performance cases (GC-pausing and mandatory stack-walking at
exception throw)
- lack of a suitable contract model for exceptions (which is a whole
additional discussion in itself)
- lack of a truly-permissive royalty-free-runtime... (mono-runtime is LGPL,
and there is Microsoft patent fear)

( Aside: I wish Microsoft would just solve the patent fear thing already.
They kinda did but didn't with the vague promise of
non-discriminatory licensing. It would be better if CIL was declared free
or had a licensing price-list for runtime patents. Otherwise the industry
will probably steer clear for a long time. When I was at Google I
encouraged the consideration of CIL for Android, but the MSFT rivalry is
challenging to work through. Ironic now given the Oracle lawsuit. Licensing
CIL might have been reasonable in comparison. )

I really like your list, so I'm going to deep-end a little commenting on
it...
Post by Jonathan S. Shapiro
- Maybe I'm writing something that is dependent on an existing library
in C (e.g. OpenSSL)
Software is kinda magical in that we can copy it for free, so I'm pretty
sure we'd built up a healthy core of pretty much everything as soon as a
typesafe system met the criteria. Heck, we've built up a pretty healthy
core of stuff in lots of systems that don't actually meet the criteria.
Post by Jonathan S. Shapiro
- Maybe I need decent I/O performance, and safety is pushing me into
data copies that I can't afford to do. Biagioni's work on TCP in ML is
instructive here. Foxnet is painted as a success, but actually it's pretty
depressing.
I don't see what data-copies are required if CIL talks to the syscall
interface directly. CIL also opens the door for post-syscall kernel
designs. Again, I'm thinking limit-case here.... of course today I agree
with your comment.
Post by Jonathan S. Shapiro
- Maybe my app warm-up time matters (so JIT gets in the way). This was
the main issue for in-transaction database servlets
This was a problem in JVM, but CIL AOT pretty much solves this.
In fact, CIL provides the best of both worlds, because you can AOT
something that approaches C-shlib level startup times, and still get JIT
inlining across boundaries with polymorphic inline caches and other tricks.
CIL can also do a safer and better job assuring there are no
field-offset-compatibility mismatches (like expectations of different
layout stat_t structs) and can even fix some of them automagically.
Post by Jonathan S. Shapiro
- Maybe pause times worry me
Isn't that the ugly truth!
- Maybe the problem of debugging inadvertently retained storage was
too hard
I read that as "debugging memory leaks". It's always seemed roughly the
same as debugging C/C++ memory leaks to me. I suppose automatic-memory
management does allow us to build contracts with much more complex memory
ownership consequences, which in turn could lead to this problem. I'll have
to consider that.
Post by Jonathan S. Shapiro
- Maybe I just don't know any better
We need to make a better CIL-ish before we can say this. Currently the
C-runtime *IS* actually better.
Post by Jonathan S. Shapiro
- Maybe I care about performance, therefore about memory placement,
and therefore need a rich system for unboxed types
I've been really happy with CIL structs, struct-arrays, type-instantiated
generics. What do you think we need that it can't do?
Post by Jonathan S. Shapiro
- Maybe the *physical* heap footprint of my app matters. Either I
can't afford 10x the memory initially, or I'm operating at a scale where I
can't afford to *cool* 10x the memory. That's why Google can't use
some of this stuff.
Why would CIL take 10x the heap? I don't see that at all. Except for maybe
the dubious decision to make all strings UCS-2, I don't see any reason CIL
needs to be much bigger than C++. Azul-C4 has a constant heap overhead.

One of the projects I ran while at Google started out on .NET 2.0 / Windows
as a bit of an experiment. We tried to move it to Linux-mono, but it wasn't
viable at the time. It ultimately moved to a mix of C++ and Java. Memory
overhead (in CIL or Java) was not an issue. GC pausing was/is. It had "very
significant" traffic.
Post by Jonathan S. Shapiro
- Maybe I care about security and modularity, and I've worked out that
taming the Java standard library is utterly hopeless.
All I'm saying here is that GC is just one of *several* issues that are
getting in the way of safety adoption
Certainly there are more issues than GC-pausing. I guess I single it out
because it's among the few issues that no amount of library work can solve.
One has to change the runtime, and if Azul C4 is any indication, the
operating system and hopefully one-day the processor.
Jonathan S. Shapiro
2012-04-16 17:42:29 UTC
Permalink
Matt makes several tempting comments here. They are thought-provoking, and
I think they bear some consideration, but I also think there is reason for
cautious skepticism.
Post by Matt Oliveri
The way I interpret Shap's point is that we don't know yet if
zero-pause GC is good enough for _all_ programs, or even all those
programs which would otherwise use C and C shlibs.
It's probably a lot closer than stop-the-world GC, but it's too early
to put all our eggs in that basket.
Hmm. Well, that's certainly a point I *should* have considered. The truth
is that my brain was more in the "let's tackle the stuff we actually know
how to do first" mode.

C4 illustrates a *foundational* problem. There is no such thing as
pauseless GC unless the collector can operate faster than the mutator. In
the limit that really isn't possible. The best we will ever achieve is
collectors that are pauseless *up to* a well-characterized allocation rate.
Post by Matt Oliveri
At any rate, maybe a typesafe runtime for all programs is not the right goal.
Splat! (That's the sound of spitting something disgusting out of my mouth
disgustedly).

I have seen many runtimes that cannot be implemented efficiently in
currently deployed commercial type systems. I have not seen *any* runtime
systems that inherently need to be type-unsafe. I have not seen *any* runtime
systems that would need to give up performance in order to be type-safe.
The latter statements assume that we apply modern type knowledge to these
systems. You can't do efficient type safety in the type systems of Java,
C#, or ML.
Post by Matt Oliveri
For starters, if zero-pause GC turns out to be unacceptable for some
programs, then we have no evidence that the goal is attainable.
You mean *when*. There *will* be applications with high, sustained
allocation rates that do not release storage.

But this is silly. There are applications whose allocation characteristics
don't fit malloc() too. Some of those applications are widely used. The JVM
comes to mind.
Post by Matt Oliveri
Further, even if it is, it's restrictive; not all applications need GC.
For example, those programs which can be shown to be safe just using
region typing should not be using GC.
Region safety does not guarantee timing of release or place any bound on
heap growth. A region-safe and region-managed program may not release heap
rapidly enough to run to completion. Regions definitely help, but they are
not a replacement for GC.
Post by Matt Oliveri
But of course, we want to be able to build a system out of components
that use GC and components that only use region-typesafe allocation.
Why? I would say rather: we may be *forced* by the limits of a particular
GC to do this, but what we *want* is adopt the simplest regime that will
actually solve our problem.
Post by Matt Oliveri
More generally, we want to be able to build a system out of components
that use all sorts of memory management policies.
The problem then is that anyone could come up with a new and
reasonable memory management policy at any time, and we'd want to be
able to integrate it.
That's a lovely image, but I think it's a bad goal.

The hard experience is that this just isn't possible in practice. We have
extended experience with mingling safe/unsafe environments. As it happens,
we also have experience with fully safe environments where different
processes use different GCs but can share objects in some form. Both
introduce nearly insurmountable complexity and risk into the overall system.
Post by Matt Oliveri
In summary, I think zero-pause GC is promising, and a safe runtime
that uses it could be extremely popular, but it will not replace C.
Actually, I disagree. If a pauseless GC supporting a high enough sustained
allocation rate becomes widely used, and a low-level type system that is
friendly to high-performance I/O comes into place, I suspect you would find
regulatory agencies all over the world requiring a cutover as a condition
of certification for their member companies.
Post by Matt Oliveri
I originally got interested in BitC because I thought its design goal
was to be a practical substrate for formal verification.
But now it seems like Shap is pursuing the "live dangerously" option,
but trying to make it less dangerous using strong, low-level types.
Shap found out that formal verification often yields less confidence than
strong typing, which makes the effort of formal methods not worth the gain.


shap
Matt Oliveri
2012-04-17 01:05:08 UTC
Permalink
I guess I should clarify.
Hopefully our disagreement is not too fundamental.
Also, I should put up a disclaimer that I'm not that experienced yet,
and systems programming is not my specialty, so I did not and do not
intend to disagree with you on performance issues.
Post by Jonathan S. Shapiro
Post by Matt Oliveri
At any rate, maybe a typesafe runtime for all programs is not the right goal.
Splat! (That's the sound of spitting something disgusting out of my mouth
disgustedly).
I have seen many runtimes that cannot be implemented efficiently in
currently deployed commercial type systems. I have not seen any runtime
systems that  inherently need to be type-unsafe. I have not seen any runtime
systems that would need to give up performance in order to be type-safe. The
latter statements assume that we apply modern type knowledge to these
systems. You can't do efficient type safety in the type systems of Java, C#,
or ML.
The restriction as I see it is not the general idea of type safety,
but the idea that there is some "one true" safe type system in which
all programming can be done without giving up performance.
You are most probably right that pretty much any runtime can be made type safe.
But good luck finding a single type system complete enough to admit
all those runtimes without modification.

The problem is that it's backwards to commit to a type system before
the program you want is written.
I believe we've gotten away with it so far because successful type
systems so far are still pretty simple, or at least based on something
simple, and because with simple type systems you expect to be unable
to write a program exactly the way you want.
When you make a type system, you've essentially chosen your well-typed
programs, it's just not necessarily obvious what things those programs
can or can't do.
For example, you discovered that the BitC type system, as implemented,
does not admit the IO library you wanted.
If there's any moral to the story, and these post-mortem discussions,
maybe it's that as type systems get fancier, there are lots and lots
of tricky design decisions, and the wrong decision will subtly rule
out a program you want.

My goal is a better way to design type systems, so that your program,
_exactly as you want it_, is well typed in your type system by
construction (of the type system, not the program).
A natural consequence of this goal is the idea that different
components can, and sometimes should be written using different type
systems.
If my goal were reached, we would design type systems by abstracting
out common arguments from provably-safe programs, not by picking a set
of abstractions haphazardly, and then seeing what programs they are
sufficient for.
This is analogous to how abstract interfaces for software components
can be designed either by abstracting details of known
implementations, or by intuition and good luck.
This reversal is possible for type systems because of semantic
approaches to type systems.
In other words, a type system is an abstract interface to certain
logical safety arguments.
Let me explain:

The current popular way to prove type system properties is syntactic.
You reason by inductions over the syntax of programs, types, and type
derivations.
These techniques are (relatively) straightforward, and pretty
powerful, but they leave the "meaning", the semantics, of types up to
intuition.
And intuition is often short-sighted, leading to selection of the
wrong type system.

But there is also the wider family of approaches for studying type
systems based on semantic modeling.
Basically you interpret features of a language (including types) as
mathematical or logical objects.
I'm specifically interested in approaches where:
- Data values are interpreted as objects in some model of computation
- Types are interpreted as predicates on (the interpretations of) data values
- The judgement that a value has a type is just the predicate applied
to the object
- Type derivations are interpreted as proofs of (the interpretations
of) type judgements

In this view, type derivations are a high-level language for formal proofs.
A type checker is in effect a highly specialized automatic theorem prover.
I've left out technical details, and useful generalizations to get the
point across.
This is the basic approach being used by the recent projects at
Princeton led by Andrew Appel.
Appel is using it to ground all the study of type systems,
compilation, and program verification in formal logic, but I like it
mainly because it provides a more flexible safe foundation than any
type system, upon which useful type systems can be _discovered_, so
that they don't have to be invented.
You decide what things your type system should be able to reason
about, and this dictates your choice of computation model, in a much
more direct way than the leap to a finished type system.
Equipped with your computation model, and a bag of formal semantics
tricks, you let your type system design itself by working through the
proofs for short examples, and generalizing to yield expressive types
and sound typing rules.

My current project is to build a framework and sort-of rapid
prototyping tool for advanced type systems, based on a proof
assistant, low-level computation model, and semantic modeling.
Even if a framework like this doesn't itself end up as the future
foundation of software, I bet it will be useful for finding a "one
true" type system, if it exists.
Jonathan S. Shapiro
2012-04-16 17:27:33 UTC
Permalink
Post by David Jeske
Post by Jonathan S. Shapiro
Post by David Jeske
After we have a zero-pause runtime, we can spend time and research on
alternative techniques such as regions and advancements to value-types to
relieve pressure on the GC system.
Post by Jonathan S. Shapiro
That's backwards. The region and other advances are available to us
today. The pausless collectors are still not understood well enough to be
mainstream yet.
I'm working on a longer response to some other points, but i think this
divide above deserves some digging into on both sides. Here is my
reasoning. Im very interested to hear yours...
My reasoning is actually very simple.

Pauseless GC is important, but there are many other issues that demand our
attention if we are going to write high-performance systems. Regions turn
out to be among them, and we know - today - how to do them.

When I wrote the note above, I wasn't aware of how far C4 had managed to
come. In that context, it was my view that production-ready concurrent
collection remained an open research problem, while regions are no longer
an open research problem. I did not intend to propose regions as a complete
replacement for pauseless GC. I was merely trying to say that regions solve
*some* of the hard problems that we need to solve, that we need them
anyway, and that we should pursue the things we know how to do while we
continue to pursue open research problems.

I agree with your point that we don't have a solution for all programs
until pauseless GC is available. That said, we are unlikely to get to a
complete solution in one jump, and we shouldn't neglect the steps that will
help us get further along.

My belief, by the way, is that pause times mostly are not the issue in the
eyes of the world. The much bigger issue, in my view, is data copy delays.
It is exceptionally hard to design any sort of efficient I/O subsystem
within the limitations of current safe type systems. The advantage to C
doesn't lie in manual memory management. It lies in type-unsafety, and the
fact that type-unsafe idioms are very effective when you are building
idioms for copy avoidance. As a canonical example, look at the BIO
mechanism of OpenSSL.

One of the interesting properties about the *low* level type system that I
envision for BitC is that it can preserve type safety for these or
equivalent idioms.

I agree with you that a 20ms pause time is a bad thing. It isn't the only
bad thing, and it may not even be the most important bad thing. We need to
look at the problem space as a complete system.

The only type-safe heap reclamation schemes im aware of which are capable
Post by David Jeske
of representing all programs are (a) some variation of mark-sweep-compact
tracing gc, and (b) ref-counting, with some variation of mark-sweep cycle
finding. Therefore, one of these two schemes is a good candidate for the
base of a typesafe runtime (once it meets our no-pause/low-jitter
requirement) to replace c/shlibs/malloc, wheras regions et. al. are only
suitable as optimizations within such a system.
With the release of C4, I think your list may need to be extended. But
also, I think you underweight the possibility of (a) being able to choose
when to take the GC hit, and (b) being able to control which operations
allocate. Point [b] is crucial. In most safe languages there is no contract
between me and the compiler regarding what actions may result in heap
allocations.
Post by David Jeske
What is your argument?
Do you believe regions and such are capable of representing all heap
reclamation needs and we simply don't know how? It seems to me any other
argument devolves into gc or manual management, which simply proves my
point.
I definitely don't believe that. What I believe is that when a cave man
plans a trip to Alpha Centauri, he's unlikely to make it as far as the
moon. Starting with a trip to Mars may be a better plan.


Jonathan
Raoul Duke
2012-04-10 17:23:29 UTC
Permalink
Post by David Jeske
I would be very happy if either were available and usable on x86.
i am saying this somewhat tongue-in-cheek, having never used it.
http://lmgtfy.com/?q=azul+x86+commercial
David Jeske
2012-04-11 04:11:27 UTC
Permalink
Raoul - Thanks for posting that reference to Azul.

While we've heard of Azul before (and already mentioned them in this
thread), I was not aware that they had already done exactly what I had
wondered about. They mated their Pauseless GC algorithm (now called C4) to
an x86 MMU supported barrier. Their paper claims sub 0.02s worst-case pause
times with a 30GB+ heap, where Sun's concurrent collector has a worst-case
pause of over 10 seconds.

http://www.azulsystems.com/technology/c4-garbage-collector

http://dl.acm.org/citation.cfm?id=1993491&dl=ACM&coll=DL&CFID=96709212&CFTOKEN=86276521

Very interesting read for anyone who hasn't seen it.
Jonathan S. Shapiro
2012-04-12 02:35:25 UTC
Permalink
Their paper claims sub 0.02s worst-case pause times with a 30GB+ heap,
where Sun's concurrent collector has a worst-case pause of over 10 seconds.
For those of you computer folks who are only comfortable in multiples of
two, that's a 20ms pause time.:-)

Which is still a factor of 20 larger than I really want to see.

But thanks for the pointer, David!
wren ng thornton
2012-04-11 22:24:51 UTC
Permalink
Post by David Jeske
I do not understand the significance of your observation. AFAIK, every
currently shipping software collector, for every language system, requires
an unpredictable world-stop. If I am mis-informed, please point me to one
that does not.
Nope. For instance, the places where GC can happen are fully defined in
the Glasgow Haskell Compiler. In particular, the need for GC is only
checked on allocation; thus, you can write tight loops which are strict
enough to guarantee no allocation and therefore no GC. It's not scalable
in any sense of the word; but it is doable, and it's even predictable
(if inscrutable) once you know GHC's internals.

I'm sure most other industry-quality compilers have similar techniques
in order to make GC pauses more predictable and/or less noticeable.
--
Live well,
~wren
Jonathan S. Shapiro
2012-04-12 02:30:55 UTC
Permalink
Post by wren ng thornton
Nope. For instance, the places where GC can happen are fully defined in
the Glasgow Haskell Compiler....
This, of course, is insufficient for multithreaded programs.
Post by wren ng thornton
...you can write tight loops which are strict
enough to guarantee no allocation and therefore no GC.
My sense, as yet undemonstrated, is that it may be a bit more scalable in
something like BitC, where unboxed types are available.

shap
Bennie Kloosteman
2012-04-12 13:02:21 UTC
Permalink
Post by Jonathan S. Shapiro
This, of course, is insufficient for multithreaded programs.
Post by wren ng thornton
...you can write tight loops which are strict
enough to guarantee no allocation and therefore no GC.
My sense, as yet undemonstrated, is that it may be a bit more scalable in
something like BitC, where unboxed types are available.
You can do something similar in C# with stackalloc and some trickery (
which Bitc would do significantly better) however in a many threaded app
all the other threads cant be causing allocations either so it works
better a single thread/message pump .

Ben
Jonathan S. Shapiro
2012-04-13 17:12:42 UTC
Permalink
Post by Bennie Kloosteman
Post by Jonathan S. Shapiro
This, of course, is insufficient for multithreaded programs.
Post by wren ng thornton
...you can write tight loops which are strict
enough to guarantee no allocation and therefore no GC.
My sense, as yet undemonstrated, is that it may be a bit more scalable in
something like BitC, where unboxed types are available.
You can do something similar in C# with stackalloc and some trickery (
which Bitc would do significantly better) however in a many threaded app
all the other threads cant be causing allocations either so it works
better a single thread/message pump .
Yes. This is actually one of the places where regions help a lot. One of
the kinds of analysis you can do (conservatively) is determine whether an
object ever escapes to some thread *other than* it's allocating thread.
Thread-affine objects can be GC'd without coordination, and can be accessed
without concerns over concurrency.

shap
Sandro Magi
2012-04-12 05:17:06 UTC
Permalink
Post by David Jeske
I do not understand the significance of your observation. AFAIK, every
currently shipping software collector, for every language system, requires
an unpredictable world-stop. If I am mis-informed, please point me to one
that does not.
I previously pointed out on-the-fly reference counting collectors as a
counter example. On-the-fly collectors interleave collection with the
mutator, so they don't suffer high latencies or stop the world. The
overheads are higher, but they are naturally concurrent and incremental.

See my previous note where I replied to you after I replied to Jonathan.

Sandro
Florian Weimer
2012-04-12 06:25:09 UTC
Permalink
Post by Sandro Magi
I previously pointed out on-the-fly reference counting collectors as a
counter example. On-the-fly collectors interleave collection with the
mutator, so they don't suffer high latencies or stop the world. The
overheads are higher, but they are naturally concurrent and incremental.
And some of us want the extra throughput, which means that we cannot
have a shared heap with low latency code and we need some sort of JIT
to provide different barriers for different garbage collectors. Large
heaps are still very difficult.

(And it's not even true that everything is better with properly-used
malloc(): you can run into fragmentation issues.)
Jonathan S. Shapiro
2012-04-10 01:18:47 UTC
Permalink
Post by David Jeske
Post by Jonathan S. Shapiro
I'd certainly be very happy to see that, but I think we need to take one
step at a time. We still don't know how to do modularity in the *absence* of
versioning issues.
This I'm surprised at.
In the absense of versioning issues, don't most dynamic language runtimes
handle modularity? Java/JVM, C#/CIL, Python, Ruby, Javascript... all can
have interfaces specified (whether checked by a compiler or not) and load
compatible classes at runtime.
I think that most of the "object as hash-map" languages are immodular - the
problem is that in most of those languages an object's has-map is
modifiable by third party code.

Regarding Java and MSIL, I think you make a good point. Java, regrettably,
has insufficient support for unboxed types.

As far as I can see, both C-shlibs and CIL both (to some degree) at least
Post by David Jeske
have some capability to handle multi-versioned dependencies because both of
them record the required dependency version in a way that is visible to the
runtime load/linker. Further, they can both load multiple versions of the
same shlib/DLL. Compare this with Java, Python, Javascript, etc.. which
don't have any record of such information to make available to the runtime,
nor can the runtime handle more than one version of a module/class loaded
because of name-clash.
Right. So recording the version number seems like a good place to start,
and perhaps it may be as much as can be done - though it's also important
that those version numbers reflect interface contract versions.

I also agree that Java, MSIL, and friends seem to have this wrong, though
it's certainly easy enough to arrange a class path with a known set of
versions.

I'd be satisfied if I could have badly overspcified interfaces between DLLs
Post by David Jeske
and actually load two modules without version typeclash. I think C-shlibs
and CIL both give me this. Do they not?
I'm not sure about CIL.

Another interesting approach here would be to declare that the effect of
loading a library is to get back a function which, when called,
instantiates that library. The point being that if there is no static
global state then you can load multiple versions of a subsystem
simultaneously into the same address space.

I see and agree fully with your view here. I wasn't aware of this 'tree
Post by David Jeske
Post by Jonathan S. Shapiro
shaking'. When I referenced Singularity I was thinking of the attempt to
provide isolation between modules via a mechanism which resembles the type
of brute-force message-passing we do today (between kernel and user or on
socket interfaces), but without the copying... (i.e. singularity's
immutable object transfer model) I admit it was many years ago I read this
paper, so I may be remembering it incorrectly.
You're thinking of the linear typing work. It turns out that linear typing
really isn't a good match for the way that data tends to flow in real
systems, and in any case, merely mortal programmers can't manage linear
types when they are exposed in their full glory.

The problem of stop-the-world pauses are exacerbated by threading and
Post by David Jeske
layered-systems. Much like cumulative MTBF, when many subsystems may have
unpredictable worst-cases pauses overlayed, the systemic worst-case pause
is additive....
Ah. When I read this initially, I was considering a slightly different
issue - the cumulative memory overheads arising from the fact that
two-space (or more) GCs hold twice as much young memory as they actually
need, and this tends to present a requirement on real RAM. Your point is
important as well.

However: it is often the case that in layered systems of the type you
describe, the layers in the "waiting for message" state are sitting at the
top of an event loop, and therefore have an effectively empty heap. I think
it is an interesting question whether this can be usefully exploited
somehow.


shap
David Jeske
2012-04-10 10:50:25 UTC
Permalink
Post by Jonathan S. Shapiro
Post by David Jeske
In the absense of versioning issues, don't most dynamic language
runtimes handle modularity? Java/JVM, C#/CIL, Python, Ruby, Javascript...
all can have interfaces specified (whether checked by a compiler or not)
and load compatible classes at runtime.
Post by Jonathan S. Shapiro
I think that most of the "object as hash-map" languages are immodular -
the problem is that in most of those languages an object's has-map is
modifiable by third party code.

I agree with this design criticism, though in practice this is often not an
issue. For example, the large majority of Python software does not modify
classes at runtime. Javascript (and prototype languages in general) do,
however, have this problem.

In practice I have trouble the most trouble because
source-dynamic-languages don't have version specificity. For example, the
tiny Python program (#!/usr/bin/python import foo) offers absolutly no
means of dependent version disambiguation of either the runtime or the
module.

This problem is exacerbated by traditional OS/UNIX design, where
/usr/bin/python is a single installed version of python. How then are we to
support two python programs which require different versions of python to
coexist on the same system?

This is not at all a problem for the C-shlib runtime, in which each
coff/elf executable references specific versions of dependent shlibs, and
where those shlibs are uniquely named to allow multiple versions to coexist.
Post by Jonathan S. Shapiro
Regarding Java and MSIL, I think you make a good point. Java,
regrettably, has insufficient support for unboxed types.

Agreed. By my view CIL seems a strictly better technical design than JVM
in all ways.

If CIL has a weakness, it's that IETF standardization and a single
open-source implementation (mono) have not been enough to eclipse industry
concern over Microsoft's intentions.
Post by Jonathan S. Shapiro
Right. So recording the version number seems like a good place to start,
and perhaps it may be as much as can be done - though it's also important
that those version numbers reflect interface contract versions.

Yes. Version-specific runtime linking (like C-shlibs) is critical...
Post by Jonathan S. Shapiro
I also agree that Java, MSIL, and friends seem to have this wrong, though
it's certainly easy enough to arrange a class path with a known set of
versions.

My understanding is that MSIL/CIL does this corrrectly (and is the only
one). CIL Assemblies make reference both to a specific runtime version, and
specific versions of dependencies, just like C-shlibs.
Post by Jonathan S. Shapiro
Another interesting approach here would be to declare that the effect of
loading a library is to get back a function which, when called,
instantiates that library. The point being that if there is no static
global state then you can load multiple versions of a subsystem
simultaneously into the same address space.

The true requirement is the elimination of absolute-fixed-addresses. AFAIK,
shlibs achieve this through PIC and load-time resolution, while still
having state they believe is 'global'. We can load three versions of a
C-shlib into the same process without the global state colliding.

Either way, we're thinking along the same lines. Pre-requisites for
handling versioning issues include being able to see the versions
requested, and load multiple versions into the runtime. AFAIK, C-shlibs and
CIL can do this, other systems can not.
Post by Jonathan S. Shapiro
However: it is often the case that in layered systems of the type you
describe, the layers in the "waiting for message" state are sitting at the
top of an event loop, and therefore have an effectively empty heap. I think
it is an interesting question whether this can be usefully exploited
somehow.

I struggle to understand what system this is with an empty heap. Take a
look at some open-source projects struggling with this now... HBase and
Cassandra.. two data-storage engines built with GC. They have very busy
heaps, and their GC pauses are petty unacceptable for clients.
Jonathan S. Shapiro
2012-04-10 17:13:22 UTC
Permalink
Post by David Jeske
Post by Jonathan S. Shapiro
Post by David Jeske
In the absense of versioning issues, don't most dynamic language
runtimes handle modularity? Java/JVM, C#/CIL, Python, Ruby, Javascript...
all can have interfaces specified (whether checked by a compiler or not)
and load compatible classes at runtime.
Post by Jonathan S. Shapiro
I think that most of the "object as hash-map" languages are immodular -
the problem is that in most of those languages an object's has-map is
modifiable by third party code.
I agree with this design criticism, though in practice this is often not
an issue. For example, the large majority of Python software does not
modify classes at runtime. Javascript (and prototype languages in general)
do, however, have this problem.
I was interested to note in ECMAScript 5 that the defaults for mutability
in this regard were changed. You can still have mutable hash-maps, but it
is now the default that they are *not* mutable.
Post by David Jeske
This problem is exacerbated by traditional OS/UNIX design, where
/usr/bin/python is a single installed version of python. How then are we to
support two python programs which require different versions of python to
coexist on the same system?
Change the path in the #! .
Post by David Jeske
Agreed. By my view CIL seems a strictly better technical design than JVM
in all ways.
Other than public availability on non-M$ platforms out from under a legal
cloud.
Post by David Jeske
Post by Jonathan S. Shapiro
However: it is often the case that in layered systems of the type you
describe, the layers in the "waiting for message" state are sitting at the
top of an event loop, and therefore have an effectively empty heap. I think
it is an interesting question whether this can be usefully exploited
somehow.
I struggle to understand what system this is with an empty heap. Take a
look at some open-source projects struggling with this now... HBase and
Cassandra.. two data-storage engines built with GC. They have very busy
heaps, and their GC pauses are petty unacceptable for clients.
Then they aren't structured in the way I have in mind.
William ML Leslie
2012-04-16 13:22:57 UTC
Permalink
Post by David Jeske
In the absense of versioning issues, don't most dynamic language runtimes
handle modularity? Java/JVM, C#/CIL, Python, Ruby, Javascript... all can
have interfaces specified (whether checked by a compiler or not) and load
compatible classes at runtime.
As far as I can see, both C-shlibs and CIL both (to some degree) at least
have some capability to handle multi-versioned dependencies because both of
them record the required dependency version in a way that is visible to the
runtime load/linker. Further, they can both load multiple versions of the
same shlib/DLL. Compare this with Java, Python, Javascript, etc.. which
don't have any record of such information to make available to the runtime,
nor can the runtime handle more than one version of a module/class loaded
because of name-clash.
I've always figured this problem is in the other direction. Java has
classloaders (and eg. OSGi), in Python, classes are first-class, the
problem is actually in the module system (although exocet showed that
this could be fixed without the obvious hack if anyone cared), in
Javascript, the syntactic mechanisms for module loading imply that
types are first-class there, too (although extending existing types is
a known problem there).

The real problem is C; where the dynamic linker appears to provide one
big ugly flat untyped namespace. You may be able to use and load two
shared objects with versioned interfaces, but if they both provide
functions from the earlier version, there's no telling which you will
get. If they don't provide versioned symbols, you're just SOL. And
forget about linking against the same library twice, say, to get two
copies of all of the static state. You just don't do that in C; so I
don't see how you can really call it an example of modularity.

Mind you, I never figured BitC was attempting to fix the dynamic linker.
--
William Leslie
William ML Leslie
2012-04-16 14:53:25 UTC
Permalink
Post by William ML Leslie
Post by David Jeske
In the absense of versioning issues, don't most dynamic language runtimes
handle modularity? Java/JVM, C#/CIL, Python, Ruby, Javascript... all can
have interfaces specified (whether checked by a compiler or not) and load
compatible classes at runtime.
As far as I can see, both C-shlibs and CIL both (to some degree) at least
have some capability to handle multi-versioned dependencies because both of
them record the required dependency version in a way that is visible to the
runtime load/linker. Further, they can both load multiple versions of the
same shlib/DLL. Compare this with Java, Python, Javascript, etc.. which
don't have any record of such information to make available to the runtime,
nor can the runtime handle more than one version of a module/class loaded
because of name-clash.
I've always figured this problem is in the other direction. Java has
classloaders (and eg. OSGi), in Python, classes are first-class, the
problem is actually in the module system (although exocet showed that
this could be fixed without the obvious hack if anyone cared), in
Javascript, the syntactic mechanisms for module loading imply that
types are first-class there, too (although extending existing types is
a known problem there).
The real problem is C; where the dynamic linker appears to provide one
big ugly flat untyped namespace. You may be able to use and load two
shared objects with versioned interfaces, but if they both provide
functions from the earlier version, there's no telling which you will
get. If they don't provide versioned symbols, you're just SOL. And
forget about linking against the same library twice, say, to get two
copies of all of the static state. You just don't do that in C; so I
don't see how you can really call it an example of modularity.
Apologies: it seems that dependency ordering is used by default,
according to dlsym() documentation. So the big ugly flat untyped
namespace you *think* you get when writing C is only an illusion.
Good times all round.
--
William Leslie
David Jeske
2012-04-16 16:31:14 UTC
Permalink
Post by William ML Leslie
Apologies: it seems that dependency ordering is used by default,
according to dlsym() documentation. So the big ugly flat untyped
namespace you *think* you get when writing C is only an illusion.
Good times all round.
Yes... now compare that with python or java, where neither source or
compiled code includes any version information.

Perhaps we are using the wrong term when talking about 'binary
compatibility'... perhaps we should be using the term 'machine version
dependency resolution'... c shared libraries and CIL can do it, afaik no
other system can. Imo this is a big fault that needs to be fixed.
Bennie Kloosteman
2012-04-10 10:12:26 UTC
Permalink
Post by Jonathan S. Shapiro
Microsoft singularity is quite interesting research in this direction, as
it is attempting to eclipse the C-shared-library.
The "whole program" idea in Singularity rests on a claim that David
Tarditi made about a process he called "tree shaking". Tree shaking, in
brief, is a process for removing unreachable code from programs. David
claimed, without supporting data or careful study, that the binary size
reduction obtained from tree shaking was greater than the size reduction
obtained from shared libraries.
1. It disregards the fact that the two optimizations are orthogonal.
The ability to remove unreached code does not reduce the value of gathering
*reused* code in a common place.
2. The metric of interest isn't the size reduction in a single
program, but the total code footprint across the system as a whole (that
is: across *many* programs). The tree shaking approach results in a
system where *most* programs will duplicate a certain body of code
that is commonly used. That's the code that belongs in shared libraries.
Re 1. In Singularities case its IPC was clearly geared to the concept of
many async services taking some of the rolls of Dlls and hence
encouraging re-use at the service level more like Minix/micro kernels. ( I
mean singulararity didnt even have DLLs in the public release and it was a
long way from supporting them)

Re 2. If you have a single large static system runtime like .NET the
remaining shared code becomes small... This remaining code is often very
domain specific and hence well placed in services. In many cases for a
Singularity type system/ runtime this would reduce the overall system
footprint significantly .


Ben
Jonathan S. Shapiro
2012-04-10 16:57:29 UTC
Permalink
Unfortunately we are treading on the edge of things I am unable to discuss
as a result of being on the Midori team. If the conversation heads off in
the right direction without my leading it, I may be able to comment
selectively.
Post by Bennie Kloosteman
Re 2. If you have a single large static system runtime like .NET the
remaining shared code becomes small... This remaining code is often very
domain specific and hence well placed in services. In many cases for a
Singularity type system/ runtime this would reduce the overall system
footprint significantly
This isn't correct, mainly because there is *no* shared code in a typical
JIT execution. It's certainly possible to use AOT compilation, but I'm not
sure whether the M$ JIT for CLR does so.
Bennie Kloosteman
2012-04-11 07:39:00 UTC
Permalink
Post by Jonathan S. Shapiro
Unfortunately we are treading on the edge of things I am unable to discuss
as a result of being on the Midori team. If the conversation heads off in
the right direction without my leading it, I may be able to comment
selectively.
Post by Bennie Kloosteman
Re 2. If you have a single large static system runtime like .NET the
remaining shared code becomes small... This remaining code is often very
domain specific and hence well placed in services. In many cases for a
Singularity type system/ runtime this would reduce the overall system
footprint significantly
This isn't correct, mainly because there is *no* shared code in a typical
JIT execution. It's certainly possible to use AOT compilation, but I'm not
sure whether the M$ JIT for CLR does so.
There is nothing stopping a JIT patching in a call address for shared code
you can even do it on flat memory but its a lot more painful , With .net
you can AOT compile (ngen.exe) an assembly ( dll) into binary and the
framework assemblies are compiled according to the machine when installed
( for all machines) . Now when your code uses the jit it should just
ensure the binary dll is loaded ( with the usual shared code / private
data) and patch the call address .

Also sing used AOT ( compile on install ?) and only had a rudimentary JIT.


It may be that the JIT times for this are too high due to security checks
on the assembly but im not aware of this, I also doubt when i start 10
hello world C# programs my memory goes up 200 Meg from multiple copied of
the shared framework.



Ben
wren ng thornton
2012-04-12 22:13:10 UTC
Permalink
Post by Jonathan S. Shapiro
Post by wren ng thornton
Nope. For instance, the places where GC can happen are fully defined in
the Glasgow Haskell Compiler....
This, of course, is insufficient for multithreaded programs.
How's that now? GHC has extremely good multithreaded support, and if all
active threads are in non-allocating loops then what I said holds...
Post by Jonathan S. Shapiro
Post by wren ng thornton
...you can write tight loops which are strict
enough to guarantee no allocation and therefore no GC.
My sense, as yet undemonstrated, is that it may be a bit more scalable in
something like BitC, where unboxed types are available.
Surely. There are unboxed types in GHC though they're not first-class, as
they would be in BitC. I wasn't claiming that Haskell solves this problem,
far from it; I was merely pointing to GHC as a constructive proof that
there are shipping garbage collectors which are not unpredictable in the
way claimed by whoever I was replying to.
--
Live well,
~wren
David Jeske
2012-04-12 22:36:54 UTC
Permalink
Post by wren ng thornton
Surely. There are unboxed types in GHC though they're not first-class, as
they would be in BitC. I wasn't claiming that Haskell solves this problem,
far from it; I was merely pointing to GHC as a constructive proof that
there are shipping garbage collectors which are not unpredictable in the
way claimed by whoever I was replying to.
Claiming a collector has no unpredictable pauses because it only pauses
during allocation is only an interesting observation when it is coupled
with general case allocation-free solutions to common problems. This is
greatly complicated when one wants to use standard components and libraries
written by others which "may or may not" use these techniques.

We just have to stop stopping the world...
Jonathan S. Shapiro
2012-04-13 16:50:57 UTC
Permalink
Post by David Jeske
Claiming a collector has no unpredictable pauses because it only pauses
during allocation is only an interesting observation when it is coupled
with general case allocation-free solutions to common problems. This is
greatly complicated when one wants to use standard components and libraries
written by others which "may or may not" use these techniques.
That seems like a very valid point.

Perhaps it raises a small additional retrospective point concerning BitC:
the language had some very minor features that were designed to address
this type of need (which, sadly, didn't get implemented). One of these was
effect typing on two effects:

- heap allocation
- noescape
- confined

The first is important because it addresses exactly the issue you raise.
It's not practical to try to build non-allocating subprograms unless you
can state contracts about allocation in the language. I'd note that heap
allocation isn't the only interesting effect here. I/O is another.

Escape analysis is another interesting case, since it lets us know whether
we are dealing with contained subsystems.

Confined is merely escape analysis in both directions, which is important
because it tells us that the subsystem is deterministic.
Post by David Jeske
We just have to stop stopping the world...
I think we all basically agree about that. Get to work on that collector!

I'm going to send out an addendum about STOPLESS on the other thread.
Loading...