Author of the post here. Happy to answer any questions I can, and FYI, we (Sourcegraph) are liveblogging all of GopherCon at http://gophercon.sourcegraph.com. Let us know if you have any questions or find it useful!
Thanks for mentioning this. I initially included "(unofficial liveblog post)" in the HN title, but it apparently was edited out by a moderator. So, I added a clarification to the top of the post itself.
It's great that they are aiming for an automated conversion from C to Go. It's clear they aspire to convert their code which was written in a certain way. But I think it would be a huge boost in Go usage if they could eventually aim to transpile any C code into Go code.
A little dream of mine would be if in the future when Rust is stable Mozilla developed a transpiler from C++ to Rust. That would be brilliant.
By the way, all the other talks at GopherCon seem pretty interesting, I hope someone uploads videos of them soon.
It is unclear how one would compile arbitrary C code into useful Go. The stereotyped conventions of well-written compiler code allows for a more idiomatic translation than a general translator could ever aspire to.
C++ to Rust would be even crazier.
(Transpile is a silly word. It's "compile". Compiling already trans-es.)
I know it's crazy difficult to do it, that's why I was talking about dreams.
But maybe a set of guidelines about translatable C code plus a translator which is good enough for a variety of cases would make refactoring the resulting code such a manageable task that many projects could consider switching to Go directly.
You are technically correct in that compile already implicates a translation, but we usually use that term refering to a translation into a lower level language and not another language at the same level. You can say transpiler or source-to-source compiler for those cases and I think it's clearer and more accurate for the reader.
What part of his definition didn't you understand? He said a transpiler trasnlates from a higher-level language to a lower-level language. We are talking about translating from C -> Go and C++ -> Rust both of which are in the opposite direction (i.e., low -> high).
> He said a transpiler trasnlates from a higher-level language to a lower-level language
No, he said a compiler translates from a higher-level language to a lower-level language:
"You are technically correct in that compile already implicates a translation, but we usually use that term [= compile] refering to a translation into a lower level language"
This should not have been down-voted. We are talking about translation from C -> Go and C++ -> Rust (which are both in the opposite direction of the definition given for 'transpiling').
I agree that your comment added value to the discussion, and gave a counter-vote.
However, arguably the term could make more sense there, if we ignore the earlier definition and instead assume "trans" is short for "transcendent", as in "climbing to a higher level".
If we then simplify that to "compile from one language to another of roughly equal or higher level", it becomes a useful word to indicate a specific subset of ways one can do compilation.
Of course, I'm completely pulling this out of my ass and I might piss off actual CompSci people who use very specific and strict definitions in their papers (kind of like how some get annoyed with the procedure/function mixup), so take this with a grain of salt.
> You are technically correct in that compile already implicates a translation, but we usually use that term [ie compile] refering to a translation into a lower level language and not another language at the same level.
The way I understood what he wrote, he was making the same point as you. Presumably the people downvoting you had a similar interpretation of what he wrote.
Good automated translations involve matching and translating the project idioms. This results in code that fits the target language's language-idioms. Automated translation only at the level of language constructs results in odd looking generated code.
The use case is to bootstrap the conversion from C to Go, if one has made a decision to do so. You pick a cutoff time, and say “now were going to do our work in Go”.
What comes out the other end, as Russ phrases it, is a C program in Go syntax. The next phase of work is to turn it into “Go” as a human might write it.
The translated program is not meant for deployment, really. It’s to give the humans a starting point.
I think the best approach for something like this is a disassembler for LLVM IR. I am not extremely familiar with the LLVM IR, but I figure there are common patterns or debug symbols clang produces that could let you get a decent idea of what the higher level gave you. If not targeted to clang output too much, you can then transpile any LLVM-able language to Rust.
I think, but am not sure, that the problem with Rust being a target for these other languages (my pipe dream is to write a JVM in it w/ a JIT using the bootstrapped rustc lib) is that you have to use unsafe code all the time (unless there's 0 performance penalty for ARC) because the original code is not written with ownership/borrowing semantics.
I think there are probably lots of chunks of code that would be perfectly valid under Rust's ownership semantics and could therefore be transpiled safely. Figuring out which chunks are and aren't would probably be quite difficult though.
They want to go quite a step up from doing some global regex replaces, but in the end, step 3 has "cleaning up and documenting the code, and adding unit tests as appropriate.". I don't think they aim to automate anything there.
It has been stated that there will be videos available, but I'd not expect them until next week sometime (when the organizers get home and have recuperated).
Automated rewrite FTW! This can help you avoid freezing a project while it is being ported. Also, if you have a code base with its own idioms, then those idioms can be matched and translated, which can produce cleaner target code.
I was also talking about projects and porting in general. Incidentally, I've been paid to do exactly what we're discussing. (Porting a production program using automated translation.) It works.
I'm reminded of the fortran to c compiler f2c that was used to produce pure c implementations of lots of libraries like LAPACK.
I'm curious how they will handle things like pointer arithmetic and memory safety in C vs Go. If they mange to do so in a performant way I could see translating lots of numerical or computationally intensive code to Go so that it could be run in a shared cloud environment without worries about memory safety and without having to resort to vms for separation.
If you took arbitrary fortran, especially with i/o, the results were an unreadable mess - but they compiled and ran.
For a project I had (in the early 90s iirc), I had to extensively modify the fortran and make multiple versions before the generated C was readable. Still, much easier than a rewrite.
> There are currently 1032 goto statements in the Go compiler jumping to 241 labels.
Wow, that's really striking. I know that goto statements have their uses but for something written in the last couple of years to have over a thousand of them is very surprising (it might not be surprising at all for those who write C code all the time). I guess they're mostly just for error handlers?
As others have pointed out, this is actually idiomatic C. You have a portion of your function below a label which is "clean up everything and return this variable (usually error status)." Then at any point where you have an error, you goto that label and everything is released and the error gets returned. It's sort of a manual 'defer' pattern.
"goto considered harmful" is taken out of context and bemoaned mostly by people unfamiliar with idiomatic C, I think. (Edit: sorry, I mean just for C. For other languages, e.g. Go, there may be more appropriate patterns like the 'defer' keyword.)
Encoding state implicitly in control flow tends to give you nicer and more natural code. A big state machine implemented with gotos can be tough to read if it's also a parser (which may need to build a tree, recurse, etc.). A small hand-rolled lexer might be reasonable to implement with goto. Although I would prefer to enumerate the states and use a switch.
If you're writing a recursive descent parser for C while keeping minimal lookahead, I can think of a place or two where goto might really help. One is to deal with casts, compound literals and subexpressions. They all begin with a left-parenthesis, so you can't tell what you're dealing with until a bit further. Since these all appear in the parsing of simple expressions (identifiers, constants, unary operators), the code can go in one function and you can jump to the right place with goto as soon as you know what you are dealing with.
I did not look in the Go code but C programmers mostly use goto for error paths.
I dunno, a few of them seem hard to justify. The one on line 475 could just be an "else" branch, frex, and the ones in the loop on 327 could as easily be continues.
When the full video of the talk comes out, you will see that Russ agrees with you. He lists the types of uses of goto. Various are used to make loops or elses, and he says those should not have occurred.
I would be curious to hear the justification behind the implementation of angryrealloc, the "goto ok" is strictly equivalent to a "continue" as far as I know. Maybe it triggers an optimization with a certain compiler.
I use that style all the time. I like having the function name start in column 1, it can make searching easier.
I do wish that in C1x (for x > 1), they could find a way to let us declare multiple formal arguments of the same type without repeating the type:
float
my_graphics_hack(float x, y, z, r, g, b, u, v) { ...
instead of
float my_graphics_hack(float x, float y, float z, float r, float g, float b, float u, float v) { ...
which gets really tedious and obfuscates that fact that they are all the same type, especially with the typename is more complex than just "float". When the new syntax was added to ANSI C, it wasn't possible to do multiple variables, for reasons I've forgotten but which I think had to do with forward type references. It would be awfully nice to find a fix for this.
That's a poor justification, there are plenty of tools to index and search C(++) code like ctags or cscope, and your example could use a vector or a struct, I have a hard time imagining when a function with such a prototype would be useful in real life.
Compound literals are not too pretty and not too well known. Plus, if you care about c89... So, do you wrap all your values in a struct because the prototype gets too long otherwise? What a chore.
Preferring that function names start at the first column to make searching easier is a perfectly good justification.
You, on the other hand, seemingly would impose your tools (which have their shortcomings) and workflow on people with no justification at all.
Just because a tool exists, doesn't mean it's good (or better than what people are accustomed to) or that everyone must use it.
Do you think everyone should use Vim on Linux too?
No, I think everyone should use Emacs on FreeBSD, but that's not my point.
Adding syntactic sugar to a language makes the language bigger and harder to completely understand, it makes it easier to misuse a feature (and C already has a lot of trickery with types, like when you declare a parameter as an array but it behaves like a pointer). I'm a strong believer that implicit is better than explicit and that while there are many ways to do the same thing some are better than other in practice. Of course, "in practice" can change from one project to the other, what matters in consistency. I applaud the choice of Go to standardize on a single coding style for instance.
For the particular example of the parent, while I write a lot of C code for work and for fun I have yet to encounter a situation where I saw a function taking 10 floats as arguments and thinking "yup, that's completely the right way to do that". If you have an example of such a code I'd me more that willing to reconsider my position, otherwise we're just talking about the best way to tame a unicorn.
Agreed, but there are still times when I end up having several arguments of the same type, and I see no virtue in having to repeat the type for each formal parameter.
BSD/Plan 9 style is extremely useful because it allows grepping for functions. More specifically, it allows the creation of acme(1) addresses of the form /some/file.c:/^foo where foo is a function. In acme you can just B3-click on that string and it will open the file at the correct function...
It is also a pretty common style in the open source world.
Yep, that's pretty weird. It harks back to the early days of C. When I first saw the currently recommended form:
int main(int argc, char **argv) {
//body..
}
I thought it was nice but not obviously advantageous, but over the years I've come around to thinking it makes code easier to follow. Even many bash shell scripting guides now recommend a similar form:
Are you sure you aren't confusing it with K&R style with the argument names on a different line? That would indicate old code. Having the return type on a different line is incredibly common, virtually every C project with decent code uses it.
Yeah, there are a lot of them in e.g. Linux kernel code for jumping to end-of-function cleanup. I suspect that in the process of translation to Go most will be replaced by the defer keyword.
This might become a nice benchmark for the Go language; the same code base implemented in C and Go! It may not be fine-tuned for optimisation, neither in C nor in Go, but may still give a good ball-park estimate.
Indeed, but it would be much more useful and impressive if they rewrote the compiler in Go by hand. I'm disappointed that they are aiming for an automatic translation instead - some people are going to ask themselves whether Go actually isn't that much fun to program in. An independent reimplementation is better for correctness too, they can compare outputs and find bugs in both implementations instead of porting over old bugs and adding new bugs where the translation goes wrong.
I'm really thankful for the liveblogging, as I couldn't manage to get my body to the conference.
I understand the desire to promote the Sourcegraph app by doing the blogging, and I think its effective. However, the blog is real annoying to browse, as every (prominent?) link points to Sourcegraph the app instead of the blog.
Just to clarify, because I think my original comment is disbalanced.
I'm REALLY thankful for sourcegraph's liveblogging. The above comment was a suggestion as I thought they might want to know that (at least for me), the navigation of the blog was confusing.
Thanks! We are having a fun time liveblogging and are glad you're finding it useful. We got complaints when the blog image did NOT link to Sourcegraph, too. :) It's almost the end of this conference, but next time we liveblog, we'll have 2 separate header images, or try something else to make it less confusing.
"They’re deciding to automatically convert the Go compiler written in C to Go, because writing from scratch would be too much hassle."
When transcribing a talk, there isn't any need to write "They're". Just use the same pronoun the presenter used, otherwise it stands out like a sore thumb.
The general way that self hosting influences language design is that the compiler is often one of the first major projects to be built using a language. This does not give it more influence then other major projects, but if your goal is to have a language designed for use case X, it is generally best to have your early projects with it be for X. Additionally, self hosting may encourage a language design that makes bootstrapping easier (such as a stricter divide between the state-1 language and the general language).
I sort of had the general sense of that already. I guess I was hoping for something more specific about what language features are so useful when writing a compiler but get in the way for general problems.
The part I quoted above almost sounds like wiping sweat from one's brow after having dodged a bullet: "phew, the language is safe from influence by those gull-durn compiler-writers..."
To expand a bit on your last point, self-hosting helps answer the question "can I build Real Software (tm) with this language, or is it really difficult?" earlier in the lifetime of the language.
>A Union is like a struct, but you’re only supposed to use one value
>(they all occupy the same space in memory). It’s up to the programmer to know which variable to use.
> There’s a joke in some of the original C code:
> #define struct union /* Great space saver */
> This inspired a solution:
> #define union struct /* keeps code correct, just wastes some space */
Somewhere in Scotland, a sum type sheds a single tear.
It works when unions are used correctly. If they are used to deliberately subvert the type system (e.g., put in an int, get an array of char out) it doesn't work. But C has so many other ways to subvert the type system that there's no need to do that.
Probably about the only place you'll see them used like that is in the code produced by web2c in compiling TeX. Knuth used variant records a lot to get around Pascal's type safety, and they get translated to unions.
That code is not valid according to the C standard, so there is no guarantee it will work anywhere. In particular, many modern compilers have optimizations that would break that code. I would be a little surprised if modern web2c still uses unions this way and gets away with it.
The only standard compliant way to, say, convert a float to an int is to use memmove:
uint32 i;
float32 f;
i = 0x80000000;
memmove(&f, &i, 4);
I finally managed to track down the definition of memoryword from texlive-20130530-source/texk/web2c/texmfmem.h. Here it is:
typedef union
{
#ifdef TeX
glueratio gr;
twohalves hh;
#else
twohalves hhfield;
#endif
#ifdef XeTeX
voidpointer ptr;
#endif
#ifdef WORDS_BIGENDIAN
integer cint;
fourquarters qqqq;
#else /* not WORDS_BIGENDIAN */
struct
{
#if defined (TeX) && !defined (SMALLTeX) || defined (MF) && !defined (SMALLMF) || defined (MP) && !defined (SMALLMP)
halfword junk;
#endif /* big {TeX,MF,MP} */
integer CINT;
} u;
struct
{
#ifndef XeTeX
#if defined (TeX) && !defined (SMALLTeX) || defined (MF) && !defined (SMALLMF) || defined (MP) && !defined (SMALLMP)
halfword junk;
#endif /* big {TeX,MF,MP} */
#endif
fourquarters QQQQ;
} v;
#endif /* not WORDS_BIGENDIAN */
} memoryword;
Once you sort through all the ifdefs and typedefs, it boils down to a union of a float (glueratio), an int32 (integer), a pair of int16s (twohalves), and a struct of 4 bytes (fourquarters). When TeX creates its memory dump file, it does it all in terms of the bytes in the fourquarters struct. (And I think there are other times that it does similar things.)
This may very well be undefined behavior, but it seems to work when compiled with GCC anyway. Thankfully gc isn't such crazy code as this.
C99 6.5.2.3.5 seems to suggest that non-overlapping union members could break conforming code:
``One special guarantee is made in order to simplify the use of unions: if a union contains several structures that share a common initial sequence (see below), and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them anywhere that a declaration of the complete type of the union is visible. Two structures share a common initial sequence if corresponding members have compatible types (and, for bit-fields, the same widths) for a sequence of one or more initial members.''
Which part of the C standard would forbid type punning a uint32 to a float32? http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm suggests that C89 explicitly allowed this, C99 at first glance did not, and was errata'ed to make it clear that this is still allowed. C11 seems to have the same verbiage.
union hack { int x; float y};
int foo( int *i, float *f)
{
*i = 3;
*f = 42.0;
return *i;
}
Aliasing rules say foo need not read from that int pointer, may switch the order of the write to f and the write to i, and can assume that foo returns 3, so
struct hack h;
foo( &h.i, &h.f);
might return 3 or something else. I think that last call introduces undefined behavior, but only becuase of the definition of foo that the writer of that call might not even have the source for.
But of course, that is an "you shouldn't do that" edge case. One could also claim that the corrigendum doesn't apply because foo doesn't "use a member to access the contents of a union".
Sure, but that's an aliasing issue, not a type punning issue. I agree that the code you cite there is a violation of the aliasing rules, and will not work "as expected" on modern compilers unless one does the equivalent of gcc's -fno-strict-aliasing.
And I agree that the corrigendum doesn't apply in this case. Once you hand different-typed pointers to the same memory to people, whether via union or just casting pointers, the aliasing rules will up and bite you.
as a way to work around the BSD socket API without horrible casting all over the place. Change the "union" to a "struct" and the code no longer works. Is it subverting the type system? Yes. Is casting subverting the type system. Yes. Pick your poison.