WIP:parser_combinator #51
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "parser_combinator"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The final in the series #47, #49, #50 exploring parsing packages for the expression language. This one still uses the Tokenizr lexer, and then parses the resulting token stream with parser-combinator. For the initial fragment implemented so far, parse-combinator feels very clean and simple, so I remain hopeful that this may be the best option yet.
OK, the parser created via parser-combinators seems to be complete; I will be migrating the parser tests from ohm/peggy to this branch soon.
OK, the Tokenizr lexer and parser-combinator parser are now passing all of the tests from the prior series of parsing packages, including ones that did not pass in the previous rounds. While there are still more tests from mainline mathjs to transfer here, this milestone means this parser is mature enough that its code weight is not going to increase appreciably from where it is now. Therefore, the next step is to test how much this particular lexer/parser combination increases the mathjs bundle.
parser_combinatorto WIP:parser_combinatorAll right, @josdejong , I totally don't understand it: even though the parser code file is only 20K (less than half the size of the current mathjs) and the entire parser-combinators library I am using is 20K and the tokenizer library is 20K and
unrawstring-escape interpreter package is 10K, which only add up to 70K, the math.js bundle increases by a whopping 150K (23%) when I just install it in main-line mathjs and call its parse function from parseStart. (I checked there are no other new dependencies: these are all of the new files.)How is that even possible? I would have thought that 70K would have been the maximum possible increase, since that's the sum of the sizes of all of the new files...
Anyhow, the zipped bundle increases by a lot less in this case, only 20K (11.4%) for a total of 195K.
I'm assuming that's still more than you're willing to devote to a new parser...
I'm really discouraged and frustrated. I was quite happy with this latest parser, it seems very clean and simple. And I assumed that since the packages it is using are very small that it would be the least size increase yet. Alas.
So here are some things I could try:
(A) Go ahead and get the Nearley parser working even though Nearley is not maintained. I was pretty close to done when I realized that Nearley is not maintained, so it wouldn't be a lot of work. That way we could see how much it adds with a more substantial grammar (so far I only measured it with a trivial grammar, and that was very lightweight, so there's clearly not a lot of overhead from just the library itself). Then if a full Nearley parser is still good on resource usage, we could consider forking/cloning it and maintaining it just enough to use in mathjs, as probably it's my favorite to use of all four, with this current parser-combinators one a close second.
(B) The parser-combinators library feels so lightweight that I could just suck a streamlined version of it fully into one source file in our repo (i.e., not use the npm package at all). Again, that would be a pretty simple experiment, to just have our own little combinator library based on parser-combinators, but slimmer, in our source tree, and see if that reduces the bundle size increase at all.
(C) At this point, having been through the entire parser four times now, I could just go ahead and hand-write the most streamlined custom parser I can think of, and see what that does to the bundle size by way of comparison.
Which of (A), (B), (C) do you think I should try next? As before, my efforts on this are on hold until I get some feedback about this. Thanks!
Oh, good news. I decided I would break it down by the three new libraries, just to see what was going on. Cutting out parser-combinator didn't have a big effect, but putting in a dummy (non-working!) lexer and uninstalling the tokenizr library made a huge difference. Without that package, the bundle size only grows by 20K and the zipped bundle size only grows by 6K, which seem totally acceptable to me (especially with the savings we will eventually get from removing the existing parser).
So actually, the next step is clear: substitute a different or handwritten tokenizer (I really don't mind a handbuilt tokenizer, as they are much simpler pieces of software) and then see how things are. So I am back on the project (not tonight) and will keep you posted. I really wonder what it is about the tokenizr package that led to the blowup: it doesn't at all seem like a heavyweight package. As I said, the entire distribution is only 20K. How could it expand the bundle so much?
So this is very strange. I decided to try writing the tokenizer with parser-combinators as well: why not, we have the parser-combinators package for the parser anyway, so just use the same package to do tokenizing too. It seemed that should be lightweight.
But sadly, that code adds 225K to the bundle and 18K to the zip (!). And what's totally weird is if I cut out the lexer -- which uses the same package as the parser -- but leave in the parser, the bundle size instead only increases a completely acceptable 18K and the zip only 5K. So I totally don't understand what is going on, since the lexer source file is only 13K, and it only imports packages that the parser already imports. There is clearly something I don't comprehend about bundling.
In any case, this new parser-combinators lexer is no good either, so it seems the only alternative is to try a new hand-written tokenizer, which as I said I don't really mind. It's just sort of a pain since even for the tokenizer, I find one of the tokenizer/grammar formalisms clearer and easier to read/write than handwritten code. But it's clearly the next thing to try -- I assume that I can write a very lean tokenizer...
OK, the results are in: with the handwritten lexer and the parser-combinators grammar, the bundle size increases by 90K (just under 14%) and the zipped version by 14K (8%). And if I strip out all the rest of the current parse.js and just return what this nanomath parser produces, we only get back down to an 80K (12.3%) bundle size increase and 10K (6%) increase in the zipped bundle. So that experiment provides a decent estimate of what the "permanent" size increase of the mathjs package would be as a result of changing to a "grammar" for the parser. (I put "grammar" in quotes because the parser-combinator approach is really just a system for organizing the writing of parsers in a functional style -- it's definitely more systematic and readable than the hand-written approach, but it is definitely not as clean as the formal grammar-based approaches of Nearly, peggy, or Ohm.)
Frankly, I am surprised at how much these approaches seem to add.
OK @josdejong I guess it is time to make some decision about how to proceed. To me, almost any of these approaches seem worthwhile -- I sort of can't imagine proceeding without some systematization of the parsing. But I don't have any basis to trade that off versus a certain amount of increase in the bundle size, so that's a call you're going to have to make.
If you think it's OK to go ahead with one of these parsing packages, the question becomes which one. I definitely like the Nearley grammar the best -- it is the clearest and most comfortable to work with. But Nearley is not maintained, so we might not want to shoulder the extra maintenance burden we might take on by adopting it, although I like it close to enough to be game. And Ohm is clearly the most heavyweight of the ones we tried.
So presuming Nearley and Ohm are both out, that leaves peggy vs parser-combinators. Both are up-to-date with their maintenance, and the actual structure of the two parsers turns out to be pretty similar (and a bit different from the Nearley and Ohm parsers, which are similar to each other; this grouping is a result of the parsing approach of each package).
The main pro of peggy over parser-combinators is the clarity/cleanliness of an actual grammar formalism. That's traded off against four advantages on the parser-combinators side: (a) smaller (14K vs 20K bump in the bundle size), (b) faster (on the Chevrotain benchmark page using a JSON parser, 1630 ops/sec vs 1200 ops/sec), (c) able to run on a token stream instead of the raw characters (peggy is devoted to the PEG principle of combined tokenizing and parsing, but personally I find the division into two layers clarifies and simplifies both), and (d) no additional build step -- parser-combinators is pure Javascript.
OK, I look forward to your feedback, and don't plan to proceed further on this until we have some consensus on which way to go. Thanks!
@glen impressive job, thanks for exploring these options! I think these bundle sizes like 6% are indeed acceptable.
In the past I've regularly used source-map-explorer to get a bit of insight of what is inside a JS bundle:
Some thoughts:
OK, I am leery of Ohm because all of the development effort there is going into compiling a grammar into a wasm thingie, and the wasm compiler doesn't have some of the features that we use and it's not clear when it's going to get them, and currently it was definitely the heaviest of all the approaches we tried. I mean, it might get much lighter weight once the wasm compiling gets to the feature level of the original, but it's not clear when that's happening, and still serving the wasm thingie may be a bit of extra mechanics on our end (does our bundler incorporate wasm chunks automagically? I know nothing about bundling). So I'd prefer us to steer away from Ohm.
[To answer your question 2, in terms of my pure syntax preference it was nearley > ohm > peggy > parser-combinators, but they were all fine so in the light of all of the other issues I don't think it makes much difference. Yes, parser-combinators feels a bit more verbose because it's stuck being actual JavaScript syntax, so it uses a lot of function calls, but actually the peggy and parser-combinators grammars are the same number of bytes in the source code.)]
So based on all the above, merging your views and mine would seem to indicate moving ahead with a peggy parser. As a reminder, that was 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle as compared with unzipped bundle size increases of 90K (14%) and zipped by 14K (8%) for parser combinator (interesting that the bundle increases more but it shrinks a lot under compression). And we may save 10K bundle/4K zipped when we strip out the existing bundle. You're OK at those levels? I'd like to be sure before proceeding, so please let me know.
The reason I'd like to be sure is that I am very strongly against your point 4. I simply do not want to be involved in writing, tuning, and maintaining a grammar-formalism grammar only to have to reimplement it as a "hand parser" and then co-maintain both. That really strikes me as an unpleasant maintenance nightmare, and to me just loses the value of the grammar formalism. If your roadmap for starting down the path of a peggy parser is to likely do that sooner or later, then I am not willing to start down that path. I don't so much mind translating to some other formalism in the future, in case we picked the wrong formalism, but I am not going to enter a state of having a "reference" parser and a "real" parser, nor having switched to a formalism am I going to be willing to go back to a handwritten parser.
That's one of the reasons you might want to give parser-combinators another thought. It basically is just a disciplined way of organizing a hand-written parser. I think that's why in the Chevrotain benchmarks, it's the fastest of the parsers at its level of power (other than handwritten). I think it's just that the functional style is just a little bit less efficient in JavaScript than the hand-written parser style which would be written iteratively rather than functionally, but on the other hand the functional style really exposes the structure of the grammar, pretty close to as well as a domain-specific language does.
So if the decision were entirely up to me, I'd proceed with parser-combinators as apparently fast and lightweight and scalable, but I am happy instead to proceed with a peggy parser if you're amenable to the position that we might in the future change grammar formalisms but we would not have two parallel parsing implementations and we wouldn't go back to handwritten. Looking forward to your call between peggy and parser-combinators (again holding off til then).
I have a couple comments about proceeding with peggy which I will put in #49, which is about peggy.
So I thought some more about proceeding with peggy, and tried to start on the next couple of steps down that road as outlined in my latest comments in #49. The upshot is that I found the peggy system more cumbersome/less easily adaptable to the mathjs language than I had thought. With some tricks it can be adapted to parsing a tokenized stream, but it will be a little wonky: matching a token will consist of an assertion to check the token type followed by accepting any "character", which will really be the token. E.g., to accept either a plus token or a minus token it would look something like
addOp = &{ return tok('plus') } [^] / &{ return tok('minus') } [^]-- a bit klunky. (Maybe I could write a plugin to syntactic-sugar this...) And I think I would want to make a plugin that would label the return value of each rule with the rule name, to facilitate writing a second pass transforming the raw parse tree into the Node tree we want. (Yes, I know it can be done by annotating each rule with in-line code to generate the Node for that rule, but I find that disrupting the grammar itself with all of those bits of code makes the grammar much more difficult to read and get a picture of what's going on. I would much prefer it if peggy had a way to specify all the rules, and then the action code on those rules that need it in a separate block below. So proposing and trying to get merged such a feature would be another way to go.)So for me, it seems like one could manage to get all this working with peggy, but it feels like it would be a bit of a struggle of trying to fit a right foot into a left shoe -- that sort of thing.
On the other hand, parser-combinators is lighweight, maintained, and nicely organized (although a bit light on documentation, which we can help with). There were two main objections above that I saw:
Therefore, I would like to strengthen my recommendation to proceed with parser-combinators (either with or without providing a DSL for defining parser-combinators rules). I really think that is the way we should go, while acknowledging that I am willing to go with peggy if you are strongly against parser-combinators. The peggy option will definitely be more effort; at this point, having the proofs-of-concept working, parser-combinators is a path of low resistance. If you end up deciding on parser-combinators, just let me know how far you want me to go with the DSL variant: (a) drop it, (b) do the easy operators like
/,*and+(so the mathjs grammar would end up being roughly half in the DSL and half in "direct parser-combinators"), or (c) go far enough so that the entire mathjs grammar is expressed in the DSL, or at least very nearly so.Thanks and I continue to look forward to your feedback.
@glen wrote in #51 (comment):
Oh, actually I had time to fully implement the slightly extended peggy-like DSL (it was really quite easy, which I also think speaks well of parser-combinators) so that the entire mathjs grammar can be expressed that way. So if you are willing to go ahead with parser-combinators (potentially for a long term), the choice is just between (A) "straight-up" JavaScript functional-notation use of parser combinators, as in this PR, or (B) a DSL very close to peggy's notation, with a few small extensions, in which to express the grammar as per #52 (which of course would be documented to describe the full grammar for the expressions defining the rules of the mathjs grammar).
e3b89054c99a09d54ab3View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.