WIP: feat: define a peggy parser for the mathjs expression language #49
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "peggy_parse"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is just an amendment to #47, but now based on the peggy parser. The status is back to roughly where the Ohm PR was. It should be ready to test the bundle size impacts of this parser.
Oh, but it also needs a build system, since there is an extra compilation step to generate the parser from the source code.
Impressive, that looks quite neat 😎!
Yeah it needs a build step, but the nice thing of course is that it doesn't have to bundle the parser generator.
@josdejong wrote in #49 (comment):
Indeed, that will be tonight's experiment, to just append this grammar to mathjs and check the bundle size effect.
Here are the sad results: the generated javascript grammar.js file is 120K; it adds 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle. Not a lot better than Ohm at 100K (15%) to the unzipped bundle and 25K (14%) to the zipped bundle. (I mean, we'll get something back from stripping out the 52K parse.js, but still either of these packages will be a big bump...)
Should I look into other parsing libraries? Is there a size increase you're comfortable with for the benefits of a grammar-driven rather than hand-coded parser? Thanks for letting me know as I feel a bit derailed at the moment.
I guess there are at least two other parsing packages I can try. Both of these have very different approaches from peggy and Ohm and each other, so I think they are worth a try in the order listed to see if each is small enough. (The package sizes on npm appear to be smaller that peggy/Ohm, so there's some hope.)
I'll will try Nearly as soon as I can. Looking at the list of parsers on the benchmark page linked in the previous paragraph, there doesn't seem to be anything else worth trying besides these two and the ones we've already done. So if neither of these works out, I guess let's just revert to a handbuilt parser.
OK, Nearley with a trivial grammar and a trivial Tokenizr lexer -- Moo actually didn't quite seem up to the job of tokenizing mathjs expression language with its context-sensitive whitespace rules -- adds 19K (3%) to the raw bundle and 6K (3.5%) to the zipped bundle. I am going to assume those are small enough, especially as we will get some back if they work well and we rip out the old parser. With the current parse.js at 52K, they seem small enough that it could end up as a wash. Of course, I will re-check these figures when I have a lexer and parser that actually handle mathjs expressions, but for the moment, I am moving forward with Nearley.
@josdejong wrote in #49 (comment):
Question for @josdejong: if/when we get one of these parser generator PRs to a point near merging (working well and sufficiently lightweight to consider inclusion in the main mathjs repository), do you have a recommendation on your preferred tool to manage the build procedure? Heretofore in mathjs, one can just edit the source files and then execute
npm run testand the unit tests can run with no prior steps necessary and they will be sure to depend just on the latest code. With the introduction of peggy or Nearley, if the grammar file has been touched, it will be necessary to run a grammar compilation step prior to running mocha. On the other hand, for 90%+ of changes, you will not be touching the grammar file, and it would be a shame to run the grammar compiler on every invocation ofnpm run test.It's for this sort of reason that in the Numberscope project we have adopted the (Gnu)
makeutility to track the dependencies among files so that all and only the necessary build steps are taken any time they are needed. As has been my experience since the start of my software-related career, it works quite well. On the other hand, I do recognize that it is an unfamiliar tool in the web-development community (for reasons I don't particularly understand; it seems to me that it's just been overlooked because of a sort of generation-skip between C-language-family coders and web coders). So I would be happy to start usingmakein connection with nanomath and/or mathjs. But if you have a particular desire to avoid that tool and can suggest another one that will let the system easily check, whenever you do the equivalent ofnpm run test, whether the compiled grammar code is out of date compared to the grammar definition file and (only) if so re-compile it, I'd be happy to hear.Thanks for testing all of this out.
Hm so Peggy generates a bit smaller parser compared to Ohm but still much larger than the current handwritten parser. I had expected a bigger difference, so kudos to Ohm I guess. Just for reference, I checked how large our current parse.js (52KB) is when minified: 14kB, and when minified+gzipped: 5kB. Could be even a bit less when it is part of the mathjs bundle. Of course bundle size is not the only thing, at least to me the performance is also relevant. And of course maintainability, which is currently an issue and will be improved by using a parser generator.
I'm indeed hesitant to go for
Nearlybecause it isn't maintained anymore, andparser-combinatorsbecause we don't know if it will be maintained in a while.We can go for the plan B we discussed last week: describe the grammer in a parser generator to ensure soundness and a clear formal description, and write a parser by hand too, which is optimized for performance and bundle size. Both have to be tested against the same test suite. In the POC phase the size of the bundle doesn't yet matter, so we can just use a parser generator for the time being.
I think the bundle size will change a lot anyway if we rewrite the whole codebase of mathjs into nanomath, for better or worse. So I think it's a bit too early to drop solutions in this phase already. Since it is relatively little work to implement a parser generator, I think the best choice for now is go for a convenient parser generater, and in a step two, implement a handwritten parser if still needed.
Do you have a preference for either Ohm or Peggy?
Hmm. Since I have the Tokenizr lexer working well now (and it does some of the heavy lifting for mathjs' shall we say unique syntax), and Tokenizr seems reasonably OK maintained, and parser-combinator is the fastest parsing package on chevrotain's benchmark list except for chevrotain (which is not powerful enough for us), and parser-combinator seems designed for effective tree-shaking and seems solidly maintained at the moment (413 merged pull requests, 1 open, last merge yesterday), I am going to try parsing with parser-combinator on the token stream emitted by the Tokenizr lexer.
I think if that turns out well, it will be near the sweet spot among performance, lightweight-ness, and maintainability. So fingers crossed.
To answer your question, with my experience so far among Ohm, peggy, and Nearley, I like Nearley the best. If it were well-maintained, I would put it in the sweet spot.
So let us see.
I'm a bit confused, would you like to go for Nearley then even though it's not maintained?
Just for reference: a well known established parser generator is antlr4. I'm not sure if it is suitable for our case, but I have used it before and really liked its syntax. It seems possible to have it generate JS: https://www.scriptol.com/programming/antlr4-javascript.php
Sorry to have written confusingly. I am currently working with parser-combinator, which is quite lightweight and seems very actively maintained. I have begun to hope that the combination of parser-combinator and the Tokenizr lexer (which makes the job of the parser significantly lighter) will produce a very clear, easy-to-maintain, lightweight, and sufficiently fast parsing solution for our plans to beef up the mathjs language (any names occur to you in the shower yet?). We're getting pretty close to knowing. See #51 for how far I have gotten to this point.
As far as I know antlr4 is not "powerful" enough for the mathjs language; I don't think LL-based parsing is suitable.
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.