WIP: feat: define a peggy parser for the mathjs expression language #49

Draft
glen wants to merge 15 commits from peggy_parse into main
Owner

This is just an amendment to #47, but now based on the peggy parser. The status is back to roughly where the Ohm PR was. It should be ready to test the bundle size impacts of this parser.

Oh, but it also needs a build system, since there is an extra compilation step to generate the parser from the source code.

This is just an amendment to #47, but now based on the peggy parser. The status is back to roughly where the Ohm PR was. It should be ready to test the bundle size impacts of this parser. Oh, but it also needs a build system, since there is an extra compilation step to generate the parser from the source code.
glen added 14 commits 2026-02-05 22:06:17 +00:00
feat: define an Ohm.js parser for maje
All checks were successful
/ test (pull_request) Successful in 19s
998e9c80c5
test: first set of parse tests
All checks were successful
/ test (pull_request) Successful in 18s
45f8f8087a
test: another set of parse tests and associated fixes
All checks were successful
/ test (pull_request) Successful in 18s
ae98a750e7
test: comment tests and associated grammar fixes
All checks were successful
/ test (pull_request) Successful in 18s
f482f2b252
test: number tests and associated grammar fixes
All checks were successful
/ test (pull_request) Successful in 19s
507056064a
chore: update to latest ohm, and remove 'holes' workaround
All checks were successful
/ test (pull_request) Successful in 18s
cf0706286a
feat: add transducer to take nested \n to \r
All checks were successful
/ test (pull_request) Successful in 18s
ae19e46adc
feat: add mageParse to match newlineTransduced string
All checks were successful
/ test (pull_request) Successful in 18s
e4a759e07e
refactor: simplify Accessor recognition
All checks were successful
/ test (pull_request) Successful in 18s
81a45a7d65
test: doublequote string tests and associated fixes
All checks were successful
/ test (pull_request) Successful in 18s
4244b52158
test: add tests for error messages as well
Some checks failed
/ test (pull_request) Failing after 21s
f1288f47a6
refactor: Switch to peggy parser
Some checks failed
/ test (pull_request) Failing after 18s
1866ba7832
We are back in the state that all of the expressions so far that should
  parse do. Just most of the error messages for parses that are supposed
  to fail don't match the expected format (which is unsurprising), or no
  format has been set.
chore: add the peggy grammar
Some checks failed
/ test (pull_request) Failing after 16s
d46b27508d
Collaborator

Impressive, that looks quite neat 😎!

Yeah it needs a build step, but the nice thing of course is that it doesn't have to bundle the parser generator.

Impressive, that looks quite neat 😎! Yeah it needs a build step, but the nice thing of course is that it doesn't have to bundle the parser generator.
Author
Owner

@josdejong wrote in #49 (comment):

but the nice thing of course is that it doesn't have to bundle the parser generator.

Indeed, that will be tonight's experiment, to just append this grammar to mathjs and check the bundle size effect.

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/49#issuecomment-3652: > but the nice thing of course is that it doesn't have to bundle the parser generator. Indeed, that will be tonight's experiment, to just append this grammar to mathjs and check the bundle size effect.
Author
Owner

Here are the sad results: the generated javascript grammar.js file is 120K; it adds 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle. Not a lot better than Ohm at 100K (15%) to the unzipped bundle and 25K (14%) to the zipped bundle. (I mean, we'll get something back from stripping out the 52K parse.js, but still either of these packages will be a big bump...)

Should I look into other parsing libraries? Is there a size increase you're comfortable with for the benefits of a grammar-driven rather than hand-coded parser? Thanks for letting me know as I feel a bit derailed at the moment.

Here are the sad results: the generated javascript grammar.js file is 120K; it adds 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle. Not a lot better than Ohm at 100K (15%) to the unzipped bundle and 25K (14%) to the zipped bundle. (I mean, we'll get something back from stripping out the 52K parse.js, but still either of these packages will be a big bump...) Should I look into other parsing libraries? Is there a size increase you're comfortable with for the benefits of a grammar-driven rather than hand-coded parser? Thanks for letting me know as I feel a bit derailed at the moment.
Author
Owner

I guess there are at least two other parsing packages I can try. Both of these have very different approaches from peggy and Ohm and each other, so I think they are worth a try in the order listed to see if each is small enough. (The package sizes on npm appear to be smaller that peggy/Ohm, so there's some hope.)

  • Nearly with a Moo tokenizer. This seems to be the only major parsing general-grammar package that welcomes a tokenizer, which I think will actually be really helpful in our case, so it seems worth trying independently. Also, it's the only parser that produces all parses in case the grammar is ambiguous (which will also help us discover whether the grammar we have is ambiguous or not); PEG parsers just return the first match they hit.
  • parser-combinators This builds the top-level parser bottom-up from simpler parsers, which is an altogether different approach, so if Nearly doesn't work out, it still seems worth trying. Also, according to the benchmarks it's faster (for parsing JSON) than any of the other parsers we've tried, except for Chevrotain, which turned out not to be powerful enough for our language.

I'll will try Nearly as soon as I can. Looking at the list of parsers on the benchmark page linked in the previous paragraph, there doesn't seem to be anything else worth trying besides these two and the ones we've already done. So if neither of these works out, I guess let's just revert to a handbuilt parser.

I guess there are at least two other parsing packages I can try. Both of these have very different approaches from peggy and Ohm and each other, so I think they are worth a try in the order listed to see if each is small enough. (The package sizes on npm appear to be smaller that peggy/Ohm, so there's some hope.) * [Nearly](https://nearley.js.org/) with a [Moo](https://github.com/tjvr/moo) tokenizer. This seems to be the only major parsing general-grammar package that welcomes a tokenizer, which I think will actually be really helpful in our case, so it seems worth trying independently. Also, it's the only parser that produces all parses in case the grammar is ambiguous (which will also help us discover whether the grammar we have is ambiguous or not); PEG parsers just return the first match they hit. * [parser-combinators](https://github.com/michalusio/Parser/) This builds the top-level parser bottom-up from simpler parsers, which is an altogether different approach, so if Nearly doesn't work out, it still seems worth trying. Also, according to the [benchmarks](https://github.com/michalusio/Parser/) it's faster (for parsing JSON) than any of the other parsers we've tried, except for Chevrotain, which turned out not to be powerful enough for our language. I'll will try Nearly as soon as I can. Looking at the list of parsers on the benchmark page linked in the previous paragraph, there doesn't seem to be anything else worth trying besides these two and the ones we've already done. So if neither of these works out, I guess let's just revert to a handbuilt parser.
Author
Owner

OK, Nearley with a trivial grammar and a trivial Tokenizr lexer -- Moo actually didn't quite seem up to the job of tokenizing mathjs expression language with its context-sensitive whitespace rules -- adds 19K (3%) to the raw bundle and 6K (3.5%) to the zipped bundle. I am going to assume those are small enough, especially as we will get some back if they work well and we rip out the old parser. With the current parse.js at 52K, they seem small enough that it could end up as a wash. Of course, I will re-check these figures when I have a lexer and parser that actually handle mathjs expressions, but for the moment, I am moving forward with Nearley.

OK, Nearley with a trivial grammar and a trivial [Tokenizr](https://github.com/rse/tokenizr) lexer -- Moo actually didn't quite seem up to the job of tokenizing mathjs expression language with its context-sensitive whitespace rules -- adds 19K (3%) to the raw bundle and 6K (3.5%) to the zipped bundle. I am going to assume those are small enough, especially as we will get some back if they work well and we rip out the old parser. With the current parse.js at 52K, they seem small enough that it could end up as a wash. Of course, I will re-check these figures when I have a lexer and parser that actually handle mathjs expressions, but for the moment, I am moving forward with Nearley.
Author
Owner

@josdejong wrote in #49 (comment):

Yeah it needs a build step,

Question for @josdejong: if/when we get one of these parser generator PRs to a point near merging (working well and sufficiently lightweight to consider inclusion in the main mathjs repository), do you have a recommendation on your preferred tool to manage the build procedure? Heretofore in mathjs, one can just edit the source files and then execute npm run test and the unit tests can run with no prior steps necessary and they will be sure to depend just on the latest code. With the introduction of peggy or Nearley, if the grammar file has been touched, it will be necessary to run a grammar compilation step prior to running mocha. On the other hand, for 90%+ of changes, you will not be touching the grammar file, and it would be a shame to run the grammar compiler on every invocation of npm run test.

It's for this sort of reason that in the Numberscope project we have adopted the (Gnu) make utility to track the dependencies among files so that all and only the necessary build steps are taken any time they are needed. As has been my experience since the start of my software-related career, it works quite well. On the other hand, I do recognize that it is an unfamiliar tool in the web-development community (for reasons I don't particularly understand; it seems to me that it's just been overlooked because of a sort of generation-skip between C-language-family coders and web coders). So I would be happy to start using make in connection with nanomath and/or mathjs. But if you have a particular desire to avoid that tool and can suggest another one that will let the system easily check, whenever you do the equivalent of npm run test, whether the compiled grammar code is out of date compared to the grammar definition file and (only) if so re-compile it, I'd be happy to hear.

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/49#issuecomment-3652: > Yeah it needs a build step, Question for @josdejong: if/when we get one of these parser generator PRs to a point near merging (working well and sufficiently lightweight to consider inclusion in the main mathjs repository), do you have a recommendation on your preferred tool to manage the build procedure? Heretofore in mathjs, one can just edit the source files and then execute `npm run test` and the unit tests can run with no prior steps necessary and they will be sure to depend just on the latest code. With the introduction of peggy or Nearley, if the grammar file has been touched, it will be necessary to run a grammar compilation step prior to running mocha. On the other hand, for 90%+ of changes, you will not be touching the grammar file, and it would be a shame to run the grammar compiler on _every_ invocation of `npm run test`. It's for this sort of reason that in the Numberscope project we have adopted the (Gnu) `make` utility to track the dependencies among files so that all and only the necessary build steps are taken any time they are needed. As has been my experience since the start of my software-related career, it works quite well. On the other hand, I do recognize that it is an unfamiliar tool in the web-development community (for reasons I don't particularly understand; it seems to me that it's just been overlooked because of a sort of generation-skip between C-language-family coders and web coders). So I would be happy to start using `make` in connection with nanomath and/or mathjs. But if you have a particular desire to avoid that tool and can suggest another one that will let the system easily check, whenever you do the equivalent of `npm run test`, whether the compiled grammar code is out of date compared to the grammar definition file and (only) if so re-compile it, I'd be happy to hear.
Collaborator

Thanks for testing all of this out.

Hm so Peggy generates a bit smaller parser compared to Ohm but still much larger than the current handwritten parser. I had expected a bigger difference, so kudos to Ohm I guess. Just for reference, I checked how large our current parse.js (52KB) is when minified: 14kB, and when minified+gzipped: 5kB. Could be even a bit less when it is part of the mathjs bundle. Of course bundle size is not the only thing, at least to me the performance is also relevant. And of course maintainability, which is currently an issue and will be improved by using a parser generator.

I'm indeed hesitant to go for Nearly because it isn't maintained anymore, and parser-combinators because we don't know if it will be maintained in a while.

We can go for the plan B we discussed last week: describe the grammer in a parser generator to ensure soundness and a clear formal description, and write a parser by hand too, which is optimized for performance and bundle size. Both have to be tested against the same test suite. In the POC phase the size of the bundle doesn't yet matter, so we can just use a parser generator for the time being.

I think the bundle size will change a lot anyway if we rewrite the whole codebase of mathjs into nanomath, for better or worse. So I think it's a bit too early to drop solutions in this phase already. Since it is relatively little work to implement a parser generator, I think the best choice for now is go for a convenient parser generater, and in a step two, implement a handwritten parser if still needed.

Do you have a preference for either Ohm or Peggy?

Thanks for testing all of this out. Hm so Peggy generates a bit smaller parser compared to Ohm but still much larger than the current handwritten parser. I had expected a bigger difference, so kudos to Ohm I guess. Just for reference, I checked how large our current parse.js (52KB) is when minified: 14kB, and when minified+gzipped: 5kB. Could be even a bit less when it is part of the mathjs bundle. Of course bundle size is not the only thing, at least to me the performance is also relevant. And of course maintainability, which is currently an issue and will be improved by using a parser generator. I'm indeed hesitant to go for `Nearly` because it isn't maintained anymore, and `parser-combinators` because we don't know if it will be maintained in a while. We can go for the plan B we discussed last week: describe the grammer in a parser generator to ensure soundness and a clear formal description, and write a parser by hand too, which is optimized for performance and bundle size. Both have to be tested against the same test suite. In the POC phase the size of the bundle doesn't yet matter, so we can just use a parser generator for the time being. I think the bundle size will change a lot anyway if we rewrite the whole codebase of mathjs into nanomath, for better or worse. So I think it's a bit too early to drop solutions in this phase already. Since it is relatively little work to implement a parser generator, I think the best choice for now is go for a convenient parser generater, and in a step two, implement a handwritten parser if still needed. Do you have a preference for either Ohm or Peggy?
Author
Owner

Hmm. Since I have the Tokenizr lexer working well now (and it does some of the heavy lifting for mathjs' shall we say unique syntax), and Tokenizr seems reasonably OK maintained, and parser-combinator is the fastest parsing package on chevrotain's benchmark list except for chevrotain (which is not powerful enough for us), and parser-combinator seems designed for effective tree-shaking and seems solidly maintained at the moment (413 merged pull requests, 1 open, last merge yesterday), I am going to try parsing with parser-combinator on the token stream emitted by the Tokenizr lexer.

I think if that turns out well, it will be near the sweet spot among performance, lightweight-ness, and maintainability. So fingers crossed.

To answer your question, with my experience so far among Ohm, peggy, and Nearley, I like Nearley the best. If it were well-maintained, I would put it in the sweet spot.

So let us see.

Hmm. Since I have the Tokenizr lexer working well now (and it does some of the heavy lifting for mathjs' shall we say unique syntax), and Tokenizr seems reasonably OK maintained, and parser-combinator is the fastest parsing package on chevrotain's benchmark list except for chevrotain (which is not powerful enough for us), and parser-combinator seems designed for effective tree-shaking and seems solidly maintained at the moment (413 merged pull requests, 1 open, last merge yesterday), I am going to try parsing with parser-combinator on the token stream emitted by the Tokenizr lexer. I think if that turns out well, it will be near the sweet spot among performance, lightweight-ness, and maintainability. So fingers crossed. To answer your question, with my experience so far among Ohm, peggy, and Nearley, I like Nearley the best. If it were well-maintained, I would put it in the sweet spot. So let us see.
Collaborator

I'm a bit confused, would you like to go for Nearley then even though it's not maintained?

Just for reference: a well known established parser generator is antlr4. I'm not sure if it is suitable for our case, but I have used it before and really liked its syntax. It seems possible to have it generate JS: https://www.scriptol.com/programming/antlr4-javascript.php

I'm a bit confused, would you like to go for Nearley then even though it's not maintained? Just for reference: a well known established parser generator is [antlr4](https://github.com/antlr/antlr4). I'm not sure if it is suitable for our case, but I have used it before and really liked its syntax. It seems possible to have it generate JS: https://www.scriptol.com/programming/antlr4-javascript.php
Author
Owner

Sorry to have written confusingly. I am currently working with parser-combinator, which is quite lightweight and seems very actively maintained. I have begun to hope that the combination of parser-combinator and the Tokenizr lexer (which makes the job of the parser significantly lighter) will produce a very clear, easy-to-maintain, lightweight, and sufficiently fast parsing solution for our plans to beef up the mathjs language (any names occur to you in the shower yet?). We're getting pretty close to knowing. See #51 for how far I have gotten to this point.

As far as I know antlr4 is not "powerful" enough for the mathjs language; I don't think LL-based parsing is suitable.

Sorry to have written confusingly. I am currently working with parser-combinator, which is quite lightweight and seems very actively maintained. I have begun to hope that the combination of parser-combinator and the Tokenizr lexer (which makes the job of the parser significantly lighter) will produce a very clear, easy-to-maintain, lightweight, and sufficiently fast parsing solution for our plans to beef up the mathjs language (any names occur to you in the shower yet?). We're getting pretty close to knowing. See #51 for how far I have gotten to this point. As far as I know antlr4 is not "powerful" enough for the mathjs language; I don't think LL-based parsing is suitable.
Some checks failed
/ test (pull_request) Failing after 16s
Required
Details
This pull request is marked as a work in progress.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin peggy_parse:peggy_parse
git switch peggy_parse
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
StudioInfinity/nanomath!49
No description provided.