WIP: feat: define a peggy parser for the mathjs expression language

glen commented

2026-02-05 22:06:17 +00:00

Owner

This is just an amendment to #47, but now based on the peggy parser. The status is back to roughly where the Ohm PR was. It should be ready to test the bundle size impacts of this parser.

Oh, but it also needs a build system, since there is an extra compilation step to generate the parser from the source code.

This is just an amendment to #47, but now based on the peggy parser. The status is back to roughly where the Ohm PR was. It should be ready to test the bundle size impacts of this parser. Oh, but it also needs a build system, since there is an extra compilation step to generate the parser from the source code.

glen added 14 commits

2026-02-05 22:06:17 +00:00

feat: define an Ohm.js parser for maje

/ test (pull_request) Successful in 19s

Details

998e9c80c5

test: first set of parse tests

/ test (pull_request) Successful in 18s

Details

45f8f8087a

test: another set of parse tests and associated fixes

/ test (pull_request) Successful in 18s

Details

ae98a750e7

test: a couple of negative tests; comments next 567999f868

test: comment tests and associated grammar fixes

/ test (pull_request) Successful in 18s

Details

f482f2b252

test: number tests and associated grammar fixes

/ test (pull_request) Successful in 19s

Details

507056064a

chore: update to latest ohm, and remove 'holes' workaround

/ test (pull_request) Successful in 18s

Details

cf0706286a

feat: add transducer to take nested \n to \r

/ test (pull_request) Successful in 18s

Details

ae19e46adc

feat: add mageParse to match newlineTransduced string

/ test (pull_request) Successful in 18s

Details

e4a759e07e

refactor: Use newline transducer to simplify grammar 920401209c

refactor: simplify Accessor recognition

/ test (pull_request) Successful in 18s

Details

81a45a7d65

test: doublequote string tests and associated fixes

/ test (pull_request) Successful in 18s

Details

4244b52158

test: add tests for error messages as well

/ test (pull_request) Failing after 21s

Details

f1288f47a6

refactor: Switch to peggy parser

/ test (pull_request) Failing after 18s

Details

1866ba7832

We are back in the state that all of the expressions so far that should
  parse do. Just most of the error messages for parses that are supposed
  to fail don't match the expected format (which is unsurprising), or no
  format has been set.

glen added 1 commit

2026-02-05 22:07:21 +00:00

chore: add the peggy grammar

/ test (pull_request) Failing after 16s

Details

d46b27508d

josdejong commented

2026-02-06 11:53:22 +00:00

Collaborator

Impressive, that looks quite neat 😎!

Yeah it needs a build step, but the nice thing of course is that it doesn't have to bundle the parser generator.

Impressive, that looks quite neat 😎! Yeah it needs a build step, but the nice thing of course is that it doesn't have to bundle the parser generator.

glen commented

2026-02-06 17:04:05 +00:00

Author

Owner

@josdejong wrote in #49 (comment):

but the nice thing of course is that it doesn't have to bundle the parser generator.

Indeed, that will be tonight's experiment, to just append this grammar to mathjs and check the bundle size effect.

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/49#issuecomment-3652: > but the nice thing of course is that it doesn't have to bundle the parser generator. Indeed, that will be tonight's experiment, to just append this grammar to mathjs and check the bundle size effect.

👍 1

glen commented

2026-02-06 22:44:30 +00:00

Author

Owner

Here are the sad results: the generated javascript grammar.js file is 120K; it adds 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle. Not a lot better than Ohm at 100K (15%) to the unzipped bundle and 25K (14%) to the zipped bundle. (I mean, we'll get something back from stripping out the 52K parse.js, but still either of these packages will be a big bump...)

Should I look into other parsing libraries? Is there a size increase you're comfortable with for the benefits of a grammar-driven rather than hand-coded parser? Thanks for letting me know as I feel a bit derailed at the moment.

Here are the sad results: the generated javascript grammar.js file is 120K; it adds 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle. Not a lot better than Ohm at 100K (15%) to the unzipped bundle and 25K (14%) to the zipped bundle. (I mean, we'll get something back from stripping out the 52K parse.js, but still either of these packages will be a big bump...) Should I look into other parsing libraries? Is there a size increase you're comfortable with for the benefits of a grammar-driven rather than hand-coded parser? Thanks for letting me know as I feel a bit derailed at the moment.

glen commented

2026-02-07 02:10:50 +00:00

Author

Owner

I guess there are at least two other parsing packages I can try. Both of these have very different approaches from peggy and Ohm and each other, so I think they are worth a try in the order listed to see if each is small enough. (The package sizes on npm appear to be smaller that peggy/Ohm, so there's some hope.)

Nearly with a Moo tokenizer. This seems to be the only major parsing general-grammar package that welcomes a tokenizer, which I think will actually be really helpful in our case, so it seems worth trying independently. Also, it's the only parser that produces all parses in case the grammar is ambiguous (which will also help us discover whether the grammar we have is ambiguous or not); PEG parsers just return the first match they hit.
parser-combinators This builds the top-level parser bottom-up from simpler parsers, which is an altogether different approach, so if Nearly doesn't work out, it still seems worth trying. Also, according to the benchmarks it's faster (for parsing JSON) than any of the other parsers we've tried, except for Chevrotain, which turned out not to be powerful enough for our language.

I'll will try Nearly as soon as I can. Looking at the list of parsers on the benchmark page linked in the previous paragraph, there doesn't seem to be anything else worth trying besides these two and the ones we've already done. So if neither of these works out, I guess let's just revert to a handbuilt parser.

I guess there are at least two other parsing packages I can try. Both of these have very different approaches from peggy and Ohm and each other, so I think they are worth a try in the order listed to see if each is small enough. (The package sizes on npm appear to be smaller that peggy/Ohm, so there's some hope.) * [Nearly](https://nearley.js.org/) with a [Moo](https://github.com/tjvr/moo) tokenizer. This seems to be the only major parsing general-grammar package that welcomes a tokenizer, which I think will actually be really helpful in our case, so it seems worth trying independently. Also, it's the only parser that produces all parses in case the grammar is ambiguous (which will also help us discover whether the grammar we have is ambiguous or not); PEG parsers just return the first match they hit. * [parser-combinators](https://github.com/michalusio/Parser/) This builds the top-level parser bottom-up from simpler parsers, which is an altogether different approach, so if Nearly doesn't work out, it still seems worth trying. Also, according to the [benchmarks](https://github.com/michalusio/Parser/) it's faster (for parsing JSON) than any of the other parsers we've tried, except for Chevrotain, which turned out not to be powerful enough for our language. I'll will try Nearly as soon as I can. Looking at the list of parsers on the benchmark page linked in the previous paragraph, there doesn't seem to be anything else worth trying besides these two and the ones we've already done. So if neither of these works out, I guess let's just revert to a handbuilt parser.

glen commented

2026-02-07 12:42:50 +00:00

Author

Owner

OK, Nearley with a trivial grammar and a trivial Tokenizr lexer -- Moo actually didn't quite seem up to the job of tokenizing mathjs expression language with its context-sensitive whitespace rules -- adds 19K (3%) to the raw bundle and 6K (3.5%) to the zipped bundle. I am going to assume those are small enough, especially as we will get some back if they work well and we rip out the old parser. With the current parse.js at 52K, they seem small enough that it could end up as a wash. Of course, I will re-check these figures when I have a lexer and parser that actually handle mathjs expressions, but for the moment, I am moving forward with Nearley.

OK, Nearley with a trivial grammar and a trivial [Tokenizr](https://github.com/rse/tokenizr) lexer -- Moo actually didn't quite seem up to the job of tokenizing mathjs expression language with its context-sensitive whitespace rules -- adds 19K (3%) to the raw bundle and 6K (3.5%) to the zipped bundle. I am going to assume those are small enough, especially as we will get some back if they work well and we rip out the old parser. With the current parse.js at 52K, they seem small enough that it could end up as a wash. Of course, I will re-check these figures when I have a lexer and parser that actually handle mathjs expressions, but for the moment, I am moving forward with Nearley.

glen referenced this pull request

2026-02-07 23:52:56 +00:00

WIP: nearley parser for the expression language #50

glen commented

2026-02-08 14:48:28 +00:00

Author

Owner

@josdejong wrote in #49 (comment):

Yeah it needs a build step,

Question for @josdejong: if/when we get one of these parser generator PRs to a point near merging (working well and sufficiently lightweight to consider inclusion in the main mathjs repository), do you have a recommendation on your preferred tool to manage the build procedure? Heretofore in mathjs, one can just edit the source files and then execute npm run test and the unit tests can run with no prior steps necessary and they will be sure to depend just on the latest code. With the introduction of peggy or Nearley, if the grammar file has been touched, it will be necessary to run a grammar compilation step prior to running mocha. On the other hand, for 90%+ of changes, you will not be touching the grammar file, and it would be a shame to run the grammar compiler on every invocation of npm run test.

It's for this sort of reason that in the Numberscope project we have adopted the (Gnu) make utility to track the dependencies among files so that all and only the necessary build steps are taken any time they are needed. As has been my experience since the start of my software-related career, it works quite well. On the other hand, I do recognize that it is an unfamiliar tool in the web-development community (for reasons I don't particularly understand; it seems to me that it's just been overlooked because of a sort of generation-skip between C-language-family coders and web coders). So I would be happy to start using make in connection with nanomath and/or mathjs. But if you have a particular desire to avoid that tool and can suggest another one that will let the system easily check, whenever you do the equivalent of npm run test, whether the compiled grammar code is out of date compared to the grammar definition file and (only) if so re-compile it, I'd be happy to hear.

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/49#issuecomment-3652: > Yeah it needs a build step, Question for @josdejong: if/when we get one of these parser generator PRs to a point near merging (working well and sufficiently lightweight to consider inclusion in the main mathjs repository), do you have a recommendation on your preferred tool to manage the build procedure? Heretofore in mathjs, one can just edit the source files and then execute `npm run test` and the unit tests can run with no prior steps necessary and they will be sure to depend just on the latest code. With the introduction of peggy or Nearley, if the grammar file has been touched, it will be necessary to run a grammar compilation step prior to running mocha. On the other hand, for 90%+ of changes, you will not be touching the grammar file, and it would be a shame to run the grammar compiler on _every_ invocation of `npm run test`. It's for this sort of reason that in the Numberscope project we have adopted the (Gnu) `make` utility to track the dependencies among files so that all and only the necessary build steps are taken any time they are needed. As has been my experience since the start of my software-related career, it works quite well. On the other hand, I do recognize that it is an unfamiliar tool in the web-development community (for reasons I don't particularly understand; it seems to me that it's just been overlooked because of a sort of generation-skip between C-language-family coders and web coders). So I would be happy to start using `make` in connection with nanomath and/or mathjs. But if you have a particular desire to avoid that tool and can suggest another one that will let the system easily check, whenever you do the equivalent of `npm run test`, whether the compiled grammar code is out of date compared to the grammar definition file and (only) if so re-compile it, I'd be happy to hear.

josdejong commented

2026-02-10 14:42:57 +00:00

Collaborator

Thanks for testing all of this out.

Hm so Peggy generates a bit smaller parser compared to Ohm but still much larger than the current handwritten parser. I had expected a bigger difference, so kudos to Ohm I guess. Just for reference, I checked how large our current parse.js (52KB) is when minified: 14kB, and when minified+gzipped: 5kB. Could be even a bit less when it is part of the mathjs bundle. Of course bundle size is not the only thing, at least to me the performance is also relevant. And of course maintainability, which is currently an issue and will be improved by using a parser generator.

I'm indeed hesitant to go for Nearly because it isn't maintained anymore, and parser-combinators because we don't know if it will be maintained in a while.

We can go for the plan B we discussed last week: describe the grammer in a parser generator to ensure soundness and a clear formal description, and write a parser by hand too, which is optimized for performance and bundle size. Both have to be tested against the same test suite. In the POC phase the size of the bundle doesn't yet matter, so we can just use a parser generator for the time being.

I think the bundle size will change a lot anyway if we rewrite the whole codebase of mathjs into nanomath, for better or worse. So I think it's a bit too early to drop solutions in this phase already. Since it is relatively little work to implement a parser generator, I think the best choice for now is go for a convenient parser generater, and in a step two, implement a handwritten parser if still needed.

Do you have a preference for either Ohm or Peggy?

Thanks for testing all of this out. Hm so Peggy generates a bit smaller parser compared to Ohm but still much larger than the current handwritten parser. I had expected a bigger difference, so kudos to Ohm I guess. Just for reference, I checked how large our current parse.js (52KB) is when minified: 14kB, and when minified+gzipped: 5kB. Could be even a bit less when it is part of the mathjs bundle. Of course bundle size is not the only thing, at least to me the performance is also relevant. And of course maintainability, which is currently an issue and will be improved by using a parser generator. I'm indeed hesitant to go for `Nearly` because it isn't maintained anymore, and `parser-combinators` because we don't know if it will be maintained in a while. We can go for the plan B we discussed last week: describe the grammer in a parser generator to ensure soundness and a clear formal description, and write a parser by hand too, which is optimized for performance and bundle size. Both have to be tested against the same test suite. In the POC phase the size of the bundle doesn't yet matter, so we can just use a parser generator for the time being. I think the bundle size will change a lot anyway if we rewrite the whole codebase of mathjs into nanomath, for better or worse. So I think it's a bit too early to drop solutions in this phase already. Since it is relatively little work to implement a parser generator, I think the best choice for now is go for a convenient parser generater, and in a step two, implement a handwritten parser if still needed. Do you have a preference for either Ohm or Peggy?

glen commented

2026-02-10 22:57:58 +00:00

Author

Owner

Hmm. Since I have the Tokenizr lexer working well now (and it does some of the heavy lifting for mathjs' shall we say unique syntax), and Tokenizr seems reasonably OK maintained, and parser-combinator is the fastest parsing package on chevrotain's benchmark list except for chevrotain (which is not powerful enough for us), and parser-combinator seems designed for effective tree-shaking and seems solidly maintained at the moment (413 merged pull requests, 1 open, last merge yesterday), I am going to try parsing with parser-combinator on the token stream emitted by the Tokenizr lexer.

I think if that turns out well, it will be near the sweet spot among performance, lightweight-ness, and maintainability. So fingers crossed.

To answer your question, with my experience so far among Ohm, peggy, and Nearley, I like Nearley the best. If it were well-maintained, I would put it in the sweet spot.

So let us see.

Hmm. Since I have the Tokenizr lexer working well now (and it does some of the heavy lifting for mathjs' shall we say unique syntax), and Tokenizr seems reasonably OK maintained, and parser-combinator is the fastest parsing package on chevrotain's benchmark list except for chevrotain (which is not powerful enough for us), and parser-combinator seems designed for effective tree-shaking and seems solidly maintained at the moment (413 merged pull requests, 1 open, last merge yesterday), I am going to try parsing with parser-combinator on the token stream emitted by the Tokenizr lexer. I think if that turns out well, it will be near the sweet spot among performance, lightweight-ness, and maintainability. So fingers crossed. To answer your question, with my experience so far among Ohm, peggy, and Nearley, I like Nearley the best. If it were well-maintained, I would put it in the sweet spot. So let us see.

glen referenced this pull request

2026-02-11 09:17:11 +00:00

WIP:parser_combinator #51

josdejong commented

2026-02-11 13:34:24 +00:00

Collaborator

I'm a bit confused, would you like to go for Nearley then even though it's not maintained?

Just for reference: a well known established parser generator is antlr4. I'm not sure if it is suitable for our case, but I have used it before and really liked its syntax. It seems possible to have it generate JS: https://www.scriptol.com/programming/antlr4-javascript.php

I'm a bit confused, would you like to go for Nearley then even though it's not maintained? Just for reference: a well known established parser generator is [antlr4](https://github.com/antlr/antlr4). I'm not sure if it is suitable for our case, but I have used it before and really liked its syntax. It seems possible to have it generate JS: https://www.scriptol.com/programming/antlr4-javascript.php

glen commented

2026-02-11 14:06:00 +00:00

Author

Owner

Sorry to have written confusingly. I am currently working with parser-combinator, which is quite lightweight and seems very actively maintained. I have begun to hope that the combination of parser-combinator and the Tokenizr lexer (which makes the job of the parser significantly lighter) will produce a very clear, easy-to-maintain, lightweight, and sufficiently fast parsing solution for our plans to beef up the mathjs language (any names occur to you in the shower yet?). We're getting pretty close to knowing. See #51 for how far I have gotten to this point.

As far as I know antlr4 is not "powerful" enough for the mathjs language; I don't think LL-based parsing is suitable.

Sorry to have written confusingly. I am currently working with parser-combinator, which is quite lightweight and seems very actively maintained. I have begun to hope that the combination of parser-combinator and the Tokenizr lexer (which makes the job of the parser significantly lighter) will produce a very clear, easy-to-maintain, lightweight, and sufficiently fast parsing solution for our plans to beef up the mathjs language (any names occur to you in the shower yet?). We're getting pretty close to knowing. See #51 for how far I have gotten to this point. As far as I know antlr4 is not "powerful" enough for the mathjs language; I don't think LL-based parsing is suitable.

glen referenced this pull request

2026-02-25 20:17:06 +00:00

WIP:parser_combinator #51

glen commented

2026-02-25 20:52:10 +00:00

Author

Owner

The discussion in #51 seems to indicate that the decision of how to proceed is down to a choice between extending this PR (peggy) and extending #51 (parser-combinators). Here is how I would like to proceed if we decide to move forward with peggy:

I think the Nearley and parser-combinators grammars really benefited from the separate tokenizer. Really streamlined and clarified the grammars. Now that I have a much more powerful hand-written tokenizer for the mathjs language (than the one in current mathjs), I would love to keep using that. So the first thing I would do is see if there is any way to use peggy on the token stream coming out of that tokenizer instead of directly on the characters.
@josdejong suggested that we could put the code to translate the parse results into a Node tree right in the grammar. peggy definitely has that ability, but I find that it makes the syntactic structure of the grammar much harder to read. So I would plan to crawl the parse tree and translate it into a Node tree on a second pass, if that's acceptable.

The discussion in #51 seems to indicate that the decision of how to proceed is down to a choice between extending this PR (peggy) and extending #51 (parser-combinators). Here is how I would like to proceed if we decide to move forward with peggy: 1) I think the Nearley and parser-combinators grammars really benefited from the separate tokenizer. Really streamlined and clarified the grammars. Now that I have a much more powerful hand-written tokenizer for the mathjs language (than the one in current mathjs), I would love to keep using that. So the first thing I would do is see if there is any way to use peggy on the token stream coming out of that tokenizer instead of directly on the characters. 2) @josdejong suggested that we could put the code to translate the parse results into a Node tree right in the grammar. peggy definitely has that ability, but I find that it makes the syntactic structure of the grammar much harder to read. So I would plan to crawl the parse tree and translate it into a Node tree on a second pass, if that's acceptable.

glen referenced this pull request

2026-02-28 09:54:57 +00:00

WIP:parser_combinator #51

/ test (pull_request) Failing after 16s

Required

Details

This pull request has changes conflicting with the target branch.

pnpm-lock.yaml

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin peggy_parse:peggy_parse

git switch peggy_parse

Rows
Columns

WIP: feat: define a peggy parser for the mathjs expression language #49

Checkout