WIP: feat: define an Ohm.js parser for maje

glen commented

2025-12-19 21:51:45 +00:00

Owner

Adds an Ohm parser for maje, the mathjs expression language. Note the grammar is 200 lines as compared to 1850 for mathjs src/expression/parse.js, although the Ohm grammar does not perform any of the semantic interpretation, it is just a recognizer, whereas mathjs parse.js also produces an AST for the expression.

This is indubitably not fully correct yet, but it appears to be a plausible start that nominally captures every feature of maje per mathjs parse.js. The next step going forward to head toward making this merge-ready is to ensure that every example in mathjs test/unit-tests/expression/parse.test.js either succeeds in parsing or fails in parsing, as appropriate. That should shake out the (probably numerous) grammar bugs. I think the only other ingredient necessary to merge this is to ensure that it's possible to generate good error messages in the failed parsing cases. Semantics, including any kind of AST if we decide to do one on top of the built-in concrete syntax tree produced by Ohm, will be left for future PRs.

Adds an [Ohm](https://ohmjs.org/) parser for maje, the mathjs expression language. Note the grammar is 200 lines as compared to 1850 for mathjs src/expression/parse.js, although the Ohm grammar does not perform any of the semantic interpretation, it is just a recognizer, whereas mathjs parse.js also produces an AST for the expression. This is indubitably not fully correct yet, but it appears to be a plausible start that nominally captures every feature of maje per mathjs parse.js. The next step going forward to head toward making this merge-ready is to ensure that every example in mathjs test/unit-tests/expression/parse.test.js either succeeds in parsing or fails in parsing, as appropriate. That should shake out the (probably numerous) grammar bugs. I think the only other ingredient necessary to merge this is to ensure that it's possible to generate good error messages in the failed parsing cases. Semantics, including any kind of AST if we decide to do one on top of the built-in concrete syntax tree produced by Ohm, will be left for future PRs.

glen added 1 commit

2025-12-19 21:51:46 +00:00

feat: define an Ohm.js parser for maje

/ test (pull_request) Successful in 19s

Details

998e9c80c5

glen referenced this pull request

2025-12-19 21:54:15 +00:00

Possible next direction: Parsing and type-enhanced Node trees #45

glen commented

2025-12-19 22:00:37 +00:00

Author

Owner

Note that the scannerless nature of Ohm coupled with maje's convention of not generally allowing newlines at the "top level" while allowing them inside nested expressions complicates the Ohm grammar: it obliges us to carry a "white" parameter to most of the rules, indicating what kind of whitespace is active. (This parameter effectively produces two "flavors" of every rule, one in which newlines are ordinary whitespace, the other in which newlines are not always allowed.)

Similarly, the fact that we just want to skip the range notation a:b:c in the true branch of the ternary conditional introduces another parameter qRange in all of the rules above range, that indicates whether we are currently skipping range constructions.

It's conceivable there might be another mechanism for ensuring that ranges are not allowed after condition ? ... but that would be a refinement for the future. Best to just get this parsing to work.

Note that the scannerless nature of Ohm coupled with maje's convention of not generally allowing newlines at the "top level" while allowing them inside nested expressions complicates the Ohm grammar: it obliges us to carry a "white" parameter to most of the rules, indicating what kind of whitespace is active. (This parameter effectively produces two "flavors" of every rule, one in which newlines are ordinary whitespace, the other in which newlines are not always allowed.) Similarly, the fact that we just want to skip the range notation `a:b:c` in the true branch of the ternary conditional introduces another parameter `qRange` in all of the rules above `range`, that indicates whether we are currently skipping range constructions. It's conceivable there might be another mechanism for ensuring that ranges are not allowed after `condition ? ...` but that would be a refinement for the future. Best to just get this parsing to work.

glen added 1 commit

2025-12-20 16:27:11 +00:00

test: first set of parse tests

/ test (pull_request) Successful in 18s

Details

45f8f8087a

glen added 1 commit

2025-12-20 17:15:39 +00:00

test: another set of parse tests and associated fixes

/ test (pull_request) Successful in 18s

Details

ae98a750e7

glen added 2 commits

2025-12-20 22:21:03 +00:00

test: a couple of negative tests; comments next 567999f868

test: comment tests and associated grammar fixes

/ test (pull_request) Successful in 18s

Details

f482f2b252

glen added 1 commit

2025-12-21 07:44:18 +00:00

test: number tests and associated grammar fixes

/ test (pull_request) Successful in 19s

Details

507056064a

glen added 1 commit

2025-12-21 07:47:33 +00:00

chore: update to latest ohm, and remove 'holes' workaround

/ test (pull_request) Successful in 18s

Details

cf0706286a

glen added 1 commit

2025-12-21 09:26:11 +00:00

feat: add transducer to take nested \n to \r

/ test (pull_request) Successful in 18s

Details

ae19e46adc

glen added 1 commit

2025-12-21 09:31:16 +00:00

feat: add mageParse to match newlineTransduced string

/ test (pull_request) Successful in 18s

Details

e4a759e07e

glen added 2 commits

2025-12-21 11:56:36 +00:00

refactor: Use newline transducer to simplify grammar 920401209c

refactor: simplify Accessor recognition

/ test (pull_request) Successful in 18s

Details

81a45a7d65

glen added 1 commit

2025-12-25 08:48:54 +00:00

test: doublequote string tests and associated fixes

/ test (pull_request) Successful in 18s

Details

4244b52158

josdejong commented

2026-01-07 13:41:02 +00:00

Collaborator

That is a nice start! Yeah I expect there are some specific cases that are hard to define in a grammer 😅.

We'll also have to see how it performs and how much code (KB's) the generated parser is compared to a handwritten parser.

That is a nice start! Yeah I expect there are some specific cases that are hard to define in a grammer 😅. We'll also have to see how it performs and how much code (KB's) the generated parser is compared to a handwritten parser.

glen commented

2026-01-07 13:54:11 +00:00

Author

Owner

There is no "generated parser," at least not as javascript code -- a string containing the grammar definition is parsed on the fly and the parser is constructed in memory.

The next step here is for Jos to respond to mathjs 3621 which attempting to implement this parser uncovered.

There is no "generated parser," at least not as javascript code -- a string containing the grammar definition is parsed on the fly and the parser is constructed in memory. The next step here is for Jos to respond to [mathjs 3621](https://github.com/josdejong/mathjs/issues/3621) which attempting to implement this parser uncovered.

👍 1

glen commented

2026-01-07 13:58:21 +00:00

Author

Owner

Of course that means the ohm compiler will have to be in the bundle, which will surely add bundle size.

josdejong commented

2026-01-07 14:06:06 +00:00

Collaborator

@glen wrote in #47 (comment):

Of course that means the ohm compiler will have to be in the bundle, which will surely add bundle size.

O, Hmm. Then we really have to see how much code that is, including Ohmjs in the bundle. I haven't used Ohmjs myself. I have (positive) experience with Pegjs which allows generating code for the parser. But Pegjs seems not maintained anymore.

I think it will be good to quickly get a feel of the bundle size and performance: if that is a blocker it is best to figure that out asap.

@glen wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3630: > Of course that means the ohm compiler will have to be in the bundle, which will surely add bundle size. O, Hmm. Then we really have to see how much code that is, including Ohmjs in the bundle. I haven't used Ohmjs myself. I have (positive) experience with Pegjs which allows [generating code for the parser](https://github.com/pegjs/pegjs/blob/master/docs/guides/generating-a-parser.md). But Pegjs seems not maintained anymore. I think it will be good to quickly get a feel of the bundle size and performance: if that is a blocker it is best to figure that out asap.

glen commented

2026-01-07 14:27:39 +00:00

Author

Owner

Yes, I tried to do a relatively diligent search for a parse package, and Pegjs came up, but I took it off the list because of its maintenance status. My first thought was Chevrotain but it doesn't handle the extensive left recursion of the current mathjs parser. I considered parjs but that looked closer to a handwritten parser, harder to read, reason about, and extend than a parser defined by a specific collection, and it was also hard to know by reading the documentation whether it would handle the left recursion. So that's when I tried Ohm, and so far it seems like a pretty good fit.

I can't imagine that Ohm parsing performance will be any kind of an obstacle; are there any use cases for mathjs that are parsing-bound?

Unfortunately, bundling is a total mystery to me (any independent projects I do are unbundled and just rely on es modules that are served individually). I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases? Does that sound like the correct next step?

Yes, I tried to do a relatively diligent search for a parse package, and Pegjs came up, but I took it off the list because of its maintenance status. My first thought was [Chevrotain](https://chevrotain.io/docs/) but it doesn't handle the extensive left recursion of the current mathjs parser. I considered [parjs](https://github.com/GregRos/parjs) but that looked closer to a handwritten parser, harder to read, reason about, and extend than a parser defined by a specific collection, and it was also hard to know by reading the documentation whether it would handle the left recursion. So that's when I tried Ohm, and so far it seems like a pretty good fit. I can't imagine that Ohm _parsing_ performance will be any kind of an obstacle; are there any use cases for mathjs that are parsing-bound? Unfortunately, bundling is a total mystery to me (any independent projects I do are unbundled and just rely on es modules that are served individually). I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases? Does that sound like the correct next step?

josdejong commented

2026-01-07 14:42:27 +00:00

Collaborator

@glen wrote in #47 (comment):

I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases?

Yes please. That is a simple way to get an idea indeed 👍.

I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the /dist folder): https://www.npmjs.com/package/ohm-js?activeTab=code

@glen wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3632: > I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases? Yes please. That is a simple way to get an idea indeed 👍. I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the `/dist` folder): https://www.npmjs.com/package/ohm-js?activeTab=code

glen commented

2026-01-07 14:46:50 +00:00

Author

Owner

@josdejong wrote in #47 (comment):

Yes please. That is a simple way to get an idea indeed 👍.

OK, will do as soon as I can.

I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the /dist folder): https://www.npmjs.com/package/ohm-js?activeTab=code

I am seeing ohm.min.js which is 96K -- if that is the increase in the bundle, how does it strike you? What's the threshold of acceptable here?

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3633: > Yes please. That is a simple way to get an idea indeed :+1:. OK, will do as soon as I can. > I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the `/dist` folder): https://www.npmjs.com/package/ohm-js?activeTab=code I am seeing ohm.min.js which is 96K -- if that is the increase in the bundle, how does it strike you? What's the threshold of acceptable here?

josdejong commented

2026-01-07 15:21:25 +00:00

Collaborator

The minified math.js bundle is 650K right now (not gzipped). An additional 96K is indeed a lot :( . But let's first get the full picture before drawing conclusions. Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse Node classes. My feeling is that it will be slower than the hand-written parser but let's see.

If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think?

The minified `math.js` bundle is 650K right now (not gzipped). An additional 96K is indeed a lot :( . But let's first get the full picture before drawing conclusions. Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse `Node` classes. My feeling is that it will be slower than the hand-written parser but let's see. If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think?

glen commented

2026-01-07 15:51:52 +00:00

Author

Owner

@josdejong wrote in #47 (comment):

Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse Node classes. My feeling is that it will be slower than the hand-written parser but let's see.

OK, when I am able instead of adding just a simple grammar, I will add the current close-but-not-quite-exact grammar from this branch to a function in mathjs. That way, I can see the bundle size, and benchmark parsing the grammar itself, and then I can benchmark using that parser to recognize some long mathjs formulas against current parse() on those same formulas.

The Ohm grammar won't produce Node trees. That's implemented separately via crawling the concrete parse tree, and doesn't seem worth it to me for a benchmark to decide if we are going to use Ohm.

If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think?

I can't judge about the importance of bundle size. I just have trouble wrapping my head around serving bundles vs. serving a directory of modules that are loaded by the imports in the code... so I will have to defer to your judgment on bundle size.

On the parsing speed, though, I respectfully disagree. As long as the cold-start one-time penalty and the expression parse times are "fast enough" (which I agree is a judgment call, but could be much slower than currently), I would go with a parsing package (even if it's not Ohm -- I am not at all wedded to Ohm, it was just a decent-looking second candidate after Chevrotain proved inadequate). I am not aware of applications in which parsing expressions is any kind of important computational bottleneck. The performance of the compiled versions of those parsed expressions is far more important, and that will be unaffected by the parser. And there are significant advantages to a parsing package: ease of reading and understanding the parser, conciseness of the description of the grammar, maintainability/modifiability, discovery of issues like mathjs 3621, to name a few.

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3635: > Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse `Node` classes. My feeling is that it will be slower than the hand-written parser but let's see. OK, when I am able instead of adding just a simple grammar, I will add the current close-but-not-quite-exact grammar from this branch to a function in mathjs. That way, I can see the bundle size, and benchmark parsing the grammar itself, and then I can benchmark using that parser to _recognize_ some long mathjs formulas against current `parse()` on those same formulas. The Ohm grammar won't produce Node trees. That's implemented separately via crawling the concrete parse tree, and doesn't seem worth it to me for a benchmark to decide if we are going to use Ohm. > If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think? I can't judge about the importance of bundle size. I just have trouble wrapping my head around serving bundles vs. serving a directory of modules that are loaded by the imports in the code... so I will have to defer to your judgment on bundle size. On the parsing speed, though, I respectfully disagree. As long as the cold-start one-time penalty and the expression parse times are "fast enough" (which I agree is a judgment call, but could be much slower than currently), I would go with a parsing package (even if it's not Ohm -- I am not at all wedded to Ohm, it was just a decent-looking second candidate after Chevrotain proved inadequate). I am not aware of applications in which parsing expressions is any kind of important computational bottleneck. The performance of the compiled versions of those parsed expressions is far more important, and that will be unaffected by the parser. And there are significant advantages to a parsing package: ease of reading and understanding the parser, conciseness of the description of the grammar, maintainability/modifiability, discovery of issues like [mathjs 3621](https://github.com/josdejong/mathjs/issues/3621), to name a few.

glen added 1 commit

2026-02-04 19:07:01 +00:00

test: add tests for error messages as well

/ test (pull_request) Failing after 21s

Details

f1288f47a6

glen referenced this pull request

2026-02-05 22:06:17 +00:00

WIP: feat: define a peggy parser for the mathjs expression language #49

glen referenced this pull request

2026-02-07 23:52:56 +00:00

WIP: nearley parser for the expression language #50