WIP: feat: define an Ohm.js parser for maje #47

Draft
glen wants to merge 12 commits from ohm_parse into main
Owner

Adds an Ohm parser for maje, the mathjs expression language. Note the grammar is 200 lines as compared to 1850 for mathjs src/expression/parse.js, although the Ohm grammar does not perform any of the semantic interpretation, it is just a recognizer, whereas mathjs parse.js also produces an AST for the expression.

This is indubitably not fully correct yet, but it appears to be a plausible start that nominally captures every feature of maje per mathjs parse.js. The next step going forward to head toward making this merge-ready is to ensure that every example in mathjs test/unit-tests/expression/parse.test.js either succeeds in parsing or fails in parsing, as appropriate. That should shake out the (probably numerous) grammar bugs. I think the only other ingredient necessary to merge this is to ensure that it's possible to generate good error messages in the failed parsing cases. Semantics, including any kind of AST if we decide to do one on top of the built-in concrete syntax tree produced by Ohm, will be left for future PRs.

Adds an [Ohm](https://ohmjs.org/) parser for maje, the mathjs expression language. Note the grammar is 200 lines as compared to 1850 for mathjs src/expression/parse.js, although the Ohm grammar does not perform any of the semantic interpretation, it is just a recognizer, whereas mathjs parse.js also produces an AST for the expression. This is indubitably not fully correct yet, but it appears to be a plausible start that nominally captures every feature of maje per mathjs parse.js. The next step going forward to head toward making this merge-ready is to ensure that every example in mathjs test/unit-tests/expression/parse.test.js either succeeds in parsing or fails in parsing, as appropriate. That should shake out the (probably numerous) grammar bugs. I think the only other ingredient necessary to merge this is to ensure that it's possible to generate good error messages in the failed parsing cases. Semantics, including any kind of AST if we decide to do one on top of the built-in concrete syntax tree produced by Ohm, will be left for future PRs.
feat: define an Ohm.js parser for maje
All checks were successful
/ test (pull_request) Successful in 19s
998e9c80c5
Author
Owner

Note that the scannerless nature of Ohm coupled with maje's convention of not generally allowing newlines at the "top level" while allowing them inside nested expressions complicates the Ohm grammar: it obliges us to carry a "white" parameter to most of the rules, indicating what kind of whitespace is active. (This parameter effectively produces two "flavors" of every rule, one in which newlines are ordinary whitespace, the other in which newlines are not always allowed.)

Similarly, the fact that we just want to skip the range notation a:b:c in the true branch of the ternary conditional introduces another parameter qRange in all of the rules above range, that indicates whether we are currently skipping range constructions.

It's conceivable there might be another mechanism for ensuring that ranges are not allowed after condition ? ... but that would be a refinement for the future. Best to just get this parsing to work.

Note that the scannerless nature of Ohm coupled with maje's convention of not generally allowing newlines at the "top level" while allowing them inside nested expressions complicates the Ohm grammar: it obliges us to carry a "white" parameter to most of the rules, indicating what kind of whitespace is active. (This parameter effectively produces two "flavors" of every rule, one in which newlines are ordinary whitespace, the other in which newlines are not always allowed.) Similarly, the fact that we just want to skip the range notation `a:b:c` in the true branch of the ternary conditional introduces another parameter `qRange` in all of the rules above `range`, that indicates whether we are currently skipping range constructions. It's conceivable there might be another mechanism for ensuring that ranges are not allowed after `condition ? ...` but that would be a refinement for the future. Best to just get this parsing to work.
test: first set of parse tests
All checks were successful
/ test (pull_request) Successful in 18s
45f8f8087a
test: another set of parse tests and associated fixes
All checks were successful
/ test (pull_request) Successful in 18s
ae98a750e7
test: number tests and associated grammar fixes
All checks were successful
/ test (pull_request) Successful in 19s
507056064a
chore: update to latest ohm, and remove 'holes' workaround
All checks were successful
/ test (pull_request) Successful in 18s
cf0706286a
feat: add transducer to take nested \n to \r
All checks were successful
/ test (pull_request) Successful in 18s
ae19e46adc
feat: add mageParse to match newlineTransduced string
All checks were successful
/ test (pull_request) Successful in 18s
e4a759e07e
test: doublequote string tests and associated fixes
All checks were successful
/ test (pull_request) Successful in 18s
4244b52158
Collaborator

That is a nice start! Yeah I expect there are some specific cases that are hard to define in a grammer 😅.

We'll also have to see how it performs and how much code (KB's) the generated parser is compared to a handwritten parser.

That is a nice start! Yeah I expect there are some specific cases that are hard to define in a grammer 😅. We'll also have to see how it performs and how much code (KB's) the generated parser is compared to a handwritten parser.
Author
Owner

There is no "generated parser," at least not as javascript code -- a string containing the grammar definition is parsed on the fly and the parser is constructed in memory.

The next step here is for Jos to respond to mathjs 3621 which attempting to implement this parser uncovered.

There is no "generated parser," at least not as javascript code -- a string containing the grammar definition is parsed on the fly and the parser is constructed in memory. The next step here is for Jos to respond to [mathjs 3621](https://github.com/josdejong/mathjs/issues/3621) which attempting to implement this parser uncovered.
Author
Owner

Of course that means the ohm compiler will have to be in the bundle, which will surely add bundle size.

Of course that means the ohm compiler will have to be in the bundle, which will surely add bundle size.
Collaborator

@glen wrote in #47 (comment):

Of course that means the ohm compiler will have to be in the bundle, which will surely add bundle size.

O, Hmm. Then we really have to see how much code that is, including Ohmjs in the bundle. I haven't used Ohmjs myself. I have (positive) experience with Pegjs which allows generating code for the parser. But Pegjs seems not maintained anymore.

I think it will be good to quickly get a feel of the bundle size and performance: if that is a blocker it is best to figure that out asap.

@glen wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3630: > Of course that means the ohm compiler will have to be in the bundle, which will surely add bundle size. O, Hmm. Then we really have to see how much code that is, including Ohmjs in the bundle. I haven't used Ohmjs myself. I have (positive) experience with Pegjs which allows [generating code for the parser](https://github.com/pegjs/pegjs/blob/master/docs/guides/generating-a-parser.md). But Pegjs seems not maintained anymore. I think it will be good to quickly get a feel of the bundle size and performance: if that is a blocker it is best to figure that out asap.
Author
Owner

Yes, I tried to do a relatively diligent search for a parse package, and Pegjs came up, but I took it off the list because of its maintenance status. My first thought was Chevrotain but it doesn't handle the extensive left recursion of the current mathjs parser. I considered parjs but that looked closer to a handwritten parser, harder to read, reason about, and extend than a parser defined by a specific collection, and it was also hard to know by reading the documentation whether it would handle the left recursion. So that's when I tried Ohm, and so far it seems like a pretty good fit.

I can't imagine that Ohm parsing performance will be any kind of an obstacle; are there any use cases for mathjs that are parsing-bound?

Unfortunately, bundling is a total mystery to me (any independent projects I do are unbundled and just rely on es modules that are served individually). I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases? Does that sound like the correct next step?

Yes, I tried to do a relatively diligent search for a parse package, and Pegjs came up, but I took it off the list because of its maintenance status. My first thought was [Chevrotain](https://chevrotain.io/docs/) but it doesn't handle the extensive left recursion of the current mathjs parser. I considered [parjs](https://github.com/GregRos/parjs) but that looked closer to a handwritten parser, harder to read, reason about, and extend than a parser defined by a specific collection, and it was also hard to know by reading the documentation whether it would handle the left recursion. So that's when I tried Ohm, and so far it seems like a pretty good fit. I can't imagine that Ohm _parsing_ performance will be any kind of an obstacle; are there any use cases for mathjs that are parsing-bound? Unfortunately, bundling is a total mystery to me (any independent projects I do are unbundled and just rely on es modules that are served individually). I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases? Does that sound like the correct next step?
Collaborator

@glen wrote in #47 (comment):

I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases?

Yes please. That is a simple way to get an idea indeed 👍.

I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the /dist folder): https://www.npmjs.com/package/ohm-js?activeTab=code

@glen wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3632: > I guess I should just make a dummy branch on mathjs in which I just add a function that gratuitously calls ohmjs to create a simple parser, and the see how much the bundle size increases? Yes please. That is a simple way to get an idea indeed 👍. I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the `/dist` folder): https://www.npmjs.com/package/ohm-js?activeTab=code
Author
Owner

@josdejong wrote in #47 (comment):

Yes please. That is a simple way to get an idea indeed 👍.

OK, will do as soon as I can.

I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the /dist folder): https://www.npmjs.com/package/ohm-js?activeTab=code

I am seeing ohm.min.js which is 96K -- if that is the increase in the bundle, how does it strike you? What's the threshold of acceptable here?

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3633: > Yes please. That is a simple way to get an idea indeed :+1:. OK, will do as soon as I can. > I had a look at the bundle size of the files on npm too, but I'm not sure which bundle is actually needed and if all of the code is used or just a part of it (looking inside the `/dist` folder): https://www.npmjs.com/package/ohm-js?activeTab=code I am seeing ohm.min.js which is 96K -- if that is the increase in the bundle, how does it strike you? What's the threshold of acceptable here?
Collaborator

The minified math.js bundle is 650K right now (not gzipped). An additional 96K is indeed a lot :( . But let's first get the full picture before drawing conclusions. Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse Node classes. My feeling is that it will be slower than the hand-written parser but let's see.

If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think?

The minified `math.js` bundle is 650K right now (not gzipped). An additional 96K is indeed a lot :( . But let's first get the full picture before drawing conclusions. Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse `Node` classes. My feeling is that it will be slower than the hand-written parser but let's see. If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think?
Author
Owner

@josdejong wrote in #47 (comment):

Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse Node classes. My feeling is that it will be slower than the hand-written parser but let's see.

OK, when I am able instead of adding just a simple grammar, I will add the current close-but-not-quite-exact grammar from this branch to a function in mathjs. That way, I can see the bundle size, and benchmark parsing the grammar itself, and then I can benchmark using that parser to recognize some long mathjs formulas against current parse() on those same formulas.

The Ohm grammar won't produce Node trees. That's implemented separately via crawling the concrete parse tree, and doesn't seem worth it to me for a benchmark to decide if we are going to use Ohm.

If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think?

I can't judge about the importance of bundle size. I just have trouble wrapping my head around serving bundles vs. serving a directory of modules that are loaded by the imports in the code... so I will have to defer to your judgment on bundle size.

On the parsing speed, though, I respectfully disagree. As long as the cold-start one-time penalty and the expression parse times are "fast enough" (which I agree is a judgment call, but could be much slower than currently), I would go with a parsing package (even if it's not Ohm -- I am not at all wedded to Ohm, it was just a decent-looking second candidate after Chevrotain proved inadequate). I am not aware of applications in which parsing expressions is any kind of important computational bottleneck. The performance of the compiled versions of those parsed expressions is far more important, and that will be unaffected by the parser. And there are significant advantages to a parsing package: ease of reading and understanding the parser, conciseness of the description of the grammar, maintainability/modifiability, discovery of issues like mathjs 3621, to name a few.

@josdejong wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/47#issuecomment-3635: > Let's see how performant it is, and how much time is required to parse the parser itself via Ohm-js on the first time use. And see how the code works out including the grammer and parsing it into parse `Node` classes. My feeling is that it will be slower than the hand-written parser but let's see. OK, when I am able instead of adding just a simple grammar, I will add the current close-but-not-quite-exact grammar from this branch to a function in mathjs. That way, I can see the bundle size, and benchmark parsing the grammar itself, and then I can benchmark using that parser to _recognize_ some long mathjs formulas against current `parse()` on those same formulas. The Ohm grammar won't produce Node trees. That's implemented separately via crawling the concrete parse tree, and doesn't seem worth it to me for a benchmark to decide if we are going to use Ohm. > If the solution with Ohm-js is both significantly slower and results in a significantly larger bundle I think I would prefer a hand-written parser. What do you think? I can't judge about the importance of bundle size. I just have trouble wrapping my head around serving bundles vs. serving a directory of modules that are loaded by the imports in the code... so I will have to defer to your judgment on bundle size. On the parsing speed, though, I respectfully disagree. As long as the cold-start one-time penalty and the expression parse times are "fast enough" (which I agree is a judgment call, but could be much slower than currently), I would go with a parsing package (even if it's not Ohm -- I am not at all wedded to Ohm, it was just a decent-looking second candidate after Chevrotain proved inadequate). I am not aware of applications in which parsing expressions is any kind of important computational bottleneck. The performance of the compiled versions of those parsed expressions is far more important, and that will be unaffected by the parser. And there are significant advantages to a parsing package: ease of reading and understanding the parser, conciseness of the description of the grammar, maintainability/modifiability, discovery of issues like [mathjs 3621](https://github.com/josdejong/mathjs/issues/3621), to name a few.
All checks were successful
/ test (pull_request) Successful in 18s
Required
Details
This pull request is marked as a work in progress.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin ohm_parse:ohm_parse
git switch ohm_parse
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
StudioInfinity/nanomath!47
No description provided.