WIP:parser_combinator

glen commented

2026-02-11 09:17:11 +00:00

Owner

The final in the series #47, #49, #50 exploring parsing packages for the expression language. This one still uses the Tokenizr lexer, and then parses the resulting token stream with parser-combinator. For the initial fragment implemented so far, parse-combinator feels very clean and simple, so I remain hopeful that this may be the best option yet.

The final in the series #47, #49, #50 exploring parsing packages for the expression language. This one still uses the Tokenizr lexer, and then parses the resulting token stream with [parser-combinator](https://github.com/michalusio/Parser). For the initial fragment implemented so far, parse-combinator feels very clean and simple, so I remain hopeful that this may be the best option yet.

glen added 5 commits

2026-02-11 09:17:12 +00:00

feat: start a lexer for the expression language

/ test (pull_request) Failing after 16s

Details

59497f5113

feat: more token types

/ test (pull_request) Failing after 17s

Details

89ce689c67

feat: first full draft of lexer with decent test coverage

/ test (pull_request) Failing after 18s

Details

ccad800be8

wip: partway through Nearley grammar, switch to parser-combinator

/ test (pull_request) Failing after 19s

Details

5a9e01e650

feat: parser-combinator parser for a fragment of language

/ test (pull_request) Successful in 17s

Details

ebebd1146b

glen added 1 commit

2026-02-11 10:33:56 +00:00

feat: clearest definition of Block yet

/ test (pull_request) Successful in 17s

Details

62dbbdcb04

glen referenced this pull request

2026-02-11 14:06:00 +00:00

WIP: feat: define a peggy parser for the mathjs expression language #49

glen added 2 commits

2026-02-11 21:47:43 +00:00

feat: Add Assignment; still skipping from there all the way to Symbol 84072ed3e6

feat: finish parser-combinator parser

/ test (pull_request) Failing after 18s

Details

d9157ed888

glen commented

2026-02-11 21:48:48 +00:00

Author

Owner

OK, the parser created via parser-combinators seems to be complete; I will be migrating the parser tests from ohm/peggy to this branch soon.

glen added 1 commit

2026-02-13 00:20:54 +00:00

test: incorporate parser tests, many still fail

/ test (pull_request) Failing after 17s

Details

cd11ec0db0

glen added 1 commit

2026-02-13 23:18:46 +00:00

fix: Improved string parser with better error messages

/ test (pull_request) Failing after 17s

Details

cbfc9c3b03

glen added 1 commit

2026-02-14 16:19:30 +00:00

feat: parser-combinator passes all tests from prior packages

/ test (pull_request) Successful in 17s

Details

fb608b095c

glen commented

2026-02-14 16:22:54 +00:00

Author

Owner

OK, the Tokenizr lexer and parser-combinator parser are now passing all of the tests from the prior series of parsing packages, including ones that did not pass in the previous rounds. While there are still more tests from mainline mathjs to transfer here, this milestone means this parser is mature enough that its code weight is not going to increase appreciably from where it is now. Therefore, the next step is to test how much this particular lexer/parser combination increases the mathjs bundle.

glen changed title from ~~parser_combinator~~ to WIP:parser_combinator

2026-02-14 21:58:44 +00:00

glen commented

2026-02-14 22:34:18 +00:00

Author

Owner

All right, @josdejong , I totally don't understand it: even though the parser code file is only 20K (less than half the size of the current mathjs) and the entire parser-combinators library I am using is 20K and the tokenizer library is 20K and unraw string-escape interpreter package is 10K, which only add up to 70K, the math.js bundle increases by a whopping 150K (23%) when I just install it in main-line mathjs and call its parse function from parseStart. (I checked there are no other new dependencies: these are all of the new files.)

How is that even possible? I would have thought that 70K would have been the maximum possible increase, since that's the sum of the sizes of all of the new files...

Anyhow, the zipped bundle increases by a lot less in this case, only 20K (11.4%) for a total of 195K.

I'm assuming that's still more than you're willing to devote to a new parser...

I'm really discouraged and frustrated. I was quite happy with this latest parser, it seems very clean and simple. And I assumed that since the packages it is using are very small that it would be the least size increase yet. Alas.

So here are some things I could try:

(A) Go ahead and get the Nearley parser working even though Nearley is not maintained. I was pretty close to done when I realized that Nearley is not maintained, so it wouldn't be a lot of work. That way we could see how much it adds with a more substantial grammar (so far I only measured it with a trivial grammar, and that was very lightweight, so there's clearly not a lot of overhead from just the library itself). Then if a full Nearley parser is still good on resource usage, we could consider forking/cloning it and maintaining it just enough to use in mathjs, as probably it's my favorite to use of all four, with this current parser-combinators one a close second.

(B) The parser-combinators library feels so lightweight that I could just suck a streamlined version of it fully into one source file in our repo (i.e., not use the npm package at all). Again, that would be a pretty simple experiment, to just have our own little combinator library based on parser-combinators, but slimmer, in our source tree, and see if that reduces the bundle size increase at all.

(C) At this point, having been through the entire parser four times now, I could just go ahead and hand-write the most streamlined custom parser I can think of, and see what that does to the bundle size by way of comparison.

Which of (A), (B), (C) do you think I should try next? As before, my efforts on this are on hold until I get some feedback about this. Thanks!

All right, @josdejong , I totally don't understand it: even though the parser code file is only 20K (less than half the size of the current mathjs) and the entire parser-combinators library I am using is 20K and the tokenizer library is 20K and `unraw` string-escape interpreter package is 10K, which only add up to 70K, the math.js bundle increases by a whopping 150K (23%) when I just install it in main-line mathjs and call its parse function from parseStart. (I checked there are no other new dependencies: these are all of the new files.) How is that even possible? I would have thought that 70K would have been the maximum possible increase, since that's the sum of the sizes of _all_ of the new files... Anyhow, the zipped bundle increases by a lot less in this case, only 20K (11.4%) for a total of 195K. I'm assuming that's still more than you're willing to devote to a new parser... I'm really discouraged and frustrated. I was quite happy with this latest parser, it seems very clean and simple. And I assumed that since the packages it is using are very small that it would be the least size increase yet. Alas. So here are some things I could try: (A) Go ahead and get the Nearley parser working even though Nearley is not maintained. I was pretty close to done when I realized that Nearley is not maintained, so it wouldn't be a lot of work. That way we could see how much it adds with a more substantial grammar (so far I only measured it with a trivial grammar, and that was _very_ lightweight, so there's clearly not a lot of overhead from just the library itself). Then if a full Nearley parser is still good on resource usage, we could consider forking/cloning it and maintaining it just enough to use in mathjs, as probably it's my favorite to use of all four, with this current parser-combinators one a close second. (B) The parser-combinators library feels so lightweight that I could just suck a streamlined version of it fully into one source file in our repo (i.e., not use the npm package at all). Again, that would be a pretty simple experiment, to just have our own little combinator library based on parser-combinators, but slimmer, in our source tree, and see if that reduces the bundle size increase at all. (C) At this point, having been through the entire parser four times now, I could just go ahead and hand-write the most streamlined custom parser I can think of, and see what that does to the bundle size by way of comparison. Which of (A), (B), (C) do you think I should try next? As before, my efforts on this are on hold until I get some feedback about this. Thanks!

glen commented

2026-02-14 22:51:51 +00:00

Author

Owner

Oh, good news. I decided I would break it down by the three new libraries, just to see what was going on. Cutting out parser-combinator didn't have a big effect, but putting in a dummy (non-working!) lexer and uninstalling the tokenizr library made a huge difference. Without that package, the bundle size only grows by 20K and the zipped bundle size only grows by 6K, which seem totally acceptable to me (especially with the savings we will eventually get from removing the existing parser).

So actually, the next step is clear: substitute a different or handwritten tokenizer (I really don't mind a handbuilt tokenizer, as they are much simpler pieces of software) and then see how things are. So I am back on the project (not tonight) and will keep you posted. I really wonder what it is about the tokenizr package that led to the blowup: it doesn't at all seem like a heavyweight package. As I said, the entire distribution is only 20K. How could it expand the bundle so much?

Oh, good news. I decided I would break it down by the three new libraries, just to see what was going on. Cutting out parser-combinator didn't have a big effect, but putting in a dummy (non-working!) lexer and uninstalling the tokenizr library made a _huge_ difference. Without that package, the bundle size only grows by 20K and the zipped bundle size only grows by 6K, which seem totally acceptable to me (especially with the savings we will eventually get from removing the existing parser). So actually, the next step is clear: substitute a different or handwritten tokenizer (I really don't mind a handbuilt tokenizer, as they are much simpler pieces of software) and then see how things are. So I am back on the project (not tonight) and will keep you posted. I really wonder what it is about the tokenizr package that led to the blowup: it doesn't at all seem like a heavyweight package. As I said, the entire distribution is only 20K. How could it expand the bundle so much?

glen added 1 commit

2026-02-21 00:18:55 +00:00

refactor: implement tokenizer in parser-combinator also

/ test (pull_request) Failing after 17s

Details

a2a7eb5d1e

glen added 1 commit

2026-02-21 10:45:09 +00:00

fix: adjust new lexer-parser combo to pass all prior tests

/ test (pull_request) Failing after 17s

Details

5edb7ebe6c

glen commented

2026-02-21 11:13:49 +00:00

Author

Owner

So this is very strange. I decided to try writing the tokenizer with parser-combinators as well: why not, we have the parser-combinators package for the parser anyway, so just use the same package to do tokenizing too. It seemed that should be lightweight.

But sadly, that code adds 225K to the bundle and 18K to the zip (!). And what's totally weird is if I cut out the lexer -- which uses the same package as the parser -- but leave in the parser, the bundle size instead only increases a completely acceptable 18K and the zip only 5K. So I totally don't understand what is going on, since the lexer source file is only 13K, and it only imports packages that the parser already imports. There is clearly something I don't comprehend about bundling.

In any case, this new parser-combinators lexer is no good either, so it seems the only alternative is to try a new hand-written tokenizer, which as I said I don't really mind. It's just sort of a pain since even for the tokenizer, I find one of the tokenizer/grammar formalisms clearer and easier to read/write than handwritten code. But it's clearly the next thing to try -- I assume that I can write a very lean tokenizer...

So this is very strange. I decided to try writing the tokenizer with parser-combinators as well: why not, we have the parser-combinators package for the parser anyway, so just use the same package to do tokenizing too. It seemed that should be lightweight. But sadly, that code adds 225K to the bundle and 18K to the zip (!). And what's totally weird is if I cut out the lexer -- which uses the same package as the parser -- but leave in the parser, the bundle size instead only increases a completely acceptable 18K and the zip only 5K. So I totally don't understand what is going on, since the lexer source file is only 13K, and it only imports packages that the parser already imports. There is clearly something I don't comprehend about bundling. In any case, this new parser-combinators lexer is no good either, so it seems the only alternative is to try a new hand-written tokenizer, which as I said I don't really mind. It's just sort of a pain since even for the tokenizer, I find one of the tokenizer/grammar formalisms clearer and easier to read/write than handwritten code. But it's clearly the next thing to try -- I assume that I can write a very lean tokenizer...

glen added 1 commit

2026-02-23 22:45:21 +00:00

refactor: handwritten lexer that passes all lexer tests (fails some parses)

/ test (pull_request) Failing after 16s

Details

11ab3d14a4

glen added 1 commit

2026-02-24 11:22:30 +00:00

fix: correct new handwritten lexer to pass all tests

/ test (pull_request) Failing after 16s

Details

2d5deb0729

glen commented

2026-02-25 08:15:26 +00:00

Author

Owner

OK, the results are in: with the handwritten lexer and the parser-combinators grammar, the bundle size increases by 90K (just under 14%) and the zipped version by 14K (8%). And if I strip out all the rest of the current parse.js and just return what this nanomath parser produces, we only get back down to an 80K (12.3%) bundle size increase and 10K (6%) increase in the zipped bundle. So that experiment provides a decent estimate of what the "permanent" size increase of the mathjs package would be as a result of changing to a "grammar" for the parser. (I put "grammar" in quotes because the parser-combinator approach is really just a system for organizing the writing of parsers in a functional style -- it's definitely more systematic and readable than the hand-written approach, but it is definitely not as clean as the formal grammar-based approaches of Nearly, peggy, or Ohm.)

Frankly, I am surprised at how much these approaches seem to add.

OK @josdejong I guess it is time to make some decision about how to proceed. To me, almost any of these approaches seem worthwhile -- I sort of can't imagine proceeding without some systematization of the parsing. But I don't have any basis to trade that off versus a certain amount of increase in the bundle size, so that's a call you're going to have to make.

If you think it's OK to go ahead with one of these parsing packages, the question becomes which one. I definitely like the Nearley grammar the best -- it is the clearest and most comfortable to work with. But Nearley is not maintained, so we might not want to shoulder the extra maintenance burden we might take on by adopting it, although I like it close to enough to be game. And Ohm is clearly the most heavyweight of the ones we tried.

So presuming Nearley and Ohm are both out, that leaves peggy vs parser-combinators. Both are up-to-date with their maintenance, and the actual structure of the two parsers turns out to be pretty similar (and a bit different from the Nearley and Ohm parsers, which are similar to each other; this grouping is a result of the parsing approach of each package).

The main pro of peggy over parser-combinators is the clarity/cleanliness of an actual grammar formalism. That's traded off against four advantages on the parser-combinators side: (a) smaller (14K vs 20K bump in the bundle size), (b) faster (on the Chevrotain benchmark page using a JSON parser, 1630 ops/sec vs 1200 ops/sec), (c) able to run on a token stream instead of the raw characters (peggy is devoted to the PEG principle of combined tokenizing and parsing, but personally I find the division into two layers clarifies and simplifies both), and (d) no additional build step -- parser-combinators is pure Javascript.

OK, I look forward to your feedback, and don't plan to proceed further on this until we have some consensus on which way to go. Thanks!

OK, the results are in: with the handwritten lexer and the parser-combinators grammar, the bundle size increases by 90K (just under 14%) and the zipped version by 14K (8%). And if I strip out all the rest of the current parse.js and just return what this nanomath parser produces, we only get back down to an 80K (12.3%) bundle size increase and 10K (6%) increase in the zipped bundle. So that experiment provides a decent estimate of what the "permanent" size increase of the mathjs package would be as a result of changing to a "grammar" for the parser. (I put "grammar" in quotes because the parser-combinator approach is really just a system for organizing the writing of parsers in a functional style -- it's definitely more systematic and readable than the hand-written approach, but it is definitely not as clean as the formal grammar-based approaches of Nearly, peggy, or Ohm.) Frankly, I am surprised at how much these approaches seem to add. OK @josdejong I guess it is time to make some decision about how to proceed. To me, almost any of these approaches seem worthwhile -- I sort of can't imagine proceeding without some systematization of the parsing. But I don't have any basis to trade that off versus a certain amount of increase in the bundle size, so that's a call you're going to have to make. If you think it's OK to go ahead with one of these parsing packages, the question becomes which one. I definitely like the Nearley grammar the best -- it is the clearest and most comfortable to work with. But Nearley is not maintained, so we might not want to shoulder the extra maintenance burden we might take on by adopting it, although I like it close to enough to be game. And Ohm is clearly the most heavyweight of the ones we tried. So presuming Nearley and Ohm are both out, that leaves peggy vs parser-combinators. Both are up-to-date with their maintenance, and the actual structure of the two parsers turns out to be pretty similar (and a bit different from the Nearley and Ohm parsers, which are similar to each other; this grouping is a result of the parsing approach of each package). The main pro of peggy over parser-combinators is the clarity/cleanliness of an actual grammar formalism. That's traded off against four advantages on the parser-combinators side: (a) smaller (14K vs 20K bump in the bundle size), (b) faster (on the Chevrotain benchmark page using a JSON parser, 1630 ops/sec vs 1200 ops/sec), (c) able to run on a token stream instead of the raw characters (peggy is devoted to the PEG principle of combined tokenizing and parsing, but personally I find the division into two layers clarifies and simplifies both), and (d) no additional build step -- parser-combinators is pure Javascript. OK, I look forward to your feedback, and don't plan to proceed further on this until we have some consensus on which way to go. Thanks!

josdejong commented

2026-02-25 18:29:06 +00:00

Collaborator

@glen impressive job, thanks for exploring these options! I think these bundle sizes like 6% are indeed acceptable.

In the past I've regularly used source-map-explorer to get a bit of insight of what is inside a JS bundle:

npx source-map-explorer ./lib/browser/math.js --no-border-checks

Some thoughts:

I think we should not go into (partly) writing our own parser generator, if we do that I would prefer to just write a parser and that's it.
I think we should not use a non-maintained library (nearley), and I think it is best to go for an option that is mainstream and maintained (peggy, ohm.js). I'm in doubt about parser-combinator because its not (yet) mainstream and because its function based syntax is relatively verbose and I think for our purpose the readability of the grammer is very important. It's performance is a pro of course. I have a slight preference for either Ohm.js or Peggy because they are mainstream and battle-tested. I have experience with peggy and like that you can insert JavaScript there which allows directly turning the AST into our Node's which can be executed then. On the other hand, the syntax of Ohm.js looks really clean. Do you have strong preference out of the 4 options?
So far I think you've explored how these four different parser generators work out parsing the mathjs syntax. I guess the next step will be to turn the parsed AST into a "mathjs tree" of Nodes which can be executed. At that point we can also see how fast parsing and evaluation is. When we're going to write a part of the mathjs functions in mathjs expression syntax this will become quite relevant I think.
In this stage, we can simply start with the most suitable parser generator based on your experiments, even if it increases the bundle size and (may) turn out to be not the fastest solution. My gut feeling is that in the end we want to write an optimized parser anyway to get the best performance and bundle size. But even then, having a formal, well defined grammer for mathjs is very valuable on its own (for documentation purposes and for third party libraries like syntax highlighters, custom parsers in different languages, etc).

@glen impressive job, thanks for exploring these options! I think these bundle sizes like 6% are indeed acceptable. In the past I've regularly used source-map-explorer to get a bit of insight of what is inside a JS bundle: ``` npx source-map-explorer ./lib/browser/math.js --no-border-checks ``` Some thoughts: 1. I think we should not go into (partly) writing our own parser generator, if we do that I would prefer to just write a parser and that's it. 2. I think we should not use a non-maintained library (nearley), and I think it is best to go for an option that is mainstream and maintained (peggy, ohm.js). I'm in doubt about parser-combinator because its not (yet) mainstream and because its function based syntax is relatively verbose and I think for our purpose the readability of the grammer is very important. It's performance is a pro of course. I have a slight preference for either Ohm.js or Peggy because they are mainstream and battle-tested. I have experience with peggy and like that you can insert JavaScript there which allows directly turning the AST into our Node's which can be executed then. On the other hand, the syntax of Ohm.js looks really clean. Do you have strong preference out of the 4 options? 3. So far I think you've explored how these four different parser generators work out parsing the mathjs syntax. I guess the next step will be to turn the parsed AST into a "mathjs tree" of Nodes which can be executed. At that point we can also see how fast parsing and evaluation is. When we're going to write a part of the mathjs functions in mathjs expression syntax this will become quite relevant I think. 4. In this stage, we can simply start with the most suitable parser generator based on your experiments, even if it increases the bundle size and (may) turn out to be not the fastest solution. My gut feeling is that in the end we want to write an optimized parser anyway to get the best performance and bundle size. But even then, having a formal, well defined grammer for mathjs is very valuable on its own (for documentation purposes and for third party libraries like syntax highlighters, custom parsers in different languages, etc).

glen commented

2026-02-25 20:17:06 +00:00

Author

Owner

OK, I am leery of Ohm because all of the development effort there is going into compiling a grammar into a wasm thingie, and the wasm compiler doesn't have some of the features that we use and it's not clear when it's going to get them, and currently it was definitely the heaviest of all the approaches we tried. I mean, it might get much lighter weight once the wasm compiling gets to the feature level of the original, but it's not clear when that's happening, and still serving the wasm thingie may be a bit of extra mechanics on our end (does our bundler incorporate wasm chunks automagically? I know nothing about bundling). So I'd prefer us to steer away from Ohm.

[To answer your question 2, in terms of my pure syntax preference it was nearley > ohm > peggy > parser-combinators, but they were all fine so in the light of all of the other issues I don't think it makes much difference. Yes, parser-combinators feels a bit more verbose because it's stuck being actual JavaScript syntax, so it uses a lot of function calls, but actually the peggy and parser-combinators grammars are the same number of bytes in the source code.)]

So based on all the above, merging your views and mine would seem to indicate moving ahead with a peggy parser. As a reminder, that was 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle as compared with unzipped bundle size increases of 90K (14%) and zipped by 14K (8%) for parser combinator (interesting that the bundle increases more but it shrinks a lot under compression). And we may save 10K bundle/4K zipped when we strip out the existing bundle. You're OK at those levels? I'd like to be sure before proceeding, so please let me know.

The reason I'd like to be sure is that I am very strongly against your point 4. I simply do not want to be involved in writing, tuning, and maintaining a grammar-formalism grammar only to have to reimplement it as a "hand parser" and then co-maintain both. That really strikes me as an unpleasant maintenance nightmare, and to me just loses the value of the grammar formalism. If your roadmap for starting down the path of a peggy parser is to likely do that sooner or later, then I am not willing to start down that path. I don't so much mind translating to some other formalism in the future, in case we picked the wrong formalism, but I am not going to enter a state of having a "reference" parser and a "real" parser, nor having switched to a formalism am I going to be willing to go back to a handwritten parser.

That's one of the reasons you might want to give parser-combinators another thought. It basically is just a disciplined way of organizing a hand-written parser. I think that's why in the Chevrotain benchmarks, it's the fastest of the parsers at its level of power (other than handwritten). I think it's just that the functional style is just a little bit less efficient in JavaScript than the hand-written parser style which would be written iteratively rather than functionally, but on the other hand the functional style really exposes the structure of the grammar, pretty close to as well as a domain-specific language does.

So if the decision were entirely up to me, I'd proceed with parser-combinators as apparently fast and lightweight and scalable, but I am happy instead to proceed with a peggy parser if you're amenable to the position that we might in the future change grammar formalisms but we would not have two parallel parsing implementations and we wouldn't go back to handwritten. Looking forward to your call between peggy and parser-combinators (again holding off til then).

I have a couple comments about proceeding with peggy which I will put in #49, which is about peggy.

OK, I am leery of Ohm because all of the development effort there is going into compiling a grammar into a wasm thingie, and the wasm compiler doesn't have some of the features that we use and it's not clear when it's going to get them, and currently it was definitely the heaviest of all the approaches we tried. I mean, it might get much lighter weight once the wasm compiling gets to the feature level of the original, but it's not clear when that's happening, and still serving the wasm thingie may be a bit of extra mechanics on our end (does our bundler incorporate wasm chunks automagically? I know nothing about bundling). So I'd prefer us to steer away from Ohm. [To answer your question 2, in terms of my pure syntax preference it was nearley > ohm > peggy > parser-combinators, but they were all fine so in the light of all of the other issues I don't think it makes much difference. Yes, parser-combinators _feels_ a bit more verbose because it's stuck being actual JavaScript syntax, so it uses a lot of function calls, but actually the peggy and parser-combinators grammars _are the same number of bytes in the source code_.)] So based on all the above, merging your views and mine would seem to indicate moving ahead with a peggy parser. As a reminder, that was 78K (12%) to the unzipped bundle and 20K (11.5%) to the zipped bundle as compared with unzipped bundle size increases of 90K (14%) and zipped by 14K (8%) for parser combinator (interesting that the bundle increases more but it shrinks a lot under compression). And we may save 10K bundle/4K zipped when we strip out the existing bundle. You're OK at those levels? I'd like to be sure before proceeding, so please let me know. The reason I'd like to be sure is that I am very strongly against your point 4. I simply do not want to be involved in writing, tuning, and maintaining a grammar-formalism grammar only to have to reimplement it as a "hand parser" and then co-maintain both. That really strikes me as an unpleasant maintenance nightmare, and to me just loses the value of the grammar formalism. If your roadmap for starting down the path of a peggy parser is to likely do that sooner or later, then I am not willing to start down that path. I don't so much mind translating to some other formalism in the future, in case we picked the wrong formalism, but I am not going to enter a state of having a "reference" parser and a "real" parser, nor having switched to a formalism am I going to be willing to go back to a handwritten parser. That's one of the reasons you might want to give parser-combinators another thought. It basically is just a disciplined way of organizing a hand-written parser. I think that's why in the Chevrotain benchmarks, it's the fastest of the parsers at its level of power (other than handwritten). I think it's just that the functional style is just a little bit less efficient in JavaScript than the hand-written parser style which would be written iteratively rather than functionally, but on the other hand the functional style really exposes the structure of the grammar, pretty close to as well as a domain-specific language does. So if the decision were entirely up to me, I'd proceed with parser-combinators as apparently fast and lightweight and scalable, but I am happy instead to proceed with a peggy parser if you're amenable to the position that we might in the future change grammar formalisms but we would not have two parallel parsing implementations and we wouldn't go back to handwritten. Looking forward to your call between peggy and parser-combinators (again holding off til then). I have a couple comments about proceeding with peggy which I will put in #49, which is about peggy.

glen referenced this pull request

2026-02-25 20:52:10 +00:00

WIP: feat: define a peggy parser for the mathjs expression language #49

glen added 1 commit

2026-02-28 07:08:27 +00:00

chore: update dependencies, include previously omitted file

/ test (pull_request) Successful in 19s

Details

e3b89054c9

glen commented

2026-02-28 09:54:57 +00:00

Author

Owner

So I thought some more about proceeding with peggy, and tried to start on the next couple of steps down that road as outlined in my latest comments in #49. The upshot is that I found the peggy system more cumbersome/less easily adaptable to the mathjs language than I had thought. With some tricks it can be adapted to parsing a tokenized stream, but it will be a little wonky: matching a token will consist of an assertion to check the token type followed by accepting any "character", which will really be the token. E.g., to accept either a plus token or a minus token it would look something like addOp = &{ return tok('plus') } [^] / &{ return tok('minus') } [^] -- a bit klunky. (Maybe I could write a plugin to syntactic-sugar this...) And I think I would want to make a plugin that would label the return value of each rule with the rule name, to facilitate writing a second pass transforming the raw parse tree into the Node tree we want. (Yes, I know it can be done by annotating each rule with in-line code to generate the Node for that rule, but I find that disrupting the grammar itself with all of those bits of code makes the grammar much more difficult to read and get a picture of what's going on. I would much prefer it if peggy had a way to specify all the rules, and then the action code on those rules that need it in a separate block below. So proposing and trying to get merged such a feature would be another way to go.)

So for me, it seems like one could manage to get all this working with peggy, but it feels like it would be a bit of a struggle of trying to fit a right foot into a left shoe -- that sort of thing.

On the other hand, parser-combinators is lighweight, maintained, and nicely organized (although a bit light on documentation, which we can help with). There were two main objections above that I saw:

It is not mainstream. My response to this is: how else can good software tools become mainstream other than by people adopting them for their good qualities?
The functional notation inherited from JavaScript is less clear and/or more verbose than peggy etc. While I was not able to measure it being more verbose (same 3.5K characters to express the mathjs language as in peggy), I absolutely agree that the domain-specific languages (DSLs) for grammars are easier to read. But a lightweight framework like parser-combinators is ideal for defining a DSL, so I created #52, which is a really simple proof-of-concept for defining a DSL very similar to a fragment of peggy's (using parser-combinators) and then using that DSL to define parts of the mathjs grammar. (In fact it wouldn't be much work to extend this DSL to essentially match peggy's, likely with some additional features.)

Therefore, I would like to strengthen my recommendation to proceed with parser-combinators (either with or without providing a DSL for defining parser-combinators rules). I really think that is the way we should go, while acknowledging that I am willing to go with peggy if you are strongly against parser-combinators. The peggy option will definitely be more effort; at this point, having the proofs-of-concept working, parser-combinators is a path of low resistance. If you end up deciding on parser-combinators, just let me know how far you want me to go with the DSL variant: (a) drop it, (b) do the easy operators like /, * and + (so the mathjs grammar would end up being roughly half in the DSL and half in "direct parser-combinators"), or (c) go far enough so that the entire mathjs grammar is expressed in the DSL, or at least very nearly so.

Thanks and I continue to look forward to your feedback.

So I thought some more about proceeding with peggy, and tried to start on the next couple of steps down that road as outlined in my latest comments in #49. The upshot is that I found the peggy system more cumbersome/less easily adaptable to the mathjs language than I had thought. With some tricks it can be adapted to parsing a tokenized stream, but it will be a little wonky: matching a token will consist of an assertion to check the token type followed by accepting any "character", which will really be the token. E.g., to accept either a plus token or a minus token it would look something like `addOp = &{ return tok('plus') } [^] / &{ return tok('minus') } [^]` -- a bit klunky. (Maybe I could write a plugin to syntactic-sugar this...) And I think I would want to make a plugin that would label the return value of each rule with the rule name, to facilitate writing a second pass transforming the raw parse tree into the Node tree we want. (Yes, I know it can be done by annotating each rule with in-line code to generate the Node for that rule, but I find that disrupting the grammar itself with all of those bits of code makes the grammar much more difficult to read and get a picture of what's going on. I would much prefer it if peggy had a way to specify all the rules, and then the action code on those rules that need it in a separate block below. So proposing and trying to get merged such a feature would be another way to go.) So for me, it seems like one could manage to get all this working with peggy, but it feels like it would be a bit of a struggle of trying to fit a right foot into a left shoe -- that sort of thing. On the other hand, parser-combinators is lighweight, maintained, and nicely organized (although a bit light on documentation, which we can help with). There were two main objections above that I saw: 1) It is not mainstream. My response to this is: how else can good software tools become mainstream other than by people adopting them for their good qualities? 2) The functional notation inherited from JavaScript is less clear and/or more verbose than peggy etc. While I was not able to measure it being more verbose (same 3.5K characters to express the mathjs language as in peggy), I absolutely agree that the domain-specific languages (DSLs) for grammars are easier to read. But a lightweight framework like parser-combinators is ideal for defining a DSL, so I created #52, which is a really simple proof-of-concept for defining a DSL very similar to a fragment of peggy's (using parser-combinators) and then using that DSL to define parts of the mathjs grammar. (In fact it wouldn't be much work to extend this DSL to essentially match peggy's, likely with some additional features.) Therefore, I would like to strengthen my recommendation to proceed with parser-combinators (either with or without providing a DSL for defining parser-combinators rules). I really think that is the way we should go, while acknowledging that I am willing to go with peggy if you are strongly against parser-combinators. The peggy option will definitely be more effort; at this point, having the proofs-of-concept working, parser-combinators is a path of low resistance. If you end up deciding on parser-combinators, just let me know how far you want me to go with the DSL variant: (a) drop it, (b) do the easy operators like `/`, `*` and `+` (so the mathjs grammar would end up being roughly half in the DSL and half in "direct parser-combinators"), or (c) go far enough so that the entire mathjs grammar is expressed in the DSL, or at least very nearly so. Thanks and I continue to look forward to your feedback.

glen commented

2026-03-03 22:56:33 +00:00

Author

Owner

@glen wrote in #51 (comment):

If you end up deciding on parser-combinators, just let me know how far you want me to go with the DSL variant: (a) drop it, (b) do the easy operators like /, * and + (so the mathjs grammar would end up being roughly half in the DSL and half in "direct parser-combinators"), or (c) go far enough so that the entire mathjs grammar is expressed in the DSL, or at least very nearly so.

Oh, actually I had time to fully implement the slightly extended peggy-like DSL (it was really quite easy, which I also think speaks well of parser-combinators) so that the entire mathjs grammar can be expressed that way. So if you are willing to go ahead with parser-combinators (potentially for a long term), the choice is just between (A) "straight-up" JavaScript functional-notation use of parser combinators, as in this PR, or (B) a DSL very close to peggy's notation, with a few small extensions, in which to express the grammar as per #52 (which of course would be documented to describe the full grammar for the expressions defining the rules of the mathjs grammar).

@glen wrote in https://code.studioinfinity.org/StudioInfinity/nanomath/pulls/51#issuecomment-3716: > If you end up deciding on parser-combinators, just let me know how far you want me to go with the DSL variant: (a) drop it, (b) do the easy operators like `/`, `*` and `+` (so the mathjs grammar would end up being roughly half in the DSL and half in "direct parser-combinators"), or (c) go far enough so that the entire mathjs grammar is expressed in the DSL, or at least very nearly so. Oh, actually I had time to fully implement the slightly extended peggy-like DSL (it was really quite easy, which I also think speaks well of parser-combinators) so that the entire mathjs grammar can be expressed that way. So if you are willing to go ahead with parser-combinators (potentially for a long term), the choice is just between (A) "straight-up" JavaScript functional-notation use of parser combinators, as in this PR, or (B) a DSL very close to peggy's notation, with a few small extensions, in which to express the grammar as per #52 (which of course would be documented to describe the full grammar for the expressions defining the rules of the mathjs grammar).

glen force-pushed parser_combinator from e3b89054c9

/ test (pull_request) Successful in 19s

Details

to 9a09d54ab3

/ test (pull_request) Successful in 18s

Details

2026-03-07 14:23:51 +00:00

Compare

/ test (pull_request) Successful in 18s

Required

Details

This pull request is marked as a work in progress.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin parser_combinator:parser_combinator

git switch parser_combinator

Rows
Columns

WIP:parser_combinator #51

Checkout