WIP:parser_combinator #51

Draft
glen wants to merge 13 commits from parser_combinator into main
Owner

The final in the series #47, #49, #50 exploring parsing packages for the expression language. This one still uses the Tokenizr lexer, and then parses the resulting token stream with parser-combinator. For the initial fragment implemented so far, parse-combinator feels very clean and simple, so I remain hopeful that this may be the best option yet.

The final in the series #47, #49, #50 exploring parsing packages for the expression language. This one still uses the Tokenizr lexer, and then parses the resulting token stream with [parser-combinator](https://github.com/michalusio/Parser). For the initial fragment implemented so far, parse-combinator feels very clean and simple, so I remain hopeful that this may be the best option yet.
feat: start a lexer for the expression language
Some checks failed
/ test (pull_request) Failing after 16s
59497f5113
feat: more token types
Some checks failed
/ test (pull_request) Failing after 17s
89ce689c67
feat: first full draft of lexer with decent test coverage
Some checks failed
/ test (pull_request) Failing after 18s
ccad800be8
wip: partway through Nearley grammar, switch to parser-combinator
Some checks failed
/ test (pull_request) Failing after 19s
5a9e01e650
feat: parser-combinator parser for a fragment of language
All checks were successful
/ test (pull_request) Successful in 17s
ebebd1146b
feat: clearest definition of Block yet
All checks were successful
/ test (pull_request) Successful in 17s
62dbbdcb04
Author
Owner

OK, the parser created via parser-combinators seems to be complete; I will be migrating the parser tests from ohm/peggy to this branch soon.

OK, the parser created via parser-combinators seems to be complete; I will be migrating the parser tests from ohm/peggy to this branch soon.
test: incorporate parser tests, many still fail
Some checks failed
/ test (pull_request) Failing after 17s
cd11ec0db0
fix: Improved string parser with better error messages
Some checks failed
/ test (pull_request) Failing after 17s
cbfc9c3b03
feat: parser-combinator passes all tests from prior packages
All checks were successful
/ test (pull_request) Successful in 17s
fb608b095c
Author
Owner

OK, the Tokenizr lexer and parser-combinator parser are now passing all of the tests from the prior series of parsing packages, including ones that did not pass in the previous rounds. While there are still more tests from mainline mathjs to transfer here, this milestone means this parser is mature enough that its code weight is not going to increase appreciably from where it is now. Therefore, the next step is to test how much this particular lexer/parser combination increases the mathjs bundle.

OK, the Tokenizr lexer and parser-combinator parser are now passing all of the tests from the prior series of parsing packages, including ones that did not pass in the previous rounds. While there are still more tests from mainline mathjs to transfer here, this milestone means this parser is mature enough that its code weight is not going to increase appreciably from where it is now. Therefore, the next step is to test how much this particular lexer/parser combination increases the mathjs bundle.
glen changed title from parser_combinator to WIP:parser_combinator 2026-02-14 21:58:44 +00:00
Author
Owner

All right, @josdejong , I totally don't understand it: even though the parser code file is only 20K (less than half the size of the current mathjs) and the entire parser-combinators library I am using is 20K and the tokenizer library is 20K and unraw string-escape interpreter package is 10K, which only add up to 70K, the math.js bundle increases by a whopping 150K (23%) when I just install it in main-line mathjs and call its parse function from parseStart. (I checked there are no other new dependencies: these are all of the new files.)

How is that even possible? I would have thought that 70K would have been the maximum possible increase, since that's the sum of the sizes of all of the new files...

Anyhow, the zipped bundle increases by a lot less in this case, only 20K (11.4%) for a total of 195K.

I'm assuming that's still more than you're willing to devote to a new parser...

I'm really discouraged and frustrated. I was quite happy with this latest parser, it seems very clean and simple. And I assumed that since the packages it is using are very small that it would be the least size increase yet. Alas.

So here are some things I could try:

(A) Go ahead and get the Nearley parser working even though Nearley is not maintained. I was pretty close to done when I realized that Nearley is not maintained, so it wouldn't be a lot of work. That way we could see how much it adds with a more substantial grammar (so far I only measured it with a trivial grammar, and that was very lightweight, so there's clearly not a lot of overhead from just the library itself). Then if a full Nearley parser is still good on resource usage, we could consider forking/cloning it and maintaining it just enough to use in mathjs, as probably it's my favorite to use of all four, with this current parser-combinators one a close second.

(B) The parser-combinators library feels so lightweight that I could just suck a streamlined version of it fully into one source file in our repo (i.e., not use the npm package at all). Again, that would be a pretty simple experiment, to just have our own little combinator library based on parser-combinators, but slimmer, in our source tree, and see if that reduces the bundle size increase at all.

(C) At this point, having been through the entire parser four times now, I could just go ahead and hand-write the most streamlined custom parser I can think of, and see what that does to the bundle size by way of comparison.

Which of (A), (B), (C) do you think I should try next? As before, my efforts on this are on hold until I get some feedback about this. Thanks!

All right, @josdejong , I totally don't understand it: even though the parser code file is only 20K (less than half the size of the current mathjs) and the entire parser-combinators library I am using is 20K and the tokenizer library is 20K and `unraw` string-escape interpreter package is 10K, which only add up to 70K, the math.js bundle increases by a whopping 150K (23%) when I just install it in main-line mathjs and call its parse function from parseStart. (I checked there are no other new dependencies: these are all of the new files.) How is that even possible? I would have thought that 70K would have been the maximum possible increase, since that's the sum of the sizes of _all_ of the new files... Anyhow, the zipped bundle increases by a lot less in this case, only 20K (11.4%) for a total of 195K. I'm assuming that's still more than you're willing to devote to a new parser... I'm really discouraged and frustrated. I was quite happy with this latest parser, it seems very clean and simple. And I assumed that since the packages it is using are very small that it would be the least size increase yet. Alas. So here are some things I could try: (A) Go ahead and get the Nearley parser working even though Nearley is not maintained. I was pretty close to done when I realized that Nearley is not maintained, so it wouldn't be a lot of work. That way we could see how much it adds with a more substantial grammar (so far I only measured it with a trivial grammar, and that was _very_ lightweight, so there's clearly not a lot of overhead from just the library itself). Then if a full Nearley parser is still good on resource usage, we could consider forking/cloning it and maintaining it just enough to use in mathjs, as probably it's my favorite to use of all four, with this current parser-combinators one a close second. (B) The parser-combinators library feels so lightweight that I could just suck a streamlined version of it fully into one source file in our repo (i.e., not use the npm package at all). Again, that would be a pretty simple experiment, to just have our own little combinator library based on parser-combinators, but slimmer, in our source tree, and see if that reduces the bundle size increase at all. (C) At this point, having been through the entire parser four times now, I could just go ahead and hand-write the most streamlined custom parser I can think of, and see what that does to the bundle size by way of comparison. Which of (A), (B), (C) do you think I should try next? As before, my efforts on this are on hold until I get some feedback about this. Thanks!
Author
Owner

Oh, good news. I decided I would break it down by the three new libraries, just to see what was going on. Cutting out parser-combinator didn't have a big effect, but putting in a dummy (non-working!) lexer and uninstalling the tokenizr library made a huge difference. Without that package, the bundle size only grows by 20K and the zipped bundle size only grows by 6K, which seem totally acceptable to me (especially with the savings we will eventually get from removing the existing parser).

So actually, the next step is clear: substitute a different or handwritten tokenizer (I really don't mind a handbuilt tokenizer, as they are much simpler pieces of software) and then see how things are. So I am back on the project (not tonight) and will keep you posted. I really wonder what it is about the tokenizr package that led to the blowup: it doesn't at all seem like a heavyweight package. As I said, the entire distribution is only 20K. How could it expand the bundle so much?

Oh, good news. I decided I would break it down by the three new libraries, just to see what was going on. Cutting out parser-combinator didn't have a big effect, but putting in a dummy (non-working!) lexer and uninstalling the tokenizr library made a _huge_ difference. Without that package, the bundle size only grows by 20K and the zipped bundle size only grows by 6K, which seem totally acceptable to me (especially with the savings we will eventually get from removing the existing parser). So actually, the next step is clear: substitute a different or handwritten tokenizer (I really don't mind a handbuilt tokenizer, as they are much simpler pieces of software) and then see how things are. So I am back on the project (not tonight) and will keep you posted. I really wonder what it is about the tokenizr package that led to the blowup: it doesn't at all seem like a heavyweight package. As I said, the entire distribution is only 20K. How could it expand the bundle so much?
refactor: implement tokenizer in parser-combinator also
Some checks failed
/ test (pull_request) Failing after 17s
a2a7eb5d1e
fix: adjust new lexer-parser combo to pass all prior tests
Some checks failed
/ test (pull_request) Failing after 17s
5edb7ebe6c
Author
Owner

So this is very strange. I decided to try writing the tokenizer with parser-combinators as well: why not, we have the parser-combinators package for the parser anyway, so just use the same package to do tokenizing too. It seemed that should be lightweight.

But sadly, that code adds 225K to the bundle and 18K to the zip (!). And what's totally weird is if I cut out the lexer -- which uses the same package as the parser -- but leave in the parser, the bundle size instead only increases a completely acceptable 18K and the zip only 5K. So I totally don't understand what is going on, since the lexer source file is only 13K, and it only imports packages that the parser already imports. There is clearly something I don't comprehend about bundling.

In any case, this new parser-combinators lexer is no good either, so it seems the only alternative is to try a new hand-written tokenizer, which as I said I don't really mind. It's just sort of a pain since even for the tokenizer, I find one of the tokenizer/grammar formalisms clearer and easier to read/write than handwritten code. But it's clearly the next thing to try -- I assume that I can write a very lean tokenizer...

So this is very strange. I decided to try writing the tokenizer with parser-combinators as well: why not, we have the parser-combinators package for the parser anyway, so just use the same package to do tokenizing too. It seemed that should be lightweight. But sadly, that code adds 225K to the bundle and 18K to the zip (!). And what's totally weird is if I cut out the lexer -- which uses the same package as the parser -- but leave in the parser, the bundle size instead only increases a completely acceptable 18K and the zip only 5K. So I totally don't understand what is going on, since the lexer source file is only 13K, and it only imports packages that the parser already imports. There is clearly something I don't comprehend about bundling. In any case, this new parser-combinators lexer is no good either, so it seems the only alternative is to try a new hand-written tokenizer, which as I said I don't really mind. It's just sort of a pain since even for the tokenizer, I find one of the tokenizer/grammar formalisms clearer and easier to read/write than handwritten code. But it's clearly the next thing to try -- I assume that I can write a very lean tokenizer...
Some checks failed
/ test (pull_request) Failing after 17s
Required
Details
This pull request is marked as a work in progress.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin parser_combinator:parser_combinator
git switch parser_combinator
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
StudioInfinity/nanomath!51
No description provided.