Updates on Highlighting and a Pythongorean Theorem

So as is surely very apparent, I still haven’t fully resolved the issues with highlighting the singular/countable nouns in the Regular mode of Grammarbuffet. I thought I would mention a few of the issues that are complicating matters, just in case anyone else is dealing with/has dealt with something similar:

The Part of Speech Tagging

For some strange reason, all of the POS modules for Node JS (which is what the backend of the app is written in) aren’t incredibly good. The main problem is that they tend to make strange decisions when tagging, such that some nouns aren’t tagged as nouns, and some other word classes ARE tagged as nouns. This is clearly unhelpful, and though I’ve been thinking of inserting logic to deal with it (eg, only caring about singular/countable noun tags within 3-4 words of the articles), this is unpredictable and a less clean solution than I would like.

Obviously, even the best POS taggers don’t reach 100% accuracy (Google’s new Parsey McParseface is only at 94%), but that being said, the Node modules seem far from that. Especially since they seem to want to break up words into smaller parts (I’m –> I/’/m), and then tag them all individually, which wouldn’t be an issue except that I’m splitting all the text into words at the whitespaces in between those words. So this means that currently I’m getting two different counts of the number of “things” in a paragraph, for example.

I suppose this could be addressed programmatically by just removing all the diacritics, such that “I’m” becomes “Im”, and given that it’s not one of the singular/countable nouns I’m targeting anyway, it wouldn’t matter whether it’s tagged correctly, and it would preserve the correct word count. This is the current line of attack, and I hope to have something sorted out later this month. In the long term, however, I think I might need to move to something with slightly better POS tagging (and language processing support in general), which leads me to…

Python

I haven’t really messed around with Python much, except for the very basic tutorials and setting up super simple websites, but I think I might need to move into creating back-ends in Python, mostly since its NLP capabilities are vastly superior to any other programming language out there (see: Tensorflow, which is what Parsey McParseface has under the hood, and NLTK, widely recognized as one of the best overall NLP packages).

So going forward, the short-term fix might be to throw in some tricksy logic, but I think in the long term, if I’m serious about integrating NLP-processing into the text-based micromaterials, I’m going to have to dig much deeper into Python.

Leave a comment