
It’s been a while since the last entry, which is mostly down to the fact that I have approx. 5 other projects going on concurrently, and also did some work on side-projects that aren’t really noteworthy (except that one of them made me the author of 57% of all cloud-based kubernetes-the-hard-way walkthroughs, nbd)…
But now I FINALLY have something to blog about, even though it’s technically just an update to a previous project, the parsons problems code block ordering web app.
I’ve been able to add some Javascript activities, which effectively doubles the number of languages available (from 1 to 2). Most of the work was just figuring out which AST parser to use for Javascript and then what to do with the results. So once that got sorted out, it was easy to just put it into a similar data structure as the previous python stuff. The work to compile the actual data is in this repo.
A New Project Theme
The BIG news, though, and the main purpose for even spending time to work out how to parse out functions for different languages, is that I’m turning now more towards actually compiling a corpus of functions in different languages. Since both Javascript and Python are at the top of the list, it made sense to start with them.
Typescript is also gaining popularity, so that’s my next target to add data for. And though it’s not listed in the above-linked GitHub list, a personal rising favorite of mine is Golang, so I’m going to add function corpus data for that later this year as well.
What struck me as very surprising, however, is that I can’t seem to find any mention of existing research into how to compile a corpus of software code. A lot of the existing work analyzing source code on GitHub seems to be geared towards either visualization or pure stylistic description within a single project, rather than actually generalizing across different repositories.
So creating a corpus of source code is basically uncharted territory and will involve a lot of decisions (how big is “big enough”, where should we store the data, how should we represent the data, what structure should it be in, etc). In addition, nobody (to my knowledge) has looked at using a corpus as an explicit tool to generate authentic learning materials, so this also shows great promise as an avenue of exploration.
As one of the original ideas behind “micromaterials”, a dataset to start compiling autogenerated learning materials is very exciting, and I’m hoping to have preliminary results later this year.