It’s been a while since the last entry, which is mostly down to the fact that I have approx. 5 other projects going on concurrently, and also did some work on side-projects that aren’t really noteworthy (except that one of them made me the author of 57% of all cloud-based kubernetes-the-hard-way walkthroughs, nbd)…
But now I FINALLY have something to blog about, even though it’s technically just an update to a previous project, the parsons problems code block ordering web app.
A New Project Theme
Typescript is also gaining popularity, so that’s my next target to add data for. And though it’s not listed in the above-linked GitHub list, a personal rising favorite of mine is Golang, so I’m going to add function corpus data for that later this year as well.
What struck me as very surprising, however, is that I can’t seem to find any mention of existing research into how to compile a corpus of software code. A lot of the existing work analyzing source code on GitHub seems to be geared towards either visualization or pure stylistic description within a single project, rather than actually generalizing across different repositories.
So creating a corpus of source code is basically uncharted territory and will involve a lot of decisions (how big is “big enough”, where should we store the data, how should we represent the data, what structure should it be in, etc). In addition, nobody (to my knowledge) has looked at using a corpus as an explicit tool to generate authentic learning materials, so this also shows great promise as an avenue of exploration.
As one of the original ideas behind “micromaterials”, a dataset to start compiling autogenerated learning materials is very exciting, and I’m hoping to have preliminary results later this year.