Recommendations for code documentation generation?
Recommendations for code documentation generation?

Recommendations for code documentation generation?

Recommendations for code documentation generation?

I just joined a team where our current assignment is to take a project that our company has just acquired and get it up to snuff with our internal standards. The project we acquired was a single developer's day job for around a decade, and I guess because he was a solo developer, he rarely saw any need to comment his code, since he was the one who wrote it-- I'm sure we all can relate to a certain extent.

Anyways, the scope of this project is relatively immense-- 1500 source files, around 500K LOC. I wrote a quick script to count and categorize all of the text characters in the project in order to give context for this post; 97% of all meaningful (non-whitespace, non-semicolon, etc) characters in the project are source code tokens, while the remaining 3% of characters are comments. There is little to no documentation external to the project, either-- whatever insight is to be gleamed here is found solely within the 1500 source files we've received.

That being said, the source code itself is relatively self-explanatory-- the few comments found throughout the project are almost all found near code snippets of high complexity, and everywhere else the author tended to use verbose and descriptive naming patterns for variables, methods, classes etc.

So, given these aforementioned qualities of this project, I think it might be a decent idea to try to offload some of the documentation efforts onto some sort of LLM or AI service. The features / capabilities that I believe will be most important in my company's case here are the following:

  • The tool should be capable of either (A) writing inline comments (javadoc style for methods, line comments wherever else is appropriate) in the source code itself, or (B) writing external "sidecar" files (e.g. a source file located at com/example/SomeClass.java would have a "sidecar" documentation file generated at com/example/SomeClass.md)

  • The tool should be capable of finding references to symbols across the project to glean more context as to what they are used for-- I'm imagining the tool looking at some method with a non-descriptive name and opaque functionality, then finding a common pattern in the contexts where the method is called, and using that common pattern to inform whatever documentation the tool writes for the method.

  • The tool should be capable of taking git history and/or commit messages into account, as for this project specifically it seems a great deal of the "story" behind the source code is recorded in it's git history.

So, with all that being said, my question: Where might I find a tool with some (or hopefully all) of my desired capabilities? If no such tool exists, my company isn't averse to spending some developer hours on writing this for us (e.g. some script that combines a source code parser with OpenAI API calls to generate the outputs that we are looking for), but in the event that anyone has gone down this seemingly dark and uncharted road, I'd be interested in hearing whatever advice or tips you guys might have for me.

Cheers


Crossposts:

r/LocalLLaMA

submitted by /u/I_Lift_for_zyzz
[link] [comments]