An idea for watermarking source code IP on projects uploaded to GitHub
An idea for watermarking source code IP on projects uploaded to GitHub

An idea for watermarking source code IP on projects uploaded to GitHub

I'm not going to lie, github is doing well for themselves, but I can't forgive them for the various examples I've seen where developers have seen GPT quote their code exactly from a private repository with almost no change, and of course, no attribution.

I have though of ideas including attribution systems for LLMs and it seems Microsoft and others are on top of this... but, I can't help but still not trust any of these large corps. Even gitlab, however well they might do, might one day succumb to the larger corporations, and then what happens to any code I thought was private?

So the idea is simple... create a database (decentralized or not is a political debate at this point), which issues unique identifiers and code patterns for your specific code, with a filter that randomly inserts this identifier not only as comments, but includes harmless code snippets at some points which are superfluous (obviously on non-system-critical pieces of code) but include somehow your identifier as a string or even a specific number that is a constant variable or the like. Any future training on your code will yield this identifier on inference perhaps correctly, and you will have shown to have registered this identifier at the database. The db can also be reference by the corporations like a DNC list for IP.

I have no plan to code something like this or even know if it is really viable with how fast everything is moving both legally and technologically, but I do know one thing. I don't feel like giving any IP to corps they didn't buy or license from me directly. I don't feel like their AI getting the credit for "creating" or "'generating" my code... UNLESS, unless it gives me either attribution or somehow has a way to pay me through an attribution network for the use of that code if the person receiving it is a person who would pay for this code and use it in their business.

I have no problem with proper business practice, what I have a problem with is being stolen from and lied to about what code is being used to train new models. I know they are "handling it" but this is the same entity that perpetrated such shady doings in the first place. Anywho, I have started the practice of putting anything I dearly want to be my own private code to private gits, but yeah, you think you know a site. ... I should have guessed with MSes track record. Joke is on us right? good one guys.

This one goes out to all my fellow 3 billion data entry slaves!

submitted by /u/enspiralart
[link] [comments]