I taught several ways to achieve the result, found there are two means 1. either by using video to detect if someone is moving there mouth or 2. by using audio and some algorithm that can differentiate voices. Important factor to consider is that it needs to be able run on CPU (computationally cheap as possible). Is there any pre existing approach for this purpose i am familiar with tracking and detection but regarding this problem i am little hazy about what approach to use or would be the best,
[link] [comments]