Question about testing an AI

I have a research question related to chatGPT or other LLMs. My question is, can chatGPT answer this set of standardized exam questions? I know, not very interesting. My question to you though is if we know that such an AI answers each question differently every time, sometimes wrong and some times right, how many times should I ask the AI to answer EACH question to conclude whether it can answer a given question or not? (e.g., averaging correct rate for each question)? Are there standard guidelines in AI research for that?

submitted by /u/Substantial-Ad2200
[link] [comments]