Best practices labelling/benchmarking of RAG chatbot systems?

Hi.
Are there any best practices for development and especially labelling/benchmarking of RAG chatbot systems? We have found out, that labelling of every new version/architecture of chatbot by human is quite demanding and takes time in company environment, but any automatisation does not really work for us.