I have the best car, it can go zero to sixty in under 4 seconds.
No, no my car is the best - It can carry seven people.
Wait, mine is better than both of yours - Its gets 40+ MPG.
As a performance guy at VMware I get involved in lots of discussions comparing physical and virtual performance. What might seem like a simple thing to do can actually be quite difficult to get right, and in many cases it really is like comparing apples and oranges.
In order to get performance tests configured so that the physcial or virtual configurations match, it almost always involves using some settings that you would not normally use if you weren't doing a performance comparison.
Current generation servers commonly have more than eight cores (and the with hyperthreading its double that number for logical cores). The maximum number of vCPUs that a VM can have today is eight. This means that the physical system has to be somehow limited to 8 or less to enable a comparison. This can be done in BIOS or in the OS or via physical reconfiguration of hardware (removing processors). It gets more complex when you have to also factor in the architecture of the server. All new x86 based servers are using NUMA architectures where performance can be affected depending on how an application or VM is or isn't spread across the NUMA nodes. When you limit a system to a subset of it's processors, you have to consider how this is done in respect to the NUMA nodes.
It also goes the other way as well. The size of the VM is increased to 8 or 4 vCPUs to be able to compare the performance against physical, but the VM when run in production will be a 1 or 2 vCPU system. The scalability of the application itself can become the gating factor if it doesn't do as well with more cores. It is also common to run a single VM on a host for these comparisons. A single VM on a host doesn't really make a lot of sense in most cases, so this becomes another aspect of the test that isn't reflective of how things will be run in production.
After all of this, it is possible to get a test that allows for a decent comparison if things were done carefully. The problem is that this isn't how it would actually be run. While these tests are great for getting a basic idea of performance differences, they don't really take into account the bigger picture of why virtualize in the first place.
It would probably be much better to run a test with your applications in an environment that is as close as possible to how you would run it in production. Base the results and analysis on end user experience or response time measurements. I know that this is usually not practical to do, which is why we end up doing these other tests. Just remember to take the constraints of a test into account when looking at results.
If you were buying a car to be able to carry your family of five, with luggage, on vacations every summer then your definition of performance probably won't rely too heavily on 0-60 time (Although we would all probably still want to know what that time was).