A few years ago there was a considerable uproar, when numerous major manufacturers were caught cheating on benchmarks. OEMs of all sizes (including Samsung, HTC, Sony, and LG) took part in this arms race of attempting to fool users without getting caught, but thankfully they eventually stopped their benchmark cheating after some frank discussions with industry experts and journalists.
Back in 2013, it was discovered that the Samsung was artificially boosting its GPU clock speeds in certain applications, sparking a series of investigations into benchmark cheating across the whole range of manufacturers. At the time, the investigation found that almost every manufacturer except for Google/Motorola were engaging in benchmark cheating. They were all investing time and money into attempts to eek a little bit extra performance out of their phones in benchmarks, in ways that wouldn't have any positive effect on everyday usage, in an attempt to fool users into thinking that their phones were faster than they actually were. These development efforts ran the whole gamut, from setting clock speed floors, to forcing the clock speeds to their maximum settings, to even creating special higher power states and special clock speeds that were only available when benchmarking, with these efforts often resulting in just a couple percentage point increases in benchmark.
There was substantial outrage when it was discovered, as these attempts at benchmark cheating ran counter to the very point of the benchmarks themselves. Most benchmarks aren't there to tell you the theoretical maximum performance of a phone in lab conditions that aren't reproducible in day to day use, but rather they are there to give you a point of reference for real world comparisons between phones. After a bit of a public berating (and some private conversations) from technology publications, industry leaders, and the general public, most manufacturers got the message that benchmark cheating was simply not acceptable, and stopped as a result. Most of the few that didn't stop at that point stopped soon after, as there were substantial changes made to how many benchmarks run, in an attempt to discourage benchmark cheating (by reducing the benefit from it). Many benchmarks were made longer so that the thermal throttling from maximizing clock speeds would become immediately apparent.
When we interviewed John Poole, the creator of Geekbench, the topic of benchmark cheating and what companies like Primate Labs can do to prevent it came up. Primate Labs in particular made Geekbench 4 quite a bit longer than Geekbench 3, in part to reduce the effects of benchmark cheating. Reducing the benefits in order to ensure that the development costs of benchmark cheating aren't worth it.
"The problem is that once we have these large runtimes if you start gaming things by ramping up your clock speeds or disabling governors or something like that, you're going to start putting actual real danger in the phone. … If you're going to game it … you won't get as much out of it. You might still get a couple percent, but is it really worth it?" – John Poole
What Happened
Unfortunately, we must report that some OEMs have started cheating again, meaning we should be on the lookout once more. Thankfully, manufacturers have become increasingly responsive to issues like this, and with the right attention being drawn to it, this can be fixed quickly. It is a bit shocking to see manufacturers implementing benchmark cheating in light of how bad the backlash was last time it was attempted (with some benchmarks completely excluding cheating devices from their performance lists). With that backlash contrasting against how tiny the performance gains from benchmark cheating typically are (with most of the attempts resulting in less than a 5% score increase last time), we had truly hoped that this would all be behind us.
The timing of this attempt is especially inopportune, as a couple months ago benchmark cheating left the world of being purely an enthusiast concern, and entered the public sphere when Volkswagen and Fiat Chrysler were both caught cheating on their emissions benchmarks. Both companies implemented software to detect when their diesel cars were being put through emissions testing, and had them switch into a low emissions mode that saw their fuel economy drop, in an attempt to compete with gasoline cars in fuel efficiency while still staying within regulatory limits for emissions tests. So far the scandal has resulted in billions in fines, tens of billions of recall costs, and charges being laid — certainly not the kind of retribution OEMs would ever see for inflating their benchmark scores, which are purely for user comparisons and are not used for measuring any regulatory requirements.
While investigating how Qualcomm achieves faster app opening speeds on the then-new Qualcomm Snapdragon 821, we noticed something strange on the OnePlus 3T that we could not reproduce on the Xiaomi Mi Note 2 or the Google Pixel XL, among other Snapdragon 821 devices. Our editor-in-chief, Mario Serrafero, was using Qualcomm Trepn and the Snapdragon Performance Visualizer to monitor how Qualcomm "boosts" the CPU clock speed when opening apps, and noticed that certain apps on the OnePlus 3T were not falling back down to their normal idling speeds after opening. As a general rule of thumb, we avoid testing benchmarks with performance monitoring tools open whenever possible due to the additional performance overhead that they bring (particularly in non-Snapdragon devices where the are no official desktop tools), however in this incident they helped us notice some strange behavior that we likely would have missed otherwise.
When entering certain benchmarking apps, the OnePlus 3T's cores would stay above 0.98 GHz for the little cores and 1.29 GHz for the big cores, even when the CPU load dropped to 0%. This is quite strange, as normally both sets of cores drop down to 0.31 GHz on the OnePlus 3T when there is no load. Upon first seeing this we were worried that OnePlus' CPU scaling was simply set a bit strangely, however upon further testing we came to the conclusion that OnePlus must be targeting specific applications. Our hypothesis was that OnePlus was targeting these benchmarks by name, and was entering an alternate CPU scaling mode to pump up their benchmark scores. One of our main concerns was that OnePlus was possibly setting looser thermal restrictions in this mode in order avoid the problems they had with the OnePlus One, OnePlus X, and OnePlus 2, where the phones were handling the additional cores coming online for the multi-core section of Geekbench poorly, and occasionally throttling down substantially as a result (to the point where the OnePlus X sometimes scored lower in the multi core section than in the single core section). You can find heavy throttling in our OnePlus 2 review, where we found the device could shed off up to 50% of its Geekbench 3 multi core score. Later, when we began comparing throttling and thermals across devices, the OnePlus 2 became a textbook example of what OEMs should avoid.
We reached out to the team at Primate Labs (the creators of Geekbench), who were instrumental in exposing the first wave of benchmark cheating, and partnered with them for further testing. We brought a OnePlus 3T to Primate Labs' office in Toronto for some initial analysis. The initial testing included a ROM dump which found that the OnePlus 3T was directly looking for quite a few apps by name. Most notably, the OnePlus 3T was looking for Geekbench, AnTuTu, Androbench, Quadrant, Vellamo, and GFXBench. As by this point we had fairly clear evidence that OnePlus was engaging in benchmark cheating, Primate Labs built a "Bob's Mini Golf Putt" version of Geekbench 4 for us. Thanks to the substantial changes between Geekbench 3 and 4, the "Mini Golf" version had to be rebuilt from the ground up specifically for this testing. This version of Geekbench 4 is designed to avoid any benchmark detection, in order to allow Geekbench to run as a normal application on phones that are cheating (going beyond the package renaming that fools most attempts at benchmark cheating).
A Surprising Example
Immediately upon opening the app, the difference was clear. The OnePlus 3T was idling at 0.31 GHz, the way it does in most apps, rather than at 1.29 GHz for the big cores and 0.98 GHz for the little cores like it does in the regular Geekbench app. OnePlus was makings it CPU governor more aggressive, resulting in a practical artificial clock speed floor in Geekbench that wasn't there in the hidden Geekbench build. It wasn't based on the CPU workload, but rather on the app's package name, which the hidden build could fool. While the difference in individual runs was minimal, the thermal throttling relaxations shine in our sustained performance test, shown below.
From our testing, it appears that this has been a "feature" of Hydrogen OS for quite a while now, and was not added to Oxygen OS until the community builds leading up to the Nougat release (after the two ROMs were merged). It is a bit disappointing to see, especially in light of the software problems that OnePlus has had this month following the merging of the ROMs, from bootloader vulnerabilities to GPL compliance issues. We are hopeful that as the dust settles down following the merger of the two teams, OnePlus will return to form, and continue to position themselves as a developer-friendly option.
With the "Mini Golf" version of Geekbench in hand, we went out and started testing other phones for benchmark cheating as well. Thankfully our testing shows no cheating by the companies which were involved in the scandal half a decade ago. HTC, Xiaomi, Huawei, Honor, Google, Sony, and others appear to have consistent scores between the regular Geekbench build and the "Mini Golf" build on our testing devices.
Unfortunately, we did find possible evidence of benchmark cheating that we have not yet been able to confirm from a couple other companies, which we will be investigating further. The very worst example of this was in the Exynos 8890-powered Meizu Pro 6 Plus, which took the benchmark cheating to another extreme.
A Terrible Example
Meizu has historically set their CPU scaling extremely conservatively. Notably, they often set their phones up so that the big cores rarely come online, even when in their "performance mode", making the flagship processors (like the excellent Exynos 8890) that they put into their flagship phones act like midrange processors. This came to a head last year when Anandtech called Meizu out for their poor performance on Anandtech's JavaScript benchmarks on the Mediatek Helio X25 based Meizu Pro 6, and noted that the big cores stayed offline for most of the test (when the test should have been running nearly exclusively on the big cores). Anandtech noticed last week that a software update had been pushed to the Meizu Pro 6 that was finally allowing the Meizu to use those cores to their fullest. Anandtech's Smartphone Senior Editor, Matt Humrick, remarked that "After updating to Flyme OS 5.2.5.0G, the PRO 6 performs substantially better. The Kraken, WebXPRT 2015, and JetStream scores improve by about 2x-2.5x. Meizu apparently adjusted the load threshold value, allowing threads to migrate to the A72 cores more frequently for better performance."
Unfortunately, it appears that rather than improving the CPU scaling for their new devices to obtain better benchmark scores, they appear to have set the phone to switch to using the big cores when certain apps are running.
Upon opening a benchmarking app, our Meizu Pro 6 Plus recommends that you switch into "Performance Mode" (which alone is enough to confirm that they are looking for specific package names), and it seems to make a substantial difference. When in the standard "Balance Mode", the phone consistently scores around 604 and 2220 on Geekbench's single-core and multi-core sections, but in "Performance Mode" it scores 1473 and 3906, largely thanks to the big cores staying off for most of the test in "Balance Mode", and turning on in "Performance Mode". Meizu appears to lock the little cores to their maximum speed of 1.48 GHz, and set a hard floor for two of their big cores of 1.46 GHz when running Geekbench while in "Performance Mode" (with the other two big cores being allowed to scale freely, and quite aggressively), which we do not see when running the "Mini Golf" build.
While being able to choose between a high power mode and a low power mode can be a nice feature, in this case it appears to be nothing more than a parlor trick. The Meizu Pro 6 Plus sees decent scores in "Performance Mode" for the regular Geekbench app, but when using the "Mini Golf" build of Geekbench, it drops right back down the the same level of performance as it has when it is set to "Balance Mode". The higher performance state on the Meizu Pro 6 Plus is just for benchmarking, not for actual day to day use.
One thing of note is that when we tested the Meizu Pro 6 Plus in "Performance Mode" with the secret build of Geekbench, the big cores came online if we were recording the clock speeds with Qualcomm Trepn. We have not yet determined if the Meizu is recognizing that Trepn is running and turning on the big cores in part because of it, or if it simply is turning on the big cores because of the extra CPU load that it creates. While it might sound counter-intuitive that an additional load in the background (such as when we kept performance graphs on during the test) would increase the results of a benchmark, Meizu's conservative scaling could mean that the extra overhead was enough to push it over the edge, and call the big cores into action, thus improving performance for all tasks.
When receptive OEMs address feedback…
Following our testing, we reached out to OnePlus about the issues we found. In response, OnePlus swiftly promised to stop targeting benchmarking apps with their benchmark cheating, but still intend to keep it for games (which also get benchmarked). In a future build of OxygenOS, this mechanism will not be triggered by benchmarks. OnePlus has been receptive of our suggestion to add a toggle as well, so that users know what is going on under the hood, and at the very least the unfair and misleading advantage in benchmarks should be corrected. Due to the Chinese New Year holiday and their feature backlog, though, it might be a while before we see user-facing customization options for this performance feature. While correcting the behavior alone is an improvement, it is still a bit disappointing to see in regular applications (like games), as it is a crutch to target specific apps, instead of improving actual performance scaling. By artificially boosting the aggressiveness of the processor, and thus the clock speeds for specific apps instead of improving their phones ability to identify when it actually needs higher clock speeds, OnePlus creates inconsistent performance for their phones, which will only become more apparent as the phone gets older and more games that OnePlus hasn't targeted are released. However, the implementation does currently allow games to perform better. OnePlus also provided a statement for this article, which you can read below:
'In order to give users a better user experience in resource intensive apps and games, especially graphically intensive ones, we implemented certain mechanisms in the community and Nougat builds to trigger the processor to run more aggressively. The trigger process for benchmarking apps will not be present in upcoming OxygenOS builds on the OnePlus 3 and OnePlus 3T.'
We are pleased to hear that OnePlus will be removing the benchmark cheating from their phones. Going forward we will continue to attempt to pressure OEMs to be more consumer friendly whenever possible, and will be keeping an eye out for future benchmark cheating.
Unfortunately, the only real answer to this type of deceit is constant vigilance. As the smartphone enthusiast community, we need to keep our eyes out for attempts to deceive users like this. It is not the benchmark scores themselves that we are interested in, but rather what the benchmarks say about the phone's performance. While the benchmark cheating was not yet active on the OnePlus 3 when we reviewed it, a simple software update was enough to add this misleading "feature", and clearly illustrates that checking the devices for benchmark cheating when they first launch is not enough. Issues like this one can be added days, weeks, months, or even years after the device launches, artificially inflating the global averages gather by benchmarks months down the line, influencing the final database result. It should be noted that even with these tweaks that manufacturers had to invest time and money to develop, we are typically only seeing a couple percentage points increase in benchmark scores (excluding a couple fringe cases like Meizu, where the cheating is covering up much larger problems). A couple percentage points, which is much smaller than the gap between the best performing and worst performing devices. We'd argue, though, that with devices running increasingly similar hardware, those extra percentage points might be the deciding factor in the ranking charts that users ultimately look up. Better driver optimization and smarter CPU scaling can have an absolutely massive effect on device performance, with the difference between the score of the top performing Qualcomm Snapdragon 820 based device and the worst performing one (from a major OEM) exceeding 20% on Geekbench. Twenty percent from driver optimization, rather than a couple percentage points from spending time and money to deceive your users. And that's just talking about the development efforts that can affect benchmark scores. Many of the biggest benefits of investing in improving a device's software don't always show up on benchmarks, with OnePlus offering excellent real world performance in their devices. It really should be clear cut where a company's development efforts should be focused in this case. We are reaching out to more companies who cheat on benchmarks as we find them, and we hope they are every bit as receptive as OnePlus.
We would like to thank the team at Primate Labs once again for working with us to uncover this issue. It would have been substantially more difficult to properly test for Benchmark Cheating without the "Mini Golf" edition of Geekbench.
  from xda-developers http://ift.tt/2jRGZRr
  via IFTTT
 
No comments:
Post a Comment