What AT&T’s AMVOTS Tells Us About the State of Video QoE Measurement

AT&T recently published a paper on AMVOTS, their Automated Mobile Video Objective Testing System, built in collaboration with Ericsson. It is a lab-based platform for measuring video quality on mobile devices under realistic network conditions. For anyone working in video QoE — ourselves included — it is worth a closer look. We talked with Ericsson researcher David Lindero, who explained to us the background of their joint work. Read on to learn more!
What AMVOTS Does
Internet service providers like AT&T have a significant interest in measuring and improving the video QoE on their network. To that end, they’re researching ways to both capture and analyze video quality on mobile devices under realistic network conditions, to optimize the overall QoE on a mobile network. To realize this, AT&T’s AMVOTS system captures the HDMI output of a mobile phone and compares it, frame by frame, against a reference video using VMAF (Netflix’s famous open-source video quality assessment algorithm). The researchers are using the information to drive decisions about network resource allocation.
When you compare a reference clip with a mobile phone screen recording, you run into some obvious issues. The AMVOTS system handles these important aspects: frame alignment and visual corrections (cropping out UI overlays, masking logos, color correction). If you did not crop out UI elements, you’d get a lower picture quality score than what the user actually perceives, which would be misleading. These scores can then better reflect the actual picture quality rather than measurement artifacts. AMVOTS also implements the ITU-T standard P.1203.3 to assess the impact of buffering and stalling, using VMAF as “video quality input” to the model.
The system runs on a Dell server with a dedicated capture card, processing 1080p at 60fps in real time. Every 10 seconds, it produces a combined QoE score that covers both spatial quality (VMAF) and temporal factors like stalls and frame freezes.
As David says, “AMVOTS was created to be able to assess the end-to-end service quality for ‘any’ video based service that runs on a cellular network, or any network for that matter. Initially it was to test the behavior of new services where content providers were blaming the network, to be able to show that usually the problem was due to badly configured streamers. Now it has developed into a more capable tool that, in Eric Petajan’s (AT&T) vision, could become an open source alternative for running drive testing or evaluations of emulated services.”
The interesting part is the “QoE-in-the-Loop” concept: these scores could then be fed back into the Radio Access Network (RAN) in near real time, allowing the base station to allocate radio resources based on what users are actually experiencing rather than just raw throughput. AT&T’s results suggest that roughly 3x more video flows can be supported at “acceptable” quality levels when the RAN is QoE-aware — mostly by not wasting bandwidth on streams that already look fine and redirecting it to those that need it. This concept has been discussed in a VQEG work item on 5G Key Performance Indicators (5G KPI).
The key in this feedback loop is that “acceptable” is defined in terms of actual user experience, not just anyone’s guess about what particular network metrics mean. The actual quality score may very well differ depending on how the videos are encoded, and what type of videos are being streamed. Hence, the inclusion of an actual quality metric improves the overall efficiency of the system significantly.
The Bigger Picture: From Lab to Production
The paper describes a lab tool. But as Eric Petajan explained in an interview, AT&T’s longer-term goal goes further: using the ground-truth data from AMVOTS, combined with subjective testing (MOS scores collected at their Austin lab), to train prediction models that can estimate QoE from network traffic alone — without needing HDMI capture.
This is where it gets interesting for the industry. If you can predict QoE from network-side data, you can do it at scale across your entire subscriber base. But building that pipeline is a serious undertaking. You need:
- The lab hardware
- Good video QoE models and/or the subjective testing infrastructure to create subjective ground truth data
- The network and ML expertise to perform the training
- Enough training data to make the models generalize
- A pathway to real deployment
There’s also a real challenge with encrypted traffic. You can really only estimate video quality very coarsely from TCP-level information. And, as QUIC adoption grows, the network-visible data that these models rely on becomes even thinner. Petajan was very open about the scalability question, calling it “kind of an open question” whether AMVOTS-derived insights can work beyond AT&T’s own environment.
Where Standardized Bitstream Models Fit in
As our readers may be aware, at AVEQ we focus on standardized QoE models like ITU-T Rec. P.1203 and P.1204. Check an overview of video models here. P.1203 was developed specifically for adaptive streaming (DASH, HLS), and it already models the subjective impact of quality switches, stalling, and resolution changes in its “integration module”. VMAF does not do that; you can only work with statistical aggregations of per-frame metrics.
Most importantly, the P.1204.1 (metadata-based) or P.1204.3 (bitstream-based) models do not need a reference video or HDMI capture. So, unlike VMAF, they work from metadata that the player already has: buffer state, bitrate, resolution, codec parameters, or from an analysis (decoding) of the transmitted video segments.
A full-reference approach like VMAF — where you need access to both source and output — will always be more precise for a given frame. But in many cases you don’t have the luxury of even uploading a video to the source (think of streams from services like Disney+, or live TV). Therefore, P.1204.1 is deployable wherever you have access to the player. And P.1204.3 can be embedded into active probing scenarios where a man-in-the-middle decryption (on your own device!) is still possible.
Since active probes capture data directly from the player, we can run them in real-life deployments, at the mobile edge, in network-centric locations, etc., even without needing dedicated hardware (since they can run as Docker containers). This is a huge advantage for operators who want to monitor real user experience across their entire subscriber base. Where the full-reference VMAF-based approach no longer works, you can deploy our SDK to hundreds of sensors, and get P.1203 scores for representative video sessions. These probes would then inform about the current state of the network, and could be used to trigger optimizations in the RAN or elsewhere.
Different Tools for Different Problems
As you can imagine from the previous explanations, such deployments are never simple. Whether and how you can measure the QoE depends on:
- Who you are (an ISP, a CDN provider, a streaming provider)
- Where you need to perform measurements (mobile devices, desktops)
- How you want to use the measurements (for lab tests, or in real-life production deployments)
- What you can control (the RAN, routing, CDN configuration, …)
AMVOTS is interesting because it covers the case where you want a detailed lab-based evaluation of possibly closed-source applications. The biggest caveat is that it requires you being able to feed the input to the system. At AVEQ we focus on another angle: measuring the quality of third-party apps from our mobile and desktop Surfmeter apps (like YouTube and Netflix), where you cannot control the video input.
It would be easy to frame this as “our approach vs. theirs” — but that’s not our point. AMVOTS is a good example of how to build a lab-based ground truth setup that might extend to other use cases (e.g., conferencing). When it comes to real deployments, AVEQ’s Surfmeter solves different problems at a different stage.
What Matters for Operators
The fact that AT&T — one of the largest operators in the world — invested this effort into video QoE measurement confirms research approaches that we have always been advocating for. The underlying premise is: managing networks by throughput alone is not enough. The differences in video services and what “acceptable” means for a user can vary across contents and services.
Here’s David’s view: “We need tools like this to show what telco operators and content providers are missing by not sharing data. Showing that ‘bitrate is not enough’, etc., is the first step in this story, and with QoE reports between clients and network, we could reach a much higher utilization of cellular networks with more, and happier, users. And maybe that even enables smart improvements that we haven’t even thought about yet…”
Ultimately, you need to understand what the user actually sees. For example, we know that a satellite operator can tune the bandwidth requirements for YouTube and Netflix differently, because they use different codecs and playout algorithms. Models like P.1203/P.1204 can tell you exactly how those differences translate into QoE, and how to optimize for them.
Now, operators like AT&T can make significant R&D investments to run such labs, create in-house prediction models, and validate the results subjectively. For operators who do not have that scale, a standards-based approach with active probes potentially offers a shorter path: deploy the tool, measure third-party streams, get P.1203 scores, and start making decisions based on actual QoE data.
To summarize, we’re happy to see the industry is moving toward QoE-aware network management. The key is to choose the right tool for the right problem, and to focus on what ultimately matters: delivering the best possible experience to users.


