Question / Help Can't record high resolution capture despite beefy CPU

war4peace

New Member
Hi all,
I am currently reviewing a series of HEDT CPUs for a roundup focusing on gaming, game streaming,dual-streaming and high resolution captures. Using OBS Studio as primary streaming and recording solution, I have encountered the following issue:
Using top CPUs such as Intel 7980XE and Threadripper 2990WX & 2950WX I expected recording to work smoothly at resolutions such as 2560x1440, 3440x1440, 3840x1080 and 3840x2160. Using x264 yielded huge frame drops despite the CPU load being rather small. For example, at 3440x1440, the frame drop was humongous:
Code:
Video stopped, number of skipped frames due to encoding lag: 39988/46113 (86.7%)
At the same time, the total CPU usage was 22%-27% according to HWInfo64 v.5.87-3495.

It can't be the CPU, so it's either some manual setting for the encoder that I can't figure out, or simply the encoder being unable to use CPU resources properly.
Is this another case of hardware preceding software? Ironically the only real time, non-synthetic test currently using the full capability of these processors is CPU cryptomining :)

Any help much appreciated! I'd very much like to show the world what these CPUs are capable of in the world of gaming.
(log, of course, attached).

Thanks again in advance for your responses!
 

Attachments

  • 2018-08-12 22-22-44.txt
    30.4 KB · Views: 58

koala

Active Member
You're doing something strange. You set your base resolution to 1680x1050, which is your monitor resolution. These 1680x1050 you rescale to 1120x700 as output resolution. Then you let the encoder bloat these 1120x700 up to full hd 1920x1080 or even 3440x1440 by setting this resolution as output resolution in the encoder settings. This way you not only get mediocre quality by first scaling down and then rescaling up the downscaled image, but you also waste CPU cycles by doing this upscaling. This second rescaling in the encoder takes place on the CPU, and this is not as optimized as rescaling in GPU space as the first rescaling.

In addition, first test reports about the Threadripper WX family hint that memory-intensive multithreaded applications (such as video encoding) suffer greatly from the architecture (missing memory controller) in comparison to the X family. In a news article from the german It news site heise online (https://www.heise.de/newsticker/mel...Kernen-an-die-Performance-Spitze-4133456.html) this is mentioned. In the the comments section of the article, the author refers to the upcoming complete test in the german c't magazine and explicity mentions video encoding: https://www.heise.de/forum/heise-on...endungen-wie-3D-Spiele/posting-32882143/show/

For actual encoding, see this thread, especially the comment about the threads parameter with 2 simultaneous x264 sessions running in parallel:
https://obsproject.com/forum/threads/can-you-please-explain-x264-option-threads.76917/

In your setup, you seem to multiply all things that negatively affect performance:
- rescaling to 4k resolution in CPU space
- use of a CPU that behaves badly with memory-intensive multithreaded applications
- use 2 x264 encoders in parallel with a huge amount of cpu cores (32) while using the default threads= configuration, so the threads may step onto each other

I don't know a definite answer, only that there is more research to do.
My conclusion so far is that this CPU isn't a plain "just buy it and you get x * 200% more power with everything".
It seems to be more of a computing CPU that primarily computes stuff, not a CPU that is built to just move large amounts of stuff around in memory.
It's the same as with all AMD in the past: it's not a universal CPU. You have to exactly evaluate your use case.
 
Last edited:

war4peace

New Member
Hi Koala,
Yes I am aware of image quality, however that is irrelevant in the context of testing. The results look bad (quality-wise) but will never be published anywhere, so they don't matter, the resulting files will be discarded once the review is finished.
These are CPU-intensive tests, and they are meant to be CPU intensive. That's exactly what I am testing: how does the CPU behave when everything is offloaded to it. Does multithreading work out of the box? Are all threads correctly and judiciously being used? Only... they are not. As mentioned above, the CPU usage is still low, and since at 2560x1440 encoding is smooth and anything above this resolution is behaving badly, this points to a software-specific issue (be it Operating System, encoder or both).
Interestingly, ffmpeg works better at 3440x1440 although it doesn't take full advantage of the CPU architecture either.

I guess future patches (both to OS and encoders) will optimize encoding.
 

koala

Active Member
In my opinion, your approach is not correct. You are testing the worst case, the situation where a user takes all wrong steps. But the worst case is only produced by the layman, by a dilettante, not by a professional. A layman doesn't buy a $1800 CPU (usually). On the other side, a professional exercises best practice. He optimizes. He is using the common case and tries to optimize his setup to get to the best case.

Thus, In a research that intends to tell the audience about the fitness of a device for a given purpose, you should mainly test and report the common case. Document what happens if a device is used as it was meant to be used. And you should work out and report what you can achieve for the best case. And, at the end, but not most importantly, you should document what happens for the worst case. But the worst case is not relevant in everyday use, so you don't need to dig too deep into it.

Focus on the common case and on the best case, so try to optimize your test setup as much as possible, not try to make it as artificially stressing as possible.
 

war4peace

New Member
You would have been right on all accounts if the CPU would have been overloaded, however it is not.

It's just like it was when gaming was tested on the 7700K and 8700K: many reviewers used resolutions as small as 1024x768 to fully load the CPU up, otherwise the bottleneck would have been the GPU. Nobody actually plays games at those resolutions, not anymore, and certainly not on those CPUs, but it was the only way to properly load a CPU up to the maximum, for comparative results and roundups. Was that a real life scenario? Of course not. Neither is almost any synthetic benchmark out there but they are used because they are standardized and repeatable. The same argument applies: "a real user would never do that" - still, they're being used :)

The issue: I am unable to load the CPU enough using OBS. Optimizing would reduce CPU load, instead of increasing it.
The question: which CPU should I buy for real-time encoding using resolutions 3440x1440, 3840x1080, 3840x2160 (etc)?
The answer: Currently it is not possible to encode in real-time using any CPU, no matter how powerful, because the software using it can't take advantage of the hardware capabilities of the CPU.
 

koala

Active Member
It's not entirely true that optimizing always reduces load. If you have a bottleneck somewhere, you are not able to use the other components to their full capacity. For example, if the CPU is able to process data faster than you are able to move the data into memory to make it available to the CPU, the memory bus is a bottleneck and you are not able to fully load the CPU. As soon as you resolve the bottleneck, you are able to fully load the CPU, which gets a higher load from then on, and you optimized the system as a whole.
 

war4peace

New Member
Agreed. However! The machine has 32 GB DDR4 3200 MHz CL14 quad-channel. The scratch disk is a brand new (never used) Samsung 970 EVO (NVMe). The games and the OS are installed on different SSDs. There is no bottleneck to speak of.
The discussion becomes academic: „what if, what about” - moving farther away from the issue at hand. I can't find any other explanation for the shown behavior, other than "the software is not prepared for the hardware yet".
 

R1CH

Forum Admin
Developer
I believe output rescaling is not only CPU based, but a single threaded operation and thus should be avoided at all costs.
 

war4peace

New Member
Interesting, I didn't know that. It could be why the process starts stumbling at higher resolutions. I'll avoid this step when I test the 2950WX tomorrow and report back. Thank you for this nugget.
 

koala

Active Member
If you want to test the same operating mode people in the real world will use for 4k streaming, set the base resolution to 4k and fit any source to this resolution. Set the output resolution to the same, thus no rescaling at all. In the encoder, don't rescale as well. Just encode what is coming directly from the canvas.

I talked about bottlenecks and how they prevent the full usage of the system resources. If it is true what R1ch believes, then by removing output rescaling in the encoder configuration you will remove a bottleneck and will get higher CPU usage and more performance.
 

TryHD

Member
Hi all,
I am currently reviewing a series of HEDT CPUs for a roundup focusing on gaming, game streaming,dual-streaming and high resolution captures. Using OBS Studio as primary streaming and recording solution, I have encountered the following issue:
Using top CPUs such as Intel 7980XE and Threadripper 2990WX & 2950WX I expected recording to work smoothly at resolutions such as 2560x1440, 3440x1440, 3840x1080 and 3840x2160. Using x264 yielded huge frame drops despite the CPU load being rather small. For example, at 3440x1440, the frame drop was humongous:
The test results would be interesting if tested with a dual PC setup. One source PC and the encoding pc with a capture card in it. This is the normal setup in this price range, what you test there will nobody build because it is just stupid and inefficient nobody would do it that way.

Could you post a link to your results here after you have finished?
Thank you
 
Last edited:

koala

Active Member
I'd like to add some info, now that I got my copy of the current issue of the c't magazine 18/2018 that contains a detailed report about the new Threadripper. It contains a section about video encoding, and interestingly it mirrors exactly what was experienced by the OP:

At the same time, the total CPU usage was 22%-27%
The article contains this:
Beim Kodieren eines Videos in Full-HD-Auflösung war der 2990WX lediglich zu 30 Prozent ausgelastet und dabei 25 Prozent langsamer als der Threadripper 2950X mit halb soviel Kernen.
english:
While encoding a full hd video, the 2990WX was utilized only up to 30 percent. This is 25 percent slower than the Threadripper 2950X, which has half as much cores.
This tells us that the NUMA architecture of the new Threadripper is responsible for the low performance of a multi-threaded memory-intensive application, not bad programming within the encoder, bad programming within the capture or within the application in general.

There is a OS scheduler component missing that as of now was only required by hypervisors: the organizing of threads to NUMA nodes. Currently, it seems that the Windows scheduler doesn't really respect NUMA nodes while allocating processor cycles for ordinary processes.

There is only a API in the OS that you can use to explicitly allocate and reserve processor cycles on NUMA nodes (called "processor groups"), but if you don't use this as application programmer, your threads are allocated on every core regardless of the NUMA node the core is on. This means, if 2 threads are running on different NUMA nodes, memory shared by both of them is "slow". The thread cannot run on full speed, because the processor has to wait some processor cycles for memory access. This results in 30% CPU usage instead of 100% while doing video encoding, wasting 70%.

Hypervisors assign VMs to NUMA nodes to prevent this kind of memory thrashing. They do this for ages. The Windows scheduler doesn't seem to do this yet for ordinary processes like OBS or games.

I expect Windows 10 support in upcoming Windows 10 versions in the next years, if this kind of NUMA CPUs get more common in the consumer area. Currently, these processors are almost exclusively used in data centers on hypervisors for virtualized server farms, not on consumer machines.
 

Pewpau

New Member
Hypervisors assign VMs to NUMA nodes to prevent this kind of memory thrashing. They do this for ages. The Windows scheduler doesn't seem to do this yet for ordinary processes like OBS or games.
Hypervisors assign VMs to NUMA nodes to prevent this kind of memory thrashing. They do this for ages. The Windows scheduler doesn't seem to do this yet for ordinary processes like OBS or games.

having the same problems on the 2990wx and looking for atleast a way to fix it so i could atleast stream at the same encoding settings as the TR1

specs:
AMD Ryzen Treadripper 1950X upgraded to 2990wx
Asus ROG Zenith Extreme Tr4 x399
x1 Nvidia Titan Xp
Samsung 960 Pro M.2 1 TB
128 GB Gskill Tridentz 16 gbx4 Ddr4-3200mhz
Custom indesk Pc case by Badvolf "Bv Volchara" dual system desk Pc
Corsair AX 1500i Titanium
 
Top