LOD_RADIUS=15 but it's NOT about distance

October 23, 201312 yr

Howellerman, on 21 Oct 2013 - 10:06 AM, said:

I think I will wait for P3D V2 to re-install my FS Genesis products

Maybe I missed it, but I didn't hear any news that P3D V2 was going to be compatible with any existing FSX/P3D products?

odourboy, on 23 Oct 2013 - 09:50 AM, said:

The main point of my post was actually to highlight the significance of mesh detail in the equation. Any thoughts on how to increase the mesh being applied to more distant scenery?

Howellerman, on 18 Oct 2013 - 08:38 AM, said:

It allows tasks to ping-pong back and forth between cores, hyper-threaded or not, and the ensuing cache thrashing just slows things down.

In theory, if you set the affinity for the thread and also set the "ideal" parameter it should NOT move the thread around to other cores (HT or real). I haven't tested this yet real world ... it's in my scope of a project I'm currently working (nothing to do with FSX) on but not there yet.

Hi Rob,

With respect to the first note, I have yet to read anything that suggests they *won't* work, but time will tell. EDIT: I see Simmerhead beat me to the punch!

With respect to the ping-pong phenomenon, UNIX systems (perhaps LINUX as well) have a BIND parameter that you can apply to a thread or thread class, where it basically says you run on this core and nowhere else. Not sure if that function exists in Windows, but my empirical evidence suggests not: I watched the single threaded task get tossed back and forth between two cores on a database update. In general, with regards to SMP operation, Windows starts slow and gets slower faster the more cores you add.

John Howell

Prepar3D V5, Windows 10 Pro, I7-9700K @ 4.6Ghz, EVGA GTX1080, 32GB Corsair Dominator 3200GHz, SanDisk Ultimate Pro 480GB SSD (OS), 2x Samsung 1TB 970 EVO M.2 (P3D), Corsair H80i V2 AIO Cooler, Fulcrum One Yoke, Samsung 34" 3440x1440 curved monitor, Honeycomb Bravo throttle quadrant, Thrustmaster TPR rudder pedals, Thrustmaster T1600M stick

October 24, 201312 yr

I did notice a slight improvement in some very distant hills, but for all intents, useless. However, since the OP was using photo scenery, I thought it might be of interest.

I tried these settings. They definitely had an impact, but I can't say it was a positive one. The Photo scenery textures looked terrible ... best way I can describe it is as if someone added a "film grain" filter ... looked like cloth fibers everywhere ... it was very bizarre. The values didn't "seem" to sharpen the textures nor add mesh detail, maybe the values need to be balanced (I tried the MIN 12 MAX 15 with a LOD 7, 7.5, 9, 9.5, 11).

October 24, 201312 yr

Bjoern - I just realized that your photos are of far off scenery looking from far off. Did you inspect the scenery zoomed in at all? I zoomed in to 3.00 to see the full effect of the application of detail1.bmp. Not suggestion you're wrong, but the pictures you posted MAY be misleading.

The main point of my post was actually to highlight the significance of mesh detail in the equation. Any thoughts on how to increase the mesh being applied to more distant scenery?

Further, I totally disagree with you. It obviously improves the appearance of my SW photoscenery. It very marginally improves my FTGX scenery. It has no negative impact on the appearance of any scenery. It has no significant performance or footprint impact. That makes it a pretty fine tweak!

If you look at the background of my images, you can clearly see differences in mesh resolution when running different LOD settings. It's most clear near the bodies of water, which appear larger at lower LODs because the terrain is displayed with less vertices. Thus, line of sight obscuring elevation points are not present or greatly simplified.

So your hypthesis about LOD having an effect on mesh display is basically right. I've only added vector line (i.e. road resoslution) to the equation.

I've already mentioned the suitability for photoscenery, if a bitm fuzzily.

The zoom factor is irrelevant. If there *are* differences, you will see them regardlessly.

7950X3D + 7900 XT + 64 GB + Linux | 4800H + RTX2060 + 32 GB + Linux
My add-ons from my FS9/FSX days

October 24, 201312 yr

I tried these settings. They definitely had an impact, but I can't say it was a positive one. The Photo scenery textures looked terrible ... best way I can describe it is as if someone added a "film grain" filter ... looked like cloth fibers everywhere ... it was very bizarre. The values didn't "seem" to sharpen the textures nor add mesh detail, maybe the values need to be balanced (I tried the MIN 12 MAX 15 with a LOD 7, 7.5, 9, 9.5, 11).

Interesting. As I understand the use of the deatil1.bmp (and correct me if I'm wrong), FSX applies that texture (as a transparency I think) to all scenery tiles anyway (not water, cloud or sky or course). All the min and max detail parameters do is force it to use the higher MIP version further out. They are the equivalent of a negative LOD bias on the detail1.bmp. Surprising that you find them so objectionable since I believe a lower MIP version is already being applied. Oh well. I appreciate that you tried it.

[email protected] - ROG Strix Z790-E - 2X16Gb G.Skill Trident DDR5 6400 CL32 - MSI RTX 4090 Suprim X - WD SN850X 2 TB M.2 - XPG S70 Blade 2 TB M.2 - MSI A1000G PCIE5 1000 W 80+ Gold PSU - Liam Li 011 Dynamic Razer case - 58" Panasonic TC-58AX800U 4K - Pico 4 VR HMD - WinWing HOTAS Orion2 MAX - ProFlight Pedals - TrackIR 5 - W11 Pro (Passmark:12574, CPU:63110-Single:4785, GPU:50688)

October 24, 201312 yr

Not sure if that function exists in Windows, but my empirical evidence suggests not: I watched the single threaded task get tossed back and forth between two cores on a database update.

It does exist in the Kernel32.dll -- SetThreadIdealProcessor and SetThreadIdealProcessorEx (http://msdn.microsoft.com/en-us/library/windows/desktop/ms686253(v=vs.85).aspx -- good for WinXP onwards. I haven't tested it yet to see if it does what I think it does ... as with a lot of MSDN documentation, accuracy isn't guaranteed. And reading this documents that use terms like:

The system schedules threads on their preferred processors whenever possible.

The "whenever possible" part ... love the ambiguity of Microsoft's documentation. But I still haven't been able to determine HT vs. real core (they both come up in the list of Logical cores) yet even on my Win7 box -- GetLogicalProcessorInformationEx (http://msdn.microsoft.com/en-us/library/windows/desktop/dd405488(v=vs.85).aspx) wasn't returning information that would help me identify a difference.

Rob

October 25, 201312 yr

The "are we there yet" is an idle core, bored, while the "mine, mine, mine" is that same core taking a queued thread from a another core (yes, I know: technically, the OS gives the queued task to the idle core). That is why I personally prefer HT off and use Affinity=14 - I just don't think the base ESP code is really capable of efficiently utilizing more than a couple of cores, despite being a "multi-threaded" application.

So, you are saying that with 3 cores, main running with two others, the two others are just duplicates and barely utilized ? Wouldnt it then not be better to just stick to 2 cores for FSX ? CPUS in waitstate arent usefull imho.

October 25, 201312 yr

So, you are saying that with 3 cores, main running with two others, the two others are just duplicates and barely utilized ? Wouldnt it then not be better to just stick to 2 cores for FSX ? CPUS in waitstate arent usefull imho.

To a point...

If the high limit is 80% busy, and low load is 20% busy, then yes: subtracting a core would be effective. However, and I think I will test for this this weekend, if the low load is 30%, then no, since you will have taken 110% CPU_busy down to 100%.

And as I pointed out a little earlier, since Windows is running on a single-socket, multi-core processor with a shared last-level cache, the penalties for playing ping pong (sometimes ya just have to love the English language) are pretty minor. The thread working set is transferred from core to core in tens of CPU cycles per instance versus hundreds.

John Howell

Prepar3D V5, Windows 10 Pro, I7-9700K @ 4.6Ghz, EVGA GTX1080, 32GB Corsair Dominator 3200GHz, SanDisk Ultimate Pro 480GB SSD (OS), 2x Samsung 1TB 970 EVO M.2 (P3D), Corsair H80i V2 AIO Cooler, Fulcrum One Yoke, Samsung 34" 3440x1440 curved monitor, Honeycomb Bravo throttle quadrant, Thrustmaster TPR rudder pedals, Thrustmaster T1600M stick

October 25, 201312 yr

Just to toss in some more testing I did on my system. I'm running with HT ON and CPU is 6 real cores (so 12 Logical cores).

Affinity = 14 (which according to the other's is not correct for a 6 core HT setup)

I get consistently better FPS but at the cost of delayed photo scenery loading, changes to Fiber_Frame_Time_Fraction make no difference (from 0.1 to 0.99)

Affinity = 62

I get lower FPS but photo scenery loading is much faster and it will respond to changes in Fiber_Frame_Time_Fraction, using a value of 0.99 will decrease FPS but ensures the photo scenery tiles get loaded (using a high LOD like 9.5).

I'm keeping an excel spreadsheet of all this testing, BUT I think it will be difficult to determine what exactly happens with the Affinity settings across all variants of hardware and OS. Win8.1 does a better job scheduling than Win7 which does a better job of scheduling than WinXP ... I'm using Win7 currently.

So I guess I have no conclusions, I don't really know or understand what is really going on with the Affinity settings in FSX and given that HT design across different Intel CPUs is NOT the same it further adds more variables to the FSX equation. Sometimes I think it would be easier to solve the grand unification theory than to figure out how FSX Affinity will work across all the variances in OS and hardware. It's once again is "try it and see" how it works on my specific setup.

October 25, 201312 yr

The thread working set is transferred from core to core in tens of CPU cycles per instance versus hundreds.

But as I understand it the transfer forces a cache clear and that's not great for performance.

October 25, 201312 yr

But as I understand it the transfer forces a cache clear and that's not great for performance.

That used to be the case a long while ago. Now a cache snoop operation that results in a different processor core owning the line just invalidates the respective line in the L1 and L2 caches. For the L3 (what Intel likes to call Last Level Cache now) it already belongs to the thread so things run merrily along without change, and the reason why L3 caches are so large now.

There is still a penalty, and for a short summary (and numbers are "rounded" accordingly to my increasingly faulty memory and enhancements in Intel processor technology): an L1 cache hit consumes 1 cycle, which is hidden in the parallel pipeline stages. An L1 miss consumes around 6-9 cycles, as the access has to go out to L2 (which is an L2 hit). If the line is not in L2, the miss consumes around 20-25 cycles going to L3. This is, therefore, the penalty for a ping-pong operation on a multi-core Intel processor: a couple of dozen CPU cycles.

In the days of single core designs with a Front Side Bus memory architecture (less than 10 years ago!), that last cache invalidation operation was a see-ya-bye-bye snoop operation that was dozens of cycles. Still, not as bad as an all-miss to memory: a hundred cycles or more, depending upon design.

All of these penalties are mitigated by the cache characteristics of the application working set, but that is a different conversation... :lol:

John Howell

Prepar3D V5, Windows 10 Pro, I7-9700K @ 4.6Ghz, EVGA GTX1080, 32GB Corsair Dominator 3200GHz, SanDisk Ultimate Pro 480GB SSD (OS), 2x Samsung 1TB 970 EVO M.2 (P3D), Corsair H80i V2 AIO Cooler, Fulcrum One Yoke, Samsung 34" 3440x1440 curved monitor, Honeycomb Bravo throttle quadrant, Thrustmaster TPR rudder pedals, Thrustmaster T1600M stick

October 25, 201312 yr

That used to be the case a long while ago. Now a cache snoop operation that results in a different processor core owning the line just invalidates the respective line in the L1 and L2 caches.

What's the cycle hit for the snoop operation?

L1 to L2 (per core caches) are so important to overall performance, I've often wondered why Intel haven't increased those Cache sizes rather than squeeze on more cores. But L3 is shared across all cores ... I thought that was the primary reason for size increase as more cores are added?

But if I understand you correctly, the L3 cache is entirely owned by "a" thread? So if a multiple cores are busy executing and they need to check L3 and it is "owned" but another core (thread), then it must do the very long (and slow) trip DDR? So shared means all cores have access to it, but only one core at a time can use it? This seems horribly inefficient for any process that can use more than a single core?

For example, rather than Intel having 6 real cores in one physical processor, I would think that removing 2 cores and adding L3 15MB dedicate cache for each core would result in much higher performance.

Please don't take these questions in the wrong way, I'm just trying to get a better understanding.

Cheers, Rob.

October 25, 201312 yr

I'm still trying to figure out how we got from terrain LOD to processor architecture and cache lines.

October 25, 201312 yr

Heya Rob,

Well, we are ranging pretty far afield from LOD radius, but since you are the topic starter...

First, I would like to defer the answer on snoop activity for later (or even in a later thread), and hopefully explain why caches are sized as they are.

The basic trade-off for any cache design is speed versus size. A large cache is always going to be slower than a small cache, if only because address resolution on a large cache, where the CPU must hash the operand data address into an LRU (least recently used) cache LINE consumes more time than on a small cache. Cache efficiency is also a balance between speed and size. A large cache will have a much higher hit-rate (the inverse of which is miss-rate), but you need to temper that with the access time. For example, say you have a cache that is sized to have a miss rate of 0.001, which is most commonly measured in instructions, not accesses. This means that for every 1,000 instructions executed, there will be a miss. In practice, this is a very good miss rate. However, a cache large enough to accommodate that miss rate is going to be big, so that there is a penalty on "hits". If the penalty is even a couple of cycles, it will contribute those couple of cycles to EACH and EVERY instruction. This is bad.

So what the CPU designers do is put intermediate caches that are smaller but much faster. So, and L1 cache, accessible in 1-cycle, contributes only 1 cycle to each instruction. This access time, however, is almost always hidden because the CPU pipeline is doing something else in parallel.

I did not want to do this, but I must. :smile: CPU performance is based upon the sum of E (Execute), D (Delay), and S (Storage) terms, where the E-term is the intrinsic instruction execution time, the D-term is conflicts between sequential instructions, such as an arithmetic instruction needing the data from the immediately preceding load instruction, and S-term is the overall CPU cycles due to cache misses and memory access. Not to get toooooo deep into the weeds, the E-term is pretty easy. On RISC (Reduced Instruction Set Computing) processors the E-term is often proudly stated as "1 CPU cycle". Yeah, yeah, yeah, but if you are moving a 256-byte cache line 8-bytes at a time, you are going to run a lot of those instructions. Really fast baby steps. Intel, being a CISC (Complex Instruction Set Computing) design will take 1 instruction to move all 256 bytes over the execution time (the E-term) of the instruction. The D-term is where the CPU guys are most proud: they will do little things, like implement a load-store bypass in the pipeline (relatively simple) or allow out-of-order execution (OOE, or sometimes OoOE), which is very complex. The objective again is to remove delays in instruction processing due to resource conflicts. Now we come to the S-term, which is always the largest contribution to instruction execution cycles. This is because of two primary things: caches got bigger (and slower), and CPU clock speeds just blasted past memory speeds. Back when I did CPU models for mainframes, the cycle speed was 25MHz. A pair of 256 KB caches was accessible in 1-cycle, always. And memory? If we had a cache miss, memory was exactly 25 cycles away. Now there are 40+ MB caches, and memory is 32 TB on big UNIX and mainframe systems.

So, what is an S-Unit designer do? Introduce smaller faster caches. They may have a miss-rate of 0.01, but they don't contribute any delay on hits. And if you do miss on that first access, mitigate it by having larger per-core secondary caches that impart only single-digit penalties on a hit.

Moving onto your next point, "the L3 cache is entirely owned by a thread". Not so. Caches are shared constructs: the L1 and L2 caches, while private to individual cores, are a hodge-podge of cache lines belonging to the OS, applications, threads, etc. Which belong to what is determined by tag bits in the cache line. Same for the L3 cache, but here the cache contains a hodge-podge of lines from all the cores on the socket. Again, tag bits sort them out. And all modern cache architectures are multi-ported, and can handle concurrent access from multiple threads/cores.

For you last point, about removing cores to increasing cache size. That is indeed the trend, particularly for server designs. The new Ivy Bridge "server" processors have up to 40MB of L3 cache. This size is appropriate because it is a "server" design, and will handle hundreds of concurrent threads running on up to 15 cores. Makes sense in that context. In the context of a PC, however, that would be overkill. Not for FSX, mind you , but for the general class of everything else a PC typically does, such as email, spreadsheets, web surfing, etc. It's a trade-off.

Sheesh. I hope I did not ramble on too much here. This knowledge is from my "second career", and I still miss the certainty of it (versus my current career in Marketing, where nothing is certain).

John

John Howell

Prepar3D V5, Windows 10 Pro, I7-9700K @ 4.6Ghz, EVGA GTX1080, 32GB Corsair Dominator 3200GHz, SanDisk Ultimate Pro 480GB SSD (OS), 2x Samsung 1TB 970 EVO M.2 (P3D), Corsair H80i V2 AIO Cooler, Fulcrum One Yoke, Samsung 34" 3440x1440 curved monitor, Honeycomb Bravo throttle quadrant, Thrustmaster TPR rudder pedals, Thrustmaster T1600M stick

October 25, 201312 yr

Well, we are ranging pretty far afield from LOD radius, but since you are the topic starter...

I wouldn't worry about that too much, I've yet to see any thread on any forum on any web site "stay on topic" ... doesn't bother me at all ... I'm ok with out-of-order execution

Moving onto your next point, "the L3 cache is entirely owned by a thread". Not so. Caches are shared constructs: the L1 and L2 caches, while private to individual cores, are a hodge-podge of cache lines belonging to the OS, applications, threads, etc. Which belong to what is determined by tag bits in the cache line.

That's what I thought, I misread this:

For the L3 (what Intel likes to call Last Level Cache now) it already belongs to the thread so things run merrily along without change, and the reason why L3 caches are so large now.

Sheesh. I hope I did not ramble on too much here. This knowledge is from my "second career", and I still miss the certainty of it (versus my current career in Marketing, where nothing is certain).

Absolutely not rambling, thank you for your insight. It's caused me some bit alignment in my own old CPU

Cheers, Rob.

October 25, 201312 yr

I'm still trying to figure out how we got from terrain LOD to processor architecture and cache lines.

I think we got to here from LOD process that uses multiple cores to load textures ... which then brought up the AffinityMask setting ... which lead to setting of thread affinity ... which then lead to HT vs. real core which got ugly but returned to being civil ... no worries I think it's been a good thread for the most part (to me anyway).

LOD_RADIUS=15 but it's NOT about distance

Featured Replies

Top Posters In This Topic

Popular Days

Create an account or sign in to comment

Top Posters In This Topic

Popular Days

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)