Battler's speed issues
Battler's speed issues
- neozeed: There would probably be a big improvement if whatever is causing rendering to be very slow when the guest is not in 16-bit/24-bit/32-bit color modes, was fixed. As a comparison, look at the Windows 98 welcome screen in 256 color mode and in 32-bit color mode. In the former, it will take some time to fully display, in the latter it will pretty much display instantly.
Edit: I just realized, the difference might be due to Voodoo which might be disabled in 256 color mode but enabled in 16-bit/24-bit/32-bit color modes. And Voodoo might render faster because it's threaded. Which makes me wonder just how much faster rendering would be in 256 color mode if that was made threaded. Probably a lot. And if disk I/O (as in, for hard disks and CD-ROM's) was made threaded, it would probably improve performance even more.
Edit #2: Nope, not Voodoo. Just logged the PCI write that enables it, and it never did. This means the answer lies elsewhere.
Edit: I just realized, the difference might be due to Voodoo which might be disabled in 256 color mode but enabled in 16-bit/24-bit/32-bit color modes. And Voodoo might render faster because it's threaded. Which makes me wonder just how much faster rendering would be in 256 color mode if that was made threaded. Probably a lot. And if disk I/O (as in, for hard disks and CD-ROM's) was made threaded, it would probably improve performance even more.
Edit #2: Nope, not Voodoo. Just logged the PCI write that enables it, and it never did. This means the answer lies elsewhere.
Re: Networking discussion
I test a bit with DooM v1.1 and it seems quite playable...Battler wrote:- neozeed: There would probably be a big improvement if whatever is causing rendering to be very slow when the guest is not in 16-bit/24-bit/32-bit color modes, was fixed. As a comparison, look at the Windows 98 welcome screen in 256 color mode and in 32-bit color mode. In the former, it will take some time to fully display, in the latter it will pretty much display instantly.
Edit: I just realized, the difference might be due to Voodoo which might be disabled in 256 color mode but enabled in 16-bit/24-bit/32-bit color modes. And Voodoo might render faster because it's threaded. Which makes me wonder just how much faster rendering would be in 256 color mode if that was made threaded. Probably a lot. And if disk I/O (as in, for hard disks and CD-ROM's) was made threaded, it would probably improve performance even more.
Edit #2: Nope, not Voodoo. Just logged the PCI write that enables it, and it never did. This means the answer lies elsewhere.
I guess I'd have to build some test program and see what happens regarding write combining...
- SarahWalker
- Site Admin
- Posts: 2055
- Joined: Thu 24 Apr, 2014 4:18 pm
Re: Networking discussion
What CPU? What graphics card? What video driver? Is PCem hitting 100% during this?Battler wrote:- neozeed: There would probably be a big improvement if whatever is causing rendering to be very slow when the guest is not in 16-bit/24-bit/32-bit color modes, was fixed. As a comparison, look at the Windows 98 welcome screen in 256 color mode and in 32-bit color mode. In the former, it will take some time to fully display, in the latter it will pretty much display instantly.
Also, what does this have to do with networking?
Voodoo won't have anything to do with basic GDI rendering, which is what the Windows 98 welcome screen is.Edit: I just realized, the difference might be due to Voodoo which might be disabled in 256 color mode but enabled in 16-bit/24-bit/32-bit color modes.
Very little most likely, as it's either purely software rendering or the blitter is being fed (and therefore limited) by the host CPU.Which makes me wonder just how much faster rendering would be in 256 color mode if that was made threaded. Probably a lot.
Less than you might think.And if disk I/O (as in, for hard disks and CD-ROM's) was made threaded, it would probably improve performance even more.
Re: Networking discussion
What do you mean? CPU and graphics card of host or guest?TomWalker wrote:What CPU? What graphics card? What video driver? Is PCem hitting 100% during this?
Guest is a Phoenix S3 Trio64 on Pentium MMX 233 MHz.
Host is an AMD Radeon HD 5450 on Pentium Dual-Core E5700 @ 3.0 GHz.
Windows 98 SE renders much faster in 32-bit color mode than at 256 colors. Anything at 640x480x256 colors lags inside PCem, especially if it's doing any kind of transition (like the Windows 98 SE Welcome screen does, or Little Big Adventure 1 when it's displaying the Adeline Software International logo).
And no idea how many % it hit at 256 colors, all I know is that at 32-bit color, it never went below 100% even while the Welcome screen was being rendered *AND* sound was playing.
My host CPU can handle the same 640x480x256 rendering with much less lag when it's done by VMWare, Virtual PC 2007, QEMU, or DOSBox. If there's anything I found affecting it, it's the emulated *guest* CPU, as the lag at 640x480x256 inside PCem is much less noticable with a 486 than with a Pentium MMX 233 MHz. And considering the lag does not occur at even 1024x768x32-bit color on a Pentium MMX 233 MHz, I think it's not my host CPU unable to handle the emulation.Very little most likely, as it's either purely software rendering or the blitter is being fed (and therefore limited) by the host CPU.
The emulation goes down from 100% to as low as 29% during I/O to an emulated hard disk or CD-ROM drive, eg. during IDE test at BIOS POST or while a DOS CD-ROM driver is being loaded. And this even if I emulate a 486. And I can assure you, on no real PC I've ever had (and I've gone from 386 upwards) slowed down THAT drastically during I/O.TomWalker wrote:Less than you might think.
Just as an example, Little Big Adventure 2 reads from the disk (and writes to it, for example the autosave data) while you're moving, which causes a noticable stall on PCem, regardless of whether I emulate at 486 or a Pentium MMX 233 MHz. On a real Pentium 100 MHz there was no such stall, neither is there on VMWare, Virtual PC 2007, DOSBox, or QEMU. Only on PCem. And only during any kind of disk I/O. When there's no disk I/O going on, the emulator hits 98-100% during the game these days and the game is smooth.
And yes, please split this discussion off to a separate thread as I agree it's off-topic here.
- SarahWalker
- Site Admin
- Posts: 2055
- Joined: Thu 24 Apr, 2014 4:18 pm
Re: Networking discussion
Yes, let's compare the performance of the free emulator written largely by one person in their spare time with commercial emulators and much older free emulators with many more people working on them. Didn't we have this conversation about a year ago?My host CPU can handle the same 640x480x256 rendering with much less lag when it's done by VMWare, Virtual PC 2007, QEMU, or DOSBox.Very little most likely, as it's either purely software rendering or the blitter is being fed (and therefore limited) by the host CPU.
There probably are improvements that can be made in this area. I haven't focused on them because most 'performance critical' uses of PCem don't rely on the 2D blitter, or other esoteric corner cases you may be hitting. I have been looking a lot at performance this last year and there is no performance issue with 8-bit colour in general. Possibly the guest video driver you're running hits some of PCem's weak points in terms of performance. The Windows 98 case is probably the performance issue related to Windows idling, which hits the self-modifying code paths a lot, and also performs a lot of selector loads and other slow protected mode stuff. Possibly in 8-bit mode it's using the blitter, rather than software in 32-bit colour mode, and hence it idles more. Probably the LBA1 issue is also self-modifying code, or either your video card has really terrible host throughput. Most software doesn't hit these issues.
I'm beginning to think there's something seriously wrong with your host machine.The emulation goes down from 100% to as low as 29% during I/O to an emulated hard disk or CD-ROM drive, eg. during IDE test at BIOS POST or while a DOS CD-ROM driver is being loaded. And this even if I emulate a 486.TomWalker wrote:Less than you might think.
Re: Battler's speed issues
Well the 29% is the lowest I've seen, and it was on the Pentium MMX 233 MHz. On 486 it lowers to about 48%. Not as bad but still a slow down. :p Maybe I should defragment my hard disk and see if that changes anything.TomWalker wrote:I'm beginning to think there's something seriously wrong with your host machine.
Edit: It seems I misremembered and I apologize for that. Just did a test, and it slows down to 81% on an emulated 486, ~45% on Pentium MMX 233 MHz, and ~29% if I modify the code to increase said Pentium MMX' s clock speed to 266 MHz.
Or it might be the host video driver which is AMD Catalyst from last year. Going to update to the latest version now and see what happens.or either your video card has really terrible host throughput.
Edit: Making some tests now. It seems the bigger factor I multiply the clock cycles for the IN and OUT instructions with, the faster the I/O inside the emulator is and the higher the percentage too. Got it to ~53% at Pentium MMX overclocked to 266 MHz by just multiplying those cycles by 8.
Edit #2: And while I have no idea what Windows 98 SE does, I know that LBA 1 gradually changes the palette from all black to the regular paletter so that the colors are brighter on each redraw, until the white of the Adeline logo is fully white. And I suspect it might be doing a lot of I/O writes to the palette registers when it does that. Remember that LBA 1 is for DOS and it uses its own set of graphics drivers.
Re: Battler's speed issues
It isn't Catalyst. If there's a difference from that, it's pure placebo.
Also, I should mention blindly applying optimization flags (like all those "faster superior secret weapon" SSE-related flags you fail to mention) will actually slow down the emulation and bring horrible cache hiccups, even on Intel Core i7. There is a reason why PCem is strictly -march=i686.
Finally don't expect to emulate a PMMX233 at full speed on a low-end budget processor. Pentium is the new Celeron. Lower your high expectations and go emulate a P75 on A lot cache for decent performance, and if it's still not decent, then maybe the claim will have some merit.
Fast x86 emulation is not some walk in the park. It took DOSBox 7 years of intense hacking to get their dynarec where it is today, and even then it's still not PMMX233 fast for a CPU like that (not to mention Win9x is far slower and buggier)
Also, I should mention blindly applying optimization flags (like all those "faster superior secret weapon" SSE-related flags you fail to mention) will actually slow down the emulation and bring horrible cache hiccups, even on Intel Core i7. There is a reason why PCem is strictly -march=i686.
Finally don't expect to emulate a PMMX233 at full speed on a low-end budget processor. Pentium is the new Celeron. Lower your high expectations and go emulate a P75 on A lot cache for decent performance, and if it's still not decent, then maybe the claim will have some merit.
Fast x86 emulation is not some walk in the park. It took DOSBox 7 years of intense hacking to get their dynarec where it is today, and even then it's still not PMMX233 fast for a CPU like that (not to mention Win9x is far slower and buggier)
Re: Battler's speed issues
Yet for me, it's faster when compiled with those flags.leilei wrote:Also, I should mention blindly applying optimization flags (like all those "faster superior secret weapon" SSE-related flags you fail to mention) will actually slow down the emulation and bring horrible cache hiccups, even on Intel Core i7. There is a reason why PCem is strictly -march=i686.
Then please explain to me why the Pentium MMX 233 MHz is at ~100% when the guest is in 32-bit color, and the only times there's lag is when it's at 256 colors or less. Also, I have cache set to "A little", which I found to be the fastest option.Finally don't expect to emulate a PMMX233 at full speed on a low-end budget processor. Pentium is the new Celeron. Lower your high expectations and go emulate a P75 on A lot cache for decent performance, and if it's still not decent, then maybe the claim will have some merit.
Oh and also, apparently on my low-end budget hardware, I can watch 1080p @ 60 fps videos on YouTube with essentially no lag. Are you going to tell me x86 emulation is more intensive than that?
Not to mention that I know plenty of people (you can connect to irc.rol.im / #RIS to talk to them) that have much better hardware than I do (talking i5's and i7's here) and yet still complain about PCem being slow. Granted for them it's slightly faster than for me, but evidently still not enough.
Last edited by Battler on Sun 30 Aug, 2015 12:41 am, edited 1 time in total.
Re: Battler's speed issues
Doesn't occur here on my r330 i686 build. Still remains 100% when I switch to 32-bit and back.
- Attachments
-
- PCem256.png (15.76 KiB) Viewed 28966 times
Re: Battler's speed issues
- leilei: Yes, it remains at 100% also for me. But try opening the Welcome screen at 32-bit color and then at 256 colors. You'll then notice just how big the difference in rendering speed is.
Edit: I just did another test. I set PCem's affinity to only 1 core and the performance was not affected at all. This means PCem never takes advantage of both cores. So is my hardware too slow, or does the emulator simply not take the fullest advantage of my hardware? I'd personally say the latter.
Edit: I just did another test. I set PCem's affinity to only 1 core and the performance was not affected at all. This means PCem never takes advantage of both cores. So is my hardware too slow, or does the emulator simply not take the fullest advantage of my hardware? I'd personally say the latter.
Re: Battler's speed issues
What is a good way to get numbers on the overall performance?
I just tried a bunch of different things, and all Ive seemed to prove is that PCem can give performance that measures from the inside at a constant rate... which is a 'good thing'..
for what it's worth -march=native -O3 -funroll-loops -fomit-frame-pointer 'feels' the fastest but I'd rather have numbers than my feelings....
I just tried a bunch of different things, and all Ive seemed to prove is that PCem can give performance that measures from the inside at a constant rate... which is a 'good thing'..
This is running doom v1.1 as "-forcedemo -timedemo demo1" ... eyeballing it, they all slow down about the same rate in the same places.. Oddly enough even -O0 is quite playable. Is there something to 'sample' the status to a file, and a special way to exit PCem from within? Just so there is some way to stress it with a consistent test and see what changes have what, if any effect?1282 -O3 -march=native -DRELEASE_BUILD
1283 -O2 -DRELEASE_BUILD
1283 -O2 -DRELEASE_BUILD -fomit-frame-pointer
1283 -O3 -DRELEASE_BUILD -fomit-frame-pointer
1283 -O3 -march=i686 -DRELEASE_BUILD -fomit-frame-pointer
sound off
1233 -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
1234 -O1 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
1234 -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
1234 -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
1231 -O0 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
for what it's worth -march=native -O3 -funroll-loops -fomit-frame-pointer 'feels' the fastest but I'd rather have numbers than my feelings....
- ppgrainbow
- Posts: 479
- Joined: Thu 04 Sep, 2014 7:03 am
- Contact:
Re: Battler's speed issues
My current video card, the EVGA 8400GS with 1 GB DDR3 video RAM (which will lose all driver support starting in April 2016) does not have all too good host throughput too.
But I'm lucky that I can fully emulate a machine up to a Intel 80486 DX4 running at 100MHz and anything up to Windows NT 3.51 SP5. Anything faster than that or later, will run at a slower pace.
I had updated my graphics card up to the 341.XX series, but I suspect that the newer graphics drives have bugs and may not work correctly with this graphics card. If it doesn't work correctly, I guess that I'll have to do a System Restore and revert.
But I'm lucky that I can fully emulate a machine up to a Intel 80486 DX4 running at 100MHz and anything up to Windows NT 3.51 SP5. Anything faster than that or later, will run at a slower pace.
I had updated my graphics card up to the 341.XX series, but I suspect that the newer graphics drives have bugs and may not work correctly with this graphics card. If it doesn't work correctly, I guess that I'll have to do a System Restore and revert.
Re: Battler's speed issues
I just noticed this unneccessary knowitall snap added to an edited post
Accelerated video decoding is nowhere around full x86 emulation. You might as well claim about how easy it is to emulate an Xbox because your video card can do Direct3D 8.Battler wrote: Oh and also, apparently on my low-end budget hardware, I can watch 1080p @ 60 fps videos on YouTube with essentially no lag. Are you going to tell me x86 emulation is more intensive than that?
Because the guest CPU is shoving pixels through color translation lookup tables, of course it's going to be fully utilized. It's worse in 16 colors.Battler wrote:- leilei: Yes, it remains at 100% also for me. But try opening the Welcome screen at 32-bit color and then at 256 colors. You'll then notice just how big the difference in rendering speed is.
No shit, sherlock. It's not emulating dual CPUs or dual core CPUs. CPUs don't work like video cards where you can thread out alternating lines.Battler wrote: This means PCem never takes advantage of both cores.
Re: Battler's speed issues
Please cut your ad hominem attacks.leilei wrote:I just noticed this unneccessary knowitall snap added to an edited post
Do you have any data to back up your claim that x86 emulation is more computationally intensive than decoding 1080p60 h.264?Accelerated video decoding is nowhere around full x86 emulation. You might as well claim about how easy it is to emulate an Xbox because your video can do Direct3D 8.
Re: Battler's speed issues
OK, I am testing with a 430HX BIOS now, and with it, the I/O is damned FAST. 101% at the LBA 1 intro screen and no lag. I'm starting to think the BIOS/emulated chipset affects I/O speed.
Re: Battler's speed issues
It's a different story for me.neozeed wrote:for what it's worth -march=native -O3 -funroll-loops -fomit-frame-pointer 'feels' the fastest but I'd rather have numbers than my feelings....
I found a nice spot in Quake to get nice consistent percentage of execution - E2M1's initial deathmatch starting position, in 320x400. It's the end-of-level room with the laser shooters which gives a constant and consistent surfacecache update, the dynamic lights put the Pentium CPU to work..
so here's my data with an AMD!!!
As you can see, -O2 beats -O3, and gratuitous sse optimization flags don't mean jack shit.s3vbe20 /install
cd quake
quake -nosound -nojoy -nocdaudio -listen 8 +map e2m1 +vid_mode 7
80% - -O3 -march=native -DRELEASE_BUILD
83% - -O2 -DRELEASE_BUILD
84% - -O2 -DRELEASE_BUILD -fomit-frame-pointer
81% - -O3 -DRELEASE_BUILD -fomit-frame-pointer
81% - -O3 -march=i686 -DRELEASE_BUILD -fomit-frame-pointer
81% - -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
79% - -O1 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
CRASH - -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
CRASH - -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
51% - -O0 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
83% - -march=native -O2 -funroll-loops -fomit-frame-pointer
CRASH - -O3 -march=amdfam10 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
81% - -O3 -march=amdfam10 -mtune=generic -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
85% - -O2 -march=native -funroll-loops -fomit-frame-pointer
- Attachments
-
- pcems.png (7.33 KiB) Viewed 28704 times
Re: Battler's speed issues
Yay real numbers!
OK, I'm using MS-DOS 5.00 and no config.sys just the commands above in autoexec.bat.
My Processor is..
I get 100% on the first test with a MMX 166... So I have to crank it up to 266Mhz (lucky me!)
Well that was.. underwhelming.
Now let's try the IDT 166 with no compiler!
(feels VERY choppy though)
Pretty amazing to see that for the most part the compiler flags really don't matter. -O2 is 'good enough' on it's own. Well that is what I really wanted to see some good metrics!!!!
OK, I'm using MS-DOS 5.00 and no config.sys just the commands above in autoexec.bat.
My Processor is..
Code: Select all
Number of cores 4 (max 8)
Number of threads 8 (max 16)
Name Intel Xeon E3 1230 v3
Codename Haswell-WS
Specification Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz
Package (platform ID) Socket 1150 LGA (0x1)
CPUID 6.C.3
Extended CPUID 6.3C
Core Stepping C0
Technology 22 nm
TDP Limit 80.0 Watts
Tjmax 100.0 °C
Core Speed 3500.4 MHz
Multiplier x Bus Speed 35.0 x 100.0 MHz
Stock frequency 3300 MHz
Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3
L1 Data cache 4 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache 4 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache 4 x 256 KBytes, 8-way set associative, 64-byte line size
L3 cache 8 MBytes, 16-way set associative, 64-byte line size
FID/VID Control yes
Code: Select all
81% CFLAGS = -O3 -march=native -fomit-frame-pointer -DRELEASE_BUILD
80% CFLAGS = -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
80% CFLAGS = -O3 -march=core2 -mtune=generic -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
81% CFLAGS = -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
81% CFLAGS = -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -fomit-frame-pointer -DRELEASE_BUILD
55% CFLAGS = -O0 -g
79% CFLAGS = -Os -fomit-frame-pointer -DRELEASE_BUILD
79% CFLAGS = -Os -fomit-frame-pointer -funroll-loops -DRELEASE_BUILD
79% CFLAGS = -O2 -fomit-frame-pointer -funroll-loops -DRELEASE_BUILD
80% CFLAGS = -O2 -fomit-frame-pointer -DRELEASE_BUILD
80% CFLAGS = -O2 -DRELEASE_BUILD
79% CFLAGS = -O1 -DRELEASE_BUILD
79% CFLAGS = -Os -DRELEASE_BUILD
81% -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
80% -O2 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
Now let's try the IDT 166 with no compiler!
(feels VERY choppy though)
Code: Select all
81% CFLAGS = -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
83% CFLAGS = -O3 -march=core2 -mtune=generic -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
83% CFLAGS = -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
86% CFLAGS = -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -fomit-frame-pointer -DRELEASE_BUILD
84% CFLAGS = -O2 -DRELEASE_BUILD
74% CFLAGS = -O1 -DRELEASE_BUILD
70% CFLAGS = -Os -DRELEASE_BUILD
51% CFLAGS = -O0 -DRELEASE_BUILD
85% -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
85% -O2 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
Re: Battler's speed issues
-flto and -ffast-math seems to help a bit. More random switch trial-and-error results
And i've already explored the options for openmp parallel instruction tree automatic parallelization whatever, even including a bleeding edge TDM-GCC setup with aggressive parallelize-all-loops usage. Not any faster and introduces a nasty crash when exiting.
i've also noticed -funroll-loops and -O2 is slower for PGO builds for some reason, and that would take a long time to test and compare PGO builds so i didn't do that. Also when trying to do some profile generation I set the Voodoo threads to 1 and crank the CPU to PMMX 300 with Infinite cache and then run Battlezone 2 demo (which has no menu system, DX6 zealousy check or movies to fail on) and watch that crawl through the mission intro for awhile, since BZ2 is a particularly Intel-favoring MMX-abusing game.80% - -march=native -Os -flto -ffast-math -funroll-loops -fomit-frame-pointer
80% - -march=native -Os -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
79% - -march=native -Os -flto -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
79% - -march=i686 -Os -flto -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
87% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fno-branch-count-reg
85% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fmerge-all-constants
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fgcse-sm
84% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -ffloat-store
85% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fno-math-errno
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -funsafe-math-optimizations
85% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fno-trapping-math
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -funroll-all-loops
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fprefetch-loop-arrays
And i've already explored the options for openmp parallel instruction tree automatic parallelization whatever, even including a bleeding edge TDM-GCC setup with aggressive parallelize-all-loops usage. Not any faster and introduces a nasty crash when exiting.
Re: Battler's speed issues
Have you tested the build with "-fprofile-use"?
Re: Battler's speed issues
PCem-X is even slower when set to use a Pentium II Overdrive 333 on a 440FX motherboard. BTW, just in case Tom needs an acid test for Pentium 2 additions, XP setup stumped OBattler for quite a while. He eventually had to bypass most of Tom's segment code in SYSENTER/SYSEXIT since XP doesn't work otherwise. I've also done a little work on adding limit checks to the segment handling code, and it works wonderfully in Windows 98 SE and Little Big Adventure 2. However, XP is far slower on the Pentium Pro than on a Pentium MMX at the same clock speed, since SYSENTER/SYSEXIT are currently interpreted rather than recompiled. Our P6 timing is also inaccurate currently, using the Pentium timings right now. This is another priority since accurate P6 emulation will pave the way for emulating much more modern processors with correct timings.
Another problem that has arisen is a RIVA 128-specific crashing bug that seems to be triggered by the drivers. The bug appears to be dynarec-related since that's where it crashes, and we have yet to pinpoint the exact cause. However, I can say without a doubt that it is not the RIVA 128 code causing it, since reverting to earlier revisions that did work doesn't help. The evidence seems to point to this commit, however: https://github.com/OBattler/PCem-X/comm ... 686613d2af
Another problem that has arisen is a RIVA 128-specific crashing bug that seems to be triggered by the drivers. The bug appears to be dynarec-related since that's where it crashes, and we have yet to pinpoint the exact cause. However, I can say without a doubt that it is not the RIVA 128 code causing it, since reverting to earlier revisions that did work doesn't help. The evidence seems to point to this commit, however: https://github.com/OBattler/PCem-X/comm ... 686613d2af
- SarahWalker
- Site Admin
- Posts: 2055
- Joined: Thu 24 Apr, 2014 4:18 pm
Re: Battler's speed issues
With the current interpreter and recompiler designs you will not get anything even remotely close to accurate timings for P6 processors, or any out-of-order processors. It's really not worth the effort.Alegend45 wrote:Our P6 timing is also inaccurate currently, using the Pentium timings right now. This is another priority since accurate P6 emulation will pave the way for emulating much more modern processors with correct timings.
I am not going to waste my time trying to fix bugs in other branches of PCem. I have quite enough to be working on as it is.The evidence seems to point to this commit, however: https://github.com/OBattler/PCem-X/comm ... 686613d2af
Re: Battler's speed issues
- TomWalker: And noone even asked you to. You completely missed the point of Alegend45's post.
Edit: And also, my work is there on GitHub for the taking. I would personally love it if you integrated at least some of it into your emulator. So it's up to you if you want to do it or not.
Edit: And also, my work is there on GitHub for the taking. I would personally love it if you integrated at least some of it into your emulator. So it's up to you if you want to do it or not.
- SarahWalker
- Site Admin
- Posts: 2055
- Joined: Thu 24 Apr, 2014 4:18 pm
Re: Battler's speed issues
If his post wasn't implying that I should fix it, why did he bother posting it here?Battler wrote:- TomWalker: And noone even asked you to. You completely missed the point of Alegend45's post.
Re: Battler's speed issues
- TomWalker: No idea. I guess just to mention what we were doing. :p
- SarahWalker
- Site Admin
- Posts: 2055
- Joined: Thu 24 Apr, 2014 4:18 pm
Re: Battler's speed issues
If you want your work in mainline then please provide patches. I'm rather unlikely to find the time to dig through GitHub to isolate worthwhile changes, rebase them, test them, and then commit them here.Battler wrote:Edit: And also, my work is there on GitHub for the taking. I would personally love it if you integrated at least some of it into your emulator. So it's up to you if you want to do it or not.
Re: Battler's speed issues
Basically, yes.Battler wrote:- TomWalker: No idea. I guess just to mention what we were doing. :p
Oh REALLY? Then how did you get somewhat accurate timings for the Pentium? Oh, you counted the amount of cycles for a timing block and used that to increment the cycle count? Because that's basically what we would be doing, except on a grander scale, by modelling micro-ops (Agner docs have LOTS of info on this) and out-of-order processing to basically come up with timing blocks that we can calculate the approximate cycles for.TomWalker wrote:With the current interpreter and recompiler designs you will not get anything even remotely close to accurate timings for P6 processors, or any out-of-order processors. It's really not worth the effort.Alegend45 wrote:Our P6 timing is also inaccurate currently, using the Pentium timings right now. This is another priority since accurate P6 emulation will pave the way for emulating much more modern processors with correct timings.
Anyway, I also have not only bugs in your dynarec to fix, but I also have to write an 8042 interpreter so that I can create a more accurate AT keyboard emulation. I might do the same for XT keyboards, but I'm not sure. I also plan on adding accurate Roland MT-32 emulation, which, while not using Munt code, will use an 8096 interpreter and will be based on studying the MAME implementation, with maybe some Munt influence.
Re: Battler's speed issues
- Alegend45: Pentium is not out-of-order, while Pentium Pro and Pentium II are. So I actually agree with what Tom is saying here. We can get a bad approximation at best, and even that won't be much more accurate than what we already have. :p
- SarahWalker
- Site Admin
- Posts: 2055
- Joined: Thu 24 Apr, 2014 4:18 pm
Re: Battler's speed issues
My main concern with modelling out-of-order CPUs is that while, with a vast amount of effort, you could model execution time within a block, you would be unable to model time across multiple blocks. This would be something of an issue for a P6 core, as the average code block is about 6 instructions, and a P6 can have something like 60 uops in flight at any one time. Any timing accuracy on such an emulator would fall to pieces extremely quickly on anything beyond the most trivial code.