Battler's speed issues

Discussion of development and patch submission.
Post Reply
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Battler's speed issues

Post by Battler »

- neozeed: There would probably be a big improvement if whatever is causing rendering to be very slow when the guest is not in 16-bit/24-bit/32-bit color modes, was fixed. As a comparison, look at the Windows 98 welcome screen in 256 color mode and in 32-bit color mode. In the former, it will take some time to fully display, in the latter it will pretty much display instantly.

Edit: I just realized, the difference might be due to Voodoo which might be disabled in 256 color mode but enabled in 16-bit/24-bit/32-bit color modes. And Voodoo might render faster because it's threaded. Which makes me wonder just how much faster rendering would be in 256 color mode if that was made threaded. Probably a lot. And if disk I/O (as in, for hard disks and CD-ROM's) was made threaded, it would probably improve performance even more.

Edit #2: Nope, not Voodoo. Just logged the PCI write that enables it, and it never did. This means the answer lies elsewhere.
neozeed
Posts: 176
Joined: Tue 08 Jul, 2014 4:41 am
Location: Hong Kong SAR
Contact:

Re: Networking discussion

Post by neozeed »

Battler wrote:- neozeed: There would probably be a big improvement if whatever is causing rendering to be very slow when the guest is not in 16-bit/24-bit/32-bit color modes, was fixed. As a comparison, look at the Windows 98 welcome screen in 256 color mode and in 32-bit color mode. In the former, it will take some time to fully display, in the latter it will pretty much display instantly.

Edit: I just realized, the difference might be due to Voodoo which might be disabled in 256 color mode but enabled in 16-bit/24-bit/32-bit color modes. And Voodoo might render faster because it's threaded. Which makes me wonder just how much faster rendering would be in 256 color mode if that was made threaded. Probably a lot. And if disk I/O (as in, for hard disks and CD-ROM's) was made threaded, it would probably improve performance even more.

Edit #2: Nope, not Voodoo. Just logged the PCI write that enables it, and it never did. This means the answer lies elsewhere.
I test a bit with DooM v1.1 and it seems quite playable...

I guess I'd have to build some test program and see what happens regarding write combining...
User avatar
SarahWalker
Site Admin
Posts: 2055
Joined: Thu 24 Apr, 2014 4:18 pm

Re: Networking discussion

Post by SarahWalker »

Battler wrote:- neozeed: There would probably be a big improvement if whatever is causing rendering to be very slow when the guest is not in 16-bit/24-bit/32-bit color modes, was fixed. As a comparison, look at the Windows 98 welcome screen in 256 color mode and in 32-bit color mode. In the former, it will take some time to fully display, in the latter it will pretty much display instantly.
What CPU? What graphics card? What video driver? Is PCem hitting 100% during this?

Also, what does this have to do with networking?
Edit: I just realized, the difference might be due to Voodoo which might be disabled in 256 color mode but enabled in 16-bit/24-bit/32-bit color modes.
Voodoo won't have anything to do with basic GDI rendering, which is what the Windows 98 welcome screen is.
Which makes me wonder just how much faster rendering would be in 256 color mode if that was made threaded. Probably a lot.
Very little most likely, as it's either purely software rendering or the blitter is being fed (and therefore limited) by the host CPU.
And if disk I/O (as in, for hard disks and CD-ROM's) was made threaded, it would probably improve performance even more.
Less than you might think.
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Networking discussion

Post by Battler »

TomWalker wrote:What CPU? What graphics card? What video driver? Is PCem hitting 100% during this?
What do you mean? CPU and graphics card of host or guest?
Guest is a Phoenix S3 Trio64 on Pentium MMX 233 MHz.
Host is an AMD Radeon HD 5450 on Pentium Dual-Core E5700 @ 3.0 GHz.
Windows 98 SE renders much faster in 32-bit color mode than at 256 colors. Anything at 640x480x256 colors lags inside PCem, especially if it's doing any kind of transition (like the Windows 98 SE Welcome screen does, or Little Big Adventure 1 when it's displaying the Adeline Software International logo).
And no idea how many % it hit at 256 colors, all I know is that at 32-bit color, it never went below 100% even while the Welcome screen was being rendered *AND* sound was playing.
Very little most likely, as it's either purely software rendering or the blitter is being fed (and therefore limited) by the host CPU.
My host CPU can handle the same 640x480x256 rendering with much less lag when it's done by VMWare, Virtual PC 2007, QEMU, or DOSBox. If there's anything I found affecting it, it's the emulated *guest* CPU, as the lag at 640x480x256 inside PCem is much less noticable with a 486 than with a Pentium MMX 233 MHz. And considering the lag does not occur at even 1024x768x32-bit color on a Pentium MMX 233 MHz, I think it's not my host CPU unable to handle the emulation.
TomWalker wrote:Less than you might think.
The emulation goes down from 100% to as low as 29% during I/O to an emulated hard disk or CD-ROM drive, eg. during IDE test at BIOS POST or while a DOS CD-ROM driver is being loaded. And this even if I emulate a 486. And I can assure you, on no real PC I've ever had (and I've gone from 386 upwards) slowed down THAT drastically during I/O.
Just as an example, Little Big Adventure 2 reads from the disk (and writes to it, for example the autosave data) while you're moving, which causes a noticable stall on PCem, regardless of whether I emulate at 486 or a Pentium MMX 233 MHz. On a real Pentium 100 MHz there was no such stall, neither is there on VMWare, Virtual PC 2007, DOSBox, or QEMU. Only on PCem. And only during any kind of disk I/O. When there's no disk I/O going on, the emulator hits 98-100% during the game these days and the game is smooth.

And yes, please split this discussion off to a separate thread as I agree it's off-topic here.
User avatar
SarahWalker
Site Admin
Posts: 2055
Joined: Thu 24 Apr, 2014 4:18 pm

Re: Networking discussion

Post by SarahWalker »

Very little most likely, as it's either purely software rendering or the blitter is being fed (and therefore limited) by the host CPU.
My host CPU can handle the same 640x480x256 rendering with much less lag when it's done by VMWare, Virtual PC 2007, QEMU, or DOSBox.
Yes, let's compare the performance of the free emulator written largely by one person in their spare time with commercial emulators and much older free emulators with many more people working on them. Didn't we have this conversation about a year ago?

There probably are improvements that can be made in this area. I haven't focused on them because most 'performance critical' uses of PCem don't rely on the 2D blitter, or other esoteric corner cases you may be hitting. I have been looking a lot at performance this last year and there is no performance issue with 8-bit colour in general. Possibly the guest video driver you're running hits some of PCem's weak points in terms of performance. The Windows 98 case is probably the performance issue related to Windows idling, which hits the self-modifying code paths a lot, and also performs a lot of selector loads and other slow protected mode stuff. Possibly in 8-bit mode it's using the blitter, rather than software in 32-bit colour mode, and hence it idles more. Probably the LBA1 issue is also self-modifying code, or either your video card has really terrible host throughput. Most software doesn't hit these issues.
TomWalker wrote:Less than you might think.
The emulation goes down from 100% to as low as 29% during I/O to an emulated hard disk or CD-ROM drive, eg. during IDE test at BIOS POST or while a DOS CD-ROM driver is being loaded. And this even if I emulate a 486.
I'm beginning to think there's something seriously wrong with your host machine.
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

TomWalker wrote:I'm beginning to think there's something seriously wrong with your host machine.
Well the 29% is the lowest I've seen, and it was on the Pentium MMX 233 MHz. On 486 it lowers to about 48%. Not as bad but still a slow down. :p Maybe I should defragment my hard disk and see if that changes anything.

Edit: It seems I misremembered and I apologize for that. Just did a test, and it slows down to 81% on an emulated 486, ~45% on Pentium MMX 233 MHz, and ~29% if I modify the code to increase said Pentium MMX' s clock speed to 266 MHz.
or either your video card has really terrible host throughput.
Or it might be the host video driver which is AMD Catalyst from last year. Going to update to the latest version now and see what happens.

Edit: Making some tests now. It seems the bigger factor I multiply the clock cycles for the IN and OUT instructions with, the faster the I/O inside the emulator is and the higher the percentage too. Got it to ~53% at Pentium MMX overclocked to 266 MHz by just multiplying those cycles by 8.

Edit #2: And while I have no idea what Windows 98 SE does, I know that LBA 1 gradually changes the palette from all black to the regular paletter so that the colors are brighter on each redraw, until the white of the Adeline logo is fully white. And I suspect it might be doing a lot of I/O writes to the palette registers when it does that. Remember that LBA 1 is for DOS and it uses its own set of graphics drivers.
User avatar
leilei
Posts: 1040
Joined: Fri 25 Apr, 2014 4:47 pm

Re: Battler's speed issues

Post by leilei »

It isn't Catalyst. If there's a difference from that, it's pure placebo.

Also, I should mention blindly applying optimization flags (like all those "faster superior secret weapon" SSE-related flags you fail to mention) will actually slow down the emulation and bring horrible cache hiccups, even on Intel Core i7. There is a reason why PCem is strictly -march=i686.

Finally don't expect to emulate a PMMX233 at full speed on a low-end budget processor. Pentium is the new Celeron. Lower your high expectations and go emulate a P75 on A lot cache for decent performance, and if it's still not decent, then maybe the claim will have some merit.


Fast x86 emulation is not some walk in the park. It took DOSBox 7 years of intense hacking to get their dynarec where it is today, and even then it's still not PMMX233 fast for a CPU like that (not to mention Win9x is far slower and buggier)
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

leilei wrote:Also, I should mention blindly applying optimization flags (like all those "faster superior secret weapon" SSE-related flags you fail to mention) will actually slow down the emulation and bring horrible cache hiccups, even on Intel Core i7. There is a reason why PCem is strictly -march=i686.
Yet for me, it's faster when compiled with those flags.
Finally don't expect to emulate a PMMX233 at full speed on a low-end budget processor. Pentium is the new Celeron. Lower your high expectations and go emulate a P75 on A lot cache for decent performance, and if it's still not decent, then maybe the claim will have some merit.
Then please explain to me why the Pentium MMX 233 MHz is at ~100% when the guest is in 32-bit color, and the only times there's lag is when it's at 256 colors or less. Also, I have cache set to "A little", which I found to be the fastest option.

Oh and also, apparently on my low-end budget hardware, I can watch 1080p @ 60 fps videos on YouTube with essentially no lag. Are you going to tell me x86 emulation is more intensive than that?

Not to mention that I know plenty of people (you can connect to irc.rol.im / #RIS to talk to them) that have much better hardware than I do (talking i5's and i7's here) and yet still complain about PCem being slow. Granted for them it's slightly faster than for me, but evidently still not enough.
Last edited by Battler on Sun 30 Aug, 2015 12:41 am, edited 1 time in total.
User avatar
leilei
Posts: 1040
Joined: Fri 25 Apr, 2014 4:47 pm

Re: Battler's speed issues

Post by leilei »

Doesn't occur here on my r330 i686 build. Still remains 100% when I switch to 32-bit and back.
Attachments
PCem256.png
PCem256.png (15.76 KiB) Viewed 28966 times
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

- leilei: Yes, it remains at 100% also for me. But try opening the Welcome screen at 32-bit color and then at 256 colors. You'll then notice just how big the difference in rendering speed is.

Edit: I just did another test. I set PCem's affinity to only 1 core and the performance was not affected at all. This means PCem never takes advantage of both cores. So is my hardware too slow, or does the emulator simply not take the fullest advantage of my hardware? I'd personally say the latter.
neozeed
Posts: 176
Joined: Tue 08 Jul, 2014 4:41 am
Location: Hong Kong SAR
Contact:

Re: Battler's speed issues

Post by neozeed »

What is a good way to get numbers on the overall performance?

I just tried a bunch of different things, and all Ive seemed to prove is that PCem can give performance that measures from the inside at a constant rate... which is a 'good thing'..
1282 -O3 -march=native -DRELEASE_BUILD
1283 -O2 -DRELEASE_BUILD
1283 -O2 -DRELEASE_BUILD -fomit-frame-pointer
1283 -O3 -DRELEASE_BUILD -fomit-frame-pointer
1283 -O3 -march=i686 -DRELEASE_BUILD -fomit-frame-pointer
sound off
1233 -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
1234 -O1 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
1234 -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
1234 -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
1231 -O0 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
This is running doom v1.1 as "-forcedemo -timedemo demo1" ... eyeballing it, they all slow down about the same rate in the same places.. Oddly enough even -O0 is quite playable. Is there something to 'sample' the status to a file, and a special way to exit PCem from within? Just so there is some way to stress it with a consistent test and see what changes have what, if any effect?

for what it's worth -march=native -O3 -funroll-loops -fomit-frame-pointer 'feels' the fastest but I'd rather have numbers than my feelings....
User avatar
ppgrainbow
Posts: 479
Joined: Thu 04 Sep, 2014 7:03 am
Contact:

Re: Battler's speed issues

Post by ppgrainbow »

My current video card, the EVGA 8400GS with 1 GB DDR3 video RAM (which will lose all driver support starting in April 2016) does not have all too good host throughput too.

But I'm lucky that I can fully emulate a machine up to a Intel 80486 DX4 running at 100MHz and anything up to Windows NT 3.51 SP5. Anything faster than that or later, will run at a slower pace.

I had updated my graphics card up to the 341.XX series, but I suspect that the newer graphics drives have bugs and may not work correctly with this graphics card. If it doesn't work correctly, I guess that I'll have to do a System Restore and revert.
User avatar
leilei
Posts: 1040
Joined: Fri 25 Apr, 2014 4:47 pm

Re: Battler's speed issues

Post by leilei »

I just noticed this unneccessary knowitall snap added to an edited post
Battler wrote: Oh and also, apparently on my low-end budget hardware, I can watch 1080p @ 60 fps videos on YouTube with essentially no lag. Are you going to tell me x86 emulation is more intensive than that?
Accelerated video decoding is nowhere around full x86 emulation. You might as well claim about how easy it is to emulate an Xbox because your video card can do Direct3D 8.
Battler wrote:- leilei: Yes, it remains at 100% also for me. But try opening the Welcome screen at 32-bit color and then at 256 colors. You'll then notice just how big the difference in rendering speed is.
Because the guest CPU is shoving pixels through color translation lookup tables, of course it's going to be fully utilized. It's worse in 16 colors.
Battler wrote: This means PCem never takes advantage of both cores.
No shit, sherlock. It's not emulating dual CPUs or dual core CPUs. CPUs don't work like video cards where you can thread out alternating lines.
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

leilei wrote:I just noticed this unneccessary knowitall snap added to an edited post
Please cut your ad hominem attacks.
Accelerated video decoding is nowhere around full x86 emulation. You might as well claim about how easy it is to emulate an Xbox because your video can do Direct3D 8.
Do you have any data to back up your claim that x86 emulation is more computationally intensive than decoding 1080p60 h.264?
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

OK, I am testing with a 430HX BIOS now, and with it, the I/O is damned FAST. 101% at the LBA 1 intro screen and no lag. I'm starting to think the BIOS/emulated chipset affects I/O speed.
User avatar
leilei
Posts: 1040
Joined: Fri 25 Apr, 2014 4:47 pm

Re: Battler's speed issues

Post by leilei »

:roll:
neozeed wrote:for what it's worth -march=native -O3 -funroll-loops -fomit-frame-pointer 'feels' the fastest but I'd rather have numbers than my feelings....
It's a different story for me.

I found a nice spot in Quake to get nice consistent percentage of execution - E2M1's initial deathmatch starting position, in 320x400. It's the end-of-level room with the laser shooters which gives a constant and consistent surfacecache update, the dynamic lights put the Pentium CPU to work..

so here's my data with an AMD!!!
s3vbe20 /install
cd quake
quake -nosound -nojoy -nocdaudio -listen 8 +map e2m1 +vid_mode 7


80% - -O3 -march=native -DRELEASE_BUILD
83% - -O2 -DRELEASE_BUILD
84% - -O2 -DRELEASE_BUILD -fomit-frame-pointer
81% - -O3 -DRELEASE_BUILD -fomit-frame-pointer
81% - -O3 -march=i686 -DRELEASE_BUILD -fomit-frame-pointer
81% - -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
79% - -O1 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
CRASH - -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
CRASH - -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
51% - -O0 -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
83% - -march=native -O2 -funroll-loops -fomit-frame-pointer

CRASH - -O3 -march=amdfam10 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
81% - -O3 -march=amdfam10 -mtune=generic -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
85% - -O2 -march=native -funroll-loops -fomit-frame-pointer
As you can see, -O2 beats -O3, and gratuitous sse optimization flags don't mean jack shit.
Attachments
pcems.png
pcems.png (7.33 KiB) Viewed 28704 times
neozeed
Posts: 176
Joined: Tue 08 Jul, 2014 4:41 am
Location: Hong Kong SAR
Contact:

Re: Battler's speed issues

Post by neozeed »

Yay real numbers!

OK, I'm using MS-DOS 5.00 and no config.sys just the commands above in autoexec.bat.

My Processor is..

Code: Select all

	Number of cores		4 (max 8)
	Number of threads	8 (max 16)
	Name			Intel Xeon E3 1230 v3
	Codename		Haswell-WS
	Specification		Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz
	Package (platform ID)	Socket 1150 LGA (0x1)
	CPUID			6.C.3
	Extended CPUID		6.3C
	Core Stepping		C0
	Technology		22 nm
	TDP Limit		80.0 Watts
	Tjmax			100.0 °C
	Core Speed		3500.4 MHz
	Multiplier x Bus Speed	35.0 x 100.0 MHz
	Stock frequency		3300 MHz
	Instructions sets	MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3
	L1 Data cache		4 x 32 KBytes, 8-way set associative, 64-byte line size
	L1 Instruction cache	4 x 32 KBytes, 8-way set associative, 64-byte line size
	L2 cache		4 x 256 KBytes, 8-way set associative, 64-byte line size
	L3 cache		8 MBytes, 16-way set associative, 64-byte line size
	FID/VID Control		yes
I get 100% on the first test with a MMX 166... So I have to crank it up to 266Mhz (lucky me!)

Code: Select all

81% CFLAGS = -O3 -march=native -fomit-frame-pointer -DRELEASE_BUILD
80% CFLAGS = -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
80% CFLAGS = -O3 -march=core2 -mtune=generic -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
81% CFLAGS = -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
81% CFLAGS = -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -fomit-frame-pointer -DRELEASE_BUILD
55% CFLAGS = -O0 -g
79% CFLAGS = -Os -fomit-frame-pointer -DRELEASE_BUILD
79% CFLAGS = -Os -fomit-frame-pointer -funroll-loops -DRELEASE_BUILD
79% CFLAGS = -O2 -fomit-frame-pointer -funroll-loops -DRELEASE_BUILD
80% CFLAGS = -O2 -fomit-frame-pointer -DRELEASE_BUILD
80% CFLAGS = -O2 -DRELEASE_BUILD
79% CFLAGS = -O1 -DRELEASE_BUILD
79% CFLAGS = -Os -DRELEASE_BUILD
81% -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
80% -O2 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
Well that was.. underwhelming.

Now let's try the IDT 166 with no compiler!
(feels VERY choppy though)

Code: Select all

81% CFLAGS = -O3 -march=i686 -mtune=generic -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD  
83% CFLAGS = -O3 -march=core2 -mtune=generic -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
83% CFLAGS = -O3 -march=core2 -mtune=generic -mavx -mfpmath=sse -fomit-frame-pointer -DRELEASE_BUILD
86% CFLAGS = -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -fomit-frame-pointer -DRELEASE_BUILD
84% CFLAGS = -O2 -DRELEASE_BUILD
74% CFLAGS = -O1 -DRELEASE_BUILD
70% CFLAGS = -Os -DRELEASE_BUILD
51% CFLAGS = -O0 -DRELEASE_BUILD
85% -O3 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
85% -O2 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -mfpmath=387 -fomit-frame-pointer -DRELEASE_BUILD
Pretty amazing to see that for the most part the compiler flags really don't matter. -O2 is 'good enough' on it's own. Well that is what I really wanted to see some good metrics!!!! :D
User avatar
leilei
Posts: 1040
Joined: Fri 25 Apr, 2014 4:47 pm

Re: Battler's speed issues

Post by leilei »

-flto and -ffast-math seems to help a bit. More random switch trial-and-error results
80% - -march=native -Os -flto -ffast-math -funroll-loops -fomit-frame-pointer
80% - -march=native -Os -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
79% - -march=native -Os -flto -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
79% - -march=i686 -Os -flto -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
87% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fno-branch-count-reg
85% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fmerge-all-constants
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fgcse-sm
84% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -ffloat-store
85% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fno-math-errno
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -funsafe-math-optimizations
85% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fno-trapping-math
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -funroll-all-loops
86% - -march=native -O2 -flto -ffast-math -funroll-loops -fomit-frame-pointer -DRELEASE_BUILD -fprefetch-loop-arrays
i've also noticed -funroll-loops and -O2 is slower for PGO builds for some reason, and that would take a long time to test and compare PGO builds so i didn't do that. Also when trying to do some profile generation I set the Voodoo threads to 1 and crank the CPU to PMMX 300 with Infinite cache and then run Battlezone 2 demo (which has no menu system, DX6 zealousy check or movies to fail on) and watch that crawl through the mission intro for awhile, since BZ2 is a particularly Intel-favoring MMX-abusing game.

And i've already explored the options for openmp parallel instruction tree automatic parallelization whatever, even including a bleeding edge TDM-GCC setup with aggressive parallelize-all-loops usage. Not any faster and introduces a nasty crash when exiting. :)
startmenu
Posts: 104
Joined: Sat 29 Nov, 2014 7:39 am

Re: Battler's speed issues

Post by startmenu »

Have you tested the build with "-fprofile-use"?
Alegend45
Posts: 85
Joined: Sat 26 Apr, 2014 4:33 am

Re: Battler's speed issues

Post by Alegend45 »

PCem-X is even slower when set to use a Pentium II Overdrive 333 on a 440FX motherboard. BTW, just in case Tom needs an acid test for Pentium 2 additions, XP setup stumped OBattler for quite a while. He eventually had to bypass most of Tom's segment code in SYSENTER/SYSEXIT since XP doesn't work otherwise. I've also done a little work on adding limit checks to the segment handling code, and it works wonderfully in Windows 98 SE and Little Big Adventure 2. However, XP is far slower on the Pentium Pro than on a Pentium MMX at the same clock speed, since SYSENTER/SYSEXIT are currently interpreted rather than recompiled. Our P6 timing is also inaccurate currently, using the Pentium timings right now. This is another priority since accurate P6 emulation will pave the way for emulating much more modern processors with correct timings.

Another problem that has arisen is a RIVA 128-specific crashing bug that seems to be triggered by the drivers. The bug appears to be dynarec-related since that's where it crashes, and we have yet to pinpoint the exact cause. However, I can say without a doubt that it is not the RIVA 128 code causing it, since reverting to earlier revisions that did work doesn't help. The evidence seems to point to this commit, however: https://github.com/OBattler/PCem-X/comm ... 686613d2af
User avatar
SarahWalker
Site Admin
Posts: 2055
Joined: Thu 24 Apr, 2014 4:18 pm

Re: Battler's speed issues

Post by SarahWalker »

Alegend45 wrote:Our P6 timing is also inaccurate currently, using the Pentium timings right now. This is another priority since accurate P6 emulation will pave the way for emulating much more modern processors with correct timings.
With the current interpreter and recompiler designs you will not get anything even remotely close to accurate timings for P6 processors, or any out-of-order processors. It's really not worth the effort.
The evidence seems to point to this commit, however: https://github.com/OBattler/PCem-X/comm ... 686613d2af
I am not going to waste my time trying to fix bugs in other branches of PCem. I have quite enough to be working on as it is.
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

- TomWalker: And noone even asked you to. You completely missed the point of Alegend45's post.

Edit: And also, my work is there on GitHub for the taking. I would personally love it if you integrated at least some of it into your emulator. So it's up to you if you want to do it or not.
User avatar
SarahWalker
Site Admin
Posts: 2055
Joined: Thu 24 Apr, 2014 4:18 pm

Re: Battler's speed issues

Post by SarahWalker »

Battler wrote:- TomWalker: And noone even asked you to. You completely missed the point of Alegend45's post.
If his post wasn't implying that I should fix it, why did he bother posting it here?
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

- TomWalker: No idea. I guess just to mention what we were doing. :p
User avatar
SarahWalker
Site Admin
Posts: 2055
Joined: Thu 24 Apr, 2014 4:18 pm

Re: Battler's speed issues

Post by SarahWalker »

Battler wrote:Edit: And also, my work is there on GitHub for the taking. I would personally love it if you integrated at least some of it into your emulator. So it's up to you if you want to do it or not.
If you want your work in mainline then please provide patches. I'm rather unlikely to find the time to dig through GitHub to isolate worthwhile changes, rebase them, test them, and then commit them here.
Alegend45
Posts: 85
Joined: Sat 26 Apr, 2014 4:33 am

Re: Battler's speed issues

Post by Alegend45 »

Battler wrote:- TomWalker: No idea. I guess just to mention what we were doing. :p
Basically, yes.
TomWalker wrote:
Alegend45 wrote:Our P6 timing is also inaccurate currently, using the Pentium timings right now. This is another priority since accurate P6 emulation will pave the way for emulating much more modern processors with correct timings.
With the current interpreter and recompiler designs you will not get anything even remotely close to accurate timings for P6 processors, or any out-of-order processors. It's really not worth the effort.
Oh REALLY? Then how did you get somewhat accurate timings for the Pentium? Oh, you counted the amount of cycles for a timing block and used that to increment the cycle count? Because that's basically what we would be doing, except on a grander scale, by modelling micro-ops (Agner docs have LOTS of info on this) and out-of-order processing to basically come up with timing blocks that we can calculate the approximate cycles for.

Anyway, I also have not only bugs in your dynarec to fix, but I also have to write an 8042 interpreter so that I can create a more accurate AT keyboard emulation. I might do the same for XT keyboards, but I'm not sure. I also plan on adding accurate Roland MT-32 emulation, which, while not using Munt code, will use an 8096 interpreter and will be based on studying the MAME implementation, with maybe some Munt influence.
Battler
Posts: 793
Joined: Sun 06 Jul, 2014 7:05 pm

Re: Battler's speed issues

Post by Battler »

- Alegend45: Pentium is not out-of-order, while Pentium Pro and Pentium II are. So I actually agree with what Tom is saying here. We can get a bad approximation at best, and even that won't be much more accurate than what we already have. :p
User avatar
SarahWalker
Site Admin
Posts: 2055
Joined: Thu 24 Apr, 2014 4:18 pm

Re: Battler's speed issues

Post by SarahWalker »

My main concern with modelling out-of-order CPUs is that while, with a vast amount of effort, you could model execution time within a block, you would be unable to model time across multiple blocks. This would be something of an issue for a P6 core, as the average code block is about 6 instructions, and a P6 can have something like 60 uops in flight at any one time. Any timing accuracy on such an emulator would fall to pieces extremely quickly on anything beyond the most trivial code.
Post Reply