20140315, 15:11  #12 
Tribal Bullet
Oct 2004
3×1,181 Posts 
Nice to hear!
I just saw your previous post about the 1MB limit to performance. Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches; those caches are huge now, and a TLB is logic that is in the critical path of every memory access. The good news on linux is that you can prove whether this is happening, by allocating a partition somewhere of type 'hugetlbfs' that is backed by RAM. Accesses inside this partition are backed by 'second level TLB' entries, which on x86 cover 2MB or 4MB. If your code speeds up drastically in this case, you know that it's TLB thrashing. The bad news is that you can't do much about it, the TLB size is a fundamental machine limit and the only way to speed up is to respect that limit. The worse news is that I have no idea how to do the test :) 
20140315, 21:39  #13 
∂^{2}ω=0
Sep 2002
República de California
2×13×449 Posts 
Thanks for the commentary and suggestions  I considered that the effect might be a TLB issue rather than L2 cache per se, but as you note, saw no easy way to test it. Perhaps George can provide some useful insight here, as I know he did much wrestling with TLB issues in days of yore.
From my perspective, as I now have a nice unifiedcode methodology [most code aside from the DFT cores common for different radices] for implementing largeradix leadingpass DFTs (the ones that wrap the carry phase as noted above), absent a simpler directfiddlewithTLBusage fix I will simply add largerradix DFTs as needed to keep the perthread data chunksizes < 1 MB. Note that on preHaswell chips this is closely tied to the compactobjectcode issue, because previously those L1 Icache overflow issues were negating the chunksizebeneficial effects of the larger radices. 
20140316, 03:55  #14  
P90 years forever!
Aug 2002
Yeehaw, FL
2×11×349 Posts 
Quote:
Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference. Now if Ernst is trying to do the entire large stride pass 1 inplace, then the TLB costs could be much more significant. 

20140316, 06:07  #15  
∂^{2}ω=0
Sep 2002
República de California
11674_{10} Posts 
Quote:
I can see why the localarray aspect of the DFT/carry/DFT stage is TLBinsensitive, but what about the scattereddata reads and writes bookending things? One more note of possible relevance: For large radices I've found it expedient for the scalardouble code path to write things back to the scattered main array locs rather than to local memory. Interestingly I see no significant performance hit from doing so, contrary to my expectations. (This is not an option currently for the SIMD builds since those use carry macros which assume data are in contiguous memory.) 

20140317, 12:28  #16  
Jan 2008
France
2×5^{2}×11 Posts 
Quote:


20140317, 23:49  #17 
"Marv"
May 2009
near the Tannhäuser Gate
2A6_{16} Posts 
Intel Vtune
If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on. It's free for 30 days. Quite a few years ago, for a project, I had the chance to use an early version of it and was VERY impressed; it was quite amazing to actually look inside the cpu to see what's going on and make changes to improve response rather that doing the trial and error approach. Although there are lots of features requiring some dedicated time to learn, I also remember it had some "quick start" modes to get results quickly. I believe there are some YouTube videos that demonstrate its features.

20140318, 03:39  #18 
∂^{2}ω=0
Sep 2002
República de California
2·13·449 Posts 

20140318, 13:01  #19 
"Marv"
May 2009
near the Tannhäuser Gate
2×3×113 Posts 
It says it doesn't. With Visual Studio being so prevalent on Windoze boxes, it makes sense that they wouldn't limit it.
In examining its new features, one that I found VERY interesting that they improved from the time I used it is that you can collect data from remote machines by running a collection piece on the other machine that doesn't require a seperate license for it. Previously, if you had 3 different Intel cpu architectures you were targeting, you needed a license for each machine or an expensive floating license. Now, it looks like 1 will do the trick. Very cool. BTW, its name is now Vtune Amplifier XE 2013. sheesh, I'll never understand marketing types. 
20140528, 07:09  #20 
∂^{2}ω=0
Sep 2002
República de California
2×13×449 Posts 
Second run of F_{28}, at FFT length 16384 kdoubles, finished today, and match the one for 15360k I posted last November in this thread. The timings in the attached checkpointsummarystatus file reflect the first part of the run being done using leading radix r0 = 64 (yielding a perthread dataset size of ~2 MB) and then switching to r0 = 128 (1MB/thread, right at the 1 MB threshold which is critical on Intel x86) and later r0 = 256 (~0.5MB/thread) as the new code for those radices became available.
The summaries for the first few and last few checkpoints and the final SelfridgeHurwitz residues are F28: using FFT length 16384K = 16777216 8byte floats. this gives an average 16.000000000000000 bits per digit Using complex FFT radices 64 16 16 16 32 [Nov 26 09:00:00] F28 Iter# = 10000 clocks = 00:00:00.000 [ 0.0695 sec/iter] Res64: 0F38D26B35C6794C. AvgMaxErr = 0.010993669. MaxErr = 0.013671875 [Nov 26 09:11:38] F28 Iter# = 20000 clocks = 00:00:00.000 [ 0.0696 sec/iter] Res64: 6E621D1C6A05BC46. AvgMaxErr = 0.011034805. MaxErr = 0.015625000 ... Using complex FFT radices 128 16 16 16 16 [Dec 14 14:50:10] F28 Iter# = 22690000 clocks = 00:00:00.000 [ 0.0580 sec/iter] Res64: 262012254B592782. AvgMaxErr = 0.010806998. MaxErr = 0.013671875 [Dec 14 14:59:50] F28 Iter# = 22700000 clocks = 00:00:00.000 [ 0.0579 sec/iter] Res64: 0002AF208FE85365. AvgMaxErr = 0.010797848. MaxErr = 0.013671875 ... Using complex FFT radices 256 8 16 16 16 [Jan 09 14:55:42] F28 Iter# = 61130000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: F3699600BCBFECE0. AvgMaxErr = 0.011603018. MaxErr = 0.015625000 [Jan 09 15:05:10] F28 Iter# = 61140000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: 2DD786952E04FC17. AvgMaxErr = 0.011608487. MaxErr = 0.015625000 ... [May 27 04:12:06] F28 Iter# = 268390000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: E529CBABAE1D8B17. AvgMaxErr = 0.011612073. MaxErr = 0.015625000 [May 27 04:21:42] F28 Iter# = 268400000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 7D195E1B0DD4A93E. AvgMaxErr = 0.011611240. MaxErr = 0.015625000 [May 27 04:31:17] F28 Iter# = 268410000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 840DBB498F57E870. AvgMaxErr = 0.011617642. MaxErr = 0.015625000 [May 27 04:40:53] F28 Iter# = 268420000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 64BE2132426CDD1C. AvgMaxErr = 0.011605388. MaxErr = 0.015625000 [May 27 04:50:28] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.011610274. MaxErr = 0.015625000 [May 27 04:55:43] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.006339752. MaxErr = 0.015625000 F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x R28 mod 2^36 = 16759471759 R28 mod 2^35  1 = 30476727665 R28 mod 2^36  1 = 9104636564 Tarchive of the 2 checkpointsummary files and the final bytewise residue file (identical for both runs, thus just 1 copy) in the previouslydescribed format are here (33 MB, so please do not download frivolously  the runstatus file alone is in the much smaller attachment below, if you just want the summary data). The MD5 checksum on the extracted bytewise residue file is md5sum f28 = 1554face46e633eb2fd773df2647cd2a F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my nonOCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider Last fiddled with by ewmayer on 20171115 at 00:21 Reason: Last para: F28 > R28 
20141010, 04:07  #21  
∂^{2}ω=0
Sep 2002
República de California
2×13×449 Posts 
Quote:
Fwd: 174 FMA [Breakdown: 87 FMA, 2 FMS, 85 FNMA, 0 FNMS.] Inv: 174 FMA [a whopping 112 of which have a trivial unity multiplicand], 34 MUL. Upshot: I probably get the desired FMAspeedup in the forward FFT routines restructured thusly, but get little speedup in the iFFT due to the above issues. Partly as a result of the effort expended to achieve this mediocre gain I did not consider widening my FMArelated optimizations beyond the above 2 "heavy lifting" routines until recently. However, I will henceforth be focusing on morelocalized "peephole" style FMA code opts, which can be done incrementally and in a localized fashion. For starters I FMAized the other code macros (radix15 and twiddleless radix16 assembled into radix240, and fused radix16 finalfwdFFT/dyadicsquare/initialinvFFTpass) used in the above set of radices for my ongoing F29 run. That and a halfdozen other miscellaneous 1%level tweaks have cut the periteration timing @30 Mdoubles to 95 msec, which is within 10% of my "stretch goal" of 86 msec. (Still running at stock  if I added a ginormocooler and OCed up the wazoo I could probably hit 80 msec with the current code, but where's the challenge in that? :) 

20141010, 14:45  #22 
P90 years forever!
Aug 2002
Yeehaw, FL
2·11·349 Posts 
In the radix16 code, have you tried prefetching the next radix16 data into the L1 cache?

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
P1/P+1 on Fermat numbers  ATH  Operazione Doppi Mersennes  2  20150125 06:27 
What are the Primality Tests ( not factoring! ) for Fermat Numbers?  Erasmus  Math  46  20140808 20:05 
LLT numbers, linkd with Mersenne and Fermat numbers  T.Rex  Math  4  20050507 08:25 
Two Primality tests for Fermat numbers  T.Rex  Math  2  20040911 07:26 
Fermat Numbers  devarajkandadai  Math  8  20040727 12:27 