We calculate "magic squares" with the help of GPU

Hello habr.

The theme of "magic squares" is quite interesting, because on the one hand, they are known since antiquity, on the other hand, the calculation of the "magic square" even today is a very difficult computational problem. Recall that in order to construct the "magic square" NxN, it is necessary to write the numbers 1N * N so that the sum of its contours, verticals and diagonals is equal to the same number. If you just sort through the number of all the options for placing the digits for a 4x4 square, you get 16! = ??? ??? options.

We will think about how this can be done more effectively.

First, we repeat the condition of the problem. It is necessary to arrange the numbers in a square so that they do not repeat, and the sum of the horizontals, verticals and diagonals is equal to the same number.

It is easy to prove that this sum is always the same, and is calculated by the formula for any n:

We will consider the squares 4x? so the sum = 34.

Denote all the variables in X, our square will have the form:

The first, and obvious, property: the sum of the square is known, the extreme columns can be expressed in terms of the remaining 3:

` X14 = S-X11-X12-X13`

X24 = S-X21-X22-X23

X41 = S-X11-X21-X31

Thus, the square of 4x4 actually turns into a 3x3 square, which reduces the number of search options from 16! up to 9! in 57 million times. Knowing this, we begin to write the code, let's see how complicated this search is for modern computers.

### C ++ is a single-threaded version of

The principle of the program is very simple. We take the set of numbers 116 and the for loop on this set, this will be x11. Then we take the second set consisting of the first one with the exception of the number x1? and so on.

The approximate form of the program looks like this:

` `` int squares = 0;`

int digits[]= {?????????1?1?1?1?1?1?16};

Set mset (digits, digits + N * N);

for (int x11 = 1; x11 <= MAX; x11++) {

Set set12 (mset); set12.erase (x11);

.for (SetIterator it12 = set12.begin (); it12! = set12.end (); it12 ++) {

int x12 = * it12;

Set set13 (set12); set13.erase (x12);

for (SetIterator it13 = set13.begin (); it13! = set13.end (); it13 ++) {

= * it13;

int x14 = S-x11-x12-x13;

if (x14 < 1 || x14 > MAX) continue;

if (x14 == x11 || x14 == x12 || x14 == x13) continue ;

.int sh1 = x11 + x12 + x13 + x1? sh2 = x21 + x22 + x23 + x2? sh3 = x31 + x32 + x33 + x3? sh4 = x41 + x42 + x43 + x44;

int sv1 = x11 + x21 + x31 + x4? sv2 = x12 + x22 + x32 + x4? sv3 = x13 + x23 + x33 + x4? sv4 = x14 + x24 + x34 + x44;

.int sd1 = x11 + x22 + x33 + x4? sd2 = x14 + x23 + x32 + x41;

if (sh1! = S || sh2! = S || sh3! = S || sh4! = S || sv1! = S || sv2 ! = S || sv3! = S || sv4! = S || sd1! = S || sd2! = S)

continue;

//If the numbers have passed all the checks for intersections, then the square is found

printf ("% d% d% d% d% d% d% d% d% d% d% d% d% d% d% d% dn", x1? x1? x1? x1? x2? x2? x23 , x2? x3? x3? x3? x3? x4? x4? x4? x44);

squares ++;

}

}

printf ("CNT:% dn", squares);

The full text of the program can be found under the spoiler.

**The source text is entirely [/b]**

`#include`

#include

#include

#include "stdafx.h"

typedef std :: set Set;

typedef Set :: iterator SetIterator;

#define N 4

#define MAX (N * N)

#define S 34

int main (int argc, char * argv[])

{

//x11 x12 x13 x14

//x21 x22 x23 x24

//x31 x32 x33 x34

//x41 x42 x43 x44

const clock_t begin_time = clock ();

int squares = 0;

int digits[]= {?????????1?1?1?1?1?1?16};

Set mset (digits, digits + N * N);

for (int x11 = 1; x11 <= MAX; x11++) {

Set set12 (mset); set12.erase (x11);

.for (SetIterator it12 = set12.begin (); it12! = set12.end (); it12 ++) {

int x12 = * it12;

Set set13 (set12); set13.erase (x12);

for (SetIterator it13 = set13.begin (); it13! = set13.end (); it13 ++) {

= * it13;

int x14 = S-x11-x12-x13;

if (x14 < 1 || x14 > MAX) continue;

if (x14 == x11 || x14 == x12 || x14 == x13) continue ;

.

Set set21 (set13); set21.erase (x13); set21.erase (x14);

For (SetIterator it21 = set21.begin (); it21! = Set21.end (); it21 ++) {

int x21 = * it21;

Set set22 (set21); set22.erase (x21);

For (SetIterator it22 = set22.begin (); it22! = Set22.end (); it22 ++) {

int x22 = * it22;

Set set23 (set22); set23.erase (x22);

for (SetIterator it23 = set23.begin (); it23! = set23.end (); i t23 ++) {

int x23 = * it2? x24 = S - x21 - x22 - x23;

if (x24 < 1 || x24 > MAX) continue;

if (x24 == x11 || x24 == x12 || x24 == x13 || x24 == x14 || x24 == x21 || x24 == x22 || x24 == x23) continue;

Set set31 (set23);

set31.erase (x23); set31.erase (x24);

for (SetIterator it31 = set31.begin (); it31! = set31.end (); it31 ++) {

int x31 = * it31;

Set set32 (set31); set32.erase (x31);

for (SetIterator it32 = set32.begin (); it32! = set32.end (); it32 ++) {

int x32 = * it32;

Set set33 (set32); set33.erase (x32);

for (SetIterator it33 = set33.begin (); it33! = set33.end (); it33 ++) {

int x33 = * it3? x34 = S - x31 - x32 - x33;

if (x34 < 1 || x34 > MAX) continue;

if (x34 == x11 || x34 == x12 || x34 == x13 || x34 == x14 || x34 == x21 || x34 == x22 || x34 == x23 || x34 == x24 || x34 == x31 || x34 == x32 || x34 == x33) continue;

int x41 = S - x11 - x21 - x3? x42 = S - x12 - x22 - x3? x43 = S - x13 - x23 - x3? x44 = S - x14 - x24 - x34;

if (x41 < 1 || x41 > MAX || x42 < 1 || x42 > MAX || x43 < 1 || x43 > MAX || x44 < 1 || x41 > MAX) continue;

if (x41 == x11 || x41 == x12 || x41 == x13 || x41 == x14 || x41 == x21 || x41 == x22 || x41 == x23 || x41 == x24 ||

x41 = x31 x41 x32 x41 x33 x41 x34 3r3 r3709. continue;

if (x42 = x11 | x42 |

x42 == x31 || x42 == x32 || x42 == x33 || x42 == x34 || x42 == x41)

continue;

if (x43 == x11 || x43 == x12 || x43 == x13 || x43 == x14 || x43 == x21 || x43 == x22 || x43 == x23 || x43 == x24 ||

x43 x31 x43 x32 x43 x33 x43 x34 x43 x41 x43 x32 3r3 r3709 continue;

if (x44 == x11 || x44 == x12 || x44 == x13 || x44 == x14 || x44 == x21 || x44 == x22 || x44 == x23 || x44 == x24 ||

x44 x31 x44 x32 x44 x44 x44 x44 x44 x44 x44 3 r3 r3709 continue;

int sh1 = x11 + x12 + x13 + x1? sh2 = x21 + x22 + x23 + x2? sh3 = x31 + x32 + x33 + x3? sh4 = x41 + x42 + x43 + x44;

int sv1 = x11 + x21 + x31 + x4? sv2 = x12 + x22 + x32 + x4? sv3 = x13 + x23 + x33 + x4? sv4 = x14 + x24 + x34 + x44;

int sd1 = x11 + x22 + x33 + x4? sd2 = x14 + x23 + x32 + x41;

if (sh1! = S || sh2! = S || sh3! = S || sh4! = S || sv1! = S || sv2! = S || sv3! = S || sv4! = S || sd1! = S || sd2! = S)

continue;

printf ("% d% d% d% d% d% d% d% d% d% d% d% d% d% d% d% dn", x1? x1? x1? x1? x2? x2? x23 , x2? x3? x3? x3? x3? x4? x4? x4? x44);

squares ++;

}

}

}

}

}

}

}

}

}

printf ("CNT:% dn", squares);

float diff_t = float (clock () - begin_time) /CLOCKS_PER_SEC;

printf ("T =% .2fsn", diff_t);

return 0;

}

Result: everything was found

**7040 variants of**"Magic squares" 4x? and the search time was

**102c**.

By the way it's interesting to check whether there is a list of squares that is depicted in the engraving of Durer. Of course there is, because the program outputs

*all*squares of dimension 4x4:

It should be noted that Dürer inserted the square into the image for a reason, the numbers

*1514*also indicate the year of creation of the engraving.

As you can see, the program works (note the task as verified at 1514 by Albrecht Dürer;), however, execution time is not too small for a computer with a Core i7 processor. Obviously, the program runs in one thread, and it is advisable to use all the other kernels.

### C ++ is a multithreaded version of

Rewriting a program using threads is, in principle, easy, albeit a bit cumbersome. Fortunately, there is an almost forgotten version today - the use of support

**OpenMP**(Open Multi-Processing). This technology has existed since 199? and allows the processor directives to instruct the compiler which parts of the program to run in parallel. OpenMP support is also available in Visual Studio, so to turn the program into multithreaded, it's enough to add only one line to the code:

` `` int squares = 0;`

#pragma omp parallel for reduction (+: squares)

for (int x11 = 1; x11 <= MAX; x11++) {

}

printf ("CNT:% dn", squares);

.

.

Directive

*#pragma omp parallel*for decreeIt means that the next for loop can be executed in parallel, and the additional parameter squares specifies the name of the variable that will be shared by the parallel threads (without this the increment does not work correctly).

The result is obvious: the execution time was reduced from 102s to

**18c**.

**The source text is entirely [/b]**

`#include`

#include

#include

#include "stdafx.h"

typedef std :: set Set;

typedef Set :: iterator SetIterator;

#define N 4

#define MAX (N * N)

#define S 34

int main (int argc, char * argv[])

{

//x11 x12 x13 x14

//x21 x22 x23 x24

//x31 x32 x33 x34

//x41 x42 x43 x44

const clock_t begin_time = clock ();

int squares = 0;

#pragma omp parallel for reduction (+: squares)

for (int x11 = 1; x11 <= MAX; x11++) {

int digits[]= {?????????1?1?1?1?1?1?16};

Set mset (digits, digits + N * N);

Set set12 (mset); set12.erase (x11);

.for (SetIterator it12 = set12.begin (); it12! = Set12.end (); it12 ++) {

.int x12 = * it12;

Set set13 (set12); set13.erase (x12);

for (SetIterator it13 = set13.begin (); it13! = set13.end (); it13 ++) {

int x13 = * it13;

int x14 = S-x11-x12-x13;

if (x14 < 1 || x14 > .Max) continue;

.if (x14 = x11 || x14 == x12 || x14 == x13) continue;

.

Set set21 (set13); set21.erase (x13); set21.erase (x14);

for (SetIterator it21 = set21.begin (); it21! = set21.end it21 ++) {

int x21 = * it21;

Set set22 (set21); set22.erase (x21);

for (SetIterator it22 = set22.begin (); it22! = set22.end (); it22 ++) {

int x22 = * it22; 3r3r???. Set set23 (set22); set23.erase (x22);

for (SetIterator it23 = set23.begin (); it23! = set23.end (); it23 ++) {

int x23 = * it2? x24 = S - x21 - x22 - x23;

if (x24 < 1 || x24 > MAX) continue;

if (x24 == x11 || x24 == x12 || x24 == x13 || x24 == x14 || x24 == x21 || x24 == x22 || x24 == x23) continue;

Set set31 (set23);

set31.erase (x23); set31.erase (x24);

for (SetIterator it31 = set31.begin (); it31! = set31.end (); it31 ++) {

int x31 = * it31;

Set set32 (set31); set32.erase (x31);

for (SetIterator it32 = set32.begin (); it32! = set32.end (); it32 ++) {

int x32 = * it32;

Set set33 (set32); set33.erase (x32);

for (SetIterator it33 = set33.begin (); it33! = set33.end (); it33 ++) {

int x33 = * it3? x34 = S - x31 - x32 - x33;

if (x34 < 1 || x34 > MAX) continue;

if (x34 == x11 || x34 == x12 || x34 == x13 || x34 == x14 || x34 == x21 || x34 == x22 || x34 == x23 || x34 == x24 || x34 == x31 || x34 == x32 || x34 == x33) continue;

int x41 = S - x11 - x21 - x3? x42 = S - x12 - x22 - x3? x43 = S - x13 - x23 - x3? x44 = S - x14 - x24 - x34;

if (x41 < 1 || x41 > MAX || x42 < 1 || x42 > MAX || x43 < 1 || x43 > MAX || x44 < 1 || x41 > MAX) continue;

if (x41 == x11 || x41 == x12 || x41 == x13 || x41 == x14 || x41 == x21 || x41 == x22 || x41 == x23 || x41 == x24 ||

x41 = x31 x41 x32 x41 x33 x41 x34 3r3 r3709. continue;

if (x42 = x11 | x42 |

x42 == x31 || x42 == x32 || x42 == x33 || x42 == x34 || x42 == x41)

continue;

if (x43 == x11 || x43 == x12 || x43 == x13 || x43 == x14 || x43 == x21 || x43 == x22 || x43 == x23 || x43 == x24 ||

x43 x31 x43 x32 x43 x33 x43 x34 x43 x41 x43 x32 3r3 r3709 continue;

if (x44 == x11 || x44 == x12 || x44 == x13 || x44 == x14 || x44 == x21 || x44 == x22 || x44 == x23 || x44 == x24 ||

x44 x31 x44 x32 x44 x44 x44 x44 x44 x44 x44 3 r3 r3709 continue;

int sh1 = x11 + x12 + x13 + x1? sh2 = x21 + x22 + x23 + x2? sh3 = x31 + x32 + x33 + x3? sh4 = x41 + x42 + x43 + x44;

int sv1 = x11 + x21 + x31 + x4? sv2 = x12 + x22 + x32 + x4? sv3 = x13 + x23 + x33 + x4? sv4 = x14 + x24 + x34 + x44;

int sd1 = x11 + x22 + x33 + x4? sd2 = x14 + x23 + x32 + x41;

if (sh1! = S || sh2! = S || sh3! = S || sh4! = S || sv1! = S || sv2! = S || sv3! = S || sv4! = S || sd1! = S || sd2! = S)

continue;

printf ("% d% d% d% d% d% d% d% d% d% d% d% d% d% d% d% dn", x1? x1? x1? x1? x2? x2? x23 , x2? x3? x3? x3? x3? x4? x4? x4? x44);

squares ++;

}

}

}

}

}

}

}

}

}

printf ("CNT:% dn", squares);

float diff_t = float (clock () - begin_time) /CLOCKS_PER_SEC;

printf ("T =% .2fsn", diff_t);

return 0;

}

This is much better - because. the task is almost perfectly parallelized (the calculations in each branch do not depend on each other), the time is less approximately the number of times equal to the number of processor cores. But alas, in principle,

*about*You can not get the most out of this code, although some optimizations can and can win a few percent. We pass to more heavy artillery, calculations on the GPU.

## Calculations with NVIDIA CUDA

If you do not go into details, the calculation process performed on the video card can be represented as several parallel hardware blocks (blocks), each of which executes several processes (threads).

For example, you can give an example of the function of adding two vectors from the CUDA documentation:

` `` __global__`

void add (int n, float * x, float * y)

{

int index = threadIdx.x;

int stride = blockDim.x;

for (int i = index; i < n; i += stride)

.y[i]= x[i]+ y[i];

}

.

.Arrays x and y are common for all blocks, and the function itself is executed simultaneously on several processors simultaneously. The key here is in parallelism - the graphics card processors are much simpler than the usual CPU, but there are a lot of them and they are focused on processing numeric data.

That's what we need. We have a matrix of numbers X1? X1? , X44. Let's start the process from 16 blocks, each of which will execute 16 processes. The number of the block will correspond to the number X1? the process number X1? and the code itself will calculate all possible squares with for the selected X11 and X12. Everything is simple, but there is one subtlety here: the data need not only be computed, but also transferred from the video card back, for this, in the zero element of the array, we will store the number of squares found.

The main code is very simple:

` `` #define N 4`

#define SQ_MAX 8 * 1024

#define BLOCK_SIZE (SQ_MAX * N * N + 1)

int main (int argc, char * argv[])

{

const clock_t begin_time = clock ();

int * results = (int *) malloc (BLOCK_SIZE * sizeof (int));

results[0]= 0;

int * gpu_out = NULL;

cudaMalloc (& gpu_out, BLOCK_SIZE * sizeof (int));

cudaMemcpy (gpu_out, results, BLOCK_SIZE * sizeof (int), cudaMemcpyHostToDevice);

squaresMAX, MAX (gpu_out);

cudaMemcpy (results, gpu_out, BLOCK_SIZE * sizeof (int), cudaMemcpyDeviceToHost);

//Print results

int squares = results[0];

for (int p = 0; p

? int i = MAX * p + 1;

printf ("[%d %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d].n",

results[i], results[i+2], results[i+2], results[i+3],

results[i+4], results[i+5], results[i+6], results[i+7],

results[i+8], results[i+9], results[i+10], results[i+11],

results[i+12], results[i+13], results[i+14], results[i+15])

}

Printf ("CNT:% dn", squares)

float diff_t = float (clock () - begin_time) /CLOCKS_PER_SEC;

Printf ("T =% .2fsn", diff_t

cudaFree (gpu_out)

free (results)

return 0;

}

.

We allocate a block of memory on a video card using cudaMalloc, run the squares function, indicating 2 parameters 1?16 (number of blocks and number of threads) corresponding to the number to be sorted 11? wait for execution with the function cudaDeviceSynchronize, then copy the data back through cudaMemcpy .

The squares function itself essentially repeats the code from the previous part, with the difference that the increment of the number of squares found is done with the help of atomicAdd - this ensures that the variable will correctly change with simultaneous calls.

**The source code is entirely [/b]**

`//Compile:`

//nvcc -o magic4_gpu.exe magic4_gpu.cu

#include

#include

#define N 4

#define MAX (N * N)

#define SQ_MAX 8 * 1024

#define BLOCK_SIZE (SQ_MAX * N * N + 1)

#define S 34

//Magic square:

//x11 x12 x13 x14

//x21 x22 x23 x24

//x31 x32 x33 x34

//x41 x42 x43 x44

__global__ void squares (int * res_array) {

int index1 = blockIdx.x, index2 = threadIdx.x;

if (index1 + 1> MAX || index2 + 1> MAX) return;

const int x11 = index1 + ? x12 = index2 + 1;

for (int x13 = 1; x13 <=MAX; x13++) {

if (x13 == x11 || x13 == x12)

continue;

int x14 = S - x11 - x12 - x13;

if (x14 < 1 || x14 > MAX) continue ;

.if (x14 == x11 || x14 == x12 || x14 == x13)

Continue;

.for (int x21 = 1; x21 <=MAX; x21++) {

.if (x21 == x11 || x21 == x12 || x21 == x13 || x21 == x14)

continue;

for (int x22 = 1; x22 <=MAX; x22++) {

.if (x22 = x11 || x22 == x12 || x22 == x13 || x22 == x14 || x22 == x21)

continue;

.for (int x23 = 1; x23 <=MAX; x23++) {

int x24 = S-x21-x22-x23;

if (x24 < 1 || x24 > MAX) continue;

.if (x23 == x11 || x23 == x12 || x23 == x13 | | x23 == x14 || x23 == x21 || x23 == x22)

continue;

if (x24 == x11 || x24 == x12 || x24 == x13 || x24 == x14 || x24 == x21 || x24 == x22 || x24 == x23)

continue;

.for (int x31 = 1; x31 <=MAX; x31++) {

if (x31 == x11 || x31 == x12 || x31 == x13 x31 = x14 x31 x21 x31 x2 x31 x23 x31 = x24

continue

for (int x32 = 1; x32 <=MAX; x32++) {

if ( x32 == x11 || x32 == x12 || x32 == x13 || x32 == x14 || x32 == x21 || x32 == x22 || x32 == x23 || x32 == x24 || x32 = = x31)

continue,

for (int x33 = ? x33 <=MAX; x33++) {

int x34 = S - x31 - x32 - x33;

if (x34 < 1 || x34 > MAX) continue;

if (x33 = x11 || x33 == x12 || x33 == x13 || x33 == x14 || x33 == x21 || x33 == x22 || x33 == x23 || x33 == x24 || x33 == x31 || x33 == x32)

continue;

if (x34 == x11 || x34 == x12 || x34 == x13 || x34 == x14 || x34 == x21 || x34 == x22 || x34 == x23 || x34 == x24 || x34 == x31 || x34 == x32 || x34 == x33)

continue;

const int x41 = S - x11 - x21 - x3? x42 = S - x12 - x22 - x3? x43 = S - x13 - x23 - x3? x44 = S - x14 - x24 - x34;

if (x41

? MAX || x42

? MAX || x43

? MAX || x44

? MAX)

continue;

if (x41 == x11 || x41 == x12 || x41 == x13 || x41 == x14 || x41 == x21 || x41 == x22 || x41 == x23 || x41 == x24 ||

x41 = x31 x41 x32 x41 x33 x41 x34 3r3 r3709. continue;

if (x42 = x11 | x42 |

x42 == x31 || x42 == x32 || x42 == x33 || x42 == x34 || x42 == x41)

continue;

if (x43 == x11 || x43 == x12 || x43 == x13 || x43 == x14 || x43 == x21 || x43 == x22 || x43 == x23 || x43 == x24 ||

x43 x31 x43 x32 x43 x33 x43 x34 x43 x41 x43 x32 3r3 r3709 continue;

if (x44 == x11 || x44 == x12 || x44 == x13 || x44 == x14 || x44 == x21 || x44 == x22 || x44 == x23 || x44 == x24 ||

x44 x31 x44 x32 x44 x44 x44 x44 x44 x44 x44 3 r3 r3709 continue;

int sh1 = x11 + x12 + x13 + x1? sh2 = x21 + x22 + x23 + x2? sh3 = x31 + x32 + x33 + x3? sh4 = x41 + x42 + x43 + x44;

int sv1 = x11 + x21 + x31 + x4? sv2 = x12 + x22 + x32 + x4? sv3 = x13 + x23 + x33 + x4? sv4 = x14 + x24 + x34 + x44;

int sd1 = x11 + x22 + x33 + x4? sd2 = x14 + x23 + x32 + x41;

if (sh1! = S || sh2! = S || sh3! = S || sh4! = S || sv1! = S || sv2! = S || sv3! = S || sv4! = S || sd1! = S || sd2! = S)

continue;

//Square found: save in array (MAX numbers for each square)

int p = atomicAdd (res_array, 1);

if (p> = SQ_MAX) continue;

int i = MAX * p + 1;

res_array[i]= x11; res_array[i+1]= x12; res_array[i+2]= x13; res_array[i+3]= x14;

res_array[i+4]= x21; res_array[i+5]= x22; res_array[i+6]= x23; res_array[i+7]= x24;

res_array[i+8]= x31; res_array[i+9]= x32; res_array[i+10]= x33; res_array[i+11]= x34;

res_array[i+12]= x41; res_array[i+13]= x42; res_array[i+14]= x43; res_array[i+15]= x44;

//Warning: printf from kernel makes calculation 2-3x slower

//printf ("% d% d% d% d% d% d% d% d% d% d% d% d% d% d% d% dn", x1? x1? x1? x1? x2? x22 , x2? x2? x3? x3? x3? x3? x4? x4? x4? x44);

}

}

}

}

}

}

}

}

int main (int argc, char * argv[])

{

int * gpu_out = NULL;

cudaMalloc (& gpu_out, BLOCK_SIZE * sizeof (int));

const clock_t begin_time = clock ();

int * results = (int *) malloc (BLOCK_SIZE * sizeof (int));

results[0]= 0;

cudaMemcpy (gpu_out, results, BLOCK_SIZE * sizeof (int), cudaMemcpyHostToDevice);

squaresMAX, MAX (gpu_out);

cudaMemcpy (results, gpu_out, BLOCK_SIZE * sizeof (int), cudaMemcpyDeviceToHost);

//Print results

int squares = results[0];

for (int p = 0; p .

The result does not require comments - the execution time was 2.7s, which is about 30 times better than the original single-threaded version:

Most likely, this is far from ideal, for example, you can run more blocks on the GPU, but this will make the code more confusing and difficult to understand.

## Conclusion

The problem of finding "magic squares" turned out to be technically very interesting, and at the same time uneasy. Even with GPU calculations, the search for all 5x5 squares may take several hours, and the optimization for searching for magical squares of 7x7 and above has yet to be done.

Mathematically and algorithmically, there are also several unsolved moments:

- The dependence of the number of "magic squares" on N. It is known that the square 2x2 does not exist, the square 3x3 exists only ? the squares 4x4 as we found out, 704? but the exclusion of rotations or reflections in the algorithm has not yet been added. For large dimensions, the question is still open.

- Exclusion of squares, which are rotations or reflections of already found.

- Speed and optimization of the algorithm. Unfortunately, there is no possibility to test the code on a supercomputer or at least NVIDIA Tesla, if someone can launch it, it would be interesting. If anyone has ideas on the algorithm itself, they can also be tried. If desired, you can even run a distributed project to search for squares, if you certainly have enough number of readers;)

About analysis and properties of the magic squares themselves, you can write a separate article, if there is interest.

PS: To the question, which for certain will follow, "but why this is necessary." In terms of power consumption, calculating magic squares is no better or worse than computing bitcoins, so why not? In addition, it is an interesting warm-up for the mind and an interesting task in the field of applied programming.

It may be interesting

#### weber

Author**29-09-2018, 23:31**

Publication Date
#### Programming / Parallel Programming / Abnormal programming / Mathematics

Category- Comments: 1
- Views: 436

looking for the best social traffic services to buy online? look no further. Want to capitalize on the world’s obsession with social media? Buy social traffic that is driven to your website or blog from the most popular social media platforms including Facebook, Instagram, Twitter and more! Activities on Facebook will be fruitful when you, along with your other activities on other social networks and of course your website, also benefit from the facilities and potentials hidden in Facebook. Facebook alone cannot be a factor in the success of your business. So, you should use Facebook as a bridge to acquaint your audience with your main sales channel, which could be another social network or your website. Once you have successfully attracted your users on a global scale, it is time to use its practical tools to attract international customers. The essence of Facebook is free and building and running a business page on it will not cost you anything. But once you get to the right place on the network, the cost of smart advertising may seem very reasonable.