A tenth-grader from Siberia wants to become a processor designer. Why shouldn't she make a FPGA neuro accelerator?

Yesterday I received a letter from a tenth-grader from Siberia, who wants to become a microprocessor developer. She has already received some results in this area - she added a multiplication instruction to the simplest schoolMIPS processor, synthesized it for the Intel FPGA MAX10 FPGA, determined the maximum frequency and increased performance of simple programs. She first did all this in the village of Burmistrovo, Novosibirsk Region, and then at a conference in Tomsk.
 
 
Now Dasha Krivoruchko (that is the name of the tenth grader) has moved to live in a Moscow boarding school and asks me what else to design. I think that at this stage of her career she should design a neural network hardware accelerator based on the systolic array for matrix multiplication. Use the Verilog hardware description language and Intel FPGA FPGA, but not a cheap MAX1? but something more expensive to accommodate a large systolic array.
 
 
After that, compare the performance of the hardware solution with the program running on the schoolMIPS processor, as well as with the Python program running on the desktop computer. As a test example, use the recognition of numbers from a small matrix.
 
 
A tenth-grader from Siberia wants to become a processor designer. Why shouldn't she make a FPGA neuro accelerator?  
may be interested in the database for the Olympiad. Olympiad NTI with whom I raised this question a couple of weeks ago in Moscow. To such an example, Olympiad participants could add a hardware for various activation functions. Here are colleagues from the STI Olympiad:
 
 
3r3342.
 
 
So if Dasha develops this, she can theoretically introduce her well-described accelerator both in RUSNANO and in the NTI Olympiad. I think it would be beneficial for the administration of her school - it would be possible to show it on TV or even send an Intel FPGA to the competition. Here is A pair of Russians from St. Petersburg at the Intel FPGA Finals Contest in Santa Clara, California :
 
 
 
 
Now let's talk about the technical side of the project. The idea of ​​an accelerator of the systolic array is described in the article, which was translated by the editor Habra Vyacheslav Golovanov 3r-358. SLY_G 3r3305. Why are TPUs so well suited for depth learning? 3r3305.
 
 
This is the dataflow graph of a neural network for simple recognition:
 
 
 
 
Primitive computational element that performs multiplication and addition:
 
 
 
 
A strongly pipelined structure of such elements is a systolic array for multiplying matrices and is:
 
 
3r388.
 
 
There is a lot of Verilog and VHDL code on the Internet with the implementation of a systolic array, for example, the code here is
under this blog post 3r3305. :
 
 
 
 
3r3104. 3r3105. module top (clk, reset, a? a? a? b? b? b? c? c? c? c? c? c? c? c? c9);
Parameter data_size = 8;
input wire clk, reset;
input wire[data_size-1:0]a? a? a? b? b? b3;
output wire[2*data_size:0]c? c? c? c? c? c? c? c? c9;
wire[data_size-1:0]a1? a2? a4? a5? a7? a8? b1? b2? b3? b4? b5? b69;
pe pe1 (.clk (clk), .reset (reset), .in_a (a1), .in_b (b1), .out_a (a12), .out_b (b14), .out_c (c1));
pe pe2 (.clk (clk), .reset (reset), .in_a (a12), .in_b (b2), .out_a (a23), .out_b (b25), .out_c (c2));
pe pe3 (.clk (clk), .reset (reset), .in_a (a23), .in_b (b3), .out_a (), .out_b (b36), .out_c (c3));
pe pe4 (.clk (clk), .reset (reset), .in_a (a2), .in_b (b14), .out_a (a45), .out_b (b47), .out_c (c4));
pe pe5 (.clk (clk), .reset (reset), .in_a (a45), .in_b (b25), .out_a (a56), .out_b (b58), .out_c (c5));
pe pe6 (.clk (clk), .reset (reset), .in_a (a56), .in_b (b36), .out_a (), .out_b (b69), .out_c (c6));
pe pe7 (.clk (clk), .reset (reset), .in_a (a3), .in_b (b47), .out_a (a78), .out_b (), .out_c (c7));
pe pe8 (.clk (clk), .reset (reset), .in_a (a78), .in_b (b58), .out_a (a89), .out_b (), .out_c (c8));
pe pe9 (.clk (clk), .reset (reset), .in_a (a89), .in_b (b69), .out_a (), .out_b (), .out_c (c9));
endmodule
module pe (clk, reset, in_a, in_b, out_a, out_b, out_c);
Parameter data_size = 8;
input wire reset, clk;
input wire[data_size-1:0]in_a, in_b;
output reg[2*data_size:0]out_c;
output reg[data_size-1:0]out_a, out_b;
always @ (posedge clk) begin
if (reset) begin
out_a <=0;
out_b <=0;
out_c <=0;
end
else begin
out_c <=out_c+in_a*in_b;
out_a <=in_a;
out_b <=in_b;
end
end
endmodule
3r3149. 3r33150.
 
I note that this code is not optimized and generally clumsy (and even unprofessionally written - the source in the post uses blocking assignments in @ (posedge clk) - I corrected it). Dasha could, for example, use Verilog generate constructs for more elegant code.
 
 
In addition to the two extreme implementations of the neural network (on the processor and on the systolic array), Dasha could consider other options that are faster than the processor, but not as voracious in multiplication operations as a systolic array. True, this is probably not for schoolchildren, but for students.
 
 
One option is a performing device with a large number of parallel functional units, as in the Out-of-Order processor:
 
 
 
 
Another option is the so-called Coarse Grained Reconfigurable Array - a matrix of quasi-processor elements, each of which has a small program. These processor elements are ideologically similar to FPGA /FPGA cells, but they operate not with separate signals, but with groups of bits /numbers on buses and in registers - see 3-333170. Live reporting from the birth of a major player in hardware AI, which accelerates TensorFlow and competes with NVidia "3r3305 3r-323289.  
 
Now actually the original letter from Dasha:
 
3r3178. Good day, Yuri.
 
 
I studied in your workshop in 2017 at LSUP and in October 2017 I participated in a conference in Tomsk in October of the same year with the work dedicated to embedding the multiplication unit into a SchooolMIPS processor.
 
 
I would now like to continue this work. At the moment I managed to get permission from the school to take this topic as a small coursework. Do you have the opportunity to help me with the continuation of this work?
 
 
P.S. Since the work is done in a specific format, it is necessary to write an introduction and a literature review of the topic. Please, advise the sources from which you can get information on the history of the development of this topic, on the philosophies of architecture and so on, if you have such resources in mind.
 
 
Plus, at the moment I live in Moscow in a boarding school, it may be easier to carry out the interaction.
 
 
Regards,
 
 
Daria Krivoruchko.
 
Dasha taught Verilog and register-level transmission design with the help of me and a 3r3206 book. By David Harris and Sarah Harris “Digital Circuit Design and Computer Architecture” 3r3305. . However, if you are a schoolboy /schoolgirl and want to understand the basic concepts at a very simple level, then for you, the publishing house DMK-Press has released 3r3208. Russian translation of the Japanese manga of 2013 about digital circuits 3r3305. created by Amano Hideharu and Meguro Koji. Despite the frivolous form of presentation, the book correctly introduces logic elements and D-triggers, 3r-3210. then binds it to the FPGA-3r3305 am. :
 
 
 
 
This is what
looked like. Summer School for Young Programmers 3r3305. in the Novosibirsk region, where Dasha learned Verilog, FPGA, development methodology at the level of register transfers (Register Transfer Level - RTL): 3r-3289.  
 
 
 
3r33232.
 
 
But the performance of Dasha at the conference in Tomsk, along with another tenth-grader, Arseny Chegodaev:
 
 

 
After the show, Dasha is with me and with Stanislav Zelio
sparf
, the main creator of the educational processor core schoolMIPS for implementation on the FPGA: 3r-3289.  
 
 
 
The schoolMIPS project is located at https://github.com/MIPSfpga/schoolMIPS . In the simplest configuration of this educational processor core there are only 300 lines on Verilog, while in the industrial integrated core of the middle class there are about 300 thousand lines. Nevertheless, Dasha was able to feel how the work of designers in the industry looks like, which in the same way change the decoder and the executing device when they add a new instruction to the processor: 3r-3289.  
 
 
 
In conclusion, we present photos of the dean of the Samara University, Ilya Kudryavtsev, who is interested in creating a summer school and competitions with processors on the FPGA for future entrants: 3r-3289.  
 
 
 
And a photo of the employees of Zelenograd MIET, who are already planning such a summer school next year: 3r-3232.  
 
 
 
Both in one and in another place, both materials from RUSNANO and possible materials for the STI Olympiad should be good, as well as developments that have been made in the implementation of FPGAs and micro-architecture in the HSE MIEM, MGU and universities in the last couple of years. Kazan Innopolis .
3r3303. Only registered users can participate in the survey. 3r3304. Log in 3r3305. , you are welcome. 3r3306.
3r3309.
3r33333.
Which post angles are you most interested in? 3r33434.
3r33333.
3r33414.
3r33417.
3r33421. Learning the basics of digital electronics
3r33424.
3r33434.
3r33414.
3r33417.
3r33421. Learning the basics of modern design methodology at the register transfer level 3r33422.
3r33424.
3r33434.
3r33414.
3r33417.
3r33421. A way to show children that the processor is not a black box 3r33422.
3r33424.
3r33434.
3r33414.
3r33417.
3r33421. Bow with the implementation of neural networks, artificial intelligence, pattern recognition 3r33422.
3r3437. 3r33434.
3r33414.
3r33417.
3r33421. Olympiad
3r33424.
3r33434.
3r33414.
3r33417.
3r33421. Career guidance for schoolchildren
3r33424.
3r33434.
3r33414.
3r33417.
3r33421. Filling the gap in Russian technical education between physics and programming 3-3-33422.
3r33424.
3r33434.
3r33434.
3r33434. Voted 1 user. Abstained 1 user.
+ 0 -

Add comment