Flying Pigs, or Optimizing Interpreters Bytecode

 
3r3-31. 3r3804. Flying Pigs, or Optimizing Interpreters Bytecode 3r3809. 3r3802.  
"You can, however, make a faster pig" (a commentary in the Emaks source code)
3r3804. Everyone knows the fact that pigs do not fly. No less popular is the opinion that bytecode interpreters as a technique for executing high-level languages ​​cannot be accelerated without the use of time-consuming dynamic compilation. 3r3809. 3r3802.  
3r3804. In the second part of a series of articles on bytecode interpreters, I will try to show by the example of a small stack virtual machine PVM (“Pig Virtual Machine”) that not everything is lost for hard-working piglets with ambitions and that it is possible to accelerate within the (mostly) standard C the work of such interpreters at least one and a half times. 3r3809. "Piglet VM" 3r3808. - an ordinary stack machine based on 3r330. Example from 3r332. first part 3r3808. series of articles. Our piggy knows only one type of data - a 64-bit machine word, and all (integer) calculations are performed on a stack with a maximum depth of 256 machine words. In addition to the stack, this pig has a working memory of 6?536 machine words. The result of the program - one machine word - can either be placed in the register-result, or simply output to standard output (stdout). 3r3809. 3r3802.  
3r3804. All state in the car "Pig VM" is stored in a single structure:
3r3802.  
3r3737. 3r3704. static struct {3r33820. /* Current instruction pointer * /
uint8_t * ip;
/* Fixed-size stack * /
uint64_t stack[STACK_MAX];
uint64_t * stack_top;
/* Operational memory * /
uint64_t memory[MEMORY_SIZE];
/* A single register containing the result * /
uint64_t result;
} vm;
3r33737. 3r33744. 3r3802.  
3r3804. The above allows you to refer this machine to low-level virtual machines, almost all of the overhead in which are in the maintenance of the main program cycle: 3r3809 3r3802.  
3r3737. 3r3704. interpret_result vm_interpret (uint8_t * bytecode)
{
vm_reset (bytecode);
for (;;) {
uint8_t instruction = NEXT_OP ();
switch (instruction) {3r3r2020. case OP_PUSHI: {
/* get the argument, push it to stack * /
uint16_t arg = NEXT_ARG ();
Push (arg);
break;
}
case OP_ADD: {
/* Pop 2 values, add 'em, push the result back to the stack * /
uint64_t arg_right = POP ();
* TOS_PTR () + = arg_right;
break;
}
/* 3r33820. * 3r320. * Lots of other instruction handlers here
* 3r320. * /
case OP_DONE: {
return SUCCESS;
}
default:
return ERROR_UNKNOWN_OPCODE;
}
}
return ERROR_END_OF_STREAM;
} 3r33744. 3r3802.  
3r3804. From the code it is clear that for each opcode the pig should:
3r3802.  
3r3755.  
3r33737. Extract opcode from instruction stream. 3r33737.  
3r33737. Make sure that the opcode is in the allowable range of opcode values ​​(this compiler C adds this logic when generating the switch code). 3r33737.  
3r33737. Go to the body instructions. 3r33737.  
3r33737. Extract instruction arguments from the stack or decode an instruction argument placed directly in bytecode. 3r33737.  
3r33737. Perform an operation. 3r33737.  
3r33737. If there is a result of the calculation, put it on the stack. 3r33737.  
3r33737. Move the pointer from the current instruction to the next. 3r33737.  
3r33766. 3r3802.  
3r3804. The payload here is only in the fifth paragraph, all the rest is overhead: decoding or retrieving instruction arguments from the stack (item 4), checking the opcode value (item 2), repeatedly returning to the beginning of the main loop and the subsequent hardly predictable conditional jump (item 3). 3r3809. 3r3802.  
3r3804. In short, the pig has clearly exceeded the recommended body mass index, and if we want to bring it into shape, we will have to deal with all these excesses. 3r3809. 3r3802.  
3r3144. Swine assembly language and sieve of Eratosthenes
3r3802.  
3r3804. To begin with, we will define the rules of the game. 3r3809. 3r3802.  
3r3804. Writing programs for a virtual machine right in C - moveton, but also creating a programming language is a long time, so the pig and I decided to limit ourselves to the pig language of assembler. 3r3809. 3r3802.  
3r3804. The program, which calculates the sum of numbers from 1 to ???? on this assembler looks like this: 3r3809. 3r3802.  
3r3737. 3r3738. # sum numbers from 1 to 65535
# init the current sum and the index
PUSHI 1
PUSHI 1
# stack s = ? i = 1
STOREI 0
# stack: s = 1
# routine: increment the counter, add it to the current sum
incrementandadd:
# check if index is too big
LOADI 0
# stack: s, i
ADDI 1
# stack: s, i + 1
DUP 3r33820. # stack: s, i + ? i + 1
GREATER_OR_EQUALI 65535
# stack: s, i + ? 1 or 0
JUMP_IF_TRUE done
# stack: s, i + 1
DUP 3r33820. # stack: s, i + ? i + 1
STOREI 0
# stack: s, i + 1
ADD 3r33820. # stack: s + i + 1
JUMP incrementandd
done: 3r320. DISCARD
PRINT 3r33820. DONE 3r33743. 3r33744. 3r3802.  
3r3804. Not Python, of course, but everything you need for pig happiness is here: comments, tags, conditional and unconditional jumps on them, mnemonics for instructions and the ability to specify the immediate arguments of instructions. 3r3809. 3r3802.  
3r3804. Included with the machine "Pig PvP" are assembler and disassembler, which courageous in spirit and having a lot of free time, readers can independently try out in battle. 3r3809. 3r3802.  
3r3804. The numbers add up very quickly, so to test the performance I wrote another program - a naive implementation of 3r3209. Sieve Eratosthenes
. 3r3809. 3r3802.  
3r3804. In fact, the piglet runs pretty fast (its instructions are close to machine ones), so to get clear results I’ll make every measurement for one hundred program launches. 3r3809. 3r3802.  
3r3804. The first version of our non-optimized pig runs like this:
3r3802.  
3r3737. 3r3738. > ./pigletvm runtimes test /sieve-unoptimized.bin 100> /dev /null
PROFILE: switch code finished took 545ms 3r33744. 3r3802.  
3r3804. Half a second! The comparison is certainly unfair, but the same algorithm in Python is a little slower than a hundred runs:
3r3802.  
3r3737. 3r3738. > python test /sieve.py> /dev /null
???r3r3743. 3r33744. 3r3802.  
3r3804. 4.5 seconds, or nine times slower. We must pay tribute to the pig - he has the ability! Well, now let's see if our pig can pump a press. 3r3802.  
3r3809. 3r3802.  
Exercise One: Static Superinstructions
3r3802.  
3r3804. The first rule of fast code is not to do extra work. The second rule of fast code is to never do any extra work. So what extra work does the "Pig VM"? 3r3809. 3r3802.  
3r3804. Observation one: profiling our program shows that there are sequences of instructions that occur more often than others. We will not torment our pig much and confine ourselves to pairs of instructions:
3r3802.  
3r3755.  
3r33737. LOADI ? ADD - put a number from the memory at address 0 and add it to the number at the top of the stack. 3r33737.  
3r33737. PUSHI 6553? GREATER_OR_EQUAL — put a number on the stack and compare it with the number that was previously at the top of the stack, putting the result of the comparison (0 or 1) back onto the stack. 3r33737.  
3r33737. PUSHI ? ADD - put a number on the stack, add it to the number that was previously at the top of the stack, and put the result of the addition back to the stack. 3r33737.  
3r33766. 3r3802.  
3r3804. In the PorosenVM machine there are a little more than 20 instructions, and the whole byte is used for coding - 256 values. The introduction of new instructions is not a problem. What we will do:
3r3802.  
3r3737. 3r3704. for (;;) {
uint8_t instruction = NEXT_OP ();
switch (instruction) {3r3r2020. /* 3r33820. * Other instructions here
* *
case OP_LOADADDI: {3r33820. /* get the value of
* of the stack * /
uint16_t addr = NEXT_ARG ();
uint64_t val = vm.memory[addr];
* TOS_PTR () + = val;
break;
}
case OP_GREATER_OR_EQUALI: {
/* get * * *
uint64_t arg_right = NEXT_ARG ();
* TOS_PTR () = PEEK ()> = arg_right;
break;
}
case OP_ADDI: {
/* Add immediate value to the stack * /
uint16_t arg_right = NEXT_ARG ();
* TOS_PTR () + = arg_right;
break;
}
/* 3r33820. * Other instructions here
* *
}
3r33737. 3r33744. 3r3802.  
3r3804. Nothing complicated. Let's see what came out of this:
3r3802.  
3r3737. 3r3738. > ./pigletvm runtimes test /sieve.bin 100> /dev /null
PROFILE: switch code finished 410ms 3r33744. 3r3802.  
3r3804. Wow! The code is only three new instructions, and we won a hundred and fifty milliseconds! 3r3809. 3r3802.  
3r3804. The win here is achieved due to the fact that our piglet doesn’t make unnecessary movements when executing such instructions: the execution flow does not fall out into the main loop, nothing is further decoded, and the arguments of the instructions don’t go through the stack once again. 3r3809. 3r3802.  
3r3804. This is called static super instructions, since the additional instructions are determined statically, that is, by the programmer of the virtual machine at the design stage. This is a simple and effective technique, which in one form or another is used by all virtual machines of programming languages. 3r3809. 3r3802.  
3r3804. The main problem of static superinstructions is that without a specific program it is impossible to determine which particular instructions should be combined. Different programs use different sequences of instructions, and these sequences can only be learned at the stage of launching a specific code. 3r3809. 3r3802.  
3r3804. The next step could be a dynamic compilation of superinstructions in the context of a specific program, that is, dynamic superindications (in the 90s and in the early 2000s, this technique played the role of a primitive JIT compilation). 3r3809. 3r3802.  
3r3804. It is impossible to create instructions on the fly in the framework of normal C, and our little pig quite rightly does not consider this an honest competition. Fortunately, I have a couple of better exercises for him. 3r3809. 3r3802.  
Exercise Two: Checking the interval of values ​​of opcodes
3r3802.  
3r3804. Following our rules of quick code, let us once again ask ourselves the eternal question: what can we not do? 3r3809. 3r3802.  
3r3804. When we got acquainted with the device of the Pigment VM machine, I listed all the actions that the virtual machine performs for each opcode. And point 2 (checking the value of the opcode for entering a valid interval of switch values) causes the most suspicion. 3r3809. 3r3802.  
3r3804. Let's look at how GCC compiles the switch construction:
3r3802.  
3r3755.  
3r33737. A transition table is built, that is, a table that displays the value of the opcode to the address of the code executing the instruction body. 3r33737.  
3r33737. A code is inserted that checks whether the received opcode is in the interval of all possible switch values, and sends to the default label if there is no handler for the opcode. 3r33737.  
3r33737. Inserts the code that goes to the handler. 3r33737.  
3r33766. 3r3802.  
3r3804. ButWhy do I check the interval of values ​​for each instruction? We believe that the opcode is either correct - the final execution of the instruction OP_DONE, or incorrect - went beyond the limits of the byte-code. The tail of the opcode stream is zero, and the zero is the opcode of the OP_ABORT instruction, which completes the execution of the bytecode with an error. 3r3809. 3r3802.  
3r3804. It turns out that this check is not needed at all! And the pig should be able to convey this idea to the compiler. Let's try a little to fix the main switch:
3r3802.  
3r3737. 3r3704. uint8_t instruction = NEXT_OP ();
/* Let the compiler know what opcodes are always between 0 and 31 * /
switch (instruction & 0x1f) {3r33820. /* All the instructions here * /
case 26 0x1f: {
/* Handle the remaining 5 non-existing opcodes * /
return ERROR_UNKNOWN_OPCODE;
}
} 3r33744. 3r3802.  
3r3804. Knowing that we only have 26 instructions, we impose a bit mask (the octal value 0x1f is a binary 0b11111 covering the interval from 0 to 31) on the opcode and add handlers to unused values ​​in the interval from 26 to 31.
3r3802.  
3r3804. Bit instructions are one of the cheapest in the x86 architecture, and they are certainly cheaper than problematic conditional transitions like the one that uses the interval check. Theoretically, we should win several cycles on each executable instruction, if the compiler takes our hint. 3r3809. 3r3802.  
3r3804. By the way, the way to specify the interval of values ​​in a case is not a standard C, but an extension of GCC. But for our purposes, this code is suitable, especially since it is easy to convert it into several handlers for each of the unnecessary values. 3r3809. 3r3802.  
3r3804. We try: 3r3809. 3r3802.  
3r3737. 3r3738. > ./pigletvm runtimes test /sieve.bin 100> /dev /null
PROFILE: switch code finished 437ms
PROFILE: switch code (no range check) finished took 383ms 3r33744. 3r3802.  
3r3804. Another 50 milliseconds! Piglet, you supposedly rang out on your shoulders!
3r3802.  
3r33434. Exercise number three: 3r3778 tracks. 3r3802.  
3r3804. What other exercises can help our pig? We received the biggest time savings due to super-instructions. And they reduce the number of outputs in the main cycle and allow you to get rid of the corresponding overhead costs. 3r3809. 3r3802.  
3r3804. The central switch is the main problem place for any processor with an extraordinary execution of instructions. Modern branch predictors have learned to predict well even such complex indirect transitions, but smearing branch points along a code can help the processor to quickly move from instruction to instruction. 3r3809. 3r3802.  
3r3804. Another problem is the byte reading of the opcodes of instructions and immediate arguments from bytecode. Physical machines operate with a 64-bit machine word and do not like much when the code operates with smaller values. 3r3809. 3r3802.  
3r3804. Compilers often operate on base blocks 3r3808. , that is, sequences of instructions without branches and labels inside. The base unit starts either from the beginning of the program, or from the label, and ends with the end of the program, conditional branching, or a direct transition to the label that starts the next base unit. 3r3809. 3r3802.  
3r3804. There are many advantages to working with base units, but our pig is interested in its key feature: the instructions within the base unit are executed sequentially. It would be great to somehow allocate these basic blocks and execute instructions in them without losing time to exit to the main loop. 3r3809. 3r3802.  
3r3804. In our case, you can even expand the definition of the base unit to the track. The track in terms of the machine "Pig VM" will include all consistently connected (that is, using unconditional jumps) basic blocks. 3r3809. 3r3802.  
3r3804. In addition to the sequential execution of instructions, it would be nice to decode the immediate arguments of the instructions in advance. 3r3809. 3r3802.  
3r3804. All this sounds pretty scary and resembles a dynamic compilation that we decided not to use. Piglet even a little doubted his strength, but in practice it was not so bad. 3r3809. 3r3802.  
3r3804. Let's first think about how to present the instruction entering the track: 3r3809. 3r3802.  
3r3737. 3r3704. struct scode {3r33838. uint64_t arg;
trace_op_handler * handler;
}; 3r33737. 3r33744. 3r3802.  
3r3804. Here arg is a pre-decoded instruction argument, and handler is a pointer to a function that executes the instruction logic. 3r3809. 3r3802.  
3r3804. Now the view of each track looks like this:
3r3802.  
3r3737. 3r3704. typedef scode trace[MAX_TRACE_LEN]; 3r33737. 3r33744. 3r3802.  
3r3804. That is, the track is a sequence of s-codes of limited length. The trail cache itself inside a virtual machine looks like this:
3r3802.  
3r3737. 3r3704. trace trace_cache[MAX_CODE_LEN]; 3r33737. 3r33744. 3r3802.  
3r3804. This is just an array of traces of length not exceeding the possible length of the byte-code. The solution is lazy, practically to save memory, it makes sense to use a hash table. 3r3809. 3r3802.  
3r3804. At the beginning of the interpreter, the first handler of each of the traces will compile itself: 3r3809. 3r3802.  
3r3737. 3r3704. for (size_t trace_i = 0; trace_i < MAX_CODE_LEN; trace_i++ )
vm_trace.trace_cache[trace_i] [0].handler = trace_compile_handler; 3r34337.
 
3r3804. The main interpreter loop now looks like this:
3r3802.  
3r3737. 3r3704. while (vm_trace.is_running) {3r33820. scode * code = & vm_trace.trace_cache[vm_trace.pc] [0];
code-> handler (code);
} 3r33744. 3r3802.  
3r3804. The handler compiling the trace is a bit more complicated, and, in addition to building the trace starting from the current instruction, it does the following: 3r3809. 3r3802.  
3r3737. 3r3704. static void trace_compile_handler (scode * trace_head)
{
scode * trace_tail = trace_head;
/* 3r33820. * Trace building here
* /
/* now, run the chain that has a trace_compile_handler replaced with proper instruction handler
* function pointer * /
trace_head-> handler (trace_head);
}
3r33737. 3r33744. 3r3802.  
3r3804. Normal instruction handler:
3r3802.  
3r3737. 3r3704. static void op_add_handler (scode * code)
{
uint64_t arg_right = POP ();
* TOS_PTR () + = arg_right;
/* 3r33820. * Call the next trace handler
* *
/* scodes are the next handler * /
code ++;
code-> handler (code);
} 3r33744. 3r3802.  
3r3804. Each handler terminates each trace without making any calls at the end of the function: 3r3809. 3r3802.  
3r3737. 3r3704. static void op_done_handler (scode * code)
{
(void) code;
vm_trace.is_running = false;
vm_trace.error = SUCCESS;
} 3r33744. 3r3802.  
3r3804. All this, of course, is more complicated than adding superinstructions, but let's see if it gave us anything:
3r3802.  
3r3737. 3r3738. > ./pigletvm runtimes test /sieve.bin 100> /dev /null
PROFILE: switch code finished 427ms
PROFILE: switch code (no range check) finished took 395ms
PROFILE: trace code finished took 367ms 3r33744. 3r3802.  
3r3804. Hooray, another 30 milliseconds! 3r3809. 3r3802.  
3r3804. How so? Instead of simple tag transitions, we make chains of callbacks to instruction handlers, spend time on calls and passing arguments, but our little pig still runs along the tracks faster than a simple switch with its labels. 3r3809. 3r3802.  
3r3804. Such a performance gain in runs is achieved due to three factors:
3r3802.  
3r3755.  
3r33737. Predicting branches scattered in different places in the code is easy. 3r33737.  
3r33737. The arguments of handlers are always pre-decoded into a full machine word, and this is done only once - during the compilation of the trace. 3r33737.  
3r33737. The compiler itself turns the chains of functions into a single call to the first handler function, which is possible thanks to the optimization of 3r3623. tail call 3r3808. . 3r33737.  
3r33766. 3r3802.  
3r3804. Before summarizing our training sessions, the pig and I decided to try another ancient program interpretation technique - embroidered code. 3r3809. 3r3802.  
3r3634. The fourth exercise: "sewn" code 3r3778. 3r3802.  
3r3804. Anyone interested in the history of interpreters pig heard about 3r3639. the sewn code
(English threaded code). There are many variants of this technique, but they all boil down to going through an array instead of an array of opcodes, for example, function pointers or labels, passing directly through them, without an intermediate opcode. 3r3809. 3r3802.  
3r3804. Challenges of functions are expensive and have no special meaning these days; Most of the other versions of the sewn code are not feasible in the framework of the standard C. Even the technique described below uses the widespread, but non-standard extension C - pointers to tags. 3r3809. 3r3802.  
3r3804. In the version of the sewed code (eng. Token threaded code), which I chose to achieve our swine goals, we save the byte code, but before interpreting we create a table that displays opcodes of instructions to the addresses of instructions handler instructions: 3r3809. 3r3802.  
3r3737. 3r3704. const void * labels[]= {
[OP_PUSHI]= && op_pushi,
[OP_LOADI]= && op_loadi,
[OP_LOADADDI]= && op_loadaddi,
[OP_STORE]= && op_store,
[OP_STOREI]= && op_storei,
[OP_LOAD]= && op_load,
[OP_DUP]= && op_dup,
[OP_DISCARD]= && op_discard,
[OP_ADD]= && op_add,
[OP_ADDI]= && op_addi,
[OP_SUB]= && op_sub,
[OP_DIV]= && op_div,
[OP_MUL]= && op_mul,
[OP_JUMP]= && op_jump,
[OP_JUMP_IF_TRUE]= && op_jump_if_true,
[OP_JUMP_IF_FALSE]= && op_jump_if_false,
[OP_EQUAL]= && op_equal,
[OP_LESS]= && op_less,
[OP_LESS_OR_EQUAL]= && op_less_or_equal,
[OP_GREATER]= && op_greater,
[OP_GREATER_OR_EQUAL]= && op_greater_or_equal,
[OP_GREATER_OR_EQUALI]= && op_greater_or_equali,
[OP_POP_RES]= && op_pop_res,
[OP_DONE]= && op_done,
[OP_PRINT]= && op_print,
[OP_ABORT]= && op_abort,
}; 3r33737. 3r33744. 3r3802.  
3r3804. Pay attention to the characters && - these are pointers to labels with the bodies of instructions, the very non-standard extension of GCC. 3r3809. 3r3802.  
3r3804. To start the execution of the code, it is enough to jump on the pointer to the label corresponding to the first opcode of the program: 3r3809. 3r3802.  
3r3737. 3r3704. goto * labels[NEXT_OP()]; 3r33737. 3r33744. 3r3802.  
3r3804. There is no loop here and there will not be; each of the instructions itself makes a jump to the next handler: 3r3809. 3r3802.  
3r3737. 3r3704. op_pushi: {
/* get the argument, push it to stack * /
uint16_t arg = NEXT_ARG ();
Push (arg);
/* jump to the next instruction * /
goto * labels[NEXT_OP()];
} 3r33744. 3r3802.  
3r3804. The absence of a switch "smears" branch points along the bodies of instructions, which, in theory, should help the branch predictor in the case of an extraordinary execution of instructions. We as if built in switch directly into instructions and manually formed the transition table. 3r3809. 3r3802.  
3r3804. That's the whole technique. Piglet liked it for its simplicity. Let's see what happens in practice:
3r3802.  
3r3737. 3r3738. > ./pigletvm runtimes test /sieve.bin 100> /dev /null
PROFILE: switch code finished 443ms
PROFILE: switch code (no range check) finished took 389ms
PROFILE: threaded code finished took 477ms
PROFILE: trace code finished took 364ms 3r33744. 3r3802.  
3r3804. Oops! This is the slowest of all our techniques! What happened? Perform the same tests by turning off all GCC optimizations:
3r3802.  
3r3737. 3r3738. > ./pigletvm runtimes test /sieve.bin 100> /dev /null
PROFILE: switch code finished 969ms
PROFILE: switch code (no range check) finished took 940ms
PROFILE: threaded code finished took 824ms
PROFILE: trace code finished took 1169ms 3r33744. 3r3802.  
3r3804. Here the stitched code shows itself better. 3r3809. 3r3802.  
3r3804. Three factors play a role here:
3r3802.  
3r3755.  
3r33737. The optimizing compiler itself will build the transition table not worse than our manual label plate. 3r33737.  
3r33737. Modern compilers remarkably get rid of unnecessary function calls. 3r33737.  
3r33737. Approximately since the generation of Haswell processors Intel predictorsPhenomena learned to accurately predict transitions through a single branch point. 3r33737.  
3r33766. 3r3802.  
3r3804. According to the old memory, this technique is still used in the code, for example, the Python VM interpreter, but, to be honest, today it is already archaism. 3r3809. 3r3802.  
3r3804. Let's finally summarize and evaluate the success that our pig has achieved. 3r3809. 3r3802.  
3r3r7777. Parsing flights
3r3802.  
3r3804. 3r33782. 3r3802.  
I'm not sure that this can be called a flight, but let's admit, our little pig has gone a long way from 550 milliseconds for a hundred runs on the “sieve” to the final 370 milliseconds. We used different techniques: superinstructions, getting rid of checking intervals of values, complicated mechanics of tracks, and finally, even embroidered code. In this case, we, in general, acted within the framework of things implemented in all popular C compilers. Acceleration one and a half times, I think, this is a good result, and the pig deserved an extra portion of bran in the trough. 3r3809. 3r3802.  
3r3804. One of the implicit conditions that we set ourselves with the pig, is the preservation of the stack architecture of the PoroVM machine. The transition to the register architecture, as a rule, reduces the number of instructions required for the program logic and, accordingly, can help get rid of unnecessary exits to the instruction manager. I think another 10-20% of the time on this could be cut. 3r3809. 3r3802.  
3r3804. Our main condition - the lack of dynamic compilation - is also not a law of nature. To pump a pig with steroids in the form of a JIT compilation these days is very easy: in libraries like 3r3793. GNU Lightning
or LibJIT all the dirty work has been done. But the time to develop and the total amount of code even with the use of libraries is greatly increased. 3r3809. 3r3802.  
3r3804. There are, of course, other techniques that our little pigs do not have hoofs. But there is no limit to perfection, and our swine journey - the second part of a series of articles about interpreter of byte-codes - still has to end somewhere. If readers come up with interesting ways to disperse a pig, we with the pig will be happy to try them. 3r3809. 3r3802.  
3r3804. 3r3805. PS
Special thanks to my sister, Renate Casanova, for the sketches of illustrations, and to our illustrator, Vladimir Shopotov (3r3807. [url]Https://www.instagram.com/vovazomb/3r3808.),[/url] For the final drawings. 3r3809. 3r33816.
! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33814.
3r33816.
+ 0 -

Add comment