Neural networks: the implementation of the problem of mushrooms on Tensor Flow and Python

Tensor Flow - a framework for building and working with neural networks from Google. Allows to abstract from the internal details of machine learning and focus directly on the solution of its task. A very powerful thing, it allows you to create, train and use neural networks of any known type. I did not find any explanatory text on this subject on Habr, therefore I write my own. The implementation of the solution to the mushroom problem using the Tensor Flow library will be described below. By the way, the algorithm described below is suitable for predictions in almost any field. For example, the probability of cancer in a person in the future or cards from an opponent in poker. Machine learning repository . Thus, the solution of the problem can be called a kind of Hello World in the field of machine learning, along with the problem of irises , where the parameters of the flower are expressed by numerical values.
 
 

Sources


 
You can download all the sources from my repository on Github: reference . Do this to see the code in action. Use only the source, because there are all the necessary indentation and coding. The whole process will be analyzed in detail below.
 
 

Preparation


 
It is assumed that you have a ready-made Tensor Flow installation. If not, you can set it to .
 
 

Source code


 
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
# The function is responsible for preparing data from the repository.
# The result of the work is two CSV-files with Tensor Flow data for training and testing of the neural network. In particular, the category parameters of fungi are converted into numerical (0 and 1)
def prepare_data (data_file_name):
header =['class', 'cap_shape', 'cap_surface', # Шапка CSV-файла в виде массива, сформирована на основе файла 'agaricus-lepiota.name' из репозитория
'cap_color', 'bruises', 'odor', 'gill_attachment',
'gill_spacing', 'gill_size', 'gill_color', 'stalk_shape',
'stalk_root', 'stalk_surface_above_ring',
'stalk_surface_below_ring', 'stalk_color_above_ring',
'stalk_color_below_ring', 'veil_type', 'veil_color',
'ring_number', 'ring_type', 'spore_print_color',
'population', 'habitat']
df = pd.read_csv (data_file_name, sep = ',', names = header)
# Entries from "?" instead of the parameter it is symbolized by its absence
# throw these records out of our data set
df.replace ('?', np.nan, inplace = True)
df.dropna (inplace = True)
# Edibility or toxicity is indicated in our data set
# the characters 'e' or 'p' respectively. It is necessary to present these data in the numeric
# form, so we make 0 instead of poisonous, 1 - instead of the edible value
df['class'].replace ('p', ? inplace = True)
df['class'].replace ('e', ? inplace = True)
# Initially, the parameters of fungi are represented in symbolic form,
# that is, in the form of words. Tensor Flow can only work with digital
# data. Pandas library using the function "get_dummies"
# converts our data into
digits. cols_to_transform = header[1:]
df = pd.get_dummies (df, columns = cols_to_transform)
# Now you need to separate the converted data
# for two sets - one for training (large)
# and one for testing the neural network (less)
df_train, df_test = train_test_split (df, test_size = 0.1)
# Determine the number of rows and columns in each of the data sets
num_train_entries = df_train.shape[0]
num_train_features = df_train.shape[1]- 1
num_test_entries = df_test.shape[0]
num_test_features = df_test.shape[1]- 1
We write the resulting sets into temporary csv files, because
# you need to write the number of columns and rows at the beginning of the header
# working csv, as required by Tensor Flow
df_train.to_csv ('train_temp.csv', index = False)
df_test.to_csv ('test_temp.csv', index = False)
# Write the numbers in the training file, then in the test
open ("mushroom_train.csv", "w"). write (str (num_train_entries) +
"," + str (num_train_features) +
"," + open ("train_temp.csv"). read ())
open ("mushroom_test.csv", "w"). write (str (num_test_entries) +
"," + str (num_test_features) +
"," + open ("test_temp.csv"). read ())
# Delete temporary files, they are no longer needed
os.remove ("train_temp.csv")
os.remove ("test_temp.csv")
# The function generates input data for testing for Tensor Flow
def get_test_inputs ():
x = tf.constant (test_set.data)
y = tf.constant (test_set.target)
return x, y
# The function generates the input data for training for Tensor Flow
def get_train_inputs ():
x = tf.constant (training_set.data)
y = tf.constant (training_set.target)
return x, y
# The function returns the data of two test fungi for
# prediction of their edibility (expected result: edible, poisonous)
# In other words, this is a function for verifying the trained and tested neural network
def new_samples ():
return np.array ([[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0],
.[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1]], dtype = np.int)
if __name__ == "__main__":
MUSHROOM_DATA_FILE = "agaricus-lepiota.data"
We prepare the fungi data for Tensor Flow,
# creating two CSV files (training and test)
prepare_data (MUSHROOM_DATA_FILE)
# Load the prepared data
training_set = tf.contrib.learn.datasets.base.load_csv_with_header (
, filename = 'mushroom_train.csv',
target_dtype = np.int,
features_dtype = np.int,
target_column = 0)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header (
, filename = 'mushroom_test.csv',
target_dtype = np.int,
features_dtype = np.int,
.target_column = 0)
# We determine that all color parameters have real values ​​(more details below)
feature_columns =[tf.contrib.layers.real_valued_column("", dimension=98)]
# Create a three-layer DNN-neural network with 1? 20 and 10 neurons in the layer
classifier = tf.contrib.learn.DNNClassifier (
feature_columns = feature_columns,
hidden_units =[10, 20, 10],
.n_classes = ?
model_dir = "/tmp /mushroom_model")
# Train the neural network
classifier.fit (input_fn = get_train_inputs, steps = 2000)
# Normalize the neural network using the test data set
accuracy_score = classifier.evaluate (input_fn = get_test_inputs,
steps = 1)["accuracy"]
print ("nPrecision accuracy: {0: f} n" .format (accuracy_score))
# We try to run the neural network on our two trial mushrooms
predictions = list (classifier.predict_classes (input_fn = new_samples))
print ("Predicting the edibility of test fungi: {} n"
.format (predictions))

 

We load and prepare data from the repository


 
Data for training and testing of the neural network will be downloaded from the specially created for this Machine learning repository . All data are presented in the form of two files: agaricus-lepiota.data and agaricus-lepiota.names. In the first 8124 rows and 22 columns. One line provides one fungus, each column - one of the 22 parameters of the fungus in the form of a symbol-contraction from the whole word-parameter. The legend of all symbols is in the file agarius-lepiota.names.
 
 
Data from the repository must be processed in order to lead to an acceptable form for Tensor Flow. First, we import several libraries for the work of
 
 
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os

 
Then we will form a cap from the parameters of the fungus for Tensor Flow, so that the library knows which column in the data file which parameter corresponds to. The cap is glued to the data file. We form in the form of an array, the elements of which we take from the file agaricus-lepiota.names.
 
 
header =['class', 'cap_shape', 'cap_surface',
'cap_color', 'bruises', 'odor', 'gill_attachment',
'gill_spacing', 'gill_size', 'gill_color', 'stalk_shape',
'stalk_root', 'stalk_surface_above_ring',
'stalk_surface_below_ring', 'stalk_color_above_ring',
'stalk_color_below_ring', 'veil_type', 'veil_color',
'ring_number', 'ring_type', 'spore_print_color',
'population', 'habitat']
df = pd.read_csv (data_file_name, sep = ',', names = header)

 
Now we need to figure out the missing data. In this case, the symbol "?" Is displayed in the agaricus-lepiota.data file instead of the parameter. There are many methods for handling such cases, we will simply delete the entire line with at least one missing parameter.
 
 
df.replace ('?', np.nan, inplace = True)
df.dropna (inplace = True)

 
Next, you need to manually replace the symbolic edibility parameter with a digital one. That is, "p" and "e" are replaced by 0 and 1.
 
 
df['class'].replace ('p', ? inplace = True)
df['class'].replace ('e', ? inplace = True)

 
And after that, you can convert the data to a digit. This is covered by the function get_dummies of the pandas library.
 
 
cols_to_transform = header[1:]
df = pd.get_dummies (df, columns = cols_to_transform)

 
Any neural network must be trained. But in addition, it also needs to be calibrated in order to increase the accuracy of work in real conditions. For this, our data set will be divided into two - training and calibration. The first will be more than the second, as it should be.
 
 
df_train, df_test = train_test_split (df, test_size = 0.1)
 
And the last. Tensor Flow requires that at the beginning of the data files the number of rows and columns of the file be indicated. We manually extract this information from our training and calibration datasets and then write them to the resulting CSV files.
 
 
# Determine the number of columns and rows in each set
num_train_entries = df_train.shape[0]
num_train_features = df_train.shape[1]- 1
num_test_entries = df_test.shape[0]
num_test_features = df_test.shape[1]- 1
# We write the sets into temporary CSV
df_train.to_csv ('train_temp.csv', index = False)
df_test.to_csv ('test_temp.csv', index = False)
# We write the quantities obtained above into the final CSV, then we also write the time
open ("mushroom_train.csv", "w"). write (str (num_train_entries) +
"," + str (num_train_features) +
"," + open ("train_temp.csv"). read ())
open ("mushroom_test.csv", "w"). write (str (num_test_entries) +
"," + str (num_test_features) +
"," + open ("test_temp.csv"). read ())

 
As a result, you should get these files: training and calibration .
 
 

Fill the generated data in Tensor Flow


 
Now that we have downloaded from the repository and processed CSV files with fungi data, you can send them to Tensor Flow for training. This is done using the function load_csv_with_header () provided by the framework itself:
 
 
training_set = tf.contrib.learn.datasets.base.load_csv_with_header (
, filename = 'mushroom_train.csv',
target_dtype = np.int,
features_dtype = np.int,
target_column = 0)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header (
filename = 'mushroom_test.csv',
target_dtype = np.int,
features_dtype = np.int,
.target_column = 0)

 
The load_csv_with_header () function creates a training data set from the files that we collected above. In addition to the data file as an argument, the function takes target_dtype, which is the type of data predicted in the end. In our case, it is necessary to teach the neural network to predict the edibility or virulence of the fungus, which can be expressed as 1 or 0. Thus, in our case target_dtype is an integer. features_dtype is a parameter where the type of parameters to be learned is set. In our case, this is also an integer (initially it was string, but, as you remember, we overtook them in numbers). At the end, the target_column parameter is set, which is the column index with the parameter that the neural network is to predict. That is, with the parameter of edibility.
 
 

Create the object of classifier Tensor Flow


 
That is, an object of a class that deals directly with the predictions of the result. In other words, the class of the neural network itself.
 
 
feature_columns =[tf.contrib.layers.real_valued_column("", dimension=98)]
classifier = tf.contrib.learn.DNNClassifier (
feature_columns = feature_columns,
hidden_units =[10, 20, 10],
.n_classes = ?
model_dir = "/tmp /mushroom_model")

 
The first parameter is feature_columns. These are the parameters of the fungi. Note that the value of the parameter is created immediately, a little higher. There, the value 98 of dimension is taken at the input, which means 98 different parameters of the fungus, except for the edibility.
 
 
hidden_units - the number of neurons in each layer of a neural network. The correct selection of the number of layers and neurons in them is something on the level of art in the field of machine learning. It is only possible to determine these values ​​after experience. We just took these figures simply because they are listed in one of the Tensor Flow tutorials. And they work.
 
 
n_classes - the number of classes to predict. We have two of them - edible and not.
 
 
model_dir is the path to which the trained neural network model will be saved. And in the future will be used to predict the results, so as not to train the network every time.
 
 

Workout


 
For simplicity of work in the future, we will create two functions:
 
 
def get_test_inputs ():
x = tf.constant (test_set.data)
y = tf.constant (test_set.target)
return x, y
def get_train_inputs ():
x = tf.constant (training_set.data)
y = tf.constant (training_set.target)
return x, y

 
Each function provides its own set of input data - for training and for calibration. x and y are the Tensor Flow constants that the frame needs for the job. Do not go into details, just accept that these functions should be an intermediary between the data and the neural network.
 
 
We train the network:
 
 
classifier.fit (input_fn = get_train_inputs, steps = 2000)
 
The first parameter accepts the input data formed just above, the second parameter - the number of training steps. Again, the figure was used in one of the Tensor Flow manuals, and understanding this setting will come to you with experience.
 
 
Next, we calibrate the trained network. This is done using the above calibration set of data. The result of the work will be the accuracy of future network predictions (accuracy_score).
 
 
accuracy_score = classifier.evaluate (input_fn = get_test_inputs,
steps = 1)["accuracy"]
print ("nPrecision accuracy: {0: f} n" .format (accuracy_score))

 

Let's try in case


 
Now the neural network is ready, and you can try to predict with its help the edibility of the fungus.
 
 
def new_samples ():
return np.array ([[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0],
.[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1]], dtype = np.int)

 
The function above gives the data of two completely new mushrooms that were not present either in the training kit or in the calibration kits (in fact they were simply pulled out of the latter). Imagine, for example, that you bought them in the market, and try to understand if you can eat them. The code below will define this:
 
 
predictions = list (classifier.predict (input_fn = new_samples))
print ("Predicting the edibility of test fungi: {} n"
.format (predictions))

 
The result of the work should be the following:
 
 
Predictions of edible mushroom edibility:[0, 1]
 
And this means that the first mushroom is poisonous, the second is quite edible. Thus, it is possible to make predictions based on any data, be it mushrooms, people, animals or anything. It is enough to form the input data in the right way. And to predict, for example, the probability of arrhythmia in a patient in the future or the rate of movement of stock quotes on the exchange.
+ 0 -

Add comment