Associative rules, or beer with diapers
Introduction to the theory of
Learning on associative rules (ARL) is, on the one hand, simple, on the other hand, a method of searching for relationships (associations) in datasets, or more specifically, itemsests, quite often used in real life. For the first time, Piatesky-Shapiro G spoke about this in detail.[1] in the work "Discovery, Analysis, and Presentation of Strong Rules." (1991) Agrawal R, Imielinski T, Swami A developed the subject more extensively in "Mining Association Rules between Sets of Items in Large Databases" (1993) [2] and "Fast Algorithms for Mining Association Rules." (1994) [3] .
[4] .
In 199? Teradata's retail consulting group, led by Thomas Blishok, conducted a study of 1.2 million transactions in 25 stores for the Osco Drug retailer (no, they did not sell drugs or even medicines, or more precisely, not only medicines.) Drug Store - shops near the house). After analyzing all these transactions, the strongest rule was "Between 17:00 and 19:00 most often beer and diapers buy together." Unfortunately, such a rule seemed to the management of Osco Drug so counterintuitive that they did not put diapers on the shelves next to the beer. Although the explanation for a couple of beer-diapers was quite obvious: when both members of the young family returned home from work (just at about 5 pm), wives usually sent their husbands for diapers to the nearest store. And husbands, without a long time thinking, combined business with pleasure - they bought diapers on the instructions of their wives and beer for their own evening pastime.
Description Association rule
So, we found out that beer and diapers are a good gentleman's set, and what's next?
Suppose that we have a certain dataset (or collection) D, such that
, where d - A unique transaction-itemset (for example, a cash receipt). Inside each d the set is presented. items (i-item) , and in the ideal case it is represented in a binary form: d1 =[{Пиво: 1}, {Вода: 0}, {Кола: 1}, {}], d2 =[{Пиво: 0}, {Вода: 1}, {Кола: 1}, {}] . It is accepted to describe each itemset through the number of nonzero values ( .k-itemset ), For example, [{Пиво: 1}, {Вода: 0}, {Кола: 1}] is 2-itemset .
If it is not represented in the original form, you can convert it by hand (
? One Hot Encoding
? and
? pd.get_dummies
).
Thus, the data set is a sparse matrix with the values {?0}. This will be a binary dataset. There are other types of records: vertical dataset (shows the transaction vector for each individual item where it is present) and transaction dataset (approximately as in a cash receipt).
OK, the data is transformed, how to find the rules?
There are a number of key (if you like - basic) concepts in ARL, which we will be helped to derive these rules.
Support
The first concept in ARL is support:
, where X - itemset, which contains i-items, and T - the number of transactions. Those. in general, this is an indicator of the "frequency" of the itemset in all analyzed transactions. But this concerns only X. We are more interested in the variant, when we have x1 and x2 (for example) in one itemset. Well, here, too, everything is simple. Suppose that x1 = {Beer}, and x2 = {Diapers} , then we need to calculate how many transactions this pair will have.
where
- the number of transactions containing
and
We will deal with this concept on a toy dataset.
play_set # microdataSet, where the transaction numbers are specified, and also in binary form that was represented on each transaction
Confidence (reliability)
The next key concept is confidence. This is an indicator of how often our rule works for the entire data set.
Let's give an example: we want to calculate confidence for the rule "who buys beer, he also buys diapers."
To do this, we first calculate what support the rule "buys beer", then we count the support from the rule "buys beer and diapers," and simply divide one into another. Those. we calculate how many times (transactions) the rule "bought beer" supp (X), "bought diapers and beer"
Does not it look like anything? Bayes looks at all this somewhat puzzled and with contempt.
Let's look at our microdata.
Lift (support)
The next concept in our list is lift. Roughly speaking, lift is a relation of "dependence" items to their "independence". Lift shows how much items depend on each other. This is evident from the formula:
For example, we want to understand the dependence of buying beer and buying diapers. To do this, consider the support rules "bought beer and diapers" and divide it into a product of the rules "bought beer" and "bought diapers". In the case that lift = 1 , we say that the items are independent and there are no rules of joint purchase. But if lift> 1 , then the amount by which lift, in fact, is greater than this unit itself, and shows us the "strength" of the rule. The more one, the steeper. In another way, lift can be defined as the relation of confidence to the expected confidence, i.e. the relation of the reliability of the rule, when both (well, or how many there want) elements are bought together to the reliability of the rule, when one of the elements was bought (whether with the second or without).
Those. our rule is that beer is bought with diapers, 11% more powerful than the rule that diapers just buy
Conviction (persuasiveness)
In general, Conviction is the "frequency of errors" of our rule. Ie, for example, how often they bought beer without diapers and vice versa.
The better the result by the formula above is closer to ? the better. For example, if the conviction of buying beer and diapers together would be equal to 1.? this means that the rule "bought beer and diapers" would be 1.2 times (20%) more accurate than if the coincidence of these items in one transaction would be clean random. A little intuitive concept, but it is not used as often as the previous three.
There are a number of commonly used classical algorithms , allowing to find rules in itemsets according to the above concepts - naive or bruteforce algorithm, Apriori algorithm, ECLAT algorithm, FP-growth algorithm, and others.
Brutfors
Finding rules in itemsets is generally not difficult, it's difficult to do it effectively. The brute force algorithm is the simplest and, at the same time, the most inefficient way.
In the pseudo-code, the algorithm looks like this:
ENTRANCE : A Datacet D containing the transaction list
OUTPUT : Sets of itemsets
where
- a set of itemsets of size I, which occur at least s times in D
APPROACH :
1. R is an integer array containing all combinations of items in D, of size
2. for n
[1, |D|]do:
F - all possible combinations of
Increase each value in R according to the values in each F[]
3. return All itemsets in R
s
The complexity of the brute force algorithm is obvious:
In order to find all possible Association rules by applying the brute force algorithm, we need to enumerate all the subsets of X from the set I and compute supp (X) for each subset of X. This approach will consist of the following steps:
- generation of candidates . This step consists of enumerating all possible subsets X of the subset I. These subsets are called candidates , since each subset can theoretically be a frequently encountered subset, the set of potential candidates will consist of
(here | D | denotes the number of elements in the set I, and | 3 | 3 |
- calculation support . At this step, supp (X) of each candidate X
is calculated.
Thus, the complexity of our algorithm will be:
, where
- the number of possible candidates is
- The complexity of calculating supp (X), since for calculating supp (X) we need to go through all the elements in I in every transaction
In O-large notation, we need O (N) operations, where N is the number of itemsets.
Thus, to apply the brute force algorithm, we need
memory cells, where i is the individual itemset. Those. to store and calculate 34 items we need 128GB of RAM (for 64-bit integer cells).
Thus, bruteforce will help us in calculating transactions in a tent with a shawarma at the station, but for something more serious it is completely unsuitable.
Apriori Algorithm
Theory of
Used concepts:
- A lot of objects (itemset):
- A set of transaction identifiers (tidset):
- A lot of transactions:
We introduce in addition a few more concepts.
We will consider a prefix tree where 2 elements X and Y are connected if X is a direct subset of Y. Thus we can number all subsets of set I. The figure is shown below:
So, consider the algorithm Apriori.
Apriori uses the following statement: if
, then
.
This is followed by the following 2 properties:
- if Y occurs frequently, then any subset of
also occurs frequently
- if X is rare, then any superset
also rarely occurs
Apriori algorithm in the level passes through the prefix tree and calculates the frequency of occurrence of subsets of X in D. Thus, in accordance with the algorithm:
- rare subsets and all their supersets
are excluded.
- a supp (X) is calculated for each suitable candidate X of size k at the level k
In the pseudo-code of notation A priori-algorithm looks like this:
ENTRANCE : Dataset D containing the transaction list, and
- User-defined threshold supp
OUTPUT : List of itemsets F (D,
?
?
)
APPROACH :
1.
[{i}|i in J]
2. k 3 r3r31662.
1
3. while
1 do:
4. We consider all support for all candidates
for all transactions (tid, I)
D do:
for all candidates X
do:
if X
I:
X.support ++
5. # We pull out all the frequent itemsets:
= {X | X.support> }
6. # Generate new candidates
X, Y
, X= Y[i]for 1
i
k-1 and X[k]
Y[k]do:
I = X
{Y | k |}
if
J
I, | J | = k: J
then
I
k ++
Thus, Apriori first searches for all single items (containing 1 item) that satisfy the user-defined supp, then makes up the pairs according to the hierarchical monotony principle, i.e. if
often occurs and
often occurs, then occurs frequently.
A clear disadvantage of this approach is that it is necessary to "scan" the entire data set, count all supp at all levels of the breadth-first search (
? search in width
)
It can also ram up RAM on large datasets, although the algorithm in terms of speed is still much more efficient than bruteforce.
Implementation in Python
from sklearn import emmm and there's nothing to import. At present there are no modules for ALR in sklearn. Well, nothing, google or write your own, right?
A number of implementations are walking along the network, for example [вот] , [вот] , and even [вот] .
We in practice adhere to the algorithm apyori, written by Yu Mochizuki (Yu Mochizuki). We will not give the full code, if you want, you can see [тут] , but the architecture of the solution and an example of use will be shown.
The Mochizuki solution can be divided into 4 parts: Data structure, Internal functions, API and Application functions.
The first part of the module (Data Structure) works with the original dataset. The TransactionManager class is implemented, the methods of which combine the transactions into a matrix, form a list of candidate rules and consider support for each rule. Internal functions additionally on support'u form the lists of rules and accordingly they are ranked. API logically allows you to work directly with datasets, and Application functions allow you to process transactions and output the result in a readable form. No rocketscience.
Let's see how to use the module on a real (well, in this case - toy) dataset.
Data loading [/b]
import pandas as pd
# load the data
dataset = pd.read_csv ('data /Market_Basket_Optimisation.csv', header = None)
Let's look at the dataset
dataset.head ()
We see that the data set is a sparse matrix, where in the rows we have a set of items in each transaction.
We will replace NaN with the last value inside the transaction, so that later it would be easier to process the entire data set.
Code [/b]
dataset.fillna (method = 'ffill', axis = ? inplace = True)
dataset.head ()
# create a matrix of them
transactions =[]
for i in range (? 7501):
transactions.append ([str(dataset.values[i,j]) for j in range (? 20)])
# load apriori
import apriori
%% time
# and learn the rules. Note that we select the threshold values ourselves, depending on whether
# naughty "strong" rules we want to get
# min_support - minimum support for rules (dtype = float).
# min_confidence - minimum confidence for rules (dtype = float)
# min_lift - minimum lift (dtype = float)
# max_length - the maximum length of itemset (remember about k-itemset) (dtype = integer)
result = list (apriori.apriori (transactions, min_support = ???? min_confidence = 0.? min_lift = ? min_length = 2))
We visualize the output of
Codomagy [/b]
import shutil, os
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
import json # will be converted to json using the built-in methods
output =[]
for RelationRecord in result:
o = StringIO ()
apriori.dump_as_json (RelationRecord, o)
output.append (json.loads (o.getvalue ()))
data_df = pd.DataFrame (output)
# and vzglyanem on the results of
pd.set_option ('display.max_colwidth', -1)
from IPython.display import display, HTML
display (HTML (data_df.to_html ()))
Total we see:
1. Pairs of items
2. items_base - the first element of the pair
3. items_add - the second (added by the algorithm) element of the pair
4. confidence - the confidence value for the pair
5. lift - the lift value for the
pair.
6. support - the value of support for the pair. If desired, you can sort
from it.
The results are logical: escalope and macaroni, escalope and creamy mushroom sauce, chicken and low-fat sour cream, soft cheese and honey, etc. - all this is quite logical and, most importantly, tasty combinations.
Implementation in R
ARL is the case, when R-Fils can gloat around (java-files generally look at us mortals with contempt, but about this below). In R, the library arules is implemented, where apriori and other algorithms are present. Official doc can be viewed [тут]
Let's look at it in action:
First, install it (if not already installed).
Installation of the library [/b]
install.packages ('arules')
Count the data and convert it into a transaction matrix.
We read the data [/b] library (arules)
dataset = read.csv ('Market_Basket_Optimisation.csv', header = FALSE)
dataset = read.transactions ('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
Let's look at the dаta:
summary (dataset)
itemFrequencyPlot (dataset, topN = 10)
The graph shows the frequency of individual items in transactions.
Let's learn our rules:
In general, the call function apriori looks like this: apriori (data, parameter = NULL, appearance = NULL, control = NULL)
, where
[i] data - our dataset is
paramter - list of parameters for the model: minimum support, confidence and lift
appearance - responsible for displaying data. It can take the values lhs, rhs, both, items, none , which determine the position of items in output
control - responsible for sorting the output (ascending, descending, without sorting), and also whether to display the progress bar or not (parameter verbose)
Train the model:
rules = apriori (data = dataset, parameter = list (support = ???? confidence = 0.2))
And look at the results:
inspect (sort (rules, by = 'lift')[1:10])
Make sure that the output has roughly the same results as when using the apyori module in Python:
1. {light cream} => {chicken} ???r3r31981.
2. {pasta} => {escalope} ???r3r31981.
3. {pasta} => {shrimp} ???r3r31981.
4. {eggs, ground beef} => {herb & pepper} ???r3r31981.
5. {whole wheat pasta} => {olive oil} ???r3r31981.
6. {herb & pepper, spaghetti} => {ground beef} ???r3r31981.
7. {herb & pepper, mineral water} => {ground beef} ???r3r31981.
8. {tomato sauce} => {ground beef} ???r3r31981.
9. {mushroom cream sauce} => {escalope} ???r3r31981.
10. {frozen vegetables, mineral water, spaghetti} => {ground beef} ???r3r31981.
# load the data
dataset = pd.read_csv ('data /Market_Basket_Optimisation.csv', header = None)
Let's look at the dataset
dataset.head ()
dataset.head ()
transactions =[]
for i in range (? 7501):
transactions.append ([str(dataset.values[i,j]) for j in range (? 20)])
# load apriori
import apriori
%% time
# and learn the rules. Note that we select the threshold values ourselves, depending on whether
# naughty "strong" rules we want to get
# min_support - minimum support for rules (dtype = float).
# min_confidence - minimum confidence for rules (dtype = float)
# min_lift - minimum lift (dtype = float)
# max_length - the maximum length of itemset (remember about k-itemset) (dtype = integer)
result = list (apriori.apriori (transactions, min_support = ???? min_confidence = 0.? min_lift = ? min_length = 2))
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
import json # will be converted to json using the built-in methods
output =[]
for RelationRecord in result:
o = StringIO ()
apriori.dump_as_json (RelationRecord, o)
output.append (json.loads (o.getvalue ()))
data_df = pd.DataFrame (output)
# and vzglyanem on the results of
pd.set_option ('display.max_colwidth', -1)
from IPython.display import display, HTML
display (HTML (data_df.to_html ()))
library (arules)
dataset = read.csv ('Market_Basket_Optimisation.csv', header = FALSE)
dataset = read.transactions ('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
summary (dataset)
itemFrequencyPlot (dataset, topN = 10)
apriori (data, parameter = NULL, appearance = NULL, control = NULL)
rules = apriori (data = dataset, parameter = list (support = ???? confidence = 0.2))
inspect (sort (rules, by = 'lift')[1:10])
ECLAT Algorithm
Theory of
The idea of the ECLAT (Equivalence CLAss Transformation) algorithm is to accelerate the calculation of supp (X). To do this, we need to index our database D so that it can quickly calculate supp (X)
It is easy to see that if t (X) denotes the set of all transactions where the subset X occurs, then
t (XY) = t (X) t (Y)
and
supp (XY) = | t (XY) |
that is, supp (XY) is equal to the cardinality (size) of the set t (XY)
This approach can be significantly improved by reducing the size of intermediate sets of transaction identities (tidsets). Namely, we can not store all the multitude of transactions at the intermediate level, but only a lot of differences of these transactions. Suppose that
Then, we get:
this is the set of all transaction ids that contain the prefix X_a, but do not contain the element b:
Unlike the Apriori algorithm, ECLAT searches in depth (DFS,
?[3]? ). Sometimes it is called "vertical" (as opposed to "horizontal" for Apriori)
ENTRANCE : Dataset D containing the transaction list, - User-defined threshold supp and a new prefix element
OUTPUT : List of itemsets F(D,
?
? ) For the corresponding prefix I
APPROACH:
1. {}
2. i J D do:
F[I]: = F[I] {I {i}}
3. Create
{}
j J D j> i do:
C ({i} {j})
if | C | then
4. DFS - recursion:
We assume
F[I]: F[I] F[I i]
The key concept for the ECLAT algorithm is the I-prefix. In the beginning, an empty set I is generated, this allows us to select all the frequency itemsets on the first pass. Then the algorithm will call itself and increment I by 1 at each step until the user-specified length I.
is reached.
To store values, use a prefix tree (trie (not tree :)), [тут подробнее] . First, the zero root of the tree is constructed (that is, the empty set I), then as the itemsets pass, the algorithm registers the items contained in each itesmsets items, with the leftmost branch being the child of the zero root and then down. At the same time, there are as many branches as there are bundles of items in the itemsets. This approach allows you to write itemset in memory only once, which makes ECLAT faster Apriori.
Implementation in Python
There are several implementations of this algorithm in Python, those who want can google and even try to make them work, but we'll write our own, as simple and, most importantly, working.
import numpy as np
"" "
The class is initiated by 3 parameters:
- min_supp is the minimum support that we are considering for the ItemSet.This is calculated as% of the number of transactions
- max_items - the maximum number of elements in our ItemSet
-min_items - the minimum number of ItemSet elements
. "" "
class Eclat:
# initialize an object of class
def __init __ (self, min_support = 0.0? max_items = ? min_items = 2):
self.min_support = min_support
self.max_items = max_items
self.min_items = min_items
self.item_lst = list ()
self.item_len = 0
self.item_dict = dict ()
self.final_dict = dict ()
self.data_size = 0
# create a dictionary of non-zero objects from all transactions (vertical dataset)
def read_data (self, dataset):
for index, row in dataset.iterrows ():
row_wo_na = set (row[0])
for item in row_wo_na:
item = item.strip ()
if item in self.item_dict:
self.item_dict[item] [0]+ = 1
else:
self.item_dict.setdefault (item,[]) .append (1)
self.item_dict[item].append (index)
# set the instance variables
self.data_size = dataset.shape[0]
self.item_lst = list (self.item_dict.keys ())
self.item_len = len (self.item_lst)
self.min_support = self.min_support * self.data_size
#print ("min_supp", self.min_support)
# A recursive method for finding all ItemSet using the Eclat
algorithm. # data structure: {Item:[Supp number, tid1, tid2, tid3, ]}
def recur_eclat (self, item_name, tids_array, minsupp, num_items, k_start):
if tids_array[0]> = minsupp and num_items <= self.max_items:
for k in range (k_start + ? self.item_len):
if self.item_dict[self.item_lst[k]] [0]> = minsupp:
new_item = item_name + "|" + self.item_lst[k]
new_tids = np.intersect1d (tids_array[1:], self.item_dict[self.item_lst[k]] [1:])
new_tids_size = new_tids.size
new_tids = np.insert (new_tids, ? new_tids_size)
if new_tids_size> = minsupp:
if num_items> = self.min_items: self.final_dict.update ({new_item: new_tids})
self.recur_eclat (new_item, new_tids, minsupp, num_items + ? k)
# Sequential call of functions defined above
def fit (self, dataset):
i = 0
self.read_data (dataset)
for w in self.item_lst:
self.recur_eclat (w, self.item_dict[w], self.min_support, ? i)
i + = 1
return self
# output in the form of the {ItemSet: support (ItemSet)}
dictionary. def transform (self):
return {k: "{0: .4f}%". format ((v[0]+0.0) /self.data_size*100) for k, v in self.final_dict.items ()}
We will test on a toy sample:
# create an instance of the class with the required parameters
model = Eclat (min_support = 0.0? max_items = ? min_items = 3)
# train
model.fit (dataset)
# and render the results
model.transform ()
Meanwhile in real-life
So, the algorithm works. But is it applicable in real life? Let's check.
There is a real business problem that came to us from a large grocery retailer of the premium segment (we will not disclose the name and product names, corporate secret-c): see those most frequent sets in grocery baskets.
Download the data from the download from the POS-system (Point-of-Sale - the system that processes transactions at the ticket offices)
Loading data. df = pd.read_csv ('data /tranprod1.csv', delimiter = ';', header = 0)
df.columns =['trans', 'item']
print (df.shape)
df.head ()
Let's change the format of the table to "transaction | list "of all items in transaction
Transformation [/b] df.trans = pd.to_numeric (df.trans, errors = 'coerce')
df.dropna (axis = ? how = 'all', inplace = True)
df.trans = df.trans.astype (int)
df = df.groupby ('trans'). agg (lambda x: x.tolist ())
df.head ()
Run the algorithm
model = Eclat (min_support = ???? max_items = ? min_items = 3)
%% time
model.fit (df)
Data read successfully
min_supp ???r3r31981.
CPU times: user 6h 47min 9s, sys: 22.2 s, total: 6h 47min 31s
Wall time: 6h 47min 28s
model.transform ()
On an ordinary computer, the solution to this problem took over night. On cloud-platforms it would be faster. But, most importantly, the customer is satisfied.
Apparently, it is quite easy to implement the algorithm on its own, although it is worthwhile to work with efficiency.
Implementation in R
And again, R users rejoice, for them there is no need to dance with a tambourine, all by analogy with apriori.
We start the library and read the data (for acceleration, let us cast the same toy dataset on which the apriori was counted):
Preparation of the data [/b] library (arules)
dataset = read.csv ('Market_Basket_Optimisation.csv')
dataset = read.transactions ('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
A quick look at the data set:
summary (dataset)
itemFrequencyPlot (dataset, topN = 10)
The same frequencies as before
Rules:
rules = eclat (data = dataset, parameter = list (support = ???? minlen = 2))
Please note, configure only support and the minimum length (k in k-itemset)
And look at the results:
inspect (sort (rules, by = 'support')[1:10])
import numpy as np
"" "
The class is initiated by 3 parameters:
- min_supp is the minimum support that we are considering for the ItemSet.This is calculated as% of the number of transactions
- max_items - the maximum number of elements in our ItemSet
-min_items - the minimum number of ItemSet elements
. "" "
class Eclat:
# initialize an object of class
def __init __ (self, min_support = 0.0? max_items = ? min_items = 2):
self.min_support = min_support
self.max_items = max_items
self.min_items = min_items
self.item_lst = list ()
self.item_len = 0
self.item_dict = dict ()
self.final_dict = dict ()
self.data_size = 0
# create a dictionary of non-zero objects from all transactions (vertical dataset)
def read_data (self, dataset):
for index, row in dataset.iterrows ():
row_wo_na = set (row[0])
for item in row_wo_na:
item = item.strip ()
if item in self.item_dict:
self.item_dict[item] [0]+ = 1
else:
self.item_dict.setdefault (item,[]) .append (1)
self.item_dict[item].append (index)
# set the instance variables
self.data_size = dataset.shape[0]
self.item_lst = list (self.item_dict.keys ())
self.item_len = len (self.item_lst)
self.min_support = self.min_support * self.data_size
#print ("min_supp", self.min_support)
# A recursive method for finding all ItemSet using the Eclat
algorithm. # data structure: {Item:[Supp number, tid1, tid2, tid3, ]}
def recur_eclat (self, item_name, tids_array, minsupp, num_items, k_start):
if tids_array[0]> = minsupp and num_items <= self.max_items:
for k in range (k_start + ? self.item_len):
if self.item_dict[self.item_lst[k]] [0]> = minsupp:
new_item = item_name + "|" + self.item_lst[k]
new_tids = np.intersect1d (tids_array[1:], self.item_dict[self.item_lst[k]] [1:])
new_tids_size = new_tids.size
new_tids = np.insert (new_tids, ? new_tids_size)
if new_tids_size> = minsupp:
if num_items> = self.min_items: self.final_dict.update ({new_item: new_tids})
self.recur_eclat (new_item, new_tids, minsupp, num_items + ? k)
# Sequential call of functions defined above
def fit (self, dataset):
i = 0
self.read_data (dataset)
for w in self.item_lst:
self.recur_eclat (w, self.item_dict[w], self.min_support, ? i)
i + = 1
return self
# output in the form of the {ItemSet: support (ItemSet)}
dictionary. def transform (self):
return {k: "{0: .4f}%". format ((v[0]+0.0) /self.data_size*100) for k, v in self.final_dict.items ()}
# create an instance of the class with the required parameters
model = Eclat (min_support = 0.0? max_items = ? min_items = 3)
# train
model.fit (dataset)
# and render the results
model.transform ()
df = pd.read_csv ('data /tranprod1.csv', delimiter = ';', header = 0)
df.columns =['trans', 'item']
print (df.shape)
df.head ()
df.trans = pd.to_numeric (df.trans, errors = 'coerce')
df.dropna (axis = ? how = 'all', inplace = True)
df.trans = df.trans.astype (int)
df = df.groupby ('trans'). agg (lambda x: x.tolist ())
df.head ()
model = Eclat (min_support = ???? max_items = ? min_items = 3)
%% time
model.fit (df)
min_supp ???r3r31981.
CPU times: user 6h 47min 9s, sys: 22.2 s, total: 6h 47min 31s
Wall time: 6h 47min 28s
model.transform ()
library (arules)
dataset = read.csv ('Market_Basket_Optimisation.csv')
dataset = read.transactions ('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
summary (dataset)
itemFrequencyPlot (dataset, topN = 10)
rules = eclat (data = dataset, parameter = list (support = ???? minlen = 2))
inspect (sort (rules, by = 'support')[1:10])
FP-Growth Algorithm
Theory of
FP-Growth (Frequent Pattern Growth) algorithm is the youngest of our trinity, it was first described in 2000 in [7] .
FP-Growth offers a radical thing - to abandon the generation of [i] candidates (recall, the generation of candidates underlies Apriori and ECLAT). Theoretically, this approach will further increase the speed of the algorithm and use even less memory.
This is achieved by storing in the memory of the prefix tree (trie) not from combinations of candidates, but from the transactions themselves.
In this case, FP-Growth generates a header table for each item, whose supp is higher than the one specified by the user. This header table stores an associated list of all the same nodes of the prefix tree. Thus, the algorithm combines the advantages of BFS due to the header table and DFS by constructing trie. The algorithm pseudocode is similar to ECALT, with some exceptions.
ENTRANCE : Dataset D containing the transaction list, - User-defined threshold supp and the prefix
OUTPUT : List of itemsets F(D,
?
? ) For the corresponding prefix I
APPROACH :
1. F[i] {}
2. i J D do:
F[I]: = F[I] {I {i}}
3. Create
{}
{}
j J D j> i do:
if supp (I
?
? {i, j}) then:
H H {j}
(tid, X) D I X do:
({tid, X H})
4. DFS - recursion:
We assume that F | I {i} | ( )
F[I] F[I] F[I i]
Implementation in Python
The implementation of FP-Growth in Python was less fortunate than other ALR algorithms. There are no standard libraries for it.
Not bad FP_Growth is represented in pyspark, see [тут]
On gitHub, you can also find several solutions of the Neolithic era, for example [тут] and [тут]
Let's test the second variant:
Installation and import [/b]
! pip install pyfpgrowth
import pyfpgrowth
# Generate the patterns
patterns = pyfpgrowth.find_frequent_patterns (transactions, 2)
Let's learn the rules
rules = pyfpgrowth.generate_association_rules (patterns, 30);
# Show
rules
Implementation in R
In this case, R does not lag behind Python: in such a convenient and native arules library FP-Growth is missing.
At the same time, as for Python, the implementation exists in Spark - [Ссылка]
But in fact
But in fact, if you want to apply ARL in your business tasks, we strongly recommend learning Java .
Weka (Waikato Environment for Knowledge Analysis). This free software for Machine Learning, written in the Java language. Razabotano at Waikato University in New Zealand in 1993. Weka has both a GUI and a command-line capability. Of the advantages can be called ease of use of the graphical interface - there is no need to write code for solving applied problems. To use Weka libraries in Python, you can install a wrapper for Weka in Python: python-weka-wrapper3. The shell uses the javabridge package to access the Weka API. Detailed installation instructions can be found [здесь] .
SPMF This is an open source data mining library written in Java that specializes in finding patterns in data ( .[ссылка] ). It is stated that SPMF implements more than 55 algorithms for data mining. Unfortunately, there is no official SPMF shell for Python (at least as of the date of this writing)
Conclusion or again about the effectiveness of
In conclusion, let's empirically compare [i] efficiency the metric of runtime depending on the density of the dataset and transaction lengths of the dataset [9] .
Under density of dataset implies the ratio of transactions that contain frequent items to the total number of transactions.
Efficiency of algorithms with different dataset density
From the graph it is obvious that the efficiency (the smaller the runtime, the more efficient) the Apriori algorithm falls with the increase in the density of the data set.
Under length transaction is the number of items in itemset
The effectiveness of algorithms for different lengths of datacentates
Obviously, with an increase in transaction length, Apriori also copes much worse.
On the selection of hyperparameters
It's time to tell a little about how to choose the hyperparameters of our models (they are the same support, confidence, etc.)
Since algorithms ARules are sensitive to hyperparameters, it is necessary to approach competently the choice question. The difference in execution time with different combinations of hyperparameters can be quite different.
The main hyperparameter in any ARules algorithm is min support. He, roughly speaking, is responsible for the minimum relative frequency of the associative rule in the sample. Choosing this indicator it is necessary first of all to be guided by the task in view. Most often, the customer needs a small number of top combinations, so you can set a high value of min support. However, in this case some emissions ( ? bundle-goods
, For example) can get to our sample and do not get any interesting combinations. In general, the higher you put the value of the support, the more powerful rules you find, and the faster the algorithm is considered. We recommend that you set a smaller value at the first run to see which pairs, triples, etc. goods generally occur in the dataset. Then gradually increase the step, getting to the desired value (specified by the customer, for example). In practice, a good value of min supp will be 0.1% (for a very large dataset).
Below we give an approximate graph of the dependence of the execution time of the algorithm on this indicator.
In general, everything as usual depends on the task and the data.
Results of
So, we got acquainted with the basic theory of ARL ("who bought x, also bought y") and basic concepts and metrics (support, confidence, lift and conviction).
We looked at the 3 most popular (and old as the world) algorithms (Apriori, ECLAT, FP-Growth), envied the users of R and the library arules, tried to implement ECLAT.
Highlights:
1. ARLs are the basis of the recommendation systems
2. ARL are widely applicable - from traditional retail and online retail (from Ozon to Steam),conventional purchases of goods and materials to banks and telecoms (services and services connected)
3. ARL is relatively easy to use, there are implementations of different levels of study for different tasks.
4. ARLs are well interpreted and do not require special skills
5. Algorithms, especially classical ones, can not be called super-efficient. If you work with them out of the box on large datasets, you may need a lot of processing power. But nothing prevents us from finishing them, right?
In addition to the above-mentioned algorithms, there are modifications and branches:
Algorithm CHARM to search for closed itemsets. This algorithm perfectly reduces the complexity of searching for rules with exponential (ie increasing with increasing dataset, for example) to polynomial. Under closed itemset is understood as such itemset, for which there is no superset (ie a set that includes our itemset + other items) with the same frequency (= support).
Here it is worthwhile to explain a little - until now we have considered simply frequent itemsets. There is also the concept of closed (see above) and of the maximum . The maximum is itemset is an itemet for which there is no frequent (= frequent) superset.
The relationship between these three kinds of itemsets is shown in the picture below:
AprioriDP (Deep Programming) - allows you to store supp in a special data structure, works a little faster than the classic Apriori
FP Bonsai - Improved FP-Growth with pruning of the prefix tree (example of the algorithm with by the constraints
? )
In conclusion, we can not fail to mention the gloomy genius of ARL to Dr.
Christiane Borgelte from the University of Constanta.
Christian implemented the mentioned above algorithms in C, Python, Java and R. Well, almost all. There is even a GUI for its authorship, where in a couple of clicks you can load a dataset, select the desired algorithm and find the rules. This is subject to the fact that it will work for you :)
For simple tasks, it is enough that we considered in this article. And if it is not enough - we call to write the implementation ourselves!
References:
Mining Association Rules between Sets of Items in Large Databases
Fast Algorithms for Mining Association Rules
Ask Dan!
Introduction to arules - A computational environment for mining association rules and frequent item sets
Publications of Dr. Borgelt
Jeff Heaton. Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms
Sources
Authors: Pavel Golubev , Nikolay Bashlykov
It may be interesting
weber
Author19-04-2018, 15:33
Publication DateMachine learning / Algorithms / Python
Category- Comments: 8
- Views: 1 801
Comments
А как же Rancher?
BACKLINKS https://www.apoi.ru/forum/viewtopic.php?f=363&t=11225&p=713382&sid=
5f221f0ae514f1ec90f2071d877f5494#p713382 https://seomarketer.gr/2020/07/anakainisi-xenodoxeiou-pote-einai-aparaitito/ https://seomarketergr.com/4-tropoi-gia-na-veltiosete-to-xenodoxeio-sas-logo-cov
id-19/ https://gamereviews.gr/5-logoi-gia-anakainisi-ksenodoxeiou/ http://jifficlassified.ca/index.php?page=user&action=pub_profile&id=180
170 http://proothisistoselidonseogr.over-blog.com/2020/07/.html https://ko-fi.com/post/Βασικα-στοιχεια-λαμπτηρα-Wattage-και-Lumens-Q5Q51WG5E http://www.true2ourselves.com/blogs/blog/index.php?seid=georgetech&id=29955 https://www.openlearning.com/u/innovation-qdgs3n/
5f221f0ae514f1ec90f2071d877f5494#p713382 https://seomarketer.gr/2020/07/anakainisi-xenodoxeiou-pote-einai-aparaitito/ https://seomarketergr.com/4-tropoi-gia-na-veltiosete-to-xenodoxeio-sas-logo-cov
id-19/ https://gamereviews.gr/5-logoi-gia-anakainisi-ksenodoxeiou/ http://jifficlassified.ca/index.php?page=user&action=pub_profile&id=180
170 http://proothisistoselidonseogr.over-blog.com/2020/07/.html https://ko-fi.com/post/Βασικα-στοιχεια-λαμπτηρα-Wattage-και-Lumens-Q5Q51WG5E http://www.true2ourselves.com/blogs/blog/index.php?seid=georgetech&id=29955 https://www.openlearning.com/u/innovation-qdgs3n/
myshopriteexperienceSuch sites are important because they provide a large dose of useful information ...