Storing a large number of files

Storing a large number of files
 
 
Good health, Spoilers!
 
 
In the process of working on a dating site project, it became necessary to organize the storage of user photos. According to the terms of reference, the number of photos of one user is limited to 10 files. But users can be tens of thousands. Especially considering that the project in its present form exists already since the beginning of "zero". That is, there are already thousands of users in the database. Almost any file system, as far as I know, reacts negatively to a large number of child nodes in the folder. By experience I can say that the problems begin after 1000-1500 files /folders in the parent folder.
 
. or here ). But I did not find a single solution exactly in my opinion. In addition, in this article I only share my own experience in solving the problem. [/i]
 
 

Theory


 
In addition to the storage task itself, there was still a condition in the TOR, according to which it was necessary to leave the captions and headings to the photos. Of course, you can not do without a database. That is, the first thing we do is create a table in which we assign a comparison of meta-data (signatures, titles, etc.) to the files on the disk. Each file corresponds to one line in the database. Accordingly, each file has an identifier.
 
 
A small digression. Let's talk about autoincrement. On the dating site can be a dozen or two thousand users. The question is how many users go through the project for the entire time of its existence. For example, the active audience of "dating-ru" is several hundred thousand. However, just imagine how many users have deleted during the lifetime of this project; how many users have not been activated so far. And now, add our legislation, which requires storing information about users for at least six months Sooner or later, 4 with a penny of a billion UNSIGNED INT will end. For this, it's best for a primary key to take BIGINT .
 
 
Now try to imagine a number like BIGINT . This is 8 bytes. Each byte is from 0 to 255. 255 child nodes is quite normal for any file system. That is, we take the file identifier in the hexadecimal representation, divide it into chunks by two characters. We use these chunks as folder names, and the latter as the name of the physical file. PROFIT!
 
 
0f /65/84/10/67/68/19 /ff.file
 
 
Elegant and simple. The file extension is not important here. Still, the file will be given a script that will give the browser in particular a MIME type, which we will also store in the database. In addition, storing information about a file in the database allows you to redefine the path to it for the browser. Say, the file is really located relative to the project directory along the path /content/files/0f/65/84/10/67/68/19/ff.file . And in the database, you can assign a URL to it, for example, /content /users /678 ​​/files /somefile . SEO-shniki now, probably, pretty smiled. All this allows us not to worry anymore about where to place the file physically.
 
 

Table in DB


 
In addition to the identifier, MIME type, URL and physical location, we will store in the md5 table and sha1 files to filter out the same files if necessary. Of course we also need to store relationships with entities in this table. Let's say the user ID to which the files belong. And if the project is not very large, then in the same system we can store, say, photos of goods. By this we will also store the name of the class of the entity to which the record belongs.
 
 
By the way, about the birds. If you close the folder with .htaccess for access from outside, the file can be obtained only through the script. And in the script it will be possible to determine access to the file. A little getting ahead, I will say that in my CMS (where the above project is currently being piloted) access is determined by the basic user groups, of which I have 8 - guests, users, managers, admins, non-activated, blocked, remote and super-admins. The super-administrator can do everything, so he does not participate in the definition of access. If the user has a super-admin flag, then it's super-admin. It's simple. That is, we will define the accesses to the remaining seven groups. Access is simple - either to give the file, or not to give. Total you can take a field of type TINYINT .
 
 
And one moment. According to our legislation, we will have to physically store user pictures. That is, we need to somehow mark the pictures as deleted, instead of physically removing them. It is most convenient for this purpose to use a bit field. I usually use a field of type in such cases. INT . To reserve, so to speak. Besides, I have already established tradition to place the flag DELETED in the 5th bit from the end. But this is not fundamentally the same again.
 
 
What we have in the end:
 
    create_type_file_ (
.` id` bigint not null auto_increment, - Primary key
.` entity_type` char (32) not null default '', - Entity type
.` entity` bigint null, - ID entity
`mime` char (32) not null default '', - MIME type
` md5` char (32) not null default '', - MD5
`sha1` char (40) not null default '', - SHA1
`file` char (64) not null default '', - Physical location of
'url` varchar (250) not null default' ', - URL
meta` text null, - Meta-data in JSON format or serialized array
.size` bigint not null default '0', - Size
.` created` datetime not null, - Created on
`updated` datetime null, - Date edited by 3r3r???. `access` tinyint not null default '0', - Bitmap
Flags` int not null default '0', - Flags
primary key (`id`),
index (`entity_type`),
index (`entity`),
index (`mime`),
index (`md5`),
index (`sha1`),
index (`url`)
) engine = InnoDB;

 
 

Class-dispatcher


 
Now we need to create a class, with which we will upload files. The class should provide the ability to create files, replace /modify files, delete files. In addition, it is worth considering two points. First, the project can be transferred from the server to the server. So in the class you need to define a property that contains the root directory of the files. Secondly, it will be very unpleasant if someone crashes the table into the database. So you need to provide the possibility of data recovery. With the first all in general it is clear. As for the reservation of data, we will reserve only what can not be restored.
 
 
ID - is restored from the physical location of the
file.  
entity_type - not restored
 
entity - not restored
 
mime - restored using the extension finfo
 
md5 - is restored from the file itself
 
sha1 - is restored from the file itself
 
file - is restored from the physical location of the
file.  
url - not restored
 
meta - not restored
 
size - is restored from the file itself
 
created - You can take information from the file
 
updated - You can take information from the file
 
access - not restored
 
flags - not restored
 
 
You can immediately discard the meta-information. It is not critical for the functioning of the system. And for faster recovery, you still need to save the MIME type. Total: entity type, entity ID, MIME, URL, access and flags. In order to increase the reliability of the system, we will store the backup information for each destination folder separately in the folder itself.
 
 
The class code is [/b]
    <?php
class BigFiles
{
const FLAG_DELETED = 0x08000000; //So far only the flag "Removed"
/** @var mysqli $ _db * /
protected $ _db = null;
protected $ _webRoot = '';
protected $ _realRoot = '';
function __construct (mysqli $ db = null) {
$ this -> _ db = $ db;
}
/**
* Setting /reading root for URLs
* @param string $ v Value of
* @return string
* /
public function webRoot ($ v = null) {
if (! is_null ($ v)) {
$ this -> _ webRoot = $ v;
}
return $ this -> _ webRoot;
}
/**
* Install /read root for
files. * @param string $ v Value of
* @return string
* /
public function realRoot ($ v = null) {
if (! is_null ($ v)) {
$ this -> _ realRoot = $ v;
}
return $ this -> _ realRoot;
}
/**
* Download the file
* @param array $ data The query data is
* @param string $ url The URL of the virtual folder
* @param string $ eType Entity type
* @param int $ eID Entity ID
* @param mixed $ meta Meta-data
* @param int $ access Access
* @param int $ flags Flags
* @param int $ fileID of the existing file
* @return bool
* @throws Exception
* /
public function upload (array $ data, $ url, $ eType = '', $ eID = null, $ meta = null, $ access = 12? $ flags = ? $ fileID = 0) {
$ meta = is_array ($ meta)? serialize ($ meta): $ meta;
if (empty ($ data['tmp_name']) || empty ($ data['name'])) {
$ fid = intval ($ fileID);
if (empty ($ fid)) {
return false;
}
$ meta = empty ($ meta)? 'null': "'". $ this -> _ db-> real_escape_string ($ meta). "'";
$ q = "meta` = {$ meta},` updated` = now () ";
$ this -> _ db-> query ("UPDATE` files` SET {$ q} WHERE (`id` = {$ fid}) AND (` entity_type` = '{$ eType}') ");
return $ fid;
}
//File data
$ meta = empty ($ meta)? 'null': "'". $ this -> _ db-> real_escape_string ($ meta). "'";
$ finfo = finfo_open (FILEINFO_MIME_TYPE);
$ mime = finfo_file ($ finfo, $ data['tmp_name']);
finfo_close ($ finfo);
//FID, file name
if (empty ($ fileID)) {
$ eID = empty ($ eID)? 'null': intval ($ eID);
$ q = sql
insert into `files` set
`mime` = '{$ mime}',
`entity` = {$ eID},
`entityType` = '{$ eType}',
`created` = now (),
`access` = {$ access},
`flags` = {$ flags}
sql;
$ this -> _ db-> query ($ q);
$ fid = $ this -> _ db-> insert_id;
list ($ ffs, $ fhn) = self :: fid ($ fid);
$ url = $ this -> _ webRoot. $ url. '/'. $ fid;
$ fdir = $ this -> _ realRoot. $ ffs;
self :: validateDir ($ fdir);
$ index = self :: getIndex ($ fdir);
$ index[$fhn]= array ($ fhn, $ mime, $ url, ($ eID == 'null'? 0: $ eID), $ access, $ flags);
self :: setIndex ($ fdir, $ index);
$ fname = $ ffs. '/'. $ fhn. '.file';
} else {
$ fid = intval ($ fileID);
$ fname = $ this-> fileName ($ fid);
}
//Move file
$ fdir = $ this -> _ realRoot. $ fname;
if (! move_uploaded_file ($ data['tmp_name'], $ fdir)) {
throw new Exception ('Upload error');
}
$ q = '`md5` =' '. md5_file ($ fdir). '', `sha1` = ''. sha1_file ($ fdir). '', '
. '`size` ='. filesize ($ fdir). ', `meta` ='. $ meta. ','
. (empty ($ fileID)? "` url` = '{$ url}', `file` = '{$ fname}'": 'updated` = now ()');
$ this -> _ db-> query ("UPDATE` files` SET {$ q} WHERE (`id` = {$ fid}) AND (` entity_type` = '{$ eType}') ");
return $ fid;
}
/**
* Read the file
* @param string $ url URL
* @param string $ basicGroup The basic group of user
* @throws Exception
* /
public function read ($ url, $ basicGroup = 'anonimous') {
if (! ctype_alnum (str_replace (array ('/', '.', '-', '_'), '', $ url))) {
header ('HTTP /??? Bad Request');
exit;
}
$ url = $ this -> _ db-> real_escape_string ($ url);
$ q = "SELECT * FROM` files` WHERE `url` = '{$ url}' ORDER BY` created` ASC ";
if ($ result = $ this -> _ db-> query ($ q)) {
$ vars = array ();
$ ints = array ('id', 'entity', 'size', 'access', 'flags');
while ($ row = $ result-> fetch_assoc ()) {
foreach ($ ints as $ i) {
$ row[$i]= intval ($ row[$i]);
}
$ fid = $ row['id'];
$ vars[$fid]= $ row;
}
if (empty ($ vars)) {
header ('HTTP /??? Not Found');
exit;
}
$ deleted = false;
$ access = true;
$ found = '';
$ mime = '';
foreach ($ vars as $ fdata) {
$ flags = intval ($ fdata['flags']);
$ deleted = ($ flags & self :: FLAG_DELETED)! = 0;
$ access = self :: granted ($ basicGroup, $ fdata['access']);
if (! $ access || $ deleted) {
continue;
}
$ found = $ fdata['file'];
$ mime = $ fdata['mime'];
}
if (empty ($ found)) {
if ($ deleted) {
header ('HTTP /??? Gone');
exit;
} elseif (! $ access) {
header ('HTTP /??? Forbidden');
exit;
}
} else {
header ('Content-type:'. $ mime. '; charset = utf-8');
readfile ($ this -> _ realRoot. $ found);
exit;
}
}
header ('HTTP /??? Not Found');
exit;
}
/**
* Deleting a file (files) from the repository
* @param mixed $ fid Identifier (s)
* @return bool
* @throws Exception
* /
public function delete ($ fid) {
$ fid = is_array ($ fid)? implode (',', $ fid): $ fid;
$ q = "delete from` table` where `id` in ({$ fid})";
$ this -> _ db-> query ($ q);
$ result = true;
foreach ($ fid as $ fid_i) {
list ($ ffs, $ fhn) = self :: fid ($ fid_i);
$ fdir = $ this -> _ realRoot. $ ffs;
$ index = self :: getIndex ($ fdir);
unset ($ index[$fhn]);
self :: setIndex ($ fdir, $ index);
$ result & = unlink ($ fdir. '/'. $ fhn. '.file');
}
return $ result;
}
/**
* Marks the file (s) with the flag "deleted"
* @param int $ fid Identifier (s)
* @param bool $ value The value of the flag is
* @return bool
* /
public function setDeleted ($ fid, $ value = true) {
$ fid = is_array ($ fid)? implode (',', $ fid): $ fid;
$ o = $ value? '| '. self :: FLAG_DELETED: '&'. (~ self :: FLAG_DELETED);
$ this -> _ db-> query ("update` files` set `flags` =` flags` {$ o} where `id` in ({$ fid})");
return true;
}
/**
* The file name is
* @param int $ fid Identifier
* @return string
* @throws Exception
* /
public function fileName ($ fid) {
list ($ ffs, $ fhn) = self :: fid ($ fid);
self :: validateDir ($ this -> _ realRoot. $ ffs);
return $ ffs. '/'. $ fhn. '.file';
}
/**
* Processing the file identifier.
Returns the array with the folder to the file and the hexadecimal representation of the low-order byte.
* @param int $ fid The identifier of the file is
* @return array
* /
public static function fid ($ fid) {
$ ffs = str_split (str_pad (dechex ($ fid), 1? '0', STR_PAD_LEFT), 2);
$ fhn = array_pop ($ ffs);
$ ffs = implode ('/', $ ffs);
return array ($ ffs, $ fhn);
}
/**
* Checking the directory of the file
* @param string $ f The full path to the directory
* @return bool
* @throws Exception
* /
public static function validateDir ($ f) {
if (! is_dir ($ f)) {
if (! mkdir ($ f, 070? true)) {
throw new Exception ('can not make dir:'. $ f);
}
}
return true;
}
/**
* Reading the reserve index
* @param string $ f The full path to the backup index file
* @return array
* /
public static function getIndex ($ f) {
$ index = array ();
if (file_exists ($ f. '/.index')) {
$ _ = file ($ f. '/.index');
foreach ($ _ as $ _i) {
$ row = trim ($ _ i);
$ row = explode ('|', $ row);
array_walk ($ row, 'trim');
$ rid = $ row[0];
$ index[$rid]= $ row;
}
}
return $ index;
}
/**
* Record of the reserve index
* @param string $ f The full path to the backup index file
* @param array $ index An array of data from the index
* @return bool
* /
public static function setIndex ($ f, array $ index) {
$ _ = array ();
foreach ($ index as $ row) {
$ _[]= implode ('|', $ row);
}
return file_put_contents ($ f. '/.index', implode ("rn", $ _));
}
/**
* Check availability
* @param string $ group Name of the group (see below)
* @param int $ value The value of the accesses is
* @return bool
* /
public static function granted ($ group, $ value = 0) {
$ groups = array (anonimous, user, manager, admin, inactive, blocked, deleted);
if ($ group == 'root') {
return true;
}
foreach ($ groups as $ groupID => $ groupName) {
if ($ groupName == $ group) {
return (((1 $ groupID) & $ value)! = 0);
}
}
return false;
}
}

 

 
 
Let's consider some moments:
 
- realRoot - full path to the folder with the file system ending with a slash.
 
- webRoot - the path from the root of the site without a leading slash (see below why).
 
- As a DBMS, I use the extension MySQLi .
 
- In fact, the method upload The first argument is information from the array $ _FILES .
 
- If you call the method update transfer the ID of an existing file, it will be replaced, if in tmp_name The input array will be non-empty.
 
- You can delete and change the flags of files at once for several pieces. To do this, you must pass in place of the file identifier either an array with identifiers, or a string with them separated by commas.
 
 

Routing


 
Actually, it all comes down to several lines in the htaccess at the root of the site (it is assumed that mod_rewrite is enabled):
 
 
    RewriteCond% {REQUEST_URI} ^ /content /(.*)$
RewriteCond% {REQUEST_FILENAME}! -F
RewriteRule ^ (. +) $ Content /index.php? File = $ 1[L,QSA]

 
 
"Content" is the folder in the root of the site in my case. Of course you can name the folder in a different way. And of course the index.php itself, stored in my case in the content folder:
 
 
    <?php
$ dbHost = '???.1';
$ dbUser = 'user';
$ dbPass = '****';
$ dbName = 'database';
try {
if (empty ($ _ REQUEST['file'])) {
header ('HTTP /??? Bad Request');
exit;
}
$ userG = 'anonimous';
//Here we define the user group; any solution of your choice
$ files = new BigFiles (new mysqli ($ dbHost, $ dbUser, $ dbPass, $ dbName));
$ files-> realRoot (dirname (__ FILE __). '/files /');
$ files-> read ($ _ REQUEST['file'], $ userG);
} catch (Exception $ e) {
header ('HTTP /??? Internal Error');
header ('Content-Type: text /plain; charset = utf-8');
echo $ e-> getMessage ();
exit;
}

 
 
Well, by itself we will close the file system itself from external access. Put the folder in the root of the folder. content /files file .htaccess with only one line:
 
 
    Deny from all  
 
 

The result is


 
This solution avoids loss of performance of the file system due to the increase in the number of files. At least troubles in the form of thousands of files in one folder can be accurately avoided. And at the same time, we can organize and control access to files on human-understandable addresses. Plus compliance with our dismal legislation. At once I will make a reservation, the given decision is not a high-grade way of protection of a content. Remember: if something is played in the browser, it can be downloaded for free.
+ 0 -

Add comment