close

Вход

Забыли?

вход по аккаунту

?

978-981-10-4603-2 19

код для вставкиСкачать
Research of Access Optimization of Small
Files on Basis of B + Tree on Hadoop
Yan Wang, YuQiang Li, YuWen Li, Yilong Shi and Weiwei Li
Abstract Hadoop, the open-source software for reliable, scalable, distributed
computing used in the processing and storage of extremely large data sets, is
originally designed to store large amounts of large files resulting in huge wastage of
storage space for Data Node and increase in the memory space utilization, for Name
Node, when dealing with massive small files. For the above shortcomings, this
paper puts forward a optimization design for small files access scheme, which
speeds up the small file location through the file index, on the Hadoop platform
based on B + tree index, resulting in the improvement in the access efficiency of
small files. The effectiveness of the proposed scheme is experimentally validated.
Keywords Hadoop
HDFS B + tree Small files Sequence file
1 Introduction
In recent years, with the continuous advances in information science and technology, “big data” technology has gradually become the focus of attention for both
industry and academia [1, 2].
Faced with Internet data presented this explosive growth [3], Google company
publishes papers which first proposed the GFS, MapReduce, and other distributed
data processing technology to deal with these massive data in 2003 [4, 5].
Thereafter, Apache Foundation develops a distributed system Hadoop.
Since Hadoop development platform is originally intended to store large
amounts of large files, which causes huge waste of storage space for DataNode,
increased memory space utilization for NameNode when dealing with massive
small files. Therefore, improving the processing ability of small HDFS file is paid
more and more attention by the outside world. At present, there is a lot of research
Y. Wang (&) Y. Li Y. Li Y. Shi W. Li
School of Computer Science and Technology, Wuhan University of Technology,
Wuhan, China
e-mail: wy_hult@163.com
© Springer Nature Singapore Pte Ltd. 2018
R.K. Choudhary et al. (eds.), Advanced Computing
and Communication Technologies, Advances in Intelligent Systems
and Computing 562, https://doi.org/10.1007/978-981-10-4603-2_19
197
198
Y. Wang et al.
work on Hadoop small file access method both in academic research and in the
application of the major Internet Division.
Mackey et al. [6] put forward a scheme that assigning a quota for each client in
the HDFS file system and verifying the effectiveness of it. Then, on the basis of the
solutions of Grant Mackey et al., Vorapongkitipun and Nupairoj [7] put forward a
new Hadoop Har (archive) scheme to solve the problem that the metadata information of small files in HDFS system occupy the memory of NameNode, and
improve the metadata memory utilization of the NameNode and the access efficiency of small files from two aspects.
In China, many researchers and Internet companies, like taobao, tencent, etc.,
put forward their own solutions for the small file access problems. Changtong [8]
merges small documents into one big file, and builds a Hash index for each of the
merged files to improve the efficiency of small file access. Focusing on the problem
that huge numbers of small files impose heavy burden on NameNode of HDFS,
Chen et al. [9] present a kind of Cache strategy to improve the reading efficiency of
small files on HDFS. In 2012, Zhang and Liu [10] puts forward a new scheme; their
core idea is to use the IO multiplexing reactor model to deal with a large number of
request tasks of small files and constantly creates indexes for the merged files. But
for different types of files, they do not make effective consideration. Then, on the
basis of the research of Yang et al. [11] team puts forward the mechanism that takes
a different file merging and prefetching for different sizes of the file to deal with
small file which increases the read and write performance by 70%.
It notes the impact of the access of small files that it brings to Hadoop platform,
so Hadoop itself proposed three small file merger proposals, Hadoop Archive,
SequenceFile, and CombineFileInputFormat. But the SequenceFile consolidation
program uses a sequential manner to search small files after merger, which is
inefficiencies and cannot locate the file quickly under the condition of large amount
of files, and seriously affects read performance of small files.
This paper establishes a B + tree index for the merged file on the basis of
SequenceFile, to speed up the small file location through file index, so as to
improve the reading efficiency of small files.
2 The Search for SequenceFile Based on B + Tree
2.1
The Construction of B + Tree Index
Combining SequenceFile merging scheme with the construction of B + tree index
strategy, a concrete design idea is given: due to the massive small image files
according to the years before the classified into multiple SequenceFile based large
files, so for the design of the B + tree index, would require the two levels of index
of the search process. First of all, build the B + tree index for all small files with the
corresponding SequenceFile large files, then the small files can be quickly located
to its large files by small file name. Then, build B + tree index for each large
Research of Access Optimization of Small Files on Basis …
199
SequenceFile, so specific information can be located by the large file rapidly and
restored the files. Here, in this article, the B + tree index for all the small files and
the corresponding SequenceFile large files is called primary index, and the B + tree
index for each large SequenceFile is called secondary index.
2.1.1
The Design of the Primary Index Structure
The primary B + tree index structure is shown in Fig. 1. To combine with massive
small images of the data, it is assumed that there are six pictures called a.jpg, b.jpg,
c.jpg, d.jpg, e.jpg, and f.jpg, and they are classified to the large file of the corresponding year according to the year. So the pictures called a.jpg, c.jpg, and f.jpg are
classified to the file called 2014, and the pictures called b.jpg, d.jpg, and e.jpg are
classified to the file called 2015. Here, the order of B + tree is three, namely when
the key word in the node is more than three, the node splits.
For primary B + tree index structure, the B + tree index structure design is as
follows:
Public class BplusTree{
protected Node root;
protected int m;
protected Node head;
/*root node*/
/*order m */
/*head of the leaf nodes*/
}
The structural of a single node in the B + tree is designed as follows; among
them, the key word entry node is designed as set type to store key information in
each node. Each keyword information is set as the generic type Entry 〈K, V〉. Entry
〈K, V〉 represents an entity of the Map that is a 〈Key, Value〉 pair. In the implementation of primary B + tree index structure, the K corresponds to each picture
filename which is a String type, and V corresponds to the classified SequenceFile
name which is Object type.
Fig. 1 The figure of primary index structure
200
Y. Wang et al.
Public class Node{
Protected Boolean isLeafNode;
Protected Boolean isRootNode;
Protected Node parentNode;
Protected Node previousNode;
Protected Node nextNode;
Protected List<Entry<String,Object>> entries;
Protected List<Node> children;
}
2.1.2
The Structural Design of the Secondary Index
The small files can be quickly located to its large files by small file name through
the primary index. However, the secondary index that the B + tree index for each
large SequenceFile is needed to find the specific small file information in
SequenceFiles. The secondary index structure is shown in Fig. 2. When the picture
c.jpg is searched, it is found that c.jpg is in the file named 2014 and there are four
other pictures a.jpg, f.jpg, h.jpg, and m.jpg.
For secondary B + tree index structure, the structure is same as the primary
B + tree index. The realization is seen in Sect. 2.1.1. However, K and V represent
different information; the K corresponds to the image name in the merged
SequenceFile files, and V corresponds to the binary of each image in merged
SequenceFile files. Because each image binary is larger, it is inconveniently
depicted in the figure, and therefore, the binary code of images is replaced with the
value.
Fig. 2 The figure of the secondary index structure
Research of Access Optimization of Small Files on Basis …
2.2
201
The Optimization of Small File Read Scheme Based
on B + Tree
If the B + tree index file is stored in the NameNode node, it is undoubtedly
exacerbated the NameNode node’s workload. At the same time, if the NameNode
node fails, the Hadoop cluster will also stop work, so the B + tree index file stored
in the NameNode node will be lost. So, in conclusion, this paper puts forward a
optimization design idea of small file reading scheme on the Hadoop platform based
on B + tree index: Secondary B + tree index file is stored in the DataNode node in
order to reduce disk I/O access to the DataNode node number and speed up the
small file search. The improved HDFS file reading process chart is shown in Fig. 3.
The improved part is shown in the dotted box in Fig. 3. The improved accessing
process is as follows:
Step 1: The client initiated the RPC () request to the remote NameNode.
Step 2: After receives the request from the client, the NameNode returns the
DataNodes, the index file Block corresponds to through the mapping
relationship between B + tree index file and the Block and the mapping
relationship between the Block and DataNodes, stored by the
NameNode.
Step 3: The client selects the nearest DataNode to read Block in B + tree index
file.
Step 4: If find the needed file through the index file Block in current DataNode,
then read the binary of the file and reduction it. If do not find the needed file
in current DataNode, cancel connections with current DataNode node and
read the next index file Block to find the most suitable DataNode.
Step 5: After finishing reading the whole index files, the client closes the whole
file system.
Fig. 3 The figure of improved HDFS file reading process
202
Y. Wang et al.
3 The Experiment
3.1
Experimental Environment
The experiment need to set up a Hadoop cluster; in this paper, the Hadoop cluster
consists of four lenovo M4350 business machines: One of them as the NameNode,
and the other three as a DataNode. Hardware configuration and software configuration of the computers in cluster are shown in Table 1.
Detail name and IP address configuration are shown in Table 2.
3.2
Experimental Setup
The contrast experiment is carried out on the Hadoop—version 1.0.3 system. The
size of the data block is set to 64 MB and parameter of copies is set to 3 in the
experiment. The size of picture type small files is between 100 KB and 15 MB. The
size of document type small files is between 100 KB and 15 MB.
This paper will compare the read time and memory utilization of the NameNode
for two kinds of small file, and test two kinds of search strategy when a number of
small files in the are 50, 100, 500, 1000, 5000, and 10,000.
3.3
3.3.1
Results and Analysis
Read Time of Small File
The contrast curve of reading time is given as follows (Fig. 4).
According to the shape of the curve in the figure, in view of the image type small
files, the smaller the number of small files is, the better the MapFile index reading
method is; the bigger the number of small files is, the better the B + tree index
Table 1 The table of computer configuration
Hardware configuration
Software configuration
CPU
I3 3240
OS
Ubuntu12.04 (64bit)
Memory
4G
JDK
JDK1.6.0_45
Hard disk
320G
Hadoop
Hadoop1.0.3
Table 2 The table of IP addresses
The name of the
machine
IP address
Master
Slave45
Slave46
Slave47
192.168.1.144
192.169.1.145
192.168.1.146
192.169.1.47
Research of Access Optimization of Small Files on Basis …
203
Fig. 4 The figure of the reading time
Fig. 5 The figure of the memory utilization
reading method is better. At the same time, according to the comparison results in
the table, the reading time using the search way based on the B + tree index after
merger based on the SequenceFile merger plan is 80% (text type was 30%) less than
the original sequential search way and the way based on the index of MapFile. The
experimental results also verify again that for images (small files) type of massive
small files on the reading efficiency; this proposed small file reading scheme based
on B + tree index is efficient.
3.3.2
The Memory Utilization of NameNode
The contrast histogram of NameNode memory utilization is given as follows
(Fig. 5).
According to the histogram, in view of the memory utilization of the NameNode
node, the smaller the number of small files is, the better the MapFile index reading
method is; the bigger the number of small files is, the better the B + tree index
reading method is. The experimental results also verify again that for images (small
files) type of massive small files on the reading efficiency, this proposed small file
reading scheme based on B + tree index is efficient.
204
Y. Wang et al.
4 Conclusion
This paper first analyzes the shortages of SequenceFile merging scheme on Hadoop
platform, and achieves the small files search by integrating small files into big files
with certain rules according to the characteristics of the experimental data.
SequenceFile merging scheme uses a sequential search to read small files which
seriously influences the small file reading efficiency. Therefore, this paper presents
a design scheme based on B + tree index file, and designs and achieves the
structure and function of B + tree index. Then, analyze the original HDFS file read
process deeply and find out the problems which need to be modified. Finally, the
improved scheme is applied to the Hadoop platform. Simulation experiment
establishes that the proposed reading scheme based on B + tree index file is feasible
and effective in improving the efficiency of small file access.
References
1. Xiao, F., Li-Lei, Q.I.: Exploration of big data processing technology. J. Comput. Mod. 1, 75–
77 (2013)
2. Meng, X., Xiang, C.: Big data management: concepts, techniques and challenges. J. Comput.
Res. Dev. 1, 146–169 (2013)
3. Wang, Y.Z.: Network big data: present and future. J. Chin. J. Comput. 6, 1125–1138 (2014)
4. Jiang, L., Li, B., Song, M.C.: The optimization of HDFS based on small files. In: 3rd IEEE
International Conference on Broadband Network and Multimedia Technology, pp. 912–915.
IEEE Computer Society, Beijing (2010)
5. Zhang, Y., Liu, D.C.: Improving the efficiency of storing for small files in HDFS. In: 2012
International Conference on Computer Science and Service System, pp. 2239–2242. IEEE
Computer Society, Jiangsu (2012)
6. Mackey, G., Sehrish, S., Wang, J.C.: Improving metadata management for small files in HDFS.
In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–4.
Institute of Electrical and Electronics Engineers Inc., New York (2009)
7. Vorapongkitipun, C., Nupairoj, N.C.: Improving performance of small-file accessing in
Hadoop. In: 11th International Joint Conference on Computer Science and Software
Engineering, pp. 200–205. IEEE Computer Society (2014)
8. Changtong, L.: An improved HDFS for small file. In: 18th International Conference on
Advanced Communications Technology, pp. 474–477. Institute of Electrical and Electronics
Engineers Inc., New York (2016)
9. Chen, J., Wang, D., Fu, L., Zhao, W.: An improved small file processing method for HDFS.
Int. J. Digit. Content Technol. Appl. 6, 293–304 (2012)
10. Zhang, Y., Liu, D.C.: Improving the efficiency of storing for small files in HDFS. In: 2012
International Conference on Computer Science and Service System, pp. 2239–2242. IEEE
Computer Society, Washington (2012)
11. Zhang, S., Miao, L., Zhang, D., et al.: A Strategy to deal with mass small files in HDFS. In:
6th International Conference on Intelligent Human-Machine Systems and Cybernetics,
pp. 331–334. Institute of Electrical and Electronics Engineers Inc., New York (2014)
Документ
Категория
Без категории
Просмотров
2
Размер файла
246 Кб
Теги
978, 981, 4603
1/--страниц
Пожаловаться на содержимое документа