Joe's Blog----TECH: 04/01/2014

2014年4月21日星期一

GlusterFS Performance Tuning

前言

GlusterFS 是一套OpenSource的Distributed File System. 簡單來說，他可以把多台多個儲存資源串連起來變成一個或多個Pool。每一個Pool支援Global Namespace/Common Namespace的特性，也就是說，在不同地方掛載同一個Pool看到的會是相同的東西。

有使用過GlusterFS的人都會發現，以Default的Configuration來跑GlusterFS將會得到慘不忍睹的笑能。因此在這邊分享及驗證透過修改GlusterFS不同的參數對於效能的幫助

Configurations

修改的參數我把他分為三類，分別是IO，Cache和Network的參數修改。

I/O

io-thread-count (16-> 64)

最多同時IO的Thread個數。數字越大就允許越多同時的R/W。特別是當你有RAID的時候可以調高這參數。Default值是16

write-behind-window-size(1MB-> 1GB)

per-file的write behind buffer. Default值是1MB。

Cache

cache-size (32MB-> 2GB)

read cache的Size。Default值是32MB

cache-max-file-size (-> 16384PB)

設定最多file cached的大小。

cache-min-file-size (-> 0)

cached file最小的大小。

Network

auto tuning(-> disable)

OS會根據網路的throughput來動態調整TCP的window size。但是如果是在穩定的Lan環境這功能是用不到的，因此關閉他來避免奇怪的問題。

Performance Testing

上述列了幾個Tuning的參數，接下來實際針對每個參數的調整來驗證改善的成效。

Experiment Environment and Method

實驗的環境是兩台Server串成一個GlusterFS cluster。然後在一台Server mount起來，利用iozone來驗證效能。三台的規格都一樣如下圖。

Experiment Configuration

重新歸納一下調整個參數並且給每一個case做編號。

Experiment Result

Write

從寫入的結果來看，在調整了write-behind-buffer後效能有得到非常大的改善。而理所當然的，record size會影響整個throughput。在record=1024KB的時候有較好的結果。

ReWrite

在rewrite的實驗和write有相近的結果。一樣是透過調整write-behind-buffer 來得到效能的改善。

Read

在Read測試當中，可以發現在record size比較小的時候，關閉tso, gro和gso會有較快的效能。不過這個結果應該與網卡及driver有關。

Re-read

Reread與read有相近的結果。在大record size時一樣能得到較好的throughput。

Conclusion

write-behind-buffer很明顯的改善了寫入的速度。我想這就跟在NFS sync/async的差別一樣吧。

在這次的實驗中，沒有RAID 卡也沒有做multi-IO的測試。猜想應該是這個原因才看不出io-thread-count的效果。

2014年4月18日星期五

Delete Symantec Netbackup Disk Pool

The process is :

Expire all backup image
Delete all image fragment
Delete Disk Pool

So fist, issue bpimmedia to get backup_id
# bpimmedia -L -client [client name]

and then expire the image
# bpexpdate -backupid [backupid] -d 0

Second, delete all image fragment by issue nbdelete
# nbdelete --allvolumes -force

Finally, delete disk pool from Administration console or Administrator JAVA console

2014年4月11日星期五

No Dropbox icon in Ubuntu 13.10

1. Install package
# sudo apt-get install libappindicator1
2. Log out, and then log back in (or issue dropbox stop && dropbox start at the command line)

REF

Linux Kernel Performance Tuning and Hardening

[Security]Deactivates automatic answers to ICMP broadcasts and protects against smurf attacks.

net.ipv4.icmp_echo_ignore_broadcasts=1

[Security]Enable source validation by reversed path, as specified in RFC1812

/*
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path. 
    Each incoming packet is tested against the FIB and if the interface
    is not the best reverse path the packet check will fail. 
    By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path. 
    Each incoming packet's source address is also tested against the FIB
    and if the source address is not reachable via any interface the packet check will fail.
*/
net.ipv4.conf.all.rp_filter = 1

[Performance]The number of possible inotify(7) watches

fs.inotify.max_user_watches = 65536

[Tuning]Avoid deleting secondary IPs on deleting the primary IP.

/*
If you remove the primary IP address, all secondary addresses are purged by default as well. 
*/
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1"

[Security]Not allow source route

net.ipv4.conf.default.accept_source_route = 0

[Tuning]Disable listen for router advertisements in order to choose an IPV6 address and router

net.ipv6.conf.all.autoconf = 0
net.ipv6.conf.all.accept_ra = 0

[Security]Do not allow redirect

net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.secure_redirects = 0

[Tuning]Avoid tcp timeout

net.ipv4.tcp_timestamps = 0

[Performance]Maximum amount for the receive socket memory

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.ipv4.tcp_rmem = 32768 262144 33554432
net.ipv4.tcp_wmem = 32768 262144 33554432

[Performance]Max open file

fs.file-max = 200000

[Performance]The number of IPC message queue resources allowed

kernel.msgmni = 1024

[Performance]How many semaphore arrays, semaphores can be allocate and how many semaphores can be in a single semaphore array

kernel.sem = 400 307200 128 1024

[Performance]Keepalive time

net.ipv4.tcp_keepalive_time = 900
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 10

[Performance] Page cache limit.

/*Some app use large amounts of memory for accelerated access to business data
Parts of this memory are seldom accessed. When a user request then needs to
access paged out memory, the response time is poor. It is even worse, when an
SAP solution running on Java incurs a Java garbage collection. The system
starts heavy page-in (disc I/O) activity and has a poor response time for an
extended period of time.*/
vm.pagecache_limit_mb = 41000

Max number of open file

* hard nofile 32768
* soft nofile 32768

2014年4月9日星期三

Setup git server

以前在使用git的時候都是同事幫忙架設，這次剛好有機會順便紀錄一下

Git Server

1. 安裝必要套件

# apt-get install git-core

2. 建立repository

假設所有的project都欲放置/opt/git 並且建立名為project1的git
a. # mkdir /opt/git
b. # mkdir /opt/git/project1
# cd /opt/git/project1
# git --bare init

Git Client

在使用者端，假設project的位置是~/projectA

# mkdir ~/projectA

# cd ~/projectA

# git init

# echo "project init">>init

# git add .

# git commit -m "init project"

# git remote add origin user@[IP]:/opt/git/project1

# git push origin master

2014年4月8日星期二

iSCSI Target and Initiator

iSCSI Target是Server端，提供iSCSI 空間。
iSCSI Initiator是Client端，連接既有的iSCSI空間來使用。

iSCSI Target

在Ubuntu上安裝

1. 選擇要提供空間的裝置

提供服務的空間可以是一個硬碟、一個partition、一個LVM的LV或是使用DD產生一個空間。
使用以下的方法可以在/media下產生一個20G名稱為volume0的空間。

# dd if=/dev/zero of=/media/volume0 count=0 obs=1 seek=20G

2. 安裝必要套件

使用APT來安裝必要套件

# apt-get install iscsitarget iscsitarget-dkms

3. 設定iSCSI

a. 設定開機自動啟動

Edit /etc/default/iscsitarget

change

ISCSITARGET_ENABLE=false

ISCSITARGET_ENABLE=true

b. 設定iSCSI Target

Edit /etc/iet/ietd.conf

加入

Target iqn.2010-12.nl.ytec.arbiter:arbiter.blabla.lun1

Lun 0 Path=/media/volume0,Type=fileio,ScsiId=lun0,ScsiSN=lun0

其中，需要修改的參數是:

I. Target 後面接的名字，可自訂。

II. Lun一定要從0開始。

III. PATH後面接的參數是空間的位置，如果是剛剛使用dd 產生出來的則為/media/volume0否則更改為實際的空間例如/dev/sdb1 etc.

4. 重新啟動iSCSI Target

# /etc/init.d/iscsitarget restart

iSCSI Initiator

以OpenSuse為例

1. 安裝必要套件

# yast -i open-iscsi

2. 搜尋可用iSCSI Lun

使用iscsiadm來搜尋可用的裝置

# iscsiadm -m discovery --type=st --portal=[iscsi target的IP]

sample output:

192.168.1.101:3260,1 iqn.2010-12.nl.ytec.arbiter:arbiter.blabla.lun1

3. 連接裝置

確定找的到裝置之後就可以連接iSCSI裝置

# iscsiadm -m node -T iqn.2010-12.nl.ytec.arbiter:arbiter.blabla.lun1 --login

使用lsscsi來檢查是否有連接成功

# lsscsi

Sample output:

[0:2:0:0] disk Intel MegaSR 1.0 /dev/sda

[1:2:0:0] disk Intel RMS25CB080 3.22 /dev/sdb

[18:0:0:0] disk IET VIRTUAL-DISK 0 /dev/sdc

4. 掛載裝置

假設我們欲將裝置format成XFS並掛載到/opt/iscsi

# mkfs.xfs /dev/sdc

# mkdir /opt/iscsi

# mount -t xfs /dev/sdc /opt/iscsi

2014年4月4日星期五

大資料資料庫儲存-NoSQL比較及效能差異(VII)

在大資料資料庫儲存的最終篇將比較各NoSQL的差異和效能的比較。

NoSQL 比較

Experiment Environment

Hardware and Network Topology

Software Version

MySQL VS Big Data

Insert Data

Update Data

Look-up Data

Delete Data

大資料資料庫儲存-Memcache和Redis介紹(VI)

Memcahce和Redis都是大多數被拿來當Cache system用途的NoSQL。他們的共通點是都是以Key-Value的形式將資料存在記憶體中。因此R/W的IO非常的快。

而兩個比較大的差別是，Redis可以將資料儲存在Local Drive中，而Memcache只能存在memory。另外一點則是，Memcache的value沒有型別而Redis則支援多種型別ex. map, set, list etc.

Memcahce

Memcache Use Scenario

REDIS

REDIS Data Type

大資料資料庫儲存--MongoDB介紹(V)

前一篇介紹了HBase，接著將繼續介紹MongoDB

MongoDB

MongoDB是由10gen這一家公司開發的，與Cassandra和Hbase不同的是，MongoDB是一個document base的一個NoSQL。他支援Auto-sharding的功能另外一個比較特別的是，它內建了以javascript來利用map-reduce的概念搜尋資料。

Data Structure

MongoDB既然說是一個Document base的資料庫，那麼舉個最簡單應用的例子來說明他的優點。假設我們有一個database是紀錄post文章和每個文章的comment。以RDBMS最簡單的方式就是建立兩個table分別存放文章和comment。因此要撈出某一筆文章和其留言就必須要兩個query。

那麼MongoDB呢?既然他是Document base，那麼我們可以在一個document記錄了文章和留言，因此，撈出一篇文章和留言就只需要一個request！

Write Path

MongoDB有三個角色，分別是存放Data的mongod、Cmongod存放shard key和mongos負責做load balance和將使用者的request導到對的shard server。

When to use MongoDB?

MongoDB可以非常lightweight(有人想要取代sqlite)，當你沒有複雜的transaction ex. 不是banking的系統就可以看看MongoDB。

大資料資料庫儲存--HBASE介紹(IV)

前一篇介紹了Cassandra，而在這裡將繼續介紹Cassandra。

HBase

Hbase是一個Apache top-level的project。他強調的是CAP理論中的CP也就是consistency 和Partition tolerance。

HBase Data Model

既然前一篇介紹了Cassandra，那麼在這裡就以Cassandra的範例來比較Hbase和Cassandra的不同。

Hbase最大的不同是，他的每一個CF(Column family)即為一個map，而每一筆資料的CF的ket set可以是不同的。而另外一點是，在hbase裡，所有的data都為byte array。

HBase Data Path

Hbase有一張很嚇人的圖，它的特點是箭頭特別多。但是我們可以很簡單的理解這張圖

以下圖為例，Hadoop的Data node其實就是很多台可以存放data 的node。以上圖為例，負責回應request的Hbase 節點有兩個。
P.S. Hbase是可以把資料存放在local disk而非HDFS不可的。

使用者(client)會先留到zookeeper，zookeper會將使用者導到其中一台hbase server。使用者Insert依資料時，hbase會先存成hlog(在損毀時可以roll back)然後存到memstore(可以想像成memory 的小database)接著把資料存到HDFS中稱作Hfile。

NoSQL DON'Ts

在NoSQL風靡了幾年之後，開始有人反思：到底NoSQL真的有比傳統的RDBMS好嗎?或者說，我們真的需要NoSQL嗎?

在選擇之前，必須要先了解NoSQL 做不到的事:

另外一個迷思是，很多人認為在大資料的時代下，傳統的RDBMS是無法應付的?真的是如此嗎? 以Facebook為例，他們是使用sharding來存放使用者的post。你的資料有比他多嗎?

大資料資料庫儲存--Cassandra 介紹(III)

上一篇我們介紹了什麼是NoSQL，在這裡，我們即將慢慢地介紹常見的NoSQL。

Cassandra

Cassandra是Apache Software foundation的top-level project。他是key-value的結構，比較不一樣的是它是tunable的consistency。什麼意思呢?就是說他可以選擇consistency的程度，例如確保2個data node的資料都被更新了才return 成功或是全部都更新了才回傳成功。

Cassandra Data Model

Cassandra的Data model主要有Keyspace對應RDBMS的database、Column family對應Table、Row key對應Primary key和Column name/key對應Column name。

我們可以簡單地把Keyspace裡面的column family(RDBMS的Table)看成是一個map of a sorted map。一個column family是一個sorted map，他的key就是每個row的row key，而value是一個map，存放每一個column 的value e.g. map<column name, column value>。

Cassandra的Column family有兩種分別是Static column Family和Dynamic column family。Static column Family和RDBMS很像，在create column family(table)的時候就定義好schema。而Dynamic column family則是在Insert data的時候，data的你仍舊可以插入未定義的column。舉例來說，假設我們的schema定義了name, email, address，你仍舊可以插入一個叫state的column的data。

另外一個不一樣的是，在RDBMS插入一個空的內容仍舊會佔去空間的。舉例來說，假設插入了一筆資料，其中一個定義4 byte char的欄位是空的，仍舊會佔去4 byte。Cassandra則沒有這個問題。

Cassandra write path

Cassandra Tunable Consistency

Cassandra分了6個consistency level如下表:

大資料資料庫儲存--What is NoSQL(II)

在上一篇大資料資料庫儲存--傳統的解決方案(I)介紹了在傳統在大資料量的解決方法，而前幾年開始熱門的NoSQL又是什麼東西?對於大資料量又有什麼不一樣的解決方式呢?

The CAP Theorem

在聊NoSQL之前有必要先介紹有名的CAP理論。Wiki是這樣說的:

In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:^[1]^[2]

Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it was successful or failed)

Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

簡單來說就是在一個分散式的系統裡面只可能同時間滿足consistency, availability和partition tolerance中的其二。

What is NoSQL?

較早之前有人會以為NoSQL就是字面上的意思e.g. 不要SQL。事實上，NoSQL應該被解釋為not only relational database。意思是說有別於關聯式資料庫的資料庫。

NoSQL要解決的是傳統的Relational Database比較難解決(非不能解決)的scalability-尤其是scale-out的方式。當資料量及Request一多的時候怎麼存放這麼多的資料及應付龐大的資料存取問題。而另外一個很大的差別則是schemaless，也就是不像Relational Database在Insert data之前一定要有定義好的schema。

下一篇我們將開始介紹常見的NoSQL

大資料資料庫儲存--傳統的解決方案(I)

The problem

以MySQL為例，在一個table裡面存放10,000筆資料絕對不是問題。但是，當我們的資料上升到十萬筆、一百萬筆甚至一千萬筆，這個時候對資料庫的影響是什麼?當使用者數量一多，資料庫是否依舊能夠在一定的時間內回覆使用者的要求呢?

Traditional Solution

傳統的解決方法不代表是個落伍的、被淘汰的方法。相對的，是一個被證明可以有效解決是個能夠在Production中實作的一些方法。而事實上，大多數的網站仍舊使用這些方法來解決既有大資料量的問題。

Partition和Sharing最大的差別在-Partition是將一個table的資料切割放置到多個table；而sharing則是將一個table的資料放置到多個database中(通常是多台資料庫實體)。

Pros and Cons

那麼，partition和sharing各有什麼優缺點呢?

Partition

大多數的資料庫其實都已經支援了Partition。Partition的設定方式主要是告訴資料庫要以哪一個column為依據，設定切割的條件。例如int的欄位，可以設定成1~1000為一個table、2000~3000一個table、剩下的一個table。

Partition相對於sharding最大的好處是不需要重寫SQL Logic。資料庫會根據搜尋的條件至相對應的table做query。

但最大的缺點是，儘管是切割成多個table但仍舊是放在同一個database。通常，隨著資料量增加也代表使用者變多了，當然request也變多了。資料庫的效能仍舊受限於該台資料庫所能提供的throughput。

Sharding

Sharding顧名思義的就是將一個table的資料切割放置到多台資料庫中。在實作上，我們可以自己改寫SQL Logic根據我們切割的條件來決定該request要發送到哪一台database。或者，我們也可以看到有許多framework來幫助你完成這件事。例如，MySQL Fabric, Vitess, Gizzard, Jetpants等等。

Sharding好處是資料庫的throughput可以隨著資料庫實體的增加而提升也就是scale out的solution。但最大的壞處是，因為資料散落在不同的資料庫，因此你可能做不到join沒有ACID transaction而且因為所有的查尋會因為資料切割的條件變得更複雜。

How about Big Requests?

前面有提到，資料量的提升通常也代表使用量變多隨之而來的request也變多了。假設我們已經把資料shard到不同資料庫了，我們要怎麼分散流量呢? Open Source的Load balancer solution是 HAProxy, Nginx或是MySQL Proxy(beta)，當然也可以購買貴鬆鬆Hardware load balancer。

另外，MySQL也有MySQL Cluster可以做到load balancer、auto sharding等等。不過就不在本篇的討論中了。

訂閱：意見 (Atom)

2014年4月21日 星期一

前言

Configurations

I/O

io-thread-count (16-> 64)

Cache

cache-size (32MB-> 2GB)

cache-max-file-size (-> 16384PB)

cache-min-file-size (-> 0)

Network

auto tuning(-> disable)

Performance Testing

Experiment Environment and Method

Experiment Configuration

Experiment Result

Write

ReWrite

Read

Re-read

Conclusion

2014年4月18日 星期五

2014年4月11日 星期五

2014年4月9日 星期三

Git Server

1. 安裝必要套件

2. 建立repository

Git Client

2014年4月8日 星期二

iSCSI Target

1. 選擇要提供空間的裝置

提供服務的空間可以是一個硬碟、一個partition、一個LVM的LV或是使用DD產生一個空間。 使用以下的方法可以在/media下產生一個20G名稱為volume0的空間。 # dd if=/dev/zero of=/media/volume0 count=0 obs=1 seek=20G

2. 安裝必要套件

3. 設定iSCSI

4. 重新啟動iSCSI Target

iSCSI Initiator

1. 安裝必要套件

2. 搜尋可用iSCSI Lun

3. 連接裝置

4. 掛載裝置

2014年4月4日 星期五

NoSQL 比較

Experiment Environment

Hardware and Network Topology

Software Version

MySQL VS Big Data

Insert Data

Update Data

Look-up Data

Delete Data

Memcahce

Memcache Use Scenario

REDIS

REDIS Data Type

MongoDB

Data Structure

Write Path

When to use MongoDB?

HBase

HBase Data Model

HBase Data Path

NoSQL DON'Ts

Cassandra

Cassandra Data Model

Cassandra write path

Cassandra Tunable Consistency

The CAP Theorem

What is NoSQL?

The problem

Traditional Solution

Pros and Cons

Partition

Sharding

How about Big Requests?

2014年4月21日星期一

2014年4月18日星期五

2014年4月11日星期五

2014年4月9日星期三

2014年4月8日星期二

提供服務的空間可以是一個硬碟、一個partition、一個LVM的LV或是使用DD產生一個空間。
使用以下的方法可以在/media下產生一個20G名稱為volume0的空間。

# dd if=/dev/zero of=/media/volume0 count=0 obs=1 seek=20G

2014年4月4日星期五