Chapter 5 Hashing & Sorting
5.1 Hashing
- Hash 定義
- Dictionary problem, maintain a dictionary and support operations Get, Insert, and Delete be done in O(1) expected time
- Static hash存在table中(array),hash table size = Bucket * Slot,例如, int hash_table[B][S],則B為分類數,S為每一類的承載量。
- Hash 相關術語
- Key density:n/T,例如,本文單字量/全球單字量,通常Key density的數值非常小。
- Loading density:n /B*S,loading高表示hash table利用度高,collision也會偏高。
- Collision:home bucket of a new pair is not empty.
- Overflow:home bucket for a new pair is full. (有collision 不一定overflow,若Bucket只有一個slot, 則collision = overflow)
- Hash 設計:一個良好的hash function須滿足
- 計算簡單
- collision少
- hash table 不局部偏重
- perfect hash function : 保證無collision,data已知。
- uniform hash function : 均勻的,對應到每個bucket機率相等。
- 常見的Hash function
- Division:h(k) = k%D,In practice, we choose D that has no prime factor smaller than 20
- Multiplication:
h(k) = (A·k mod 2^w) right_shift (w–r)
- Mid-Square:compute the home bucket by taking square of k, using an appropriate number of bits from the middle of the square
- Folding:If k = 12320324111220, partition into 3 decimal digits long, we have
P1 =123, P2 =203, P3 =241, P4 =112, P5=20, - Shift, 直接相加:h(k)= Pi =123+203+241+112+20=699
- Boundary, 折紙, 偶數片段反向:h(k)=123+302+241+211+20=897
- Hash的Overflow處理
- Chaining:a bucket is a pointer to a list of keys having same home bucket
- Open Addressing (rehashing)
- Linear probing:(h(k) + i)%b
- 缺點:形成primary clustering,具有相同hash address的資料容易聚集在鄰近的Bucket中,造成search time增加
- Quadratic probing:(h(k) + i^2)%b and (h(k) − i^2)%b.
- 缺點:hash table 不保證充分利用,會有secondary clustering problem, 探測順序皆相同, 增加search time
- Double hashing:h(k,i) = ( h1(k) + i*h2(k) ) mod m
- This method generally produces excellent results, but h2(k) must be relatively prime to m. One way is to make m a power of 2 and design h2(k) to produce only odd numbers
- Security Hash function 必要的特性
- input 可變動大小
- output 長度固定
- Hash function 計算有效率
- 已知H(x), 難找x (preimage resistant)
- 已知x, 難找x != y 且 H(y) = H(x) (second preimage resistant, weak collision resistant
- 難找任一對x, y 使H(x) = H(y) (strong collision resistant)
- peudorandomness
5.2 Sorting
- 排序的種類
- Internal sort 與 External sort
- Internal sort:排序工作主要在memory完成,資料量較少者適用。
- External sort:排序工作主要是在disk完成,資料量較大者適用。
- stable sorting 與 unstable sorting
- stable sorting:如果鍵值相同之資料,在排序後相對位置與排序前相同。
排序前 : 3,5,19,1,3*,10
排序後 : 1,3,3*,5,10,19
(因為兩個 3, 3*的相對位置在排序前與後皆相同。) - simple sorting 與 Advanced sorting
- 常見的排序(重要)
- Bubble sorting:每一回合逐一比較相臨資料,依排序之順序交換位置,每回合至少會有一次交換位置,至沒交換位置則停止。
- Selection sorting:未排序前之第一筆資料可視為已排序好之資料,每一回合逐一比較相臨資料,依排序之順序交換位置。
- Insertion sorting:將待排序之資料逐一與已排序後之資料比較,再將資料放入適當位置
- Merge sorting:將兩個已排序過的記錄合併而得到另一個排序好的記錄。
- Heap Sort:將資料先以 “Bottom-up” 的方式建立Max-Heap,再執行 n-1 回合的 “Delete Max.” 動作
5.3 Quick sorting
- 原理:找資料pivot,將小於pivot放右邊,大於pivot放左邊,再以相同方法分左右兩邊序列直到排序完成。
- 性質:最差 O(n2)與平均時間 O(nlog2n),需要額外堆疊空間,不穩定排序,快速排序是平均時間最快之內部排序法。
- pseudocode for quick sort :Initial call: QUICKSORT(A, 1, n)
- QUICKSORT(A, p, r)
if p < r
then q ← PARTITION(A, p, r) - QUICKSORT(A, p, q–1)
- QUICKSORT(A, q+1, r)
- Partitioning subroutine (in-place version):
- Analysis of quick sort:Randomized quick sort, we can make sure we are usually lucky
- L(n) = 2U(n/2) + Θ (n) lucky
- U(n)= L(n –1) + Θ (n) unlucky
- Summary of randomized order-statistic selection
- Works fast: linear expected time.
- Excellent algorithm in practice.
- But, the worst case is very bad: Θ(n^2).
- Q. Is there an algorithm that runs in linear time in the worst case?
A. Yes, due to Blum, Floyd, Pratt, Rivest, and Tarjan [1973].
IDEA : Generate a good pivot recursively. - [改善pivot方法1] Median-of-3 Partition
- choose the pivot as the median of a set of 3 elements randomly selected from the array,for example…
- What is the probability of getting an OK split if the pivot is chosen at random? (A split is “OK” if the smaller piece has at least n/4 elements.)
- what is the probability of getting an OK split with Median-of-3 Partition?
- [改善pivot方法2] columns-of-5 Partition
- Choose the pivot : At least half the group medians are ≤ x , which is at least n/5/2= n/10 group medians.
- Therefore, at least 3n/10 elements are ≤ x.
- In practice, this algorithm runs slowly, because the constant in front of n is large.
The randomized algorithm is far more practical. - Exercise: Why not divide into groups of 3? 會失敗!
5.4 Linear-Time Sorting
- Decision-tree model
- Theorem:A decision tree can model the execution of any comparison sort:
Theorem:Any decision tree that can sort n elements must have height Ω(nlgn).
- Counting sort: No comparisons between elements.
- Input: A[1 . . n], where A[j]∈{1, 2, ..., k}.
- Output: B[1 . . n], sorted.
- Auxiliary storage: C[1 . . k].
- 原理
- 計算每個key的出現次數
- 計算「之後排好串列」的起始位置
- 放進去,C[A[j]] - 1使insert時按順序且stable
- 結論
- O(n + k), 假設key的範圍是受到限制的則 k 可視為constant
- 手上有counting sort可以排0~9 key的範圍可否排00~99, 000~999?
可! 用radix sort , LSD
- Radix sort:
- 可以降成O(n)的理由:若可知key的範圍限制,就可知要做幾回合
ex : 3位數需要3回合 - 原理
- radix sort 結論
- time : 每回合O(n + r), 總共花O(d*(n+r))
- 分派 : O(n), n個數字要分
- 合併 : O(r), r個桶子
- d:回合數
- space : Buckets space需求龐大, O(r*n)
- stable, linear time
- Downside: Unlike quicksort, radix sort displays little locality of reference, and thus a well-tuned quicksort fares better on modern processors, which feature steep memory hierarchies.
沒有留言:
張貼留言