标签:
算法错误率规则支持度 |
分类: 数据挖掘 |
关联规则(TextAssociation Rules)是反映一个事件和其他事件之间依赖或关联的知识。文本关联规则挖掘是从大量文本中发现项集之间有趣的关联或相关联系。
常用的关联规则方法有:
(1)Apriori算法和AprioriTid算法
Apriori算法是个基于频繁集理论递推的方法。Apriori的缺点是:在每一次产生候选项目集时都要扫描一遍数据库D。为此在Apriori的基础上提出了AprioriTid算法。AprioriTid的特点是:仅在第一次扫描时用事务数据库D计算候选频繁项目集的支持度,提高了算法的效率。
(2)MH算法和LSH算法
在Apriori和AprioriTid算法中,起决定作用的是支持度,如果想要挖掘具有高可信度的规则,则需要用到MH(Min_Hashing)算法和LSH(Locality_Sensitive_Hashing)算法。Min_Hashing算法遗漏率为零,错误率可以严格控制,但是时空效率相对较差。LSH算法遗漏率和错误率是无法同时降低的,但是时空效率却相对的好很多。
The Apriori Algorithm: Basics
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.
The Apriori Algorithm in a Nutshell
• Find the frequent itemsets: the sets of items that have minimum support
– A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
• Use the frequent itemsets to generate association rules.
The Apriori Algorithm : Pseudo code
•Join Step: Ck is generated by joining Lk-1with itself
•Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
https://qmna8w.bay.livefilestore.com/y1mMy0nIAmYtvJ1QtV3bKEikZa7NW53eFVJBRIBjQkrFU8mG1XlLkTlGReh_GbAN_vJyYTqUZ1fV2NP1W5FP6OSA7QSg0Rur0cxU59uOJh_fw0mfJNlChrEbaaJwYBPjdrCphq3G4YMZwV22kfKwrVkPA/clip_image002_thumb%205E7FC07D.jpg
https://qmna8w.bay.livefilestore.com/y1mK-KVubHfXmv4AprWJF-XckMnbpRreF_6mN7ohMiaIKU4R_mnFmTfvWg53jXkI_erZ_1s45ix80On_EPfvGQSoqiECzfLNjA5UdMDnXAj3E1LL9qG7AVk-WNIZmkuQSXY8OqMTQRymj-R2pqcIvP89Q/clip_image004_thumb%20329380AF.jpg
https://qmna8w.bay.livefilestore.com/y1mQVw_1ez4wx2hkP9dLL56-r5d_asEHn0kPNzlhFUPpvbUeVLg9Xkvy1vnavR4kgmZvVre6ZVTBTMVMPuNJJ_mqmxOWljQbpqgAqLM7WZiQmTfjv9xEg9hNPZ6kfyQ4DF2Liym7o59_EJfQxHvKeJCIA/clip_image006_thumb.jpg
https://qmna8w.bay.livefilestore.com/y1mQzlOGpWwEhixDdz6pw2XzUmMSIR9aaJwP1KAGhiyV51GK2XyeUV_ioDoKxPDgNaSl8iQdy-0OwvbXzSjqsVFcZ0t3dSGup8vEfYdUGAX_t9siORV7ARAQEorZNq6Fsd-JHi1EMYmQ7YbiE2KbrRDXg/clip_image008_thumb.jpg
https://qmna8w.bay.livefilestore.com/y1mwAc5pdPJxXTW6sA4AHjedrFp0c3phanXkYP0uIv4w7RbtUsr6ti5bY5Xi_PyETVY-OWZcFKlCkRYUVyPIfempn1m0PQC5n1o5en7bgim8TVVEk-HxWq9rn8IAi-VD_vSuVZ3lKxvrzra_Bz44JbixA/clip_image010_thumb.jpg
https://qmna8w.bay.livefilestore.com/y1m-Yvk9eL_TlDjhC6AQ0SeK4PKC3F61pVWUAkRsw8PpM-ACiOtT8hUcU4RbOinBJ-f-vcbHfFUAfhyORP0Wm42S_9JPcaj8IAOHjGoSP1ClM9Q92C3SzjY_xdYVClGXhbj6gSSj_6QHkL_oyVJr-McNw/clip_image012_thumb.jpg
https://qmna8w.bay.livefilestore.com/y1m4W6QBrQLIX1-M6ExGgd2dPCaZQuiqQ-yWiPKKuuocunB6NHipjVfufxCqn8kUsSaI6gjtNk-l6AsdxztbTJkgOC8empZYJjPaxVmi-WWFmwQ9hhjAu1870GXrlur-rSUEmtl3rhHdqYRuOIks9D7AA/clip_image014_thumb.jpg
https://qmna8w.bay.livefilestore.com/y1mG-K0TBvXYUfpFW1ZDffmy-1pRiQGEbezRFw-hgAXpdUfMbhplLkK3k17oH4syM2XohXNk4ebF4TOOv8pCfk4GG_6k3rtTobjKq6mceFFBVvUgtrDvmI3YOugt0LRqSu-6IHiBdbSAavV26W4sbK-qw/clip_image016_thumb.jpg
https://qmna8w.bay.livefilestore.com/y1mou8y2_ceeP_sVGiPZOojGQmzhZHSmqX8IPv6um_Br8GeIKxCDN-1PzCPErEoNQDj1OAHS1ZNKIPxluW9Rxs7tEYNIazzDCRHT58JHbjH7mBdTbgYMG3zP1Xtp38rH25SaSX-QreEMtOx7QGvgO9cWw/clip_image018_thumb.jpg
https://qmna8w.bay.livefilestore.com/y1m3VkFZXs7DBNIM_VXJYQyk80DP4ri2hZDv7l02DCkUU3b1dJ3lVJaIuP5O5sRPZlwTcJYgDlxxnHPhFM2fDH7bRWF7egkZ7J1X0XWp_DAWxdne0IoVgNvyyIcEBC0mPnLwq0Er98FeRyL7bultHckAQ/clip_image020_thumb.jpg
Methods to Improve Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.
• Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans.
• Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.
• Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness.
• Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent.