Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering

International Journal of Research in Computer Science
eISSN 2249-8265 Volume 2 Issue 4 (2012) pp. 7-12
© White Globe Publications
www.ijorcs.org

PRIVACY PRESERVING MFI BASED SIMILARITY
MEASURE FOR HIERARCHICAL DOCUMENT
CLUSTERING
P. Rajesh1, G. Narasimha2, N.Saisumanth3
1,3
Department of CSE, VVIT, Nambur, Andhra Pradesh, India
Email: rajesh.pleti@gmail.com
Email: saisumanth.nanduri@gmail.com
2
Department of CSE, JNTUH, Hyderabad, Andhra Pradesh, India
Email: narasimha06@gmail.com

Abstract: The increasing nature of World Wide Web navigation steps to find relevant documents. So we
has imposed great challenges for researchers in need a hierarchical clustering that is relatively flat that
improving the search efficiency over the internet. Now reduces the number of navigation steps. Therefore
days web document clustering has become an there is a great need for new document clustering
important research topic to provide most relevant algorithms, which are more efficient than conventional
documents in huge volumes of results returned in clustering algorithms [1, 2].
response to a simple query. In this paper, first we
The increasing nature of World Wide Web has
proposed a novel approach, to precisely define
imposed great challenges for researchers to cluster the
clusters based on maximal frequent item set (MFI) by
similar documents over the internet and their by
Apriori algorithm. Afterwards utilizing the same
improving the efficiency of search. Search engine uses
maximal frequent item set (MFI) based similarity
the getting more confused in selecting the relevant
measure for Hierarchical document clustering. By
documents among huge volumes of search results
considering maximal frequent item sets, the
returned to a simple query. A potential solution to this
dimensionality of document set is decreased. Secondly,
problem is to cluster the similar web documents, which
providing privacy preserving of open web documents
helps the user in identifying the relevant data easily
is to avoiding duplicate documents. There by we can
and effectively [3].
protect the privacy of individual copy rights of
documents. This can be achieved using equivalence The outline of this paper is divided into six
relation. sections. section II, briefly discusses related work. We
explained our proposed algorithm description
Keywords: Maximal Frequent Item set, Apriori
including common preprocessing steps and pseudo
algorithm, Hierarchical document clustering,
code of algorithm in section III. It also includes to
equivalence relation.
precisely defining clusters based on maximal frequent
item set (MFI) by Apriori algorithm. Section IV,
I. INTRODUCTION
describes exploiting the same maximal frequent item
Document clustering has been studied intensively set (MFI) based similarity measure for Hierarchical
because of its wide applicability in areas such as web document clustering with running example. In section
mining, search engines, text mining and information V, provides privacy preserving of open web
retrieval. The rapid progress of databases in every documents by using equivalence relation to protect the
aspect of human actions has resulted in enormous individual copy rights of a document.. Section VI,
demand for efficient algorithms for spinning data into consists of conclusion and future scope.
valuable knowledge.
II. RELATED WORK
Document clustering has undergone through
various methods, still document clustering is in its The related work of using maximal frequent item
inefficiency state for providing the required set in web document clustering is explained as follows.
information needed by the user exactly and Ling Zhuang Honghua Dai [4] introduced a new
approximately. Suppose the user makes an incorrect criterion to specifically locate the initial points using
selection while browsing the documents in hierarchy. maximal frequent item set. These initial points are then
If user may not notice his mistakes until he browses used as centers for k-means algorithm. However k-
into the deep portion of the hierarchy, then it decreases means clustering is completely unstructured approach,
the efficiency of search and increases the number of sensitive to noise and produces an unorganized

www.ijorcs.org

8 P. Rajesh, G. Narasimha, N.Saisumanth
collection of clusters that is not favorable to based similarity measure . The clusters in the resulting
interpretation [5, 6]. To minimize the overlapping of hierarchy are non-overlapping. The parent cluster
documents, Beil, Ester [7] were proposed a method contains only the general documents.
HFTC (Hierarchical Frequent Text Clustering) is
another frequent item set based approach to choose the III. ALGORITHM DESCRIPTION
next frequent item sets. But the clustering result In this section, we explained our proposed
depends on the order of choosing next frequent item algorithm description including common
sets. The resulting hierarchy in HFTC usually contains preprocessing steps and pseudo code of algorithm. It
many clusters at first level. As a result the documents also includes to precisely defining clusters based on
in the same class are to be distributed into different maximal frequent item set (MFI) by Apriori algorithm.
branches of hierarchy, which decreases the overall First, we will speak about some common
clustering accuracy. preprocessing steps for representing each document by
C.M.Fung [8] has introduced FIHC (Frequent Item item sets (terms). Second we will bring in vector space
set based Hierarchical Clustering) method for model by assigning weights to terms in all document
document clustering. Which employed, a cluster topic sets. Finally, we will explain the process of
tree is constructed based on the similarity among initialization of clusters seeds using MFI to perform
clusters. FIHC used the efficient child pruning when hierarchical clustering. Let Ds represents set of all
number of clusters is large and to apply the elaborated documents in collection of database.
sibling merging only when number of clusters is small. Ds= {d1, d2, d3………dM}: 1 ≤ i ≤ M
The experiment results FIHC actually outperforms all
other algorithms (bisecting-k means, UPGMA) in A. Pre-Processing
accuracy for most number of clusters. The document set Ds is converted from
The Apriori algorithm [9] is a well-known method unstructured format into some common representation
for computing frequent item sets in a transaction using the text preprocessing techniques, in which
database. The document under the same topic, shares words or terms are extracted (tokenization). The input
more common frequent item sets (terms) than the data set of documents in Ds are preprocessed using the
documents of different topics. The main advantage of techniques namely, removing HTML tags first, after
using frequent item sets is that it can identify the that apply stop words list and stemming algorithm.
relation among the more than two documents at a time a) HTML Tags: parsing of HTML Tag
in a document collection unlike similarity measure b) Stop words: Remove the stop words list like
between two documents [10, 11].By the means of “conjunctions, connectives, prepositions etc”
maximal frequent item sets, the dimensionality of the c) Stemming algorithm: We utilize porter 2
document set is reduced. More over maximal frequent stemmer algorithm in our approach.
item sets captures most related document sets. On the
other hand, hierarchical clustering most relevant for B. Vector representation of document:
browsing and maps most specific documents to
generalized documents in the whole collection. Vector space model is the most commonly used
document representation model in text mining, web
A conventional hierarchical clustering method mining and information retrieval areas. In this model
constructs the hierarchy by subdividing parent cluster each document is represented as n-dimensional term
or merging similar children clusters. It usually suffers vector. The value of each term in the n-dimensional
from its inability to perform tuning once a merge or vector reflects the importance of corresponding
split decision has been performed. This rigidity may document. Let N be the total number of terms and M
lower the clustering accuracy. Furthermore, due to the be the number of documents and each the document

𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤ i≤ M. Where
fact that a parent cluster in the hierarchy always can be denoted as

𝑑𝑓(𝑡𝑒𝑟𝑚 𝑖𝑗 ) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
contains all objects of its Childs, this kind of hierarchy

frequency 𝑡𝑒𝑟𝑚 𝑖𝑗 is less than the threshold value is
is not suitable for browsing. The user may have value. The document
difficulty to locate his intention object in such a large
cluster. considered to avoid the problem of more times a term
Our hierarchical clustering method is completely appears throughout all documents in the whole
different. The aim of this paper is, first we form all collection, the more poorly it discriminates between
the clusters by assigning documents to the most similar documents [12].Calculate term frequency tf is number
cluster using maximal frequent item sets by Apriori of times a term appears in a document. Document
frequency of a term df as no of documents that

documents vectors. 𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 )
algorithm and then construct the hierarchical
document clustering based on their inter-cluster contains term. Also construct the weights for

Where 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝐼𝐷𝑓(𝑗) and
similarities via same maximal frequent item set (MFI)

www.ijorcs.org

IDf (j) =𝑙𝑜𝑔 � �1≤j≤n.where IDf is the inverse
𝑚
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 9

𝑑𝑓 𝑗
A frequent item set is a set of words which occurs
frequently together and are good candidates for

such that X ⊂ X1 and t(X) = t(X1), where t(X) defined
document frequency. clusters and are denoted by FI. An item set X is closed
Table 1: Table Representation of Transactional Database of if there does not exist an item set X1 such that X1,
Documents
as the set of transactions that contain item set X and it
Terms Doc 1 Doc 2 Doc 3 ..... Doc 4
is denoted by FCI(frequently closed items).If X is
Java 1 1 0 ..... 1
frequent and no superset of X is frequent among the
Beans 0 1 0 ..... 0

MFI. Then MFI⊂ FCI ⊂ FI Whenever there are very
set of items I in transactional databases. Then we say
..... ..... ….. ….. ..... ….. that X is maximal frequent item set and denoted by
Servlets 1 0 1 ..... 1

By the representation of document as vector form, long patterns are present in the data it is often
we can easily identify which documents Contains the impractical to generate the entire set if frequent item
same features .The more features documents have in sets or closed item sets [16]. In that case, maximal
common, the more related they are. Thus, it is realistic frequent item sets are adequate for such applications.
to find well related documents. Assume that each We employed maximal frequent item set algorithm
document is an item in the transactional database; each from [17] using apriori. These maximal frequent item
term corresponds to a transaction. Our aim is to search sets are initial seeds for hierarchical document
for highly related documents “appearing” together clustering.
with same features (the documents whose MFI features D. Pseudo code Algorithm
are closed). Similarly, the maximal frequent item set
discovery in the transaction database serves the For MFI Based Similarity Measure for Hierarchical
purpose of finding items of documents appearing Document Clustering
together in many transactions. i.e., document sets Input: Document set Ds.
which have large amount of feature in common.
Definition: MFI: Maximal Frequent Item set.
C. Apriori for maximal frequent item sets
(tf) Term frequency and (df) document frequency
Mining frequent item sets is a primary content of
Step 1. For each document in Ds, Remove the HTML
data mining that emphasizes particularly in finding the
relation of different items in the large database. Mining tags and perform stop word list and stemming.
Step 2. Calculate the term frequency (tf) and document

𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤i≤M
frequent patterns is crucial problem in many data
mining applications such as the discovery of frequency (df).

Where df�𝑡𝑒𝑟𝑚 𝑖𝑗 � < Threshold value
association rules, correlations, multidimensional
patterns, and other numerous important inferring
patterns from consumer market basket analysis and
web access etc. The association mining problem is Step 3. Also construct the weighted document vectors

𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 ) 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗
formulated as follows: Given a large data base of set of for all the documents

𝐼𝐷𝑓(𝑗).Idf (j) =𝑙𝑜𝑔 � � 1≤j≤n.
items transactions, find all frequent item sets, where a
𝑚
Where
frequent item set is one that occurs in at least a user-
𝑑𝑓 𝑗
specified threshold value of the data base. Many of the
proposed item set mining algorithms are a variant of
Step 4. Now represent each documents by keywords
Apriori, which employs a bottom-up, breadth first
whose tf>support
search that enumerates every single frequent item set.

𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 , … … … … . . 𝐹 𝑛 }
Apriori is a conventional algorithm that was first Calculate the Maximal Frequent Item set(MFI) of
introduced] for mining association rules. Association terms using Apriori algorithm

Where each 𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 , … … … 𝑑 𝑘 }
can be viewed as two-step process as

a document 𝑑 𝑖 is in more than one maximal
frequent item set then choose 𝐼 𝑑 as a set
(1) Identifying all frequent item sets
Step 5. If
(2) Generating strong association rules from the

containing document 𝑑 𝑖 . Then Assign𝐼 𝑥 =𝐼 𝑑0 .For
frequent item sets
consisting of such maximal frequent item sets
At first, candidate item sets are generated and

the document 𝑑 𝑖
afterwards frequent item sets are mined with the help each the maximal frequent item sets containing

𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑 𝑖 ))
of these candidate item sets. In the proposed approach,

> 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑 𝑖 ))]
we have used only the frequent item sets for further
processing so that, we undergone only the first step
(generation of maximal frequent item sets) of the
Apriori algorithm.

www.ijorcs.org

Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .Assign the document 𝑑 𝑖 to 𝐼 𝑥 𝐹𝑖 𝑙𝑖𝑘𝑒 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } as one cluster in hierarchy

and discard 𝑑 𝑖 for other maximal frequent item sets.
Case 3: If 𝐹𝑖 , 𝐹𝑗 contains some same documents
and represent it by center (as in step6).
Repeat this process for all documents that occurs in

consider the case of document 𝑑2 is repeatedin more
more than one maximal frequent item set

these maximal frequent item sets 𝐹𝑖 as clusters than one maximal frequent item sets{𝐹1 𝐹4 }.Similarly
among the documents list obtained from MFI. Let us

and combine the documents in 𝐹𝑖 into a single
Step 6. Apply hierarchical document clustering to make

𝑑4 is repeated in{ 𝐹1 , 𝐹2 , 𝐹4 }. Then choose𝐼 𝑑 =
{ 𝐹1 , 𝐹2 , 𝐹4 } = { 𝐼 𝑑0 , 𝐼 𝑑1 , 𝐼 𝑑2 }for document𝑑4 .Assign
𝐼 𝑥 =𝐼 𝑑0 = 𝐹1 . For each the maximal frequent item sets
new document and represent it by centers of the

𝐼 𝑑 containing 𝑑4
maximal frequent item sets. These are obtained

𝐼 𝑑0 𝑡𝑜 𝐼 𝑑2 calculate the measure
by combining the features of maximal frequent in the document from

𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑4 ))
item set of terms that grouping the documents

> 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑4 ))]
Step 7. Repeat the same process of hierarchical
document clustering based on maximal frequent

document 𝑑4 closest to which maximal frequent item
item sets for all levels in hierarchy and stop if
total number of documents equals to one else go By using this jaccards measure, we can identify the

document 𝑑4 .Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .
to step 4.
set among maximal frequent item sets containing the

Let’s suppose that 𝑑4 is closed to the maximal
IV. HIERARCHICAL CLUSTERS BASED ON

frequent item set 𝐹4 . Assign the document𝑑4 to𝐼 𝑥 =
MAXIMAL FREQUENT ITEM SETS

𝐼 𝑑𝑖 = 𝐹4 and discard 𝑑4 for other maximal frequent
After finding maximal frequent item sets (MFI) by
using Apriori algorithm. We turn to describing the

exactly one cluster. Similarly 𝑑2 belongs to𝐹1 .Repeat
creation of hierarchical document clustering using item sets. After this step, each document belongs to
same similarity measure by MFI. A simple instance

among the whole collection of documents 𝐷 𝑆 by
case of example is also provided to demonstrate the

𝑑2 , 𝑑4 are repeated in𝐹1 , 𝐹4 . The clusters that will form
this process for all documents that occurs in more than

apriorialgorithm are 𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 … . . 𝐹 𝑛 }.Where
entire process. The set of maximal frequent item sets one maximal frequent item set. Since the documents

by𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 … . . 𝑑 𝑘 }.Then consider total number
at the first level of hierarchy by applying step5 and

𝐹1 = {𝑑2 , 𝑑6 }
each MFI consist of set of documents represented step 6 are as follows.

𝐹2 = {𝑑3 , , 𝑑8 }
of documents which occurs in maximal frequent item

𝑑1 , 𝑑2 , 𝑑3, 𝑑4 , 𝑑5 , 𝑑6 , 𝑑7 , 𝑑8 , 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 }
sets in MFI as follows.

𝑀𝐹𝐼 = � �
𝑑9 , 𝑑10 , 𝑑11 , 𝑑12 , 𝑑13 , 𝑑14 , 𝑑15
𝐹4 = {𝑑4 , , 𝑑14 }
𝐹1 = {𝑑2 , 𝑑4 , 𝑑6 }
𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 }
𝐹2 = {𝑑3 , 𝑑4 , 𝑑8 }
𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 }
𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 }
𝐹4 = {𝑑4 , 𝑑2 , 𝑑14 }
The hierarchical diagram for the above form of

𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 }
maximal frequent item set clusters can be representing

𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 }
as follows. Repeat the same process of hierarchical
document clustering based on maximal frequent item
sets for all levels in hierarchy and stop if total number
The clusters in the resulting hierarchy are non- of documents equals to one else go to step 4.
overlapping. This can be achieved through the

Case1: If 𝐹𝑖 , 𝐹𝑗 are same then choose one in random
following cases.

Case2: If 𝐹𝑖 , 𝐹𝑗 are different then form clusters of
to form cluster.

documents contained in𝐹𝑖 , 𝐹𝑗 independently. In our

in 𝐹3 , 𝐹5 and 𝐹6 𝑎𝑟𝑒 different. So we form a clusters
example, the maximal frequent item set of documents

according to the documents contained in
Figure 1: Hierarchical document clustering using MFI

www.ijorcs.org

Represent each new document �𝐿 𝑖𝑗 � in hierarchy by
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 11
itself. When we are classifying the documents into
maximal frequent item set of terms as centers (as in equivalence classes, we are not considering these ones
step 6).These maximal frequent item sets are obtained and put zeros. Jaccard similarity coefficient matrix for
by combining the features of maximal frequent item four documents can be represented as follows.
set of terms that grouping the documents. Each new
d1 d2 d3 d4

�𝐿 𝑖𝑗 � represents that jth document in the level of
document also consisting of corresponding updated
weights of maximal frequent item set of terms. Where d 1  1 0.4 0.8 0.5

hierarchy𝐿 𝑖 . In the figure { 𝐿12 = 𝐿21 }means that the
d 2 0.4 1 0.8 0.4
Rα =  

level 𝐿1 are not matched with other documents MFI set
d 3 0.8 0.8 1 0.9
maximal frequent item set of terms in 2nd document of  
d 4 0.5 0.4 0.9 1 
in same level𝐿1 .So it is repeated same for the next
level and it is also same for the document { 𝐿13 = Ds = {d1 , d2 , d3 , d4 }as the collectionof document pairs
𝐿22 }. The documents{ 𝐿11 , 𝐿15 } and{ 𝐿14 , 𝐿16 } in first
Where alpha is threshold. Let define a relation R on

value. i.e 𝑅 = {(𝑑 𝑖 , 𝑑 𝑗 )/ 𝐽 (𝑑 𝑖 , 𝑑 𝑗 ) ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 }
whose similarity measure is above some threshold

level as 𝐿23 , 𝐿24 .
level are combined using MFI based hierarchical

1. R is reflexive on Ds iff 𝑅 (𝑑 𝑖 , 𝑑 𝑖 ) = 1. i.e Every
clustering and represent these documents in the second

2. R is symmetric on Ds iff𝑅 �𝑑 𝑖 , 𝑑 𝑗 � = 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �i.e
document is mostly related to itself.

if the document 𝑑 𝑖 is similar to 𝑑 𝑗 then the
V. PRIVACY PRESERVING OF WEB

document 𝑑 𝑗 is also similar to𝑑 𝑖 .
DOCUMENTS USING EQUIVALENCE
RELATION
Most internet web documents are publicly available

𝑅 (𝑑 𝑖 , 𝑑 𝑘 ) ≥ 𝑚𝑎𝑥 𝑗 { min{𝑅 �𝑑 𝑖 , 𝑑 𝑗 �, 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �}}.
for providing services required by the user. In such 3. R is transitive on Ds iff
documents there is no confidential or sensitive data
(open to all). Then how can we provide privacy of
such documents. Now a days, same information will Then R is transitive by the definition.
be exists in more than one document in duplicate
Then R is an equivalence relation on Ds, which
forms. The way of providing privacy preserving of
partitions the input document set Ds into set of
documents is by avoiding duplicate documents. There
equivalence classes. Equivalence relation seems a
by we can protect the privacy of individual copy rights
natural technique for duplicate document
of documents. Many duplicate document detection
categorization. Any two documents in same
techniques are available such as syntactic, URL based,
equivalence class are related and are different if they
semantic approaches. In each technique, a processing
are coming from two equivalence classes. The set of
overhead of maintaining shingling’s, signatures,
all equivalence classes induces the document set Ds.
fingerprints [13, 14, 15, 18]. In this paper, we
High syntactic similarity pairs of documents typically
proposed a new technique for avoiding duplicate
referred to as duplicates or near duplicates except
documents using equivalence relation. Let Ds be the
diagonal elements. By using equivalence relation,
input duplicate document set is subset to web
easily we can identify the duplicate documents or we
document collection. First find the jaccard similarity
can perform the clustering on duplicate documents.
measure for every pair of documents in Ds using
Apart from the representation of feature document
weighted feature representation of maximal frequent
vector by MFI, we also need to consider that who is
item sets discussed in step 2 and step 3 in algorithm. If
the author of document, when the document was
the similarity measure of two documents is equal to 1,
created, where it is available, helps in effectively
then the two documents are most similar. If the
finding the duplicate documents. Each document in
measure is 0, then they are not duplicates. The Jaccard
input Ds must belong to unique equivalence class. If R
index or the Jaccard similarity coefficient is a
is equivalence relation on Ds = {d1, d2, d3, d4 …..dn}.
statistical measure of similarity between sample sets.
Then number of equivalence relations on Ds is always
For two sets, it is denoted as the cardinality of their
lies between n ≤ | R|≤ n2. i.e the time complexity of
intersection divided by the cardinality of their union.

|𝑑1 ∩ 𝑑2 |
calculating equivalence relation on Ds is O(n2).

.i.e𝐽 �𝑑 𝑖 , 𝑑 𝑗 � ≥ 0.8. Since the matrix is symmetric, the
Mathematically

𝐽(𝑑1 , 𝑑2 ) =
Choose the threshold α in equivalence relation as 0.8

|𝑑1 ∩ 𝑑2 | documents sets {(𝑑3 , 𝑑1 ), (𝑑3 , 𝑑2 ), (𝑑4 , 𝑑3 )} are
mostly related. Hence the documents are near
For every pair of two documents calculate jaccard duplicates and grouping the documents into clusters
measure of d1, d2.All the diagonal elements in matrix thereby providing privacy of individual copy rights of
are ones, because every document mostly related to documents.

www.ijorcs.org


0 0 1 0 Data mining 2002 (KDD-2002), Edmonton, Alberta,
0 0 1 0
Canada.
R 0.8 =   [8] BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003).
1 1 0 1 “Hierarchical Document Clustering using Frequent Item
  Sets”. In Proceedings SIAM International Conference
0 0 1 0 on Data Mining 2003 (SIAM DM-2003), pp:59-70.
[9] Agrawal, R., Srikant, R. (1994). “Fast Algorithms for
VI. CONCLUSION AND FUTURE SCOPE Mining Association Rules”. In the Proceedings of 20th
International Conference on Very Large Data Bases,
Cluster analysis can be used as powerful ,stranded
1994, Santiago, Chile, PP: 487-499.
alone data mining concept that gains insight
[10] Liu, W.L., and Zeng, X.S. (2005). “Document
information of knowledge from huge unstructured
Clustering Based on Frequent Term Sets”. Proceedings
databases. Most conventional clustering methods do of Intelligent Systems and Control, 2005.
not satisfy the document clustering requirements such
[11] Zamir, O., Etzioni, O. (1998). “Web Document
as high dimensionality, huge volumes and easy of
Clustering: A Feasibility Demonstration”. In the
accessing meaningful clusters labels. In this paper, we Proceedings of ACM,1998 (SIGIR-98), PP: 46-54.
presented novel approach; Maximal frequent item set
[12] Kjersti, (1997). “A Survey on Personalized Information
(MFI) Based Similarity Measure for Hierarchical
Filtering Systems for the World Wide Web”. Technical
Document Clustering to address these issues. Report 922, Norwegian Computing Center, 1997.
Dimensionality reduction can be achieved through
[13] Prasannakumar, J., Govindarajulu, P., “Duplicate and
MFI. By using the same MFI similarity measure in Near Duplicate Documents Detection: A Review”.
hierarchal document clustering, the number of levels European Journal of Scientific Research ISSN 1450-
will be decreased. It is easy for browsing. Clustering 216X Vol.32 No.4 ,2009, pp:514-527
has its paths in many areas, by applying MFI based [14] Syed Mudhasir,Y., Deepika,J., “Near Duplicate
techniques to clusters, including data mining, statistics, Detection and Elimination Based on Web Provenance
biology, and machine learning we can get the high for Efficient Web Search”. In the Proceedings of
quality of clusters. Moreover, by means of maximal International Journal on Internet and Distributed
frequent item sets, we can predict the most influenced Computing Systems, Vol.1, No.1, 2011.
objects of clusters in the entire dataset of applications [15] Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicate
like business, marketing, world wide web, social Document Detection Survey”. In the Proceedings of
networking analysis. International Journal of Computer Science and
Communications Networks, Vol.2, N0.2, pp:147-151.
VII. REFEERENCES [16] Doug Burdick, Manuel Calimlim, Johannes Gehrke.
(2001). “A Maximal Frequent Itemset Algorithm for
[1] Ruxixu, Donald Wunsch., “A Survey of Clustering
Transactional Databases”. In the Proceedings of ICDE,
Algorithms”. In the Proceedings of IEEE Transactions
17th International Conference on Data Engineering
on Neural Networks, Vol. 16, No. 3, May 2005.
2001 (ICDE-2001).
[2] Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering:
[17] Murali Krishna, S., Durga Bhavani, S., “An Efficient
A Review”. In the Proceedings of ACM Computing
Approach for Text Clustering Based On Frequent Item
Surveys, Vol.31, No.3, 1999, pp: 264-323.
Sets”. European Journal of Scientific Research ISSN
[3] Kleinberg, J.M., “Authoritative Sources in a 1450-216X, Vol.42, No.3, 2010, pp:399-410.
Hyperlinked Environment”. In the Journal of the ACM,
[18] Lopresti, D.P. (1999). "Models and Algorithms for
Vol. 46, No.5, 1999, pp: 604-632.
Duplicate Document Detection". In the Proceedings of
[4] Ling Zhuang, Honghua Dai. (2004). “A Maximal Fifth International Conference on Document Analysis
Frequent Item Set Approach for Web Document and Recognition 1999 (ICDAR-1999), 20th-22th Sep,
Clustering”. In Proceedings of the IEEE Fourth pp:297-300.
International Conference on Computer and Information
Technology 2004 (CIT-2004).
[5] Michael, W., Trosset. (2008). “Representing Clusters:
k-Means Clustering, Self-Organizing Maps and
Multidimensional Scaling”. Technical Report,
Department of Statistics, Indian University,
Bloomington, 2008.
[6] Michael Steinbach, George karypis, and Vipinkumar.
(2000). “A Comparison of Document Clustering
Techniques”. In Proceedings of the Workshop on Text
Mining, 2000 (KDD-2000), Boston, pp: 109-111.
[7] Beil, F., Ester, M., Xu, X. (2002). “Frequent Term-
Based Text Clustering”. In Proceedings of 8th
International Conference on Knowledge Discovery and

www.ijorcs.org

Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering

Similar to Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering (20)

More from IJORCS

More from IJORCS (20)

Recently uploaded

Recently uploaded (20)

Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering