In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups.This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
1) The document discusses an approach to measure semantic similarity between multiple documents using an enhanced suffix tree. It involves preprocessing documents, constructing a suffix tree with documents' phrases as edges, calculating weights of shared nodes using TF-IDF, and applying cosine, dice, and hellinger similarity measures to determine pairwise document similarities.
2) The approach first preprocesses documents by removing stop words, special characters, and converting to lowercase. A suffix tree is constructed with documents' phrases as edges. Shared nodes in the tree represent common phrases between documents.
3) Node weights are calculated using TF-IDF, with higher weights given to rarer phrases. Several similarity measures (cosine, dice, hellinger) are then applied
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
A rough set based hybrid method to text categorizationNinad Samel
This document summarizes a hybrid text categorization method that combines Latent Semantic Indexing (LSI) and Rough Sets theory to reduce the dimensionality of text data and generate classification rules. It introduces LSI to reduce the feature space of text documents represented as high-dimensional vectors. Then it applies Rough Sets theory to the reduced feature space to locate a minimal set of keywords that can distinguish document classes and generate multiple knowledge bases for classification instead of a single one. The method is tested on text categorization tasks and shown to improve accuracy over previous Rough Sets approaches.
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
This document discusses distributed document clustering. It begins with an introduction to how documents are stored and indexed in computers. It then discusses different clustering algorithms like hierarchical and k-means clustering that are used to group similar documents. The document proposes a new framework for efficiently clustering text documents stored across different distributed resources. It argues that traditional clustering algorithms cannot perfectly cluster text data in decentralized systems. The framework uses properties of traditional algorithms with the ability to cluster in distributed systems.
This paper proposes a Tamil document summarization system that utilizes statistical, semantic, and heuristic methods to generate a coherent multi-document summary based on a given query. The system performs Latent Dirichlet Allocation (LDA) topic modeling on document clusters to identify important topics and words. Sentences are then scored based on topic modeling results and redundancy is removed using Maximal Marginal Relevance. The summary is generated from the highest scoring sentences in different perspectives based on the query topic or entities. Evaluation results show the system effectively summarizes multiple documents according to the query.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
1) The document discusses an approach to measure semantic similarity between multiple documents using an enhanced suffix tree. It involves preprocessing documents, constructing a suffix tree with documents' phrases as edges, calculating weights of shared nodes using TF-IDF, and applying cosine, dice, and hellinger similarity measures to determine pairwise document similarities.
2) The approach first preprocesses documents by removing stop words, special characters, and converting to lowercase. A suffix tree is constructed with documents' phrases as edges. Shared nodes in the tree represent common phrases between documents.
3) Node weights are calculated using TF-IDF, with higher weights given to rarer phrases. Several similarity measures (cosine, dice, hellinger) are then applied
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
A rough set based hybrid method to text categorizationNinad Samel
This document summarizes a hybrid text categorization method that combines Latent Semantic Indexing (LSI) and Rough Sets theory to reduce the dimensionality of text data and generate classification rules. It introduces LSI to reduce the feature space of text documents represented as high-dimensional vectors. Then it applies Rough Sets theory to the reduced feature space to locate a minimal set of keywords that can distinguish document classes and generate multiple knowledge bases for classification instead of a single one. The method is tested on text categorization tasks and shown to improve accuracy over previous Rough Sets approaches.
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
This document discusses distributed document clustering. It begins with an introduction to how documents are stored and indexed in computers. It then discusses different clustering algorithms like hierarchical and k-means clustering that are used to group similar documents. The document proposes a new framework for efficiently clustering text documents stored across different distributed resources. It argues that traditional clustering algorithms cannot perfectly cluster text data in decentralized systems. The framework uses properties of traditional algorithms with the ability to cluster in distributed systems.
This paper proposes a Tamil document summarization system that utilizes statistical, semantic, and heuristic methods to generate a coherent multi-document summary based on a given query. The system performs Latent Dirichlet Allocation (LDA) topic modeling on document clusters to identify important topics and words. Sentences are then scored based on topic modeling results and redundancy is removed using Maximal Marginal Relevance. The summary is generated from the highest scoring sentences in different perspectives based on the query topic or entities. Evaluation results show the system effectively summarizes multiple documents according to the query.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation improves performance over traditional measures like cosine and Jaccard similarity. The algorithm captures semantic information within documents for more accurate similarity assessment and clustering.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
This document proposes a new multi-viewpoint based similarity measure for clustering text documents that aims to overcome limitations of existing measures. Existing measures use a single viewpoint to measure similarity between documents, but the proposed measure uses multiple viewpoints to ensure clusters exhibit all relationships between documents. The empirical study found that using a multi-viewpoint similarity measure forms more meaningful clusters by capturing more informative relationships between documents.
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
This document discusses using the K-Means clustering algorithm to cluster text documents and compares it to using K-Means clustering with dimension reduction techniques. It uses the BBC Sports dataset containing 737 documents in 5 classes. The document outlines preprocessing the text, creating a document term matrix, applying K-Means clustering, and using dimension reduction techniques like InfoGain before clustering. It evaluates the different methods using precision, recall, accuracy, and F-measure, finding that K-Means with InfoGain dimension reduction outperforms standard K-Means clustering.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The document proposes a privacy-preserving approach for hierarchical document clustering using maximal frequent item sets (MFI). First, MFI are identified from document collections using the Apriori algorithm to define clusters precisely. Then, the same MFI-based similarity measure is used to construct a hierarchy of clusters. This approach decreases dimensionality and avoids duplicate documents, thereby protecting individual copyrights. The methodology and algorithm are described in detail.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Correlation Preserving Indexing Based Text ClusteringIOSR Journals
This document discusses a correlation preserving indexing (CPI) based text clustering method. CPI aims to find a low dimensional semantic subspace that maximizes correlation between similar documents while minimizing correlation between dissimilar documents. It is different from other methods like LSI and LPI that use Euclidean distance. The document outlines the CPI method and evaluates it on document clustering tasks, showing it doubles the accuracy of previous correlation-based methods. Hierarchical clustering algorithms are also discussed and compared to CPI in terms of evaluation metrics.
A rule based approach towards detecting human temperamentijcsit
This paper presented a rule based system for detecting human temperament.. The system was developed to
provide support for an expert psychologist in properly predicting the temperament of an individual as well
as given advice to the user. The system does this by following specified rules. Of this, we have deduced
some features that makes up known temperament types from which the system can accurately classify the
user‘s temperament based on the person‘s characters. Also, our work is solely limited to temperament, any
expert advice sought from and given by the system is limited to this scope.
This document provides tips and sample answers for common interview questions for a legal assistant position. It discusses how to answer questions about yourself, your strengths, career goals, reasons for leaving previous jobs, weaknesses, knowledge of the organization, and ways you've improved your skills. For each question, it offers steps and strategies for crafting effective responses that highlight your relevant qualifications and experience. Sample answers are provided for questions about your background, work history, goals, and steps taken to further your knowledge.
An iterative morphological decomposition algorithm for reduction of skeleton ...ijcsit
Shape representation is an important aspect in image processing and computer vision. There are several skeleton transforms that lead to morphological shape representation algorithm. One of the main problems with these algorithms is in selecting the skeleton points that represent the shape component. If the numbers of skeleton subsets are reduced then the reconstruction process will be easy and time consuming. The present paper proposes a skeleton scheme that selects skeleton points based on the largest shape element. By this, overall skeleton subsets will be reduced. The present method is applied on various images and is compared with generalized skeleton transform and octagon-generating decomposition algorithm.
P REDICTION F OR S HORT -T ERM T RAFFIC F LOW B ASED O N O PTIMIZED W...ijcsit
Short term traffic forecasting has been a very impo
rtant consideration in many areas of transportation
research for more than 3 decades. Short-term traffi
c forecasting based on data driven methods is one o
f the
most dynamic and developing research arenas with en
ormous published literature. In order to improve
forecasting model accuracy of wavelet neural networ
k, an adaptive particle swarm optimization algorith
m
based on cloud theory was proposed, not only to hel
p improve search performance, but also speed up
individual optimizing ability. And the inertia weig
ht adaptively changes depending on X-conditional cl
oud
generator which has the stable tendency and randomn
ess property .Then the adaptive particle swarm
optimization algorithm based on cloud theory was us
ed to optimize the weights and thresholds of wavele
t
BP neural network, Instead of traditional gradient
descent method . At last, wavelet BP neural network
was
trained to search for the optimal solution. Based o
n above theory, an improved wavelet neural network
model based on modified particle swarm optimization
algorithm was proposed and the availability of the
modified prediction method was proved by predicting
the time series of real traffic flow. At last, the
computer simulations have shown that the nonlinear
fitting and accuracy of the modified prediction
methods are better than other prediction methods.
Template matching is a basic method in image analysis to extract useful information from images. In this
paper, we suggest a new method for pattern matching. Our method transform the template image from two
dimensional image into one dimensional vector. Also all sub-windows (same size of template) in the
reference image will transform into one dimensional vectors. The three similarity measures SAD, SSD, and
Euclidean are used to compute the likeness between template and all sub-windows in the reference image
to find the best match. The experimental results show the superior performance of the proposed method
over the conventional methods on various template of different sizes.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation improves performance over traditional measures like cosine and Jaccard similarity. The algorithm captures semantic information within documents for more accurate similarity assessment and clustering.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
This document proposes a new multi-viewpoint based similarity measure for clustering text documents that aims to overcome limitations of existing measures. Existing measures use a single viewpoint to measure similarity between documents, but the proposed measure uses multiple viewpoints to ensure clusters exhibit all relationships between documents. The empirical study found that using a multi-viewpoint similarity measure forms more meaningful clusters by capturing more informative relationships between documents.
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
This document discusses using the K-Means clustering algorithm to cluster text documents and compares it to using K-Means clustering with dimension reduction techniques. It uses the BBC Sports dataset containing 737 documents in 5 classes. The document outlines preprocessing the text, creating a document term matrix, applying K-Means clustering, and using dimension reduction techniques like InfoGain before clustering. It evaluates the different methods using precision, recall, accuracy, and F-measure, finding that K-Means with InfoGain dimension reduction outperforms standard K-Means clustering.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The document proposes a privacy-preserving approach for hierarchical document clustering using maximal frequent item sets (MFI). First, MFI are identified from document collections using the Apriori algorithm to define clusters precisely. Then, the same MFI-based similarity measure is used to construct a hierarchy of clusters. This approach decreases dimensionality and avoids duplicate documents, thereby protecting individual copyrights. The methodology and algorithm are described in detail.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Correlation Preserving Indexing Based Text ClusteringIOSR Journals
This document discusses a correlation preserving indexing (CPI) based text clustering method. CPI aims to find a low dimensional semantic subspace that maximizes correlation between similar documents while minimizing correlation between dissimilar documents. It is different from other methods like LSI and LPI that use Euclidean distance. The document outlines the CPI method and evaluates it on document clustering tasks, showing it doubles the accuracy of previous correlation-based methods. Hierarchical clustering algorithms are also discussed and compared to CPI in terms of evaluation metrics.
A rule based approach towards detecting human temperamentijcsit
This paper presented a rule based system for detecting human temperament.. The system was developed to
provide support for an expert psychologist in properly predicting the temperament of an individual as well
as given advice to the user. The system does this by following specified rules. Of this, we have deduced
some features that makes up known temperament types from which the system can accurately classify the
user‘s temperament based on the person‘s characters. Also, our work is solely limited to temperament, any
expert advice sought from and given by the system is limited to this scope.
This document provides tips and sample answers for common interview questions for a legal assistant position. It discusses how to answer questions about yourself, your strengths, career goals, reasons for leaving previous jobs, weaknesses, knowledge of the organization, and ways you've improved your skills. For each question, it offers steps and strategies for crafting effective responses that highlight your relevant qualifications and experience. Sample answers are provided for questions about your background, work history, goals, and steps taken to further your knowledge.
An iterative morphological decomposition algorithm for reduction of skeleton ...ijcsit
Shape representation is an important aspect in image processing and computer vision. There are several skeleton transforms that lead to morphological shape representation algorithm. One of the main problems with these algorithms is in selecting the skeleton points that represent the shape component. If the numbers of skeleton subsets are reduced then the reconstruction process will be easy and time consuming. The present paper proposes a skeleton scheme that selects skeleton points based on the largest shape element. By this, overall skeleton subsets will be reduced. The present method is applied on various images and is compared with generalized skeleton transform and octagon-generating decomposition algorithm.
P REDICTION F OR S HORT -T ERM T RAFFIC F LOW B ASED O N O PTIMIZED W...ijcsit
Short term traffic forecasting has been a very impo
rtant consideration in many areas of transportation
research for more than 3 decades. Short-term traffi
c forecasting based on data driven methods is one o
f the
most dynamic and developing research arenas with en
ormous published literature. In order to improve
forecasting model accuracy of wavelet neural networ
k, an adaptive particle swarm optimization algorith
m
based on cloud theory was proposed, not only to hel
p improve search performance, but also speed up
individual optimizing ability. And the inertia weig
ht adaptively changes depending on X-conditional cl
oud
generator which has the stable tendency and randomn
ess property .Then the adaptive particle swarm
optimization algorithm based on cloud theory was us
ed to optimize the weights and thresholds of wavele
t
BP neural network, Instead of traditional gradient
descent method . At last, wavelet BP neural network
was
trained to search for the optimal solution. Based o
n above theory, an improved wavelet neural network
model based on modified particle swarm optimization
algorithm was proposed and the availability of the
modified prediction method was proved by predicting
the time series of real traffic flow. At last, the
computer simulations have shown that the nonlinear
fitting and accuracy of the modified prediction
methods are better than other prediction methods.
Template matching is a basic method in image analysis to extract useful information from images. In this
paper, we suggest a new method for pattern matching. Our method transform the template image from two
dimensional image into one dimensional vector. Also all sub-windows (same size of template) in the
reference image will transform into one dimensional vectors. The three similarity measures SAD, SSD, and
Euclidean are used to compute the likeness between template and all sub-windows in the reference image
to find the best match. The experimental results show the superior performance of the proposed method
over the conventional methods on various template of different sizes.
This document provides tips and sample answers for common marketing coordinator interview questions. It discusses how to answer questions about yourself, your strengths, career goals, reasons for leaving previous jobs, weaknesses, knowledge of the organization, and ways you've improved your marketing skills. For each question, it offers steps and guidelines for effective responses, including focusing answers on the job requirements, providing evidence of strengths, and avoiding negative statements. Sample answers are provided for questions about the applicant's background, experience, goals, and knowledge of the company.
This document provides tips and sample answers for common interview questions for an HR generalist position. It discusses how to answer questions about yourself, your strengths, career goals, reasons for leaving previous jobs, weaknesses, and knowledge of the organization. For each question, it offers a step-by-step approach and emphasizes connecting your experiences to the employer's needs, providing evidence for your strengths, and avoiding negative responses. Sample answers are provided for each question to demonstrate effective responses.
This document provides tips and sample answers for common interview questions for a maintenance technician position. It discusses how to answer questions about yourself, your strengths, career goals, reasons for leaving previous jobs, weaknesses, knowledge of the organization, and ways you have improved your skills. For each question, it offers steps and examples to effectively convey your qualifications and experience in a positive light.
The document discusses corporate governance and IT governance. It states that corporate governance principles are the same even if models vary, with the board responsible for protecting shareholder rights. IT governance focuses on using IT to support top-line growth while reducing costs. The value of IT investments is not just financial but also comes from supporting strategic goals. Signs of bad IT governance include projects running over budget and benefits not being tracked. Good IT governance provides strategic alignment, visibility of benefits, and efficient control to maximize future cash flows and deliver shareholder value.
INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...ijcsit
This document discusses challenges and recommendations regarding integrating information and communication technologies (ICT) into the education sector in Bangladesh. Some key challenges discussed include a lack of ICT infrastructure and funding, as well as issues related to proper vision/planning, social/cultural factors, teachers' attitudes and skills, and a lack of convergence between technology and education. Recommendations provided to address these challenges include building teachers' ICT skills, learning from other countries' approaches, connecting schools digitally, establishing a dedicated IT group, and implementing smart classroom technologies on a limited pilot basis. An overall national ICT policy for education is also recommended.
Multi objective predictive control a solution using metaheuristicsijcsit
The application of multi objective model predictive control approaches is significantly limited with
computation time associated with optimization algorithms. Metaheuristics are general purpose heuristics
that have been successfully used in solving difficult optimization problems in a reasonable computation
time. In this work , we use and compare two multi objective metaheuristics, Multi-Objective Particle
swarm Optimization, MOPSO, and Multi-Objective Gravitational Search Algorithm, MOGSA, to generate
a set of approximately Pareto-optimal solutions in a single run. Two examples are studied, a nonlinear
system consisting of two mobile robots tracking trajectories and avoiding obstacles and a linear multi
variable system. The computation times and the quality of the solution in terms of the smoothness of the
control signals and precision of tracking show that MOPSO can be an alternative for real time
applications.
This document summarizes a research paper on applying a multiviewpoint-based similarity measure to hierarchical document clustering. It begins by introducing document clustering and hierarchical clustering. It then discusses traditional similarity measures used for clustering and introduces a new multiviewpoint-based similarity measure (MVS) that uses multiple reference points to more accurately assess similarity. The paper applies MVS to both hierarchical and k-means clustering algorithms and evaluates the accuracy, precision, and recall of the resulting clusters. It finds that hierarchical clustering with MVS achieves better performance than k-means clustering with MVS based on these evaluation metrics.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
1) The document discusses different clustering algorithms for text summarization including hierarchical clustering, query-based summarization, graph theoretic clustering, fuzzy c-means clustering, and DBSCAN clustering.
2) These algorithms are evaluated based on performance parameters like precision, recall, time complexity, space complexity, and summary quality.
3) The algorithm found to perform best based on these evaluations will be suggested as the better algorithm for query-dependent text document summarization.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
This document proposes using an enhanced suffix tree approach to measure semantic similarity between multiple documents. It involves preprocessing documents by removing stop words, special characters, and converting to lowercase. Phrases are extracted and used to construct a suffix tree, where internal nodes represent phrases shared across documents. Term frequency-inverse document frequency (tf-idf) is used to calculate weights for internal nodes. Cosine, Dice, and Hellinger similarity measures are then applied to calculate pairwise similarities between documents based on the weighted internal nodes. The approach aims to efficiently and accurately measure semantic similarity between documents.
Study, analysis and formulation of a new method for integrity protection of d...ijsrd.com
This document discusses a text-based fuzzy clustering algorithm to filter spam emails. It begins with an introduction discussing how most classification approaches are for structured data but large amounts of unstructured data are transmitted online. It then discusses spam emails being a major problem and filtering being an important approach. The paper aims to use a fuzzy clustering approach called Fuzzy C-Means to classify emails. It describes the training and testing modules, which extract features from emails to create vector space models and then applies the fuzzy clustering algorithm to determine if emails are spam or not spam. Evaluation results show the precision and accuracy of the approach on different datasets, with the author concluding the vector space model with fuzzy C-Means works well for both small and large datasets.
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailijsrd.com
This document summarizes a research paper that proposes using a text-based fuzzy clustering algorithm to filter spam emails. Specifically, it uses a fuzzy c-means algorithm on a vector space model of emails. The vector space model represents emails as vectors of weighted terms. The fuzzy c-means algorithm allows emails to belong to more than one cluster (spam or not spam) with varying membership degrees. The paper evaluates the proposed system on a testing dataset, calculating precision and accuracy metrics. Results show the approach works well for both small and large datasets, with larger training datasets yielding higher accuracy.
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Reviews on swarm intelligence algorithms for text document clusteringIRJET Journal
This document reviews swarm intelligence algorithms that have been used for text document clustering. It discusses how text clustering is an unsupervised learning technique that groups similar documents into clusters while separating dissimilar documents. Various swarm intelligence algorithms like particle swarm optimization, artificial bee colony, grey wolf optimizer, and krill herd have been applied to text document clustering problems. The document surveys previous research that has used these swarm intelligence algorithms for text clustering and discusses their advantages and limitations. It aims to provide readers an overview of the different swarm intelligence algorithms available for text document clustering applications.
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
This document discusses several approaches for clustering textual documents, including:
1. TF-IDF, word embedding, and K-means clustering are proposed to automatically classify and organize documents.
2. Previous work on document clustering is reviewed, including partition-based techniques like K-means and K-medoids, hierarchical clustering, and approaches using semantic features, PSO optimization, and multi-view clustering.
3. Challenges of clustering large document collections at scale are discussed, along with potential solutions using frameworks like Hadoop.
A number of benefits have been reported for computer-based assessments over traditional paper-based exams, both in terms of IT support for question development, reduced distribution and test administration costs, and automated support. Possible for the ranking. However, existing computerized assessment systems do not provide all kinds of questions, namely open questions that require writing solutions. To overcome the challenges of the existing, the objective of this work is to achieve an intelligent evaluation system (IES) responding to the problems identified, and which adapts to the different types of questions, especially open-ended questions of which the answer requires sentence writing or programming.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...ijseajournal
Extensive amount of data stored in medical documents require developing methods that help users to find
what they are looking for effectively by organizing large amounts of information into a small number of
meaningful clusters. The produced clusters contain groups of objects which are more similar to each other
than to the members of any other group. Thus, the aim of high-quality document clustering algorithms is to
determine a set of clusters in which the inter-cluster similarity is minimized and intra-cluster similarity is
maximized. The most important feature in many clustering algorithms is treating the clustering problem as
an optimization process, that is, maximizing or minimizing a particular clustering criterion function
defined over the whole clustering solution. The only real difference between agglomerative algorithms is
how they choose which clusters to merge. The main purpose of this paper is to compare different
agglomerative algorithms based on the evaluation of the clusters quality produced by different hierarchical
agglomerative clustering algorithms using different criterion functions for the problem of clustering
medical documents. Our experimental results showed that the agglomerative algorithm that uses I1 as its
criterion function for choosing which clusters to merge produced better clusters quality than the other
criterion functions in term of entropy and purity as external measures.
An effective pre processing algorithm for information retrieval systemsijdms
The Internet is probably the most successful distributed computing system ever. However, our capabilities
for data querying and manipulation on the internet are primordial at best. The user expectations are
enhancing over the period of time along with increased amount of operational data past few decades. The
data-user expects more deep, exact, and detailed results. Result retrieval for the user query is always
relative o the pattern of data storage and index. In Information retrieval systems, tokenization is an
integrals part whose prime objective is to identifying the token and their count. In this paper, we have
proposed an effective tokenization approach which is based on training vector and result shows that
efficiency/ effectiveness of proposed algorithm. Tokenization on documents helps to satisfy user’s
information need more precisely and reduced search sharply, is believed to be a part of information
retrieval. Pre-processing of input document is an integral part of Tokenization, which involves preprocessing
of documents and generates its respective tokens which is the basis of these tokens probabilistic
IR generate its scoring and gives reduced search space. The comparative analysis is based on the two
parameters; Number of Token generated, Pre-processing time.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document discusses distance similarity measures that can be used for data mining classification and clustering techniques. It proposes a novel distance similarity measure called "Supervised & Unsupervised learning" that uses Euclidean distance similarity to partition training data into clusters. It then builds decision trees on each cluster to improve classification performance. The document also discusses using these measures for other applications like image processing, where k-means clustering can be used to segment images into clusters of similar pixel intensities. In conclusion, it states these similarity measures can help analyze complex datasets for business analysis purposes.
The document summarizes research on multi-document summarization using EM clustering. It begins with an introduction to the topic and issues with existing techniques. It then proposes using Expectation-Maximization (EM) clustering to identify clusters, which improves over other methods by identifying latent semantic variables between sentences. The architecture involves preprocessing, EM clustering, mutual reinforcement ranking algorithms RARP and RDRP, summarization, and post-processing. Experimental results on DUC2007 data show EM clustering identifies more clusters and sentences than affinity propagation clustering. The technique aims to improve summarization accuracy by better capturing semantic relationships between sentences.
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Similar to A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING (20)
Data Communication and Computer Networks Management System Project Report.pdfKamal Acharya
Networking is a telecommunications network that allows computers to exchange data. In
computer networks, networked computing devices pass data to each other along data
connections. Data is transferred in the form of packets. The connections between nodes are
established using either cable media or wireless media.
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfBalvir Singh
Sri Guru Hargobind Ji (19 June 1595 - 3 March 1644) is revered as the Sixth Nanak.
• On 25 May 1606 Guru Arjan nominated his son Sri Hargobind Ji as his successor. Shortly
afterwards, Guru Arjan was arrested, tortured and killed by order of the Mogul Emperor
Jahangir.
• Guru Hargobind's succession ceremony took place on 24 June 1606. He was barely
eleven years old when he became 6th Guru.
• As ordered by Guru Arjan Dev Ji, he put on two swords, one indicated his spiritual
authority (PIRI) and the other, his temporal authority (MIRI). He thus for the first time
initiated military tradition in the Sikh faith to resist religious persecution, protect
people’s freedom and independence to practice religion by choice. He transformed
Sikhs to be Saints and Soldier.
• He had a long tenure as Guru, lasting 37 years, 9 months and 3 days
Cricket management system ptoject report.pdfKamal Acharya
The aim of this project is to provide the complete information of the National and
International statistics. The information is available country wise and player wise. By
entering the data of eachmatch, we can get all type of reports instantly, which will be
useful to call back history of each player. Also the team performance in each match can
be obtained. We can get a report on number of matches, wins and lost.
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
DOI:10.5121/ijcsit.2015.7306 73
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT
MINING
Deepa Patil and Yashwant Dongre
Department of Computer Engineering, VIIT, Pune, India
ABSTRACT
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc.
occupying considerable amount of cyber space, organizing these documents has become a practical need.
Clustering is an important technique that organizes large number of objects into smaller coherent groups.
This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.
Email is one of the most frequently used e-document by individual or organization. Email categorization is
one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and
maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails,
feature appears in only one email and feature appears in none of the emails. The potency of suggested
similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the
efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
KEYWORDS
Similarity Measure, Clustering Algorithm, Document Clustering, Email Mining
1. INTRODUCTION
We are living in cyber world dumped with whole lot of information. Efficient and accurate
retrieval of the information is important for survival of many web portals. Text processing is an
important aspect of information retrieval, data mining and web search. It is equally important to
cluster the emails into different groups, so as to retrieve the similar emails i.e. email containing
search string or key words. We can further group the mails based on time stamp for more easy
retrieval.
As the amount of digital documents has been increasing dramatically over the years with the
growth of internet, information management, search and retrieval have become an important
concern. Similarity measure plays an important role in text classification and clustering algorithm.
Many a times, a bag of words model [1], [2], [3] is used in information retrieval or text processing
task, wherein a document is modelled as a collection of unique words that it has and their
frequencies. The order is completely ignored. The importance of a word in the document can be
decided based on term frequency (number of times a particular a word appears in the document),
relative term frequency (it is ratio between term frequency and total number of occurrences of all
terms in the document set) or tf-idf(it is combination of term frequency and inverse document
frequency)[4]. All documents get converted into a matrix, where each word adds a new
dimension, being row of matrix and each document is represented as a column vector. This
implies that each entry in the matrix gives the frequency of world occurring in a particular
document. It is easy to see that the matrix would be sparse. Higher the frequency of each word, it
is more descriptive of the document.
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
74
Clustering is a technique that organizes large number of objects into smaller coherent groups.
Clustering aims at grouping similar documents in one group and separate this group as much as
possible from the one which contains information on entirely different topics. Clustering
algorithms [4] requires a metric to quantify how similar or different two given documents are.
This difference is often measured by some distance measures. These measures are called as
Similarity Measures. Clustering requires definition of a distance measure which assigns a numeric
value to the extent of difference between two documents and which the clustering algorithm uses
for making different groups of a given dataset. Similarity measure plays an important role in text
classification and clustering.
A lot of similarity measures are in existence to compute similarity between two documents.
Euclidian distance [5] is one of the popular similarity metric taken from Euclidian geometry field.
Cosine similarity [4] is a measure which takes cosine of the angle between two vectors. The
Jaccard coefficient [6] is a statistic used for comparing the similarity of two document sets and is
defined as size of intersection divided by size of union on sample data sets. An information-
theoretic measure for document similarity called IT_Sim [7], [8] is a phrase-based measure which
computes the similarity based on Suffix Tree Document Model. Pairwise-adaptive similarity [9]
is a measure which selects a number of features dynamically out of document d1 and document
d2. In [7], [10] Hamming distance is used; hamming distance between two document vectors is
number of positions where the corresponding symbols differ. The Kullback-Leibler divergence
[11] is a non-symmetric measure of difference between probability distributions associated with
two vectors.
Many a times, it is essential to search email contents to retrieve mails containing similar contents
or key world .It is essential equally to search email body or content to put them into different
groups or to search and bring up the emails that contains the search string or search keywords.
This feature will be very useful in business domain. For example, if an employee in an
organization wants to retrieve emails containing information about recent sells occurred in a
particular area, he/she can specify the search string and search options accordingly, and fetch the
emails containing required information.
My work focuses on implementing k-means clustering algorithm [12], [13] along with similarity
measure (SMTP) Similarity Measure for Text Processing [14] on email data set to categorize
emails into different groups. The main purpose of this work is to test effectiveness of SMTP used
with k- means clustering algorithm for email clustering.
SMTP has many advantages. SMTP considers presence or absence of features than difference
between two values associated with present feature. It also considers that similarity degree should
increase when difference between two non-zero values of a specific feature decreases. It also
takes into account that similarity degree should decrease when the number of presence-absence
features increases. SMTP takes into consideration one more important aspect that two documents
are least similar to each other if none of the features have non-zero values in both documents.
SMTP is symmetric similarity measure. The last and most important fact is that it considers
standard deviation of feature taken into count for its contribution to similarity between two
documents.
The rest of the paper is organized as follows. Related work is discussed in short in section 2
.Proposed system is described in section 3. Experimental results are presented in section 4.
Finally conclusion is given in section 5.
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
75
2. RELATED WORK
Clustering is one of the data mining methods which is used for the purpose of email mining.
Clustering is used for grouping emails for the purpose easy management .Commonly used
clustering algorithms for email grouping are Hierarchical clustering [15] and k-means clustering
algorithm [12],[13]. Closeness of any two emails can be determined by any distance measures
such as Euclidian distance, Cosine similarity, Pairwise adaptive similarity, Jaccard coefficient,
Dice coefficient etc.
The Euclidian distance measure [5] is defined as root of square differences between respective
coordinates of d1 and d2 i.e.
, = p1 − q1 + p1 − q1 + ⋯ + pn − qn
= pi − qi (1)
Cosine similarity [4] measures the cosine of the angle between d1 and d2 as
= cos $ =
%.'
(|%|( ||'||
=
∑ %+∗'+-
./0
%+ 1-
./0
∗ '+ 1-
./0
(2)
The resulting similarity ranges from −1 means exactly opposite, to 1 means exactly the same,
with 0 usually indicating independence, and in-between values indicating intermediate similarity
or dissimilarity.
The formula for Jaccard coefficient [6] for data processing is:
SJ = a/ (a + b + c), where (3)
SJ = Jaccard similarity coefficient
a = number of terms common to (shared by) both documents
b = number of terms unique to the first document
c = number of terms unique to the second document
Jaccard coefficient uses presence/absence data.
Dice coefficient is similar to Jaccard's index
Dice coefficient also uses presence/absence data and is given as:-
SS = 2a/ (2a + b + c), where (4)
SS = Dice similarity coefficient
a = number of terms common to (shared by) both documents
b = number of terms unique to the first document
c = number of terms unique to the second document
IT_Sim [7], [8] is a phrase-based measure to compute the similarity based on Suffix Tree
Document Model. Pairwise-adaptive similarity [9] dynamically selects a number of features out
of d1 and d2. Hamming distance [7], [10] between two vectors is the number of positions at
which the corresponding symbols are different. The Kullback-Leibler divergence [11] is a non-
symmetric measure of difference between probability distributions associated with two vectors.
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
76
Studies have shown that above mentioned similarity measures does not give optimal results for
text classification. So a measure known as (SMTP) Similarity Measure for Text classification
[14] is used for email categorization along with k-means clustering algorithm.
3. PROPOSED SYSTEM
Figure 1. Proposed System
Email clustering will be carried on, on email data set.
Step 1:- Email data is pre-processed. Following are steps for pre-processing-
1. Removal of stopwords
2. Stemming process
Step 2:- Keywords are identified.
Step 3:- Term frequency is calculated
Step 4:-Similarity calculation uses distance measure or similarity function
Step 5:- Document clustering is done using k-means clustering algorithm
The email consists of structured information such as email header and unstructured information
such as subject and body of the email. Text processing is done on unstructured information
available in the message. Pre-processing of raw data is the first step for any email management
task. In this step the email header, email body and attachments are parsed. From the parsed data,
subject and email content or other fields are extracted. Once the required text is obtained, in our
case it is email content or email body part, we need to remove stop words such as ‘the’, ’for’, ‘of’
etc. from the data obtained.
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
77
Next, stemming algorithm can be used to stem the data. For example, words ‘connection’,
’connecting’, ’connected’ will be converted to ‘connect’. After stopwords removal and stemming
task, keywords or terms are identified. We can consider those nouns or pronouns which have
higher frequency of occurrence. Once the data is extracted from the email, it is represented in
some format. The most prevalent model for representation is the vector space model. In this
model every email message is represented by a single vector and each element as a token or
feature. In such type of data the tokens are usually words or phrases. These tokens can be
categorized mainly into three categories- Unigram, Bigram and Co occurrence. Unigram are
significant individual words. For example, in a data such as “Good Morning, my dear friends” the
unigrams are ‘good’, ‘morning’, ‘my’, ‘dear’ and ‘friends’. Bigram are pair of two adjacent
words. For example, in a data such as “hello friends, how are you?” the bigrams are ‘hello
friends’, ‘friends how’, ‘how are’ and ‘are you’. In bigrams, word sequence is important, ‘hello
friends’ and ‘friends hello’ are two different units. Co-occurrences are same as bigrams, only
difference is, word sequence is not important. For example, ‘hello friends’ and ‘friend’s hello’ are
treated as single unit as their order does not matter. There is another feature called target co
occurrence which is same as co-occurrence with one target word inside each pair.
Once the term frequency is calculated, document vector is generated. For each email document,
individual document vector is generated. Using document vector and similarity measure,
similarity is calculated between email documents. The most similar documents are clustered
together using clustering algorithm.
4. EXPERIMENTAL RESULTS
The effectiveness of email clustering using (SMTP) Similarity Measure for Text Processing [14],
is tested by implementing k-means clustering algorithm with SMTP. The results are compared
with k- means clustering algorithm implemented using other similarity measures- Euclidian
distance, cosine similarity, extended jaccard coefficient and dice coefficient. We observed that,
the results obtained by using SMTP with k-means clustering algorithm are better than other
similarity measures. For implementation we used Intel(R) Core(TM) i3 processor 1.70 GHz with
4 GB RAM. The entire implementation is done using object oriented programming language
Java.
4.1 Data set
For experimental purpose we used Enron email dataset which is the most popular email dataset
and is available free on World Wide Web and can be downloaded from [16] .This dataset is
cleaned and contains large set of email messages organized in folders and contains thousands of
messages belonging to almost 150 users. For each user, one folder is allocated by the name of that
user. Each of such folders contains subfolders such as inbox, sent, sent-items, drafts and other
user created sub folders
.
The experimental results are based on emails contained inbox folder.
At initial stage, clusters are formed manually on the basis of similarity. For experimental purpose,
four clusters are created and emails with similarity are put into four clusters.
Clusters generated by our system using k-means clustering algorithm are saved in a folder
dynamically. Then this system generated clusters are compared with manually created clusters.
The clusters returned by the system may not contain all the relevant mails; the clusters may
contain some irrelevant mail. For example, out of 40 mails in a particular cluster returned by the
system , only 30 mails would be relevant, means in real sense those 30 mails have similarity
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
78
among themselves based on the mail content. Finally relevant mails in the system generated
clusters are identified and accuracy is calculated as follows-
Suppose manually generated cluster contain emails {11,12,13,14}
System generated cluster contain emails{11,12,15}
This means matching contents are 2, so matchCounter is 2.
Accuracy=( matchCounter / (length of manually created cluster + length of system generated
cluster - matchCounter) ) *100
Substituting values in above formula, accuracy is calculated as –
( 2 / (4 + 3 - 2 ) ) *100=40%
4.2 Figures and Tables
Following table shows accuracy obtained for different clusters for five similarity measures.
Overall results show that similarity measure SMTP retrieves maximum relevant emails.
Table 1. Accuracy obtained for different clusters for five similarity measures
Similarity Measure Cluster 1 Cluster 2 Cluster 3 Cluster 4
Accuracy By Cosine Similarity 36.1111 6.4516 20.8333 28.5714
Accuracy By Euclidian Dist 24.6575 25.00 2.38095 20.00
Accuracy By Dice Coefficient 23.3333 8.8235 23.9130 3.8461
Accuracy By Extended Jaccard
Coefficient
24.2424 20.00 27.4509 26.3157
Accuracy By SMTP 53.8461 23.0769 83.3333 31.5789
Following graphs show, the accuracy in percentage obtained for different clusters for five
similarity measures.
Figure 2. Accuracy for cluster 1 Figure 3. Accuracy for cluster 2
7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
79
Figure 4. Accuracy for cluster 3 Figure 5. Accuracy for cluster 4
Following screen shot shows email clusters obtained using SMTP.
Figure 6. Output for k- means Clustering Algorithm with SMTP
8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
80
5. CONCLUSION
The effectiveness of email clustering is tested using k-means clustering algorithm with similarity
measure SMTP. From the experimental results we have observed that SMTP gives better results
for document (email) clustering than any other similarity measure used for the experimental
purpose. This is mainly because SMTP considers presence or absence of features than difference
between two values associated with present feature. So we can conclude that k-means clustering
algorithm implemented with similarity measure SMTP gives best results for email (document)
clustering.
On concluding note we suggest that, it would be very interesting to test how SMTP works with
other clustering algorithms such as fuzzy c means clustering algorithm. Also it would be worth,
analysing the SMTP on classification algorithms such as KNN.
ACKNOWLEDGEMENT
The authors wish to thank everyone who has contributed to this work directly or indirectly.
REFERENCES
[1] T. Joachims , (1997) “A Probabilistic analysis of the rocchio algorithm with TFIDF for text
categorization”, in Proc. 14th
Int.Conf. Mach. Learn. ,San Francisco ,CA,USA, pp.143-151.
[2] H Kim, P. Howland & H. Park , (2005) “Dimension reduction in text classification with support
vector machines”, J. Mach. Learn. Res., Vol. 6 pp. 37-53.
[3] G. Salton & M. J. McGill, (1983) Introduction to Modern Retrieval. London, U. K.: McGraw-Hill.
[4] J. Han & M. Kamber ,(2006) Data Mining Concepts and Techniques, 2nd
ed. San Francisco ,CA,
USA: Elsevier.
[5] T. W. Schoenharl & G. Madey , (2008) “Evaluation of measurement techniques for the validation of
agent-based simulations against streaming data “,in Proc. ICCS, Krakow, Poland.
[6] C.G. Gonzalez , W. Bonventi, Jr. & A.L.V. Rodrigues, (2008) “Density of closed balls in real-valued
and autometrized Boolean spaces for clustering applications”, in Proc. 19th
Brazilizn Symp. Artif.
Intel., Savador, Brazil, pp. 8-22.
[7] J. A. Aslam & M. Frost, (2003) “An information-theoretic measure for document similarity”, in Proc.
26th
SIGIR, Toronto, ON, Canada, pp. 449-450.
[8] D. Lin, (1998) “An information theoretic definition of similarity”, in Proc. 15th
Int. Conf. Mach.
Learn., San Francisco, CA,USA.
[9] J. D’hondt, J. Vertommen, P.A. Verhaegen, D. Cattrysse & R.J. Duflou, (2010) “Pairwise-adaptive
dissimilarity measure for document clustering”, Inf. Sci., Vol. 180, No. 12, pp. 2341-2358.
[10] R.W. Hamming, (1950) “Error detecting and error correcting codes”, Bell Syst. Tech. J., Vol. 29, No.
2, pp. 147-160.
[11] S. Kullback & R.A.Leibler, (1951) “On information and sufficiency”, Annu. Math. Statist.,Vol. 22,
No. 1, pp. 79-86.
[12] G. H. Ball & D. J. Hall , (1967) “A clustering techniques for summarizing multivariate data”, Behav.
Sci., Vol. 12, No. 2, pp. 153-155.
[13] R. O. Duda, P. E. Hart & D. J. Stork, (2001) Pattern Recognition, New York, NY, USA: Wiley
[14] Yung-Shen Lin, Jung-Yi Jiang & Shie-Jue Lee, (2014) “Similarity Measure for Text Classification
and clustering”, IEEE Transactions on Knowledge and Data Engineering , Vol. 26, No. 7.
[15] M. B. Eisen , P. T. S Spellman, P. O. Brown & D. Boststein, (1998) “A cluster analysis and display of
genome-wide expression patterns”, Sci., Vol. 95, No. 25, pp. 14863-14868.
[16] https://www.cs.cmu.edu/~./enron/
9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
81
Authors
Ms. Deepa B. Patil is a post graduate student in Computer Engineering from Savitribai
Phule Pune University, Pune , Maharashtra State, India
Prof. Yashwant V. Dongre is Assistant Professor in Vishwakarma Institute of Information
Technology (VIIT) at Savitribai Phule Pune University, Pune, Maharashtra State, India.
His area of interest includes database management, data mining and information retrieval.
He has several journal papers to his credit published in prestigious journals which
includes AIRCCE IJDMS.