DATA Cluster Analysis of Online Shop Product Reviews Using K-Means Clustering

. Technological developments have made changes in people's lifestyles, namely changes in the behavior of people who had shopped directly or offline to online. Many benefits are obtained from shopping online, namely the many conveniences offered by shopping online, besides that there are also many disadvantages of shopping online, namely the many risks in using e-commerce facilities, namely the problem of product or service quality, safety in payments, fraud. This research aims to mine review data on one of the e-commerce sites which ultimately produces clusters using the K-Means Clustering algorithm that can help potential customers to make a decision before deciding to buy a product or service


Introduction
This figure is higher compared to other countries such as Malaysia (14%), Thailand (22%), and the Philippines (28%) [ 2] The development of online shops has increased greatly in Indonesia, even in remote areas. With the many conveniences obtained by consumers in shopping online makes consumers switch to using these facilities. People only need internet subscription fees to get these facilities. In an online store we will find a lot of sellers, making it easier for prospective customers to choose the products / services offered by the seller according to their needs, this makes competition among the providers of products / services compete fiercely, it becomes an advantage for prospective buyers, but there are some obstacles in shopping online, before deciding to buy a product / service, prospective buyers must pay attention to the history of the seller whether it can be trusted or not, for that in conducting transactions online, trust is needed between the seller and the buyer, one of factors that greatly affect the consumer to buy goods is to know the history of the seller and how the products offered by the seller, this can be known by the seller by looking at reviews of the product that can be read by prospective buyers through existing product reviews on the online store site.

Data Mining
Data Mining aims to extract information and knowledge in proving the accuracy and potential Useful for decision making and problem solving. In general, data mining discusses methods such as classification, regression, variable selection an clustering. Based on the task that can be performed data mining is devide into several sections: [3] • Description provide possible explanations for a pattern. To describe the pattern and trends contained in the data.
• Estimation The model is built using a complete record that provides the values of the target variable as a predictive value.
• Prediction is classification and estimation are almost the same as predictions, predictions of the value of future results.
• Classification in the classification there are target variable categories. For example.
Income classification can be separated into several categories, namely low income, medium income and high income.
• Clustering grouping data/objects into other clusters that have different charateristics and ultimately will produce a cluster or group that has a very high level of similarity

Clustering
Clustering is an activity to divide a number of data into groups based on similarities that have been determined beforehand. Clusters are groups or groups of data objects that are smilar to each other in the same cluster and similar to different cluster objects.

K-Means Clustering
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster's center is represented by the mean value of the objects in the cluster. [4] Input: • k: the number of clusters, • D: a data set containing n objects.
Output: A set of k clusters. Method: • Arbitrarily choose k objects from D as the initial cluster centers

• Repeat
• (re)assign each object to the cluster to which the object is the most similar • based on the mean value of the objects in the cluster • Update the cluster means, that is, calculate the mean value of the objects for

Pre Process Data
In general there are four stages completed in this research model, namely the stage of collecting product review data, the stage of preprocessing data, the feature selection stage, testing the MK-Means clustering model and testing the clustering performance.

A. Data collection
Data collection (data crawling) aims to capture product review data. This research uses product review data obtained from online buying and selling sites. Data is collected using Octoparse application, which is an open source application for web crawlers [5] B. Text pre-processing Initial processing or text pre-processing is the second stage in text mining [6]. The initial processing phase aims to prepare data to be used at the pattern discovery stage, for example eliminating data that contains noise, incomplete data and inconsistent data.

C. Feature Selection
At this stage there are two processes that are carried out, as follows: Case Folding Case Folding is the process of changing uppercase letters into all lowercase. In the process of stop words, it takes a data or a list of words that you want to be deleted, in general stop words are common words that do not have the meaning namu often appear. In Indonesian such as "to", "with", "which", "if", "will" and so forth. For this reason, a removal is needed.

E. Stemming
Stemming in this research is based on Nazief and Andriani's algorithm. This algorithm is also known as the confix stripping algorithm, which is a special algorithm for stemming Indonesian texts [7].
After all data has been transformed into numerical form, then the data can be grouped using the K-Mean Clustering method. To be able to group these data into clusters, several steps need to be done, namely: • Determine the desired number of clusters in advance. In this study the existing data will be grouped into two clusters.
• Determine the starting point of each cluster. In this study the initial center point is generated randomly. The center of the cluster in the initial solution can be seen in table 1

Results of data collection
Research conducted using online customer reviews consisting of 888 data. From 888 data, there were 806 positive comments and 82 negative comments.

Pre Processing Data
After the product review data has been carried out, the next step is so that online customer review data can be applied to the k-means clustering algorithm, then the pre-process data is carried out.
The pre-processing stages that are applied are Case Folding, Non Alpha Numeric Removal, Stop words Removal, and Stemming. The list of stop words for Indonesian consists of 760 words [8].

A. String to word vector
To convert string data into word vectors, the TF-IDF algorithm is applied. Results of Implementation of TF_IDF produces a data matrix with dimensions of 85 attributes x 888 data.
There are 85 terms in the data as shown in the following figure.

Converting Data to Numeric
The following

Attribute Selection
The data above is still too large and ineffective, so the attributes must be filtered. By using the Cfs algorithm, as many as 50 attributes.

Conclusion
From table.2 above it can be concluded that the results of testing of 888 reviews produced 2 clusters, namely: 1. Cluster 1, which produces 729 (82%) reviews that have a very high similarity grouped into 1 cluster 2. Cluster 2 produced 159 (18%) reviews that had very high similarities grouped into 1 cluster group.