Menu
K-Means Clustering with Scikit-Learn. January 10, 2018. 0 Comments. K-means clustering is one of the most widely used unsupervised machine learning algorithms that forms clusters of data based on the similarity between data instances. For this particular algorithm to work, the number of clusters has to be defined beforehand. The plots display firstly what a K-means algorithm would yield using three clusters. It is then shown what the effect of a bad initialization is on the classification.
Active4 years, 6 months ago
I have a dataset which looks like this:
{'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2'}
This is already converted to dict from csv
Then i use DictVectorizer to convert it
Then i try to use Kmeans on it
My question is how do i get infromation about which row of my data belongs to what cluster?
I expect to get something like this:
{'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2', cluster: '1'}
Can someone give me an step by step example how to go from raw data like i showed to the same data with information to which clusters they belong?
For example i used Weka for this dataset and it showed me what i want - i can click datapoints on the graphs and read exactly which datapoints belongs to which cluster. How to get similar results with sklearn?
CoolfaceCoolface
1 Answer
This will show how you can retrieve the cluster id for each row and the cluster centers. I have also measured the distance from each row to each centroid so you can see that the rows are properly assigned to the clusters.
jay sjay s
Not the answer you're looking for? Browse other questions tagged pythonmachine-learningscikit-learnk-means or ask your own question.
Active4 years, 6 months ago
I have a dataset which looks like this:
{'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2'}
This is already converted to dict from csv
Then i use DictVectorizer to convert it
Then i try to use Kmeans on it
My question is how do i get infromation about which row of my data belongs to what cluster?
I expect to get something like this:
{'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2', cluster: '1'}
Can someone give me an step by step example how to go from raw data like i showed to the same data with information to which clusters they belong?
For example i used Weka for this dataset and it showed me what i want - i can click datapoints on the graphs and read exactly which datapoints belongs to which cluster. How to get similar results with sklearn?
CoolfaceCoolface
1 Answer
![Sklearn Kmeans Wine Sklearn Kmeans Wine](http://blog.yhat.com/static/img/random-points-clustered.png)
This will show how you can retrieve the cluster id for each row and the cluster centers. I have also measured the distance from each row to each centroid so you can see that the rows are properly assigned to the clusters.
jay sjay s