Search Tool Data Analysis

by Kevin Schmidt (schmidkeschmidke in BIT330, Fall 2008)

Questions and queries

Web search engines

My query that I submitted to the Web search engines was about looking for other people on the internet. I want sites that offer users some way of searching and/or connecting with people from around the world. I'm looking for Web pages that list personal information like addresses, phones numbers, ages, incomes, etc. In addition, the site must have the capability to search for people at least within the entire United States.

The search query that I submitted for all of the three search engines was "people search find".

Blog search engines

My query that I submitted to the blog search engines was about this coming season of NBA basketball. All blogs that talk about any aspect of NBA basketball for the coming 2008-2009 season qualify as relevant to my search.

The search query I submitted for all of the three blog search engines was "NBA basketball 2008 2009".

Data that I collected

Search engine overlap data

These tables list averages for the class. The raw individual data can be found here.

Web search Live Google Yahoo Web
Live 42.77 18.33 20
Google 54.44 20.55
Yahoo Web 51.66
All 10
Blog search Technorati Google Blog Bloglines
Technorati 33.05 3.61 9.16
Google Blog 52.5 6.94
Bloglines 44.44
All 1.38

The two tables above are pretty simple to read. Any number in a column and row with the same header (i.e. Google and Google) is the percentage of relevant search results returned for search queries on that site. Any number in cell that has different column and row headings (i.e. Google and Yahoo Web) is the percentage of overlap amongst search results for those two search or blog search engines. Finally, the number in the all row at the bottom of the table gives the percentage of overlap in search results for all three of the search or blog search engines.

Search engine ranking overlap data

These tables list averages for the class. The raw individual data can be found here.

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 1.05 1.35 1.64
10 1.29 2 2.64
20 1.64 2.47 3.70
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 1.05 1.17 1.64
10 1.47 1.94 2.47
20 1.88 2.64 3.76
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0.294 0.352 0.588
10 0.411 0.529 0.823
20 0.529 0.882 1.117
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0.294 0.352 0.470
10 0.411 0.470 0.823
20 0.705 0.764 1.058

The four tables above are a bit harder to read than the first two, but I'll lead you through it. Once you understand how to read one of the tables the rest are easy. Let's use the top "GY" table for our explanation. This table provides a measure of how much of Google's responses are reproduced by Yahoo. The first cell in the top left of the table shows a row and column heading of 5. The quantity in this cell gives the number of times Google's top 5 search results matched Yahoo's top 5 search results, in this case, 1.05 times on average. If we move to the cell directly to the right, this number will tell us how many of Google's top 5 search results were found in Yahoo's top 10 search results. The rest of this table and the others can be read in a very similar fashion.

Results

Web search

From looking at the table of the class's search results for the search tool data collection assignment, it appears that Google is the best site for obtaining relevant search results on the Web. Google had the highest precision rate for Web search engines at 54%, followed by 51% for Yahoo, and 42% for Live. There isn't a ton of difference across the search engines here, but the results kind of confirm most people's current beliefs that Google is in fact the best search engine on the Web. As far as search results overlapping is concerned, no combination of two of the three sites showed very much overlap; all of them were around 20%. The overlap of the three sites together was, as expected, even lower at 10%. From observing the search engine ranking overlap data table it appears that overall Google reproduces more responses by Yahoo within the top 20 search results than vice versa.

Blog search

From looking at the table of the class's search results for the search tool data collection assignment, it appears that Google Blog is the best site for finding relevant blogs on the Web. Google Blog returned the most relevant search results as a whole with around 53%, compared to 44% for Bloglines, and 33% for Technorati. Overlap amongst the blog search engines' results was very low overall, with Bloglines and Technorati overlapping the most at 9%. Blog search results for Google Blog and Technorati overlapped only 3.6% of the time, similarly results for Google Blog and Bloglines overlapped only 6.9% of the time. Again, there was almost zero (1.3%) overlap when comparing all the blog search engines' results together. From observing the search engine ranking overlap data table it appears that overall Google Blog reproduced more responses by Bloglines within the top 20 search results than vice versa.

Discussion

Web search

The purpose of this assignment was to explore and analyze three of the Web's major search engines in Live, Google, and Yahoo. After doing the exercises and looking at the data two things really jump out to me. One, the overlap amongst the search engines is relatively low. What this means is the Web is huge, not even the best search engines out there can compile a list of similar results when asked the exact same question. Secondly, Google is the best overall Web search engine. Google gave the most relevant search results even amongst the few number of trials that we did. Google also covered a larger percentage of Yahoo results as opposed to vice versa, which implies Google has all of Yahoo's important stuff plus more.

So as far as recommendations go, I would tell any Web users to use Google first, followed by Yahoo, and then Live as their search engine when trying to find things on the internet. This may not always be the case though. As evidenced by the table summarizing the search engines overlap, these search engines don't really return the same results even for the same search. What this means is you should always be open to using an alternate search engine if your not finding what you want with Google. From this assignment I learned a great deal about how search engines actually go about finding, retrieving, and listing search results for our search queries. I've definitely gotten some beneficial practice at formulating search queries most effectively and using different search engine syntax.

Some further questions I'd liked to investigate are just summarizing this data again as I did, but taking a much larger data set. Maybe like 1,000 students' search results as opposed to just the 20 or so students in this class. This way, we could really find out for sure what the best Web search engine is, and how much overlap really exists amongst the most used search engines.

Blog search

The purpose of this assignment was to explore and analyze three of the Web's major blog search engines in Google Blog, Technorati, and Bloglines. After going through the exercises and analyzing the data the first thing that jumps out to me is that there aren't nearly as many blogs out there as there is other random Web pages and stuff. None of these three major blog search engines showed a high percentage of relevant search results for our given queries. Furthermore, there was almost no overlap amongst the blog search engines, which to me suggests that these databases aren't very developed or there is just simply a lack of blogs on the Web.

As far as recommendations go, I would tell any Web users to use Google Blog first, followed by Bloglines, and then Technorati when searching for useful blogs on the internet. This may not always be the case though. As evidenced by the table summarizing the blog search engines overlap, these search engines don't really return the same results even for the same search. What this means is you should always be open to using an alternate blog search engine if your not getting the kind of results you want. From this assignment I learned first and foremost, what the best blog search engines are on the Web. I've never really been one to read and use blogs, but now that I may have some more handy ones, this may changed in the future.

Some further questions I'd liked to investigate are just summarizing this data again as I did, but taking a much larger data set. Maybe like 1,000 students search results as opposed to just the 20 or so students in this class. This way, we could really find out for sure what the best Web blog search engine is, and how much overlap really exists amongst the most used blog search engines.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License