Web Crawler & Search Engine

CS314H (Honors Data Structures) ✦ Fall 2024

Skills: Java, JUnit, LaTeX, Documentation, GitHub, Pair Programming, Collaboration

Partner: Roy Yue (CSB Class of 2027)

Key Objectives

  • Code a Java web crawler and corresponding search engine

  • Design our program in such a way that it can crawl and index through 100+ web pages and handle basic search queries involving operators such as AND, OR, NOT, quotation marks, and parentheses within reasonable time

  • Rigorously test our application and its components to ensure that it handles as many test cases as possible, including edge cases

  • Document our engineering process, code, testing methodology, and pair programming experience in a detailed report

Project Description

This excerpt from our original report summarizes the project well: “For this project, we were challenged to make a search engine, a program that crawls through web pages in search of more pages, indexes said pages, and enables a user to search for desired pages using queries hosted on a server. Our personal goals for this project were to not only code a functional search engine but also to gain more experience coding open-ended projects on a larger scale than we typically do in class” (1).

To protect from plagiarism from future and current CS314H students, I can’t go into detail about how our project works, but I can share details about the report that won’t compromise the integrity of upcoming projects. In our report, we documented project assumptions, scope, quality, limitations, problems encountered, interesting results, and work log in order to contextualize our project. We also documented all methods included in WebCrawler, WebIndex, CrawlingMarkupHandler, and WebQueryEngine, explaining what each method did in every file. Regarding our testing methodology, we used a combination of black box testing and white box testing where appropriate. We documented common test cases that we ran against our project, as well as edge cases with varying degrees of complexity.

We were also tasked to reflect on our pair programming experience. Roy is a much more experienced and knowledgable programmer than I am, so there was a considerable skill difference. However, we were able to collaborate in such a way that was beneficial for both of us. He would guide the overall direction of the project but still encouraged me to figure out algorithms when needed and gave me the support to do so. I would devise slight but clever test cases that would break our code, encouraging us to reinforce our program. Out of all 7 programming projects we completed in the Fall 2023 semester, this one is honestly my favorite — even if it was technically considered one of the hardest — because I was able to learn so much! We completed this project in only 12 working hours, including the extra credit that was offered for this project.

To protect against plagiarism from future and current CS314H students, I cannot display our GitHub repository and full report, so here are some trivial screenshots of our report instead. Posting more specific content will reveal essential hints for figuring out the project— we don’t want that happening!

Previous
Previous

Beats by Dre Consumer Insights Analytics

Next
Next

Intel Data Analysis for Sustainability